packet-gen // synthetic office traffic

Synthetic office traffic with real packets.

Simulated office personas — businessman, customer support rep, developer, executive, even idle desk phones and EDR agents — produce real protocol traffic (HTTPS, SMTP, IMAP, SIP/RTP, NTP, beacons) against a paired sink. The aim is packet-level realism: when piped through an encrypted tunnel, the resulting flows should be statistically indistinguishable from a real working office.
connecting… — personas 6 protocols

What's in the box

Three coordinated containers: an originator running a FastAPI control plane and per-persona asyncio tasks, a terminator hosting protocol sinks with on-demand per-SNI certificate minting from an internal CA, and a local ollama content service that pre-generates a corpus of URL paths, HTML page bodies, and email subject/body pairs so wire traffic has realistic size and structure distribution — not uniform-random bytes.

The content of the traffic is throwaway. What matters is the packet-level characteristics: SNI values, packet sizes, inter-arrival times, session durations, fan-out across destinations.

ActivityWireTLSWhat it produces
https_browseTCP 443per-SNI certmulti-page sessions, corpus paths + bodies
smtp_sendTCP 587STARTTLSEHLO + DATA, corpus-sourced subject/body
imap_pollTCP 993implicitLOGIN/SELECT/NOOP/SEARCH/LOGOUT
voip_callUDP 5060 + RTPINVITE/200/ACK + 50pps RTP + BYE
ntp_queryUDP 12348-byte NTPv4 round-trip every ~64s
https_beaconTCP 443per-SNI certsmall HTTPS POSTs to EDR / heartbeat endpoints

How the local LLM fits in

Real office traffic isn't uniformly random. HTML pages follow Zipfian length distributions. Office emails are bimodal — short replies plus long threads. URL paths look like /api/v1/customers/42/orders?page=2, not random bytes. TLS record sizes leak this structure even when the body is encrypted. Filling responses with os.urandom(N) produces flat noise an analyst can spot; an LLM-generated corpus produces traffic with the right size and structure distribution.

build time · runs once
personas.yaml corpus-builder ollama · llama3.2:3b JSONL on shared volume
~110 LLM calls total · ~5 min on a small GPU, ~20 min on CPU · ollama is idle after this
run time · always
JSONL corpus lazy load + random.choice() originator / terminator
LLM is never on the request hot path · runtime is bounded by Python + asyncio
ArtifactVoiced asSampled byWhat gets generated
paths/{domain}.jsonlfirst persona using that domainoriginator · https_browse40 per domain · short paths, IDs, query strings, static assets
pages/{domain}.jsonlsame voice, mixed length bucketsterminator · HTTPS sink (by Host)5 per domain · full HTML, ~500 B → 8 kB
emails/{persona}.jsonlrole-specific (csr · developer · exec)originator · smtp_send10 per persona · subject + body, bimodal length

Why pre-generate vs call live: latency (an LLM call would dominate every request), determinism (pinned model + seed = reproducible traffic, useful for testing analyzers against a fixed baseline), and resource cost. If the corpus is missing — e.g. corpus-builder turned off, or a new domain added since the last build — both sides silently fall back to os.urandom(N) with a bell-curve size, so the system stays up.

Offices control plane

Personas

Live event feed awaiting events…

connecting to control plane…