Inside the DKTrace Detection Engine: How We Catch APTs in Under 2 Seconds
A full technical walkthrough of DKTrace's correlation engine — entity graphs, kill-chain reconstruction, MITRE ATT&CK tactic chaining, and why sub-second detection matters for ransomware containment. We trace a real Cobalt Strike beacon from first packet to P1 incident creation in 847 milliseconds.
In this article
- ›Entity graph construction using Redis adjacency lists
- ›Kill-chain state machine: 7 stages, 23 transition rules
- ›Why ClickHouse columnar storage enables real-time correlation
- ›How NATS JetStream guarantees zero-event-loss at 100K eps
DKTrace Research Team
Security Engineering · Threat Research
How DKTrace Detects APTs in Under 2 Seconds
Modern APTs don't trigger single-rule alerts. They operate across multiple kill-chain stages over hours or days, each step individually benign. DKTrace's correlation engine is built to link these dots in real time.
The Entity Graph
Every event DKTrace ingests is parsed into entities: users, hosts, processes, network connections, files. These entities are stored as nodes in a Redis-backed adjacency graph. Each event creates or updates edges between nodes — "user X authenticated to host Y", "process Z made network connection to IP A".
The graph is queried by the correlation engine every time a new event arrives. In milliseconds, it can answer: "Has this user authenticated to this host before? What processes has this host spawned in the past 30 minutes? Has this IP been seen by any other host in the environment?"
Kill-Chain State Machine
DKTrace models a 7-stage kill chain:
| Stage | MITRE Tactic | Example Techniques |
|---|---|---|
| 1 | Reconnaissance | T1595, T1592 |
| 2 | Initial Access | T1190, T1566, T1078 |
| 3 | Execution | T1059, T1204 |
| 4 | Persistence | T1053, T1543, T1547 |
| 5 | Privilege Escalation | T1068, T1055, T1548 |
| 6 | Lateral Movement | T1021, T1550 |
| 7 | Impact / Exfil | T1486, T1048, T1071 |
The state machine transitions between stages based on weighted evidence. A single T1059 PowerShell execution scores 0.3 confidence at Stage 3. Add a T1566 phishing email 2 hours earlier on the same host and confidence jumps to 0.7. Add an outbound connection to a known C2 IP and you're at 0.94 — automatic P1.
The Cobalt Strike Timeline (847ms)
T+0ms Agent receives first C2 beacon packet (60s interval, 10% jitter)
T+12ms Event normalised to DCEM canonical model
T+28ms Enrichment: dest IP checked against TI bloom filter → HIT (confidence 0.91)
T+41ms Sigma rule T1071.001 fires (beaconing to known C2)
T+89ms UEBA: this process has never made external connections (baseline violation)
T+156ms Correlation: same host had PowerShell execution 4 minutes ago (T1059.001)
T+203ms Kill-chain stage machine: Stage 3→5 in one hop — escalate
T+390ms Entity graph query: has this host talked to any other internal hosts? YES — 3 hosts
T+612ms Lateral movement pattern detected: T1021.002 (SMB) to 3 hosts
T+847ms P1 incident created, SOC notified via PagerDutyWhy ClickHouse
Real-time correlation at 100K events/sec requires a database that can answer "show me all events involving host X in the last 30 minutes" in under 10ms. PostgreSQL cannot. ClickHouse's columnar storage with ZSTD compression handles this trivially — a 30-minute window query over 180 million rows returns in 4ms on a 32-core node.
Why NATS JetStream
Event loss during an active incident is unacceptable. NATS JetStream replicates every event to N/2+1 nodes before acknowledging receipt. At 100K eps across a 3-node cluster, a single node failure causes zero event loss and services reconnect within 5 seconds.
See It Live
Watch DKTrace detect this threat in your environment
Our engineers will run a live detection simulation against a sample of your log telemetry — no agents, no commitment.
Request a Live Demo