Deep DiveMarch 202614 min read

Inside the DKTrace Detection Engine: How We Catch APTs in Under 2 Seconds

A full technical walkthrough of DKTrace's correlation engine — entity graphs, kill-chain reconstruction, MITRE ATT&CK tactic chaining, and why sub-second detection matters for ransomware containment. We trace a real Cobalt Strike beacon from first packet to P1 incident creation in 847 milliseconds.

In this article

  • Entity graph construction using Redis adjacency lists
  • Kill-chain state machine: 7 stages, 23 transition rules
  • Why ClickHouse columnar storage enables real-time correlation
  • How NATS JetStream guarantees zero-event-loss at 100K eps
DK

DKTrace Research Team

Security Engineering · Threat Research

How DKTrace Detects APTs in Under 2 Seconds

Modern APTs don't trigger single-rule alerts. They operate across multiple kill-chain stages over hours or days, each step individually benign. DKTrace's correlation engine is built to link these dots in real time.

The Entity Graph

Every event DKTrace ingests is parsed into entities: users, hosts, processes, network connections, files. These entities are stored as nodes in a Redis-backed adjacency graph. Each event creates or updates edges between nodes — "user X authenticated to host Y", "process Z made network connection to IP A".

The graph is queried by the correlation engine every time a new event arrives. In milliseconds, it can answer: "Has this user authenticated to this host before? What processes has this host spawned in the past 30 minutes? Has this IP been seen by any other host in the environment?"

Kill-Chain State Machine

DKTrace models a 7-stage kill chain:

StageMITRE TacticExample Techniques
1ReconnaissanceT1595, T1592
2Initial AccessT1190, T1566, T1078
3ExecutionT1059, T1204
4PersistenceT1053, T1543, T1547
5Privilege EscalationT1068, T1055, T1548
6Lateral MovementT1021, T1550
7Impact / ExfilT1486, T1048, T1071

The state machine transitions between stages based on weighted evidence. A single T1059 PowerShell execution scores 0.3 confidence at Stage 3. Add a T1566 phishing email 2 hours earlier on the same host and confidence jumps to 0.7. Add an outbound connection to a known C2 IP and you're at 0.94 — automatic P1.

The Cobalt Strike Timeline (847ms)

T+0ms     Agent receives first C2 beacon packet (60s interval, 10% jitter)
T+12ms    Event normalised to DCEM canonical model
T+28ms    Enrichment: dest IP checked against TI bloom filter → HIT (confidence 0.91)
T+41ms    Sigma rule T1071.001 fires (beaconing to known C2)
T+89ms    UEBA: this process has never made external connections (baseline violation)
T+156ms   Correlation: same host had PowerShell execution 4 minutes ago (T1059.001)
T+203ms   Kill-chain stage machine: Stage 3→5 in one hop — escalate
T+390ms   Entity graph query: has this host talked to any other internal hosts? YES — 3 hosts
T+612ms   Lateral movement pattern detected: T1021.002 (SMB) to 3 hosts
T+847ms   P1 incident created, SOC notified via PagerDuty

Why ClickHouse

Real-time correlation at 100K events/sec requires a database that can answer "show me all events involving host X in the last 30 minutes" in under 10ms. PostgreSQL cannot. ClickHouse's columnar storage with ZSTD compression handles this trivially — a 30-minute window query over 180 million rows returns in 4ms on a 32-core node.

Why NATS JetStream

Event loss during an active incident is unacceptable. NATS JetStream replicates every event to N/2+1 nodes before acknowledging receipt. At 100K eps across a 3-node cluster, a single node failure causes zero event loss and services reconnect within 5 seconds.

See It Live

Watch DKTrace detect this threat in your environment

Our engineers will run a live detection simulation against a sample of your log telemetry — no agents, no commitment.

Request a Live Demo