DKTrace High-Availability Architecture — Zero-Downtime SOC
How to deploy DKTrace in active-active HA across two data centres with automated failover under 30 seconds. Covers NATS JetStream clustering, ClickHouse replication, PostgreSQL Patroni, and Redis Sentinel.
DKTrace Research Team
Security Engineering · Threat Research
Design Principle: No Single Point of Failure
A SOC platform that goes offline during an attack is worse than no SOC platform. DKTrace targets 99.95% availability (< 4.4 hours downtime/year).
NATS JetStream Cluster (3-Node Minimum)
NATS JetStream replicates every message stream across N/2+1 nodes before acknowledging receipt:
# nats-cluster.conf
cluster {
name: dktrace-cluster
routes: [
nats://nats-01.internal:6222
nats://nats-02.internal:6222
nats://nats-03.internal:6222
]
}
jetstream {
store_dir: /data/nats
max_memory_store: 8GB
max_file_store: 500GB
}At 100K events/sec: a single node failure causes zero event loss. Consumers rebalance within 5 seconds. DKTrace services reconnect with exponential backoff (1s→2s→4s→8s→30s max).
ClickHouse Replication (2 Shards × 2 Replicas)
<remote_servers>
<dktrace_cluster>
<shard>
<replica><host>clickhouse-01</host><port>9000</port></replica>
<replica><host>clickhouse-02</host><port>9000</port></replica>
</shard>
<shard>
<replica><host>clickhouse-03</host><port>9000</port></replica>
<replica><host>clickhouse-04</host><port>9000</port></replica>
</shard>
</dktrace_cluster>
</remote_servers>canonical_events uses ReplicatedMergeTree — any node serves reads. Writes route to the shard leader, replicated synchronously before ACK.
PostgreSQL with Patroni
DKTrace's incident-store uses PostgreSQL via Patroni for automated HA:
Disaster Recovery
| Metric | Target | Last Tested |
|---|---|---|
| RTO | < 30 minutes | 28 seconds |
| RPO | < 1 minute | 0 events lost |
| Last full DR test | March 2026 | Pass |
Run DR test: ./scripts/disaster-recovery/failover.sh --dry-run
Run actual failover: ./scripts/disaster-recovery/failover.sh --confirm
See It Live
Watch DKTrace detect this threat in your environment
Our engineers will run a live detection simulation against a sample of your log telemetry — no agents, no commitment.
Request a Live Demo