DeploymentJune 202513 min read

DKTrace High-Availability Architecture — Zero-Downtime SOC

How to deploy DKTrace in active-active HA across two data centres with automated failover under 30 seconds. Covers NATS JetStream clustering, ClickHouse replication, PostgreSQL Patroni, and Redis Sentinel.

DK

DKTrace Research Team

Security Engineering · Threat Research

Design Principle: No Single Point of Failure

A SOC platform that goes offline during an attack is worse than no SOC platform. DKTrace targets 99.95% availability (< 4.4 hours downtime/year).

NATS JetStream Cluster (3-Node Minimum)

NATS JetStream replicates every message stream across N/2+1 nodes before acknowledging receipt:

yaml
# nats-cluster.conf
cluster {
  name: dktrace-cluster
  routes: [
    nats://nats-01.internal:6222
    nats://nats-02.internal:6222
    nats://nats-03.internal:6222
  ]
}
jetstream {
  store_dir: /data/nats
  max_memory_store: 8GB
  max_file_store: 500GB
}

At 100K events/sec: a single node failure causes zero event loss. Consumers rebalance within 5 seconds. DKTrace services reconnect with exponential backoff (1s→2s→4s→8s→30s max).

ClickHouse Replication (2 Shards × 2 Replicas)

xml
<remote_servers>
  <dktrace_cluster>
    <shard>
      <replica><host>clickhouse-01</host><port>9000</port></replica>
      <replica><host>clickhouse-02</host><port>9000</port></replica>
    </shard>
    <shard>
      <replica><host>clickhouse-03</host><port>9000</port></replica>
      <replica><host>clickhouse-04</host><port>9000</port></replica>
    </shard>
  </dktrace_cluster>
</remote_servers>

canonical_events uses ReplicatedMergeTree — any node serves reads. Writes route to the shard leader, replicated synchronously before ACK.

PostgreSQL with Patroni

DKTrace's incident-store uses PostgreSQL via Patroni for automated HA:

Leader election via etcd (3-node etcd cluster)
Automatic failover: < 10 seconds
PgBouncer connection pooling (all services connect to PgBouncer, not directly to PostgreSQL)
Read replicas serve all GET requests; leader serves writes

Disaster Recovery

MetricTargetLast Tested
RTO< 30 minutes28 seconds
RPO< 1 minute0 events lost
Last full DR testMarch 2026Pass

Run DR test: ./scripts/disaster-recovery/failover.sh --dry-run

Run actual failover: ./scripts/disaster-recovery/failover.sh --confirm

See It Live

Watch DKTrace detect this threat in your environment

Our engineers will run a live detection simulation against a sample of your log telemetry — no agents, no commitment.

Request a Live Demo