Cheap audit logs in ClickHouse: the schema we're using for Basira's command log

Every command Basira's agent runs against a customer database needs to be auditable. Who ran it, against which database, what SQL was executed, what the result was, when it happened. Customers ask us about this in their first security review and we need a real answer, not a gesture at log_statement = 'all' in Postgres.

We've run into the same problem most teams hit at scale: audit logs in Postgres are easy to start and painful to keep. A billion rows of audit data in a transactional Postgres instance bloats the heap, slows VACUUM, fights for buffer cache with your actual workload, and turns every compliance query ("show me everything user X did last quarter") into a sequential scan.

ClickHouse solves this, but only if you pick the right schema. We've been designing the table that will back Basira's agent command-audit trail, and the pattern is the same one we use for every telemetry table in the product today. This post walks through the DDL choice by choice — what we picked, what we rejected, and why.

Why not Postgres

Postgres is fine for the first few million audit rows. The pain starts later:

MVCC bloat. Audit tables are write-heavy and almost never updated or deleted in place. But every INSERT still produces a live tuple, and retention-driven DELETEs produce dead tuples that VACUUM has to reclaim. You either pay for a big autovacuum budget or you pay for bloat.
Column storage. Audit rows have a handful of high-cardinality columns (actor, target, timestamp) and a wide variable-length payload (the full event details). Postgres stores them row-major, so every scan pays for columns you don't need.
Retention is a chore. Dropping old audit data in Postgres means either partitioning the table (and rolling partitions) or running batched DELETEs. Both are operational work you shouldn't have to do.
Cost per GB. Managed Postgres storage runs roughly 10-20× what managed ClickHouse object storage costs, and audit data compresses ~5-10× better in ClickHouse than in Postgres because it's columnar and heavily redundant.

If you already run ClickHouse for analytics (we do), the marginal cost of an audit-log table is close to zero.

The schema

Here's the table, followed by a breakdown of each decision:

CREATE TABLE command_audit (
    org_id          String,
    db_instance_id  String,
    actor_id        String,
    event_time      DateTime64(3, 'UTC'),
    command_id      String,
    action          LowCardinality(String),
    target_type     LowCardinality(String),
    target_id       String,
    status          LowCardinality(String),
    sql_text        String CODEC(ZSTD(3)),
    result_summary  String CODEC(ZSTD(3)),
    payload         String CODEC(ZSTD(3))
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (org_id, actor_id, event_time)
TTL toDateTime(event_time) + INTERVAL 365 DAY;

Every field is deliberate. Let's go through the choices that actually matter.

ORDER BY (org_id, actor_id, event_time)

In ClickHouse the ORDER BY clause controls two things: how data is physically sorted on disk within each part, and what the sparse primary index points at. The index lets ClickHouse skip most of the data when your WHERE clause filters on the leading columns.

We put org_id first because Basira is multi-tenant and almost every query is scoped to a single organization. Putting it first gives us free tenant isolation at the physical-layout level — a query for one org never touches another org's data.

Next is actor_id. The single most common audit query is "what did user X do?" or "what did the agent do against database Y?". With actor_id as the second sort key, this is an index range scan: ClickHouse jumps to the actor's region of the part and reads sequentially. No full scan.

event_time comes last. Within a given (org_id, actor_id) pair, rows are sorted chronologically, so "what did user X do between T1 and T2" is also an index range.

The pattern we rejected: putting event_time first. Time-first ordering makes "what happened in the last hour across all actors" fast, but audit queries are rarely shaped like that. They're almost always scoped to an actor. Time-first would also make the tenant-isolation property worse.

PARTITION BY toYYYYMM(event_time)

Partitioning is how ClickHouse physically groups parts. Monthly partitions are the sweet spot for audit logs:

Retention is a metadata operation. When a partition ages out, ClickHouse's TTL mover can drop it wholesale with ALTER TABLE ... DROP PARTITION — zero scan, zero rewrite. A daily partition would work too but creates too many parts; a yearly partition is too coarse for retention cleanup.
Queries prune partitions. A query for last month's activity only touches one partition. This is independent of the primary-key index and stacks with it.
Backups are per-partition. You can copy or restore a single month cheaply.

We use toYYYYMM(event_time) across every MergeTree table in Basira (server_metrics, alert_history, query_stats, and so on) for exactly these reasons.

LowCardinality(String) for enums

action, target_type, and status are effectively enums — a small set of values repeated billions of times (query_run, migration_applied, connection_opened for action; success, failed, timeout for status).

ClickHouse's LowCardinality type dictionary-encodes the values, storing each row as a small integer index into a per-part dictionary. For columns with fewer than ~10,000 distinct values this typically compresses 10-100× vs raw String, and it speeds up GROUP BY and WHERE filters because equality checks happen on integer indexes.

We don't use LowCardinality for actor_id or target_id — those are high-cardinality UUIDs/IDs where dictionary encoding would hurt more than help.

ZSTD on the large payload columns

sql_text, result_summary, and payload are the bulky free-form columns. The default LZ4 codec is fast but compresses less than ZSTD. For columns you write once and read occasionally — which audit payloads are — ZSTD is the right trade.

We picked ZSTD(3) specifically. Level 3 is the default for zstd and benchmarks consistently as the best ratio/speed trade for most workloads. Levels above 9 get diminishing returns and slow writes noticeably. In our telemetry tables, ZSTD(3) on JSON-shaped payloads gets us 5-10× compression over LZ4.

There's a subtle win here: the large columns are compressed independently of the primary-key columns. A query that only touches org_id, actor_id, event_time, and action never pays the cost of reading or decompressing payload. This is the columnar-storage dividend.

TTL for retention

TTL toDateTime(event_time) + INTERVAL 365 DAY;

One line, and retention is solved. ClickHouse's background TTL thread drops expired rows (or whole parts if the whole part is expired) automatically. We don't run a retention cron, we don't batch DELETEs, we don't worry about VACUUM. A compliance change from 365 → 730 days is a one-line ALTER TABLE MODIFY TTL and ClickHouse takes care of the rest.

The toDateTime(event_time) wrapper is because our event_time is DateTime64(3, 'UTC') (millisecond precision) and the TTL engine expects a plain DateTime. This is a common gotcha — we hit it and it cost us half an hour the first time.

Query patterns: what's fast, what isn't

With this schema, these queries are fast (index range scans):

-- What did this actor do last week?
SELECT event_time, action, target_type, target_id, status
FROM command_audit
WHERE org_id = '...'
  AND actor_id = '...'
  AND event_time >= now() - INTERVAL 7 DAY
ORDER BY event_time DESC;

-- Failed commands for a given target DB
SELECT actor_id, event_time, sql_text
FROM command_audit
WHERE org_id = '...'
  AND actor_id = '...'
  AND status = 'failed'
  AND target_id = '...';

These queries are not fast with this schema (full scan required):

-- Full-text search inside sql_text or payload
SELECT * FROM command_audit
WHERE org_id = '...'
  AND sql_text LIKE '%DROP TABLE%';

-- Cross-actor aggregations scoped only by time
SELECT action, count() FROM command_audit
WHERE org_id = '...'
  AND event_time >= now() - INTERVAL 1 DAY
GROUP BY action;

For the full-text case, you have two options: add a tokenbf_v1 skip index on sql_text, or move full-text search out of ClickHouse entirely into a tool built for it. For the second case, a projection with a time-first ORDER BY will let ClickHouse silently route time-range queries to the projection.

We haven't needed either yet. Audit queries in practice almost always filter by actor or target. YAGNI applies here.

Multi-tenancy, GDPR, and tamper-evidence

Three operational concerns worth calling out:

Tenant isolation is handled by org_id being first in the sort key. Queries that forget to filter by org_id still work — they'll just scan every org's data and be slow. At the API layer we enforce org_id in every query; the schema is a defense-in-depth layer.

GDPR right-to-delete is where MergeTree gets awkward. You can ALTER TABLE ... DELETE by actor_id, but mutations in ClickHouse rewrite whole parts and are expensive. For audit specifically, the answer is usually retention + pseudonymization: rather than delete on request, you replace PII in payload with a tombstone and let TTL do the rest. For hard-delete requirements, you either accept the mutation cost or move PII out of the audit table entirely.

Tamper-evidence isn't something the schema alone gives you. If you need hash-chained audit logs (each row signed with the hash of the previous), compute the hash at ingest and store it in an extra column. We don't ship this today because our customers haven't asked for it, but it's a simple addition.

FAQ

How much does this actually cost at scale? At 1 billion rows of typical audit events (~200 bytes compressed per row), you're storing around 200 GB. On ClickHouse Cloud or a self-hosted cluster with object storage, that's a few dollars per month. The per-row cost is dominated by the ZSTD-compressed payload column.

What about very high cardinality in target_id? Fine. target_id isn't in the primary key and isn't LowCardinality. High-cardinality random IDs are exactly what plain String columns are for.

Why DateTime64(3) instead of DateTime? Millisecond precision matters when events fire in bursts — two commands in the same second are common. The storage cost difference is marginal.

Should I use ReplacingMergeTree for idempotent ingest? Usually no. ReplacingMergeTree helps when you genuinely have duplicates at ingest (at-least-once delivery). If your ingest pipeline already deduplicates, plain MergeTree is simpler and faster.

What if I want to query across multiple orgs (for platform-wide analytics)? Build a materialized view that re-sorts by (action, event_time) and aggregates. Don't query the raw audit table cross-org.

Closing

The same pattern applies to any event-stream table: partition monthly, order by (tenant, actor-or-entity, time), LowCardinality on the enum columns, ZSTD on the large free-form columns, TTL for retention. We use variants of this in every MergeTree table in Basira — server_metrics, alert_history, query_stats, and now command_audit.

If you're hand-rolling audit logs in Postgres because you don't have a ClickHouse cluster, that might be the bigger cost. A tiny self-hosted ClickHouse node costs less than a month of the Postgres bloat you're about to accrue. And if you're already on ClickHouse for other reasons, adding audit is near-free — as long as you pick the right primary key on the first try.

If you need to audit what's happening inside Postgres specifically, that's a different problem with its own set of trade-offs — we'll cover pgaudit, triggers, and logical decoding in the next post.