ClickHouse MergeTree Engine: Understanding Parts, Merges, and Performance
If you're running ClickHouse in production, you're running MergeTree. It's the default table engine for almost every use case: analytics, log storage, time-series data, event tracking. ReplacingMergeTree, AggregatingMergeTree, CollapsingMergeTree, and the replicated variants are all built on top of MergeTree.
Understanding how MergeTree handles data internally is the difference between a cluster that runs smoothly and one that regularly hits "too many parts" errors, merge storms, or memory exhaustion. This isn't abstract knowledge. These are the operational issues that wake people up at 3am.
How MergeTree Works
MergeTree's write path is fundamentally different from OLTP databases like PostgreSQL. There's no WAL, no row-level updates, no in-place modification. Instead, MergeTree uses an LSM-tree-inspired approach optimized for high-throughput analytical writes.
The Write Path
When you INSERT data into a MergeTree table, ClickHouse does the following:
-
Creates a new "part": The inserted data is written to a new directory on disk. Each part is a self-contained unit containing column files, primary index, and metadata. Parts are immutable once written, which means there are no partial writes or corruption risks during insertion.
-
Sorts by primary key: Within each part, data is sorted by the table's
ORDER BYkey. This sorting enables efficient range queries and sparse index lookups. -
Writes column files: Each column is stored in a separate file within the part directory. This columnar storage is what makes ClickHouse fast for analytical queries that read only a subset of columns.
-
Returns immediately: The INSERT completes as soon as the part is written to disk. There's no waiting for background processing. This is why ClickHouse can achieve very high insert throughput.
Background Merges
Over time, a table accumulates many small parts. Background merge threads continuously combine smaller parts into larger ones. This is the "merge" in MergeTree.
Merges serve several purposes:
- Reduce part count: Fewer parts means fewer files to open and fewer index lookups during queries
- Apply deduplication: For ReplacingMergeTree, merges are when duplicate rows with the same sorting key are collapsed
- Apply aggregation: For AggregatingMergeTree, merges combine partial aggregate states
- Optimize storage: Merged parts can achieve better compression because larger data blocks have more opportunities for compression algorithms to find patterns
The merge process is automatic. ClickHouse selects which parts to merge based on a heuristic that balances merge size (larger merges are more efficient per byte but take longer) against part count (too many parts hurt query performance).
The Part Lifecycle
A part goes through these states:
- Active: Created by an INSERT. Visible to queries. Candidate for merging.
- Merged: Selected by the merge scheduler. While merging, the original parts remain active and queryable.
- Inactive: After a merge completes, the original parts are marked inactive. They're no longer visible to queries but remain on disk temporarily.
- Deleted: Inactive parts are cleaned up after
old_parts_lifetime(default 8 minutes).
You can see all parts for a table with:
SELECT
partition,
name,
active,
rows,
bytes_on_disk,
modification_time
FROM system.parts
WHERE table = 'my_table' AND database = 'default'
ORDER BY modification_time DESC;
The "Too Many Parts" Problem
This is the most common ClickHouse operational issue, and understanding it requires understanding the merge scheduler.
What Causes It
Every INSERT creates a new part. If you insert data in small batches frequently (e.g., one row at a time, or 100 rows every 100ms), you create parts faster than background merges can consolidate them. The part count grows.
ClickHouse has a safety threshold: when the number of active parts in a single partition exceeds parts_to_throw_insert (default: 300), the server refuses new INSERTs with the error:
DB::Exception: Too many parts (301). Merges are processing
significantly slower than inserts
At a lower threshold (parts_to_delay_insert, default: 150), ClickHouse starts artificially delaying INSERTs to give merges time to catch up. Your insert latency increases gradually before the hard failure.
How to Prevent It
The fundamental rule: batch your inserts. Aim for at least 10,000 to 100,000 rows per INSERT, and limit insert frequency to no more than 1 INSERT per second per table.
If your data source produces records one at a time (e.g., a message queue, event stream, or application logging), you have several options:
Buffer tables: Create a Buffer table engine that sits in front of your MergeTree table. Buffer accumulates rows in memory and flushes them as larger batches on a configurable schedule.
CREATE TABLE events_buffer AS events
ENGINE = Buffer(default, events, 16, 10, 60, 10000, 100000, 1000000, 10000000);
This creates a buffer with 16 partitions that flushes after 10-60 seconds or when it accumulates 10,000-100,000 rows (whichever comes first). INSERTs go to events_buffer, and data appears in events after flushing.
Async inserts: Enable async_inserts=1 at the server or query level. ClickHouse will buffer incoming small inserts and batch them internally before writing parts. This is simpler than Buffer tables and recommended for most use cases on ClickHouse 22.8+.
SET async_inserts = 1;
SET wait_for_async_insert = 1; -- wait for batch flush before returning
Application-side batching: Accumulate rows in your application and flush at regular intervals. This gives you the most control but requires implementing batching logic in every service that writes to ClickHouse.
How to Diagnose It
Monitor the number of active parts per partition:
SELECT
database,
table,
partition,
count() AS part_count,
sum(rows) AS total_rows,
formatReadableSize(sum(bytes_on_disk)) AS total_size
FROM system.parts
WHERE active = 1
GROUP BY database, table, partition
HAVING part_count > 50
ORDER BY part_count DESC;
If any partition consistently has more than 100 active parts, you're heading toward trouble. Set alerts at 150 (warning) and 250 (critical).
Merge Storms
A merge storm happens when the merge scheduler triggers many large merges simultaneously, consuming all available disk I/O and memory. Queries slow to a crawl because they're competing with merges for I/O bandwidth.
What Triggers Merge Storms
- Data backfills: Inserting a large amount of historical data creates many parts at once. The merge scheduler sees all of them as merge candidates and schedules aggressive merging.
- Partition key changes: Altering the partition key or creating a new table with a different partition scheme can cause a flood of new partitions with small parts.
- Cluster restarts: After a restart, previously deferred merges all run at once.
How to Control Merges
ClickHouse exposes several settings that control merge behavior:
background_pool_size: Number of background merge threads (default varies by version, typically 16). Reducing this limits concurrent merges but slows overall merge throughput.max_bytes_to_merge_at_max_space_in_pool: Maximum total size of parts to merge in a single operation (default: 150GB). Reducing this creates more frequent but smaller merges.merge_max_block_size: Maximum number of rows to read from each part per merge iteration. Reducing this decreases merge memory usage but slows merges.
Monitor current merge activity:
SELECT
database,
table,
elapsed,
progress,
num_parts,
formatReadableSize(total_size_bytes_compressed) AS total_size,
formatReadableSize(memory_usage) AS memory
FROM system.merges
ORDER BY elapsed DESC;
If you see many concurrent merges with low progress, merges are competing for resources. Consider reducing background_pool_size temporarily or throttling inserts.
Memory Exhaustion from Unbounded Queries
Unlike PostgreSQL, where a slow query primarily holds a connection and has relatively bounded memory usage, a ClickHouse query can allocate tens of gigabytes of RAM in seconds. A query that scans a large table without a WHERE clause, or a GROUP BY on a high-cardinality column, can consume all available memory and crash the server.
Server-Level Guardrails
Set these to prevent runaway queries from affecting the entire cluster:
-- Per-query memory limit (default: unlimited, which is dangerous)
SET max_memory_usage = '10000000000'; -- 10GB per query
-- Total memory for all queries
SET max_memory_usage_for_all_queries = '50000000000'; -- 50GB total
-- Maximum rows a query can scan (prevents accidental full table scans)
SET max_rows_to_read = '1000000000'; -- 1 billion rows
Configure these in the server's users.xml or config.xml for default profiles, not just per-session.
Monitoring Memory Usage
Track query memory usage over time:
SELECT
query_id,
user,
left(query, 80) AS query_preview,
memory_usage,
formatReadableSize(memory_usage) AS memory_readable,
read_rows,
elapsed
FROM system.processes
ORDER BY memory_usage DESC;
For historical analysis:
SELECT
left(query, 80) AS query_preview,
type,
query_duration_ms,
formatReadableSize(memory_usage) AS peak_memory,
read_rows,
event_time
FROM system.query_log
WHERE type = 'QueryFinish'
AND memory_usage > 1000000000 -- > 1GB
ORDER BY event_time DESC
LIMIT 20;
Replication Lag in Distributed Setups
For Replicated*MergeTree tables, replicas synchronize through ClickHouse Keeper (the built-in replacement for ZooKeeper, available since ClickHouse 21.8). Each insert creates a log entry in Keeper that replicas must fetch and replay.
What Causes Replication Lag
- Network issues between replicas or between replicas and Keeper
- Slow disk on a replica that can't write parts as fast as the primary produces them
- High insert rates that overwhelm Keeper's throughput
- Keeper instability (high latency, connection drops)
Monitoring Replication Health
SELECT
database,
table,
is_leader,
total_replicas,
active_replicas,
queue_size,
inserts_in_queue,
merges_in_queue,
absolute_delay
FROM system.replicas
WHERE queue_size > 0 OR absolute_delay > 0
ORDER BY absolute_delay DESC;
Key metrics:
queue_size > 100: Replica is significantly behindabsolute_delay > 300: More than 5 minutes of replication lag, investigate immediatelyactive_replicas < total_replicas: A replica is down or disconnected
Key Metrics to Monitor
If you're setting up ClickHouse monitoring, here's a prioritized list of what to track:
| Priority | Metric | Source | Why It Matters |
|---|---|---|---|
| Critical | Active parts per partition | system.parts | "Too many parts" kills inserts |
| Critical | Merge queue depth | system.merges | Merge storms degrade all queries |
| High | Query memory peaks | system.query_log | Unbounded queries crash servers |
| High | Replication lag | system.replicas | Data consistency depends on it |
| Medium | Insert throughput | system.events | Drops indicate upstream or config issues |
| Medium | ZooKeeper/Keeper latency | system.zookeeper | Replication depends on Keeper health |
| Low | Disk usage per table | system.parts | Capacity planning |
Automating ClickHouse Monitoring
Querying these system tables manually works for one-off investigations, but production ClickHouse deployments need continuous automated monitoring. The challenge is that monitoring ClickHouse through its own system tables creates a "monitoring observing itself" problem: heavy queries against system.query_log on a busy cluster can themselves impact performance.
Basira's approach is to run a lightweight agent that connects as a read-only user, collects from system tables at configurable intervals, and ships the data to an external analytics backend. Basira's ClickHouse monitoring understands the ClickHouse monitoring gap and tracks MergeTree-specific metrics (parts, merges, insert health) that general-purpose APM tools miss.
If you're running ClickHouse alongside PostgreSQL, having both database types in a single monitoring tool with consistent pricing ($29/db/month regardless of engine or cluster size) eliminates the need to maintain separate monitoring stacks. The agent setup is API-driven and takes a few minutes.
For teams that prefer to build their own monitoring, the system tables and queries in this guide are a solid starting point. The critical thing is to collect this data continuously, not just when something breaks.