ClickHouse MergeTree Engine: Understanding Parts, Merges, and Performance

If you're running ClickHouse in production, you're running MergeTree. It's the default table engine for almost every use case: analytics, log storage, time-series data, event tracking. ReplacingMergeTree, AggregatingMergeTree, CollapsingMergeTree, and the replicated variants are all built on top of MergeTree.

Understanding how MergeTree handles data internally is the difference between a cluster that runs smoothly and one that regularly hits "too many parts" errors, merge storms, or memory exhaustion. This isn't abstract knowledge. These are the operational issues that wake people up at 3am.

How MergeTree Works

MergeTree's write path is fundamentally different from OLTP databases like PostgreSQL. There's no WAL, no row-level updates, no in-place modification. Instead, MergeTree uses an LSM-tree-inspired approach optimized for high-throughput analytical writes.

The Write Path

When you INSERT data into a MergeTree table, ClickHouse does the following:

Creates a new "part": The inserted data is written to a new directory on disk. Each part is a self-contained unit containing column files, primary index, and metadata. Parts are immutable once written, which means there are no partial writes or corruption risks during insertion.
Sorts by primary key: Within each part, data is sorted by the table's ORDER BY key. This sorting enables efficient range queries and sparse index lookups.
Writes column files: Each column is stored in a separate file within the part directory. This columnar storage is what makes ClickHouse fast for analytical queries that read only a subset of columns.
Returns immediately: The INSERT completes as soon as the part is written to disk. There's no waiting for background processing. This is why ClickHouse can achieve very high insert throughput.

Background Merges

Over time, a table accumulates many small parts. Background merge threads continuously combine smaller parts into larger ones. This is the "merge" in MergeTree.

Merges serve several purposes:

Reduce part count: Fewer parts means fewer files to open and fewer index lookups during queries
Apply deduplication: For ReplacingMergeTree, merges are when duplicate rows with the same sorting key are collapsed
Apply aggregation: For AggregatingMergeTree, merges combine partial aggregate states
Optimize storage: Merged parts can achieve better compression because larger data blocks have more opportunities for compression algorithms to find patterns

The merge process is automatic. ClickHouse selects which parts to merge based on a heuristic that balances merge size (larger merges are more efficient per byte but take longer) against part count (too many parts hurt query performance).

The Part Lifecycle

A part goes through these states:

Active: Created by an INSERT. Visible to queries. Candidate for merging.
Merged: Selected by the merge scheduler. While merging, the original parts remain active and queryable.
Inactive: After a merge completes, the original parts are marked inactive. They're no longer visible to queries but remain on disk temporarily.
Deleted: Inactive parts are cleaned up after old_parts_lifetime (default 8 minutes).

You can see all parts for a table with:

SELECT
  partition,
  name,
  active,
  rows,
  bytes_on_disk,
  modification_time
FROM system.parts
WHERE table = 'my_table' AND database = 'default'
ORDER BY modification_time DESC;

The "Too Many Parts" Problem

This is the most common ClickHouse operational issue, and understanding it requires understanding the merge scheduler.

What Causes It

Every INSERT creates a new part. If you insert data in small batches frequently (e.g., one row at a time, or 100 rows every 100ms), you create parts faster than background merges can consolidate them. The part count grows.

ClickHouse has a safety threshold: when the number of active parts in a single partition exceeds parts_to_throw_insert (default: 300), the server refuses new INSERTs with the error:

DB::Exception: Too many parts (301). Merges are processing
significantly slower than inserts

At a lower threshold (parts_to_delay_insert, default: 150), ClickHouse starts artificially delaying INSERTs to give merges time to catch up. Your insert latency increases gradually before the hard failure.

How to Prevent It

The fundamental rule: batch your inserts. Aim for at least 10,000 to 100,000 rows per INSERT, and limit insert frequency to no more than 1 INSERT per second per table.

If your data source produces records one at a time (e.g., a message queue, event stream, or application logging), you have several options:

Buffer tables: Create a Buffer table engine that sits in front of your MergeTree table. Buffer accumulates rows in memory and flushes them as larger batches on a configurable schedule.

CREATE TABLE events_buffer AS events
ENGINE = Buffer(default, events, 16, 10, 60, 10000, 100000, 1000000, 10000000);

This creates a buffer with 16 partitions that flushes after 10-60 seconds or when it accumulates 10,000-100,000 rows (whichever comes first). INSERTs go to events_buffer, and data appears in events after flushing.

Async inserts: Enable async_inserts=1 at the server or query level. ClickHouse will buffer incoming small inserts and batch them internally before writing parts. This is simpler than Buffer tables and recommended for most use cases on ClickHouse 22.8+.

SET async_inserts = 1;
SET wait_for_async_insert = 1;  -- wait for batch flush before returning

Application-side batching: Accumulate rows in your application and flush at regular intervals. This gives you the most control but requires implementing batching logic in every service that writes to ClickHouse.

How to Diagnose It

Monitor the number of active parts per partition:

SELECT
  database,
  table,
  partition,
  count() AS part_count,
  sum(rows) AS total_rows,
  formatReadableSize(sum(bytes_on_disk)) AS total_size
FROM system.parts
WHERE active = 1
GROUP BY database, table, partition
HAVING part_count > 50
ORDER BY part_count DESC;

If any partition consistently has more than 100 active parts, you're heading toward trouble. Set alerts at 150 (warning) and 250 (critical).

Merge Storms

A merge storm happens when the merge scheduler triggers many large merges simultaneously, consuming all available disk I/O and memory. Queries slow to a crawl because they're competing with merges for I/O bandwidth.

What Triggers Merge Storms

Data backfills: Inserting a large amount of historical data creates many parts at once. The merge scheduler sees all of them as merge candidates and schedules aggressive merging.
Partition key changes: Altering the partition key or creating a new table with a different partition scheme can cause a flood of new partitions with small parts.
Cluster restarts: After a restart, previously deferred merges all run at once.

How to Control Merges

ClickHouse exposes several settings that control merge behavior:

background_pool_size: Number of background merge threads (default varies by version, typically 16). Reducing this limits concurrent merges but slows overall merge throughput.
max_bytes_to_merge_at_max_space_in_pool: Maximum total size of parts to merge in a single operation (default: 150GB). Reducing this creates more frequent but smaller merges.
merge_max_block_size: Maximum number of rows to read from each part per merge iteration. Reducing this decreases merge memory usage but slows merges.

Monitor current merge activity:

SELECT
  database,
  table,
  elapsed,
  progress,
  num_parts,
  formatReadableSize(total_size_bytes_compressed) AS total_size,
  formatReadableSize(memory_usage) AS memory
FROM system.merges
ORDER BY elapsed DESC;

If you see many concurrent merges with low progress, merges are competing for resources. Consider reducing background_pool_size temporarily or throttling inserts.

Memory Exhaustion from Unbounded Queries

Unlike PostgreSQL, where a slow query primarily holds a connection and has relatively bounded memory usage, a ClickHouse query can allocate tens of gigabytes of RAM in seconds. A query that scans a large table without a WHERE clause, or a GROUP BY on a high-cardinality column, can consume all available memory and crash the server.

Server-Level Guardrails

Set these to prevent runaway queries from affecting the entire cluster:

-- Per-query memory limit (default: unlimited, which is dangerous)
SET max_memory_usage = '10000000000';  -- 10GB per query

-- Total memory for all queries
SET max_memory_usage_for_all_queries = '50000000000';  -- 50GB total

-- Maximum rows a query can scan (prevents accidental full table scans)
SET max_rows_to_read = '1000000000';  -- 1 billion rows

Configure these in the server's users.xml or config.xml for default profiles, not just per-session.

Monitoring Memory Usage

Track query memory usage over time:

SELECT
  query_id,
  user,
  left(query, 80) AS query_preview,
  memory_usage,
  formatReadableSize(memory_usage) AS memory_readable,
  read_rows,
  elapsed
FROM system.processes
ORDER BY memory_usage DESC;

For historical analysis:

SELECT
  left(query, 80) AS query_preview,
  type,
  query_duration_ms,
  formatReadableSize(memory_usage) AS peak_memory,
  read_rows,
  event_time
FROM system.query_log
WHERE type = 'QueryFinish'
  AND memory_usage > 1000000000  -- > 1GB
ORDER BY event_time DESC
LIMIT 20;

Replication Lag in Distributed Setups

For Replicated*MergeTree tables, replicas synchronize through ClickHouse Keeper (the built-in replacement for ZooKeeper, available since ClickHouse 21.8). Each insert creates a log entry in Keeper that replicas must fetch and replay.

What Causes Replication Lag

Network issues between replicas or between replicas and Keeper
Slow disk on a replica that can't write parts as fast as the primary produces them
High insert rates that overwhelm Keeper's throughput
Keeper instability (high latency, connection drops)

Monitoring Replication Health

SELECT
  database,
  table,
  is_leader,
  total_replicas,
  active_replicas,
  queue_size,
  inserts_in_queue,
  merges_in_queue,
  absolute_delay
FROM system.replicas
WHERE queue_size > 0 OR absolute_delay > 0
ORDER BY absolute_delay DESC;

Key metrics:

queue_size > 100: Replica is significantly behind
absolute_delay > 300: More than 5 minutes of replication lag, investigate immediately
active_replicas < total_replicas: A replica is down or disconnected

Key Metrics to Monitor

If you're setting up ClickHouse monitoring, here's a prioritized list of what to track:

Priority	Metric	Source	Why It Matters
Critical	Active parts per partition	`system.parts`	"Too many parts" kills inserts
Critical	Merge queue depth	`system.merges`	Merge storms degrade all queries
High	Query memory peaks	`system.query_log`	Unbounded queries crash servers
High	Replication lag	`system.replicas`	Data consistency depends on it
Medium	Insert throughput	`system.events`	Drops indicate upstream or config issues
Medium	ZooKeeper/Keeper latency	`system.zookeeper`	Replication depends on Keeper health
Low	Disk usage per table	`system.parts`	Capacity planning

Automating ClickHouse Monitoring

Querying these system tables manually works for one-off investigations, but production ClickHouse deployments need continuous automated monitoring. The challenge is that monitoring ClickHouse through its own system tables creates a "monitoring observing itself" problem: heavy queries against system.query_log on a busy cluster can themselves impact performance.

Basira's approach is to run a lightweight agent that connects as a read-only user, collects from system tables at configurable intervals, and ships the data to an external analytics backend. Basira's ClickHouse monitoring understands the ClickHouse monitoring gap and tracks MergeTree-specific metrics (parts, merges, insert health) that general-purpose APM tools miss.

If you're running ClickHouse alongside PostgreSQL, having both database types in a single monitoring tool with consistent pricing ($29/db/month regardless of engine or cluster size) eliminates the need to maintain separate monitoring stacks. The agent setup is API-driven and takes a few minutes.

For teams that prefer to build their own monitoring, the system tables and queries in this guide are a solid starting point. The critical thing is to collect this data continuously, not just when something breaks.