The only monitoring platform with native ClickHouse support →
← Back to blog
6 min read

ClickHouse Monitoring: Why the Gap Exists

By Behroz Saadat
ClickHouseObservability

Almost nobody has proper ClickHouse monitoring in place. Not for lack of interest, but because the tooling barely exists.

ClickHouse adoption has grown significantly over the past few years. Companies like Cloudflare, Uber, eBay, Deutsche Bank, and Lyft run ClickHouse in production for analytics, log storage, and real-time reporting. ClickHouse Inc. has reported over 1,000 enterprise customers on ClickHouse Cloud, and the open-source version sees tens of thousands of deployments. It's become a default choice for teams that need fast analytical queries over large datasets.

Yet the monitoring ecosystem hasn't kept up. The OLAP world has fewer established monitoring tools than the OLTP world, and ClickHouse specifically has been underserved.

The ClickHouse Monitoring Gap

Try finding a database monitoring tool with first-class ClickHouse support. You'll find plenty of options for PostgreSQL, MySQL, and MongoDB. But ClickHouse? Your options are basically:

Grafana + system tables: Roll your own dashboards by querying system.query_log, system.parts, and system.metrics. This is the most common approach, and it works, but it comes with real operational overhead. You're writing and maintaining custom SQL queries against system tables that change between ClickHouse versions. You need to build alerting logic from scratch, define thresholds based on your own experience, and keep dashboards updated as your cluster topology evolves. For a team with deep ClickHouse expertise, this is manageable. For teams where ClickHouse is one of several data stores, it's a maintenance burden that often gets deprioritized until something breaks.

General APM with a ClickHouse integration: Datadog, New Relic, and similar platforms offer ClickHouse "integrations" that collect basic host metrics (CPU, memory, disk) and maybe check that the ClickHouse process is running. Some will pull a handful of metrics from system.metrics. But they don't provide query-level performance tracking, part merge visibility, or insert health monitoring. They treat ClickHouse the same way they'd treat any other process running on a host. That's not database monitoring. That's infrastructure monitoring with a ClickHouse label on it.

ClickHouse Cloud's built-in tools: If you're on ClickHouse Cloud, you get the Query Insights feature, advanced dashboards, and some automated alerting. This is genuinely useful, but it only covers your Cloud instances. Teams running self-hosted ClickHouse alongside ClickHouse Cloud (or alongside PostgreSQL) still need a separate monitoring solution. And Cloud users who want to consolidate their database fleet monitoring into a single view are out of luck.

None of these give you what you actually want: deep query-level performance tracking, part merge visibility, insert throughput monitoring, and intelligent alerting. Not without building it yourself, anyway.

Why ClickHouse Monitoring Is Different

ClickHouse monitoring isn't just PostgreSQL monitoring with different SQL. The architecture is fundamentally different, and it shows.

  • Merges matter: ClickHouse uses a MergeTree engine where background merges significantly impact write and query performance. You need visibility into merge activity, part counts, and merge queue depth.

  • Insert patterns: Unlike OLTP databases where individual row inserts are normal, ClickHouse performance depends heavily on batch size and insert frequency. Monitoring insert block sizes and flush intervals is critical.

  • Distributed queries: In clustered setups, a single query fans out across shards. Performance bottlenecks can be shard-specific, and you need per-shard visibility.

  • System log tables: ClickHouse exposes rich telemetry through system.query_log, system.part_log, system.metric_log, and others. But querying these tables on a busy cluster is itself an operational concern.

What We Built for ClickHouse

Basira's ClickHouse collector understands the ClickHouse data model natively:

  • Query performance: Tracks query execution time, rows read, bytes processed, and memory usage from system.query_log. Groups by query fingerprint so you see patterns, not individual executions.

  • Part and merge monitoring: Watches active and total part counts, merge progress, and detects when part counts are growing faster than merges can keep up. The classic "too many parts" problem.

  • Insert health: Monitors insert block sizes, async insert queue depth, and buffer table flush activity.

  • Resource pressure: Tracks memory usage, thread pool saturation, and ZooKeeper latency for replicated tables.

All collected by a lightweight agent that connects as a read-only user with minimal permissions. Same flat $29/db/month pricing as our PostgreSQL monitoring. And because the agent is fully API-driven, you can add ClickHouse monitoring to an automated provisioning workflow without touching a UI.

Common ClickHouse Failure Modes

Understanding what can go wrong helps you know what to monitor. These are the most common ClickHouse production issues:

Too many parts: This is the most frequent ClickHouse failure mode. Every INSERT creates a new "part" (a directory of column files on disk). If you insert data too frequently in small batches, parts accumulate faster than background merges can consolidate them. When a table exceeds roughly 300 active parts in a single partition, ClickHouse starts rejecting inserts with TOO_MANY_PARTS errors. The fix is straightforward: batch your inserts (aim for at least 10,000-100,000 rows per INSERT, no more than once per second per table), or use async inserts or Buffer tables to aggregate small writes. But you need to see the part count trending upward before it hits the threshold. For a deeper dive on how MergeTree parts and merges work, see our MergeTree performance guide.

Merge storms: Background merges are normal and necessary. But when many large merges run simultaneously, they can consume all available disk I/O and memory, causing query latency to spike. This often happens after a large data backfill or when partition keys change. Monitoring merge activity (number of concurrent merges, bytes being merged, merge queue depth) gives you early warning. If merges are consistently falling behind, you may need to tune max_bytes_to_merge_at_max_space_in_pool or increase disk throughput.

Memory exhaustion from unbounded queries: ClickHouse is fast, but a query that scans a billion rows without a WHERE clause will consume all available memory. Unlike PostgreSQL, where a slow query holds a connection but typically has bounded memory usage, a ClickHouse query can allocate tens of gigabytes of RAM in seconds. Server-level settings like max_memory_usage and max_memory_usage_for_all_queries provide guardrails, but you need monitoring to know when queries are approaching those limits.

Replication lag in distributed setups: For Replicated*MergeTree tables, replicas synchronize through ClickHouse Keeper (or ZooKeeper). Network issues, slow disks on replicas, or high insert rates can cause replication lag. If a replica falls far enough behind, it may need to re-sync from scratch. Monitoring the replication queue depth and lag (via system.replicas) is essential for multi-node deployments.

Key Metrics and Alert Thresholds

If you're building your own monitoring or want to know what to watch, these are the metrics that matter most:

MetricSourceAlert Threshold
Active parts per tablesystem.parts> 200 (warn), > 300 (critical)
Concurrent mergessystem.merges> 80% of background_pool_size
Memory usagesystem.metrics> 80% of max_memory_usage_for_all_queries
Replication queuesystem.replicas> 100 entries or > 5 min lag
Insert rows/secsystem.eventsSudden drop > 50% from baseline
Query duration p99system.query_log> 2x baseline

Basira's ClickHouse monitoring tracks all of these automatically and alerts when thresholds are exceeded.

One Dashboard for PostgreSQL and ClickHouse

The real power is having PostgreSQL and ClickHouse monitoring in the same place. A lot of teams use both: PostgreSQL for application data, ClickHouse for analytics or logs. With Basira, you get a unified view of your entire database fleet without paying per-metric fees that make database monitoring costs spiral out of control.

Stop switching between tools and maintaining separate monitoring stacks for each database engine.

Getting Started with ClickHouse Monitoring

If you're running ClickHouse without dedicated monitoring today, here's a practical starting point regardless of which tool you use:

  1. Enable system.query_log: It's enabled by default in most ClickHouse installations, but verify it's not disabled in your server config. This table is the foundation of query performance monitoring.

  2. Create a monitoring user: Don't query system tables with your admin account. Create a dedicated read-only user with access to system.* tables. In ClickHouse, this is straightforward:

CREATE USER basira_monitor IDENTIFIED BY 'secure_password';
GRANT SELECT ON system.* TO basira_monitor;
  1. Set up part count alerts first: Of all the metrics you could monitor, part counts are the most operationally urgent. A runaway part count will take your table offline. Start here.

  2. Monitor merge activity: After part counts, watch merge throughput. If merges are consistently falling behind inserts, you'll hit the too-many-parts wall eventually.

  3. Track query memory usage: Set max_memory_usage and max_memory_usage_for_all_queries at the server level, then monitor how close queries get to those limits.

For a deeper understanding of the MergeTree engine and how parts and merges work under the hood, see our MergeTree performance guide.

Deploy the Basira agent and get ClickHouse visibility alongside your PostgreSQL databases. The setup is fully API-driven, and the pricing is the same flat $29/db/month regardless of cluster size.

Stop guessing. Start monitoring.

Basira gives you deep visibility into every query your database runs. Deploy in under a minute.