Skip to main content

Confluent Cloud Monitoring

Log and Monitor

# Discover Available Resources
curl -X GET 'https://api.telemetry.confluent.cloud/v2/metrics/cloud/descriptors/resources' -u '<API_KEY>:<SECRET>'

Discover available metrics

curl -X GET 'https://api.telemetry.confluent.cloud/v2/metrics/cloud/descriptors/metrics?resource_type=kafka' -u '<API_KEY>:<SECRET>'

io.confluent.kafka.server

  • io.confluent.kafka.server/request_bytes
  • io.confluent.kafka.server/response_bytes
  • io.confluent.kafka.server/received_bytes
  • io.confluent.kafka.server/sent_bytes
  • io.confluent.kafka.server/received_records
  • io.confluent.kafka.server/sent_records
  • io.confluent.kafka.server/retained_bytes
  • io.confluent.kafka.server/active_connection_count
  • io.confluent.kafka.server/connection_info
  • io.confluent.kafka.server/request_count
  • io.confluent.kafka.server/deprecated_request_count
  • io.confluent.kafka.server/created_acls_count_per_tenant
  • io.confluent.kafka.server/partition_count
  • io.confluent.kafka.server/successful_authentication_count
  • io.confluent.kafka.server/consumer_lag_offsets
  • io.confluent.kafka.server/max_pending_rebalance_time_milliseconds
  • io.confluent.kafka.server/elastic_cku_count
  • io.confluent.kafka.server/rest_produce_request_bytes
  • io.confluent.kafka.server/producer_latency_avg_milliseconds
  • io.confluent.kafka.server/cluster_link_destination_response_bytes
  • io.confluent.kafka.server/cluster_link_source_response_bytes
  • io.confluent.kafka.server/cluster_active_link_count
  • io.confluent.kafka.server/cluster_link_count
  • io.confluent.kafka.server/cluster_link_task_count
  • io.confluent.kafka.server/cluster_link_mirror_transition_in_error
  • io.confluent.kafka.server/cluster_link_mirror_topic_count
  • io.confluent.kafka.server/cluster_link_mirror_topic_offset_lag
  • io.confluent.kafka.server/cluster_link_mirror_topic_bytes
  • io.confluent.kafka.server/cluster_load_percent
  • io.confluent.kafka.server/cluster_load_percent_avg
  • io.confluent.kafka.server/cluster_load_percent_max
  • io.confluent.kafka.server/dedicated_cku_count
  • io.confluent.kafka.server/hot_partition_ingress
  • io.confluent.kafka.server/hot_partition_egress

io.confluent.tableflow

  • io.confluent.tableflow/snapshots_generated
  • io.confluent.tableflow/rows_added
  • io.confluent.tableflow/rows_read
  • io.confluent.tableflow/bytes_added
  • io.confluent.tableflow/bytes_read
  • io.confluent.tableflow/kafka_topic_offset
  • io.confluent.tableflow/table_offset
  • io.confluent.tableflow/bytes_processed
  • io.confluent.tableflow/num_topics
  • io.confluent.tableflow/storage
  • io.confluent.tableflow/bytes_compacted
  • io.confluent.tableflow/files_compacted
  • io.confluent.tableflow/rows_skipped
  • io.confluent.tableflow/compactions_pending
  • io.confluent.tableflow/compaction_duration
  • io.confluent.tableflow/delta_bytes_added
  • io.confluent.tableflow/delta_rows_added
  • io.confluent.tableflow/delta_versions_generated

io.confluent.kafka.streams

  • io.confluent.kafka.streams/kafka_streams_client_state
  • io.confluent.kafka.streams/thread_thread_state
  • io.confluent.kafka.streams/recording_level
  • io.confluent.kafka.streams/process_ratio
  • io.confluent.kafka.streams/commit_ratio
  • io.confluent.kafka.streams/punctuate_ratio
  • io.confluent.kafka.streams/poll_ratio
  • io.confluent.kafka.streams/node_e2e_latency_max
  • io.confluent.kafka.streams/node_e2e_latency_min
  • io.confluent.kafka.streams/size_all_mem_tables_bytes
  • io.confluent.kafka.streams/estimate_num_keys
  • io.confluent.kafka.streams/block_cache_usage

Alerting

1. Kafka Brokers (The Core)

Broker health is the foundation of your cluster. Alerts here should typically be paged immediately to an on-call engineer.

Alert NameMetric to Monitor (JMX / Prometheus)Critical ThresholdWhy it Matters
Offline PartitionsOfflinePartitionsCount> 0Immediate data unavailability. Producers cannot write, and consumers cannot read from these partitions.
Under-Replicated Partitions (URP)UnderReplicatedPartitions> 0 for > 5 minsA broker is down, slow, or network issues are preventing replication. Risk of data loss if another broker fails.
Active Controller CountActiveControllerCount!= 1If 0, the cluster cannot manage partitions or reassign leaders. If > 1, you have a "split-brain" scenario (very rare but critical).
Network Processor IdleNetworkProcessorAvgIdlePercent< 30%Brokers are bottlenecked on network threads. Time to scale out or optimize client requests.
Request Handler IdleRequestHandlerAvgIdlePercent< 30%Brokers are bottlenecked on CPU/Disk IO. They cannot process requests fast enough.

2. Quorum / Metadata (KRaft or ZooKeeper)

If the metadata layer fails, the brokers cannot function, even if their disks and CPUs are completely healthy.

Alert NameComponentCritical ThresholdWhy it Matters
ZooKeeper DisconnectsZooKeeper> 0Brokers losing connection to ZK will drop out of the cluster, causing massive URP spikes.
Current State / QuorumKRaftCurrentState != 3 (Leader/Follower)The KRaft controller quorum has lost consensus. Cluster management is halted.
Metadata Commit LatencyKRaft / ZK> 100msSlow metadata commits will increase latency for producers creating new topics or partitions.

3. Kafka Consumers (Data Reading)

Consumer alerts generally indicate application-level issues, but they can sometimes point to broker network bottlenecks.

Alert NameMetric / SymptomCritical ThresholdWhy it Matters
Critical Consumer Lagrecords-lag-maxExceeds SLA (e.g., > 10,000 msgs)The consumer is too slow, crashed, or stuck. Data is not being processed in real-time.
High Rebalance Ratejoin-rate / sync-rate> 5 per minute"Stop-the-world" rebalances halt consumption. Usually caused by consumers continually dying, network blips, or bad max.poll.interval.ms configs.
Fetch Latencyfetch-latency-avg> 500msConsumers are waiting too long for brokers to return data. Could indicate broker disk read issues.

4. Kafka Producers (Data Ingestion)

Producers are the entry point. If they fail, data is lost before it even enters the pipeline.

Alert NameMetric / SymptomCritical ThresholdWhy it Matters
High Error Raterecord-error-rate> 1%Producers are dropping messages due to authorization failures, message size limits, or broker unavailability.
Retries Spikingrecord-retry-rateSustained > 5% increaseIndicates transient network issues or overloaded brokers. Increases end-to-end latency.
Request Latencyrequest-latency-avg> 1000ms (varies by acks)Producers are waiting too long for broker acknowledgments, which backs up application buffers.

5. Infrastructure (The Foundation)

Kafka is highly dependent on Disk I/O and Network availability.

Alert NameResourceCritical ThresholdAction Required
Disk Space ExhaustionDisk> 80% capacityURGENT: If a broker disk hits 100%, it will crash and corrupt its log segments. Add disk space or lower retention policies immediately.
High Disk I/O WaitDisk> 20% iowaitDisks are too slow to keep up with incoming writes. Consider moving to faster SSDs.
Network SaturationNetwork> 80% of NIC capacityBrokers cannot replicate or serve data fast enough. You need larger instances or a 10Gbps+ network upgrade.

6. Kafka Connect & Ecosystem (If Applicable)

If you use Kafka Connect or Schema Registry, they need their own guardrails.

  • Kafka Connect Task Failures: Alert if connector-failed-task-count > 0. A failed task means the pipeline is broken, even if the connector itself shows as "Running".
  • Schema Registry Unavailability: Alert on HTTP 5xx errors from the registry. If producers cannot fetch schemas, they will fail to serialize and drop messages.