ZooKeeper vs KRaft — Kafka Architecture Deep Dive

Background & Origins

Understanding how Kafka's metadata management evolved from a third-party coordination service to a self-contained consensus mechanism.

🦓

Apache ZooKeeper

External Coordination

ZooKeeper is a centralized, highly-available coordination service originally developed at Yahoo! and donated to the Apache Software Foundation in 2008. It provides distributed synchronization, configuration management, and naming services via a hierarchical namespace of znodes — essentially a distributed file system optimized for small, frequent reads.

Kafka adopted ZooKeeper from its very inception (Kafka 0.8, 2012) to store cluster metadata: topic configurations, partition leadership, ISR (in-sync replica) lists, consumer group offsets, and controller election. ZooKeeper uses the ZAB (ZooKeeper Atomic Broadcast) protocol — a variant of Paxos — to maintain consensus across its quorum of nodes.

⚙️

Apache KRaft

Built-in Consensus

KRaft (Kafka Raft Metadata Mode) is Kafka's native consensus protocol introduced via KIP-500 (Kafka Improvement Proposal 500), targeting the complete removal of ZooKeeper as a dependency. Introduced in Kafka 2.8 as an early-access feature, it reached production-readiness in Kafka 3.3 (October 2022) and became the sole supported mode from Kafka 4.0 onwards.

KRaft uses a subset of the Raft consensus algorithm, embedding metadata management directly within the Kafka broker process. A dedicated set of controller nodes (or co-located controller+broker nodes) maintains a metadata log — an ordinary Kafka topic (__cluster_metadata) replicated via the Raft protocol.

📅 Historical Timeline

2008

ZooKeeper Open-Sourced

Apache ZooKeeper donated to ASF by Yahoo!, initially part of Hadoop ecosystem.

2012 — Kafka 0.8

Kafka Adopts ZooKeeper

Kafka uses ZooKeeper for broker registration, topic metadata, controller election, and consumer offsets.

2019 — KIP-500 Proposed

ZooKeeper Removal Plan

Community proposes native Kafka metadata quorum to eliminate ZooKeeper dependency entirely.

2021 — Kafka 2.8

KRaft Early Access

KRaft mode ships as a technical preview. Not suitable for production workloads yet.

2022 — Kafka 3.3

KRaft Production-Ready

KRaft declared production-ready; ZooKeeper mode begins official deprecation path.

2024 — Kafka 4.0

ZooKeeper Mode Removed

ZooKeeper support fully removed. KRaft is the only supported metadata management mode.

Architecture Deep Dive

How each coordination model is structurally composed, how metadata flows, and how leadership is elected.

🦓 ZooKeeper Architecture

⚙️ KRaft Architecture

🔄 Metadata Flow Comparison

ZooKeeper Metadata Flow

# ZooKeeper znode hierarchy (legacy)
/kafka
  /brokers
    /ids/0  # ephemeral → broker alive
    /topics/my-topic/partitions
  /controller        # ephemeral → leader
  /isr_change_notification
  /admin/reassign_partitions
  /config/topics/my-topic

# Controller writes ZK → notifies ALL brokers
# Brokers read from ZK directly
# O(N×P) notification storm on restarts
          

KRaft Metadata Flow

# KRaft: metadata as a Kafka log
__cluster_metadata # partition 0
  offset 0: RegisterBrokerRecord
  offset 1: TopicRecord # topic created
  offset 2: PartitionRecord
  offset 3: PartitionChangeRecord
  offset N: ProducerIdsRecord

# Brokers FETCH from controller like consumers
# No ZK watches — event-sourced, append-only
# Snapshots compacted at configurable intervals
          

Feature-by-Feature Comparison

A systematic breakdown across operational, architectural, and scalability dimensions critical for PB-scale deployments.

Dimension

🦓 ZooKeeper Mode

⚙️ KRaft Mode

External Dependency

Requires separate ZK cluster (3–5 nodes typical). Adds operational overhead, separate monitoring, and JVM management.

Zero external dependencies. Metadata quorum is embedded within Kafka itself — single process to manage.

Consensus Protocol

ZAB (ZooKeeper Atomic Broadcast) — Paxos variant. Leader-follower model. ZK leader processes all writes.

Raft consensus. Active controller is Raft leader. Metadata log replicated with configurable quorum (typically 3 controllers).

Metadata Storage

In-memory tree of znodes in ZK. Persistent via transaction log + snapshots. Limited to <1MB per znode. ZK write lock is global.

Append-only Kafka log (__cluster_metadata). Periodically snapshotted. No size limits from ZK constraints. Allows arbitrary record types.

Partition Scalability

Practically limited to ~200K partitions per cluster due to ZK watch storms and controller reload latency. Performance degrades significantly beyond this.

Supports millions of partitions. Benchmarks show 10–100× improvement in partition leadership handling. Confluent tested 3.3M partitions in KRaft.

Controller Failover

New controller must reload all metadata from ZK. At 200K partitions this can take 30–60+ seconds of unavailability — a critical operational risk.

Standby controllers maintain a full in-memory replica of the metadata log. Failover is near-instantaneous (seconds, not minutes).

Broker Startup

Each broker reads all partition/topic state from ZK on startup. Slow at scale. Fan-out causes ZK read amplification.

Brokers fetch metadata incrementally from the controller using a Kafka Fetch request. Faster, incremental, and bandwidth-efficient.

TLS / Auth Security

Separate TLS configuration for ZK and Kafka. Two sets of certificates, ACLs, and SASL configs. Double the attack surface.

Single unified security domain. One set of certificates and ACL configurations. Simplified compliance posture for regulated industries.

Observability

Requires monitoring two JVM applications with separate metric namespaces. ZK has its own 4-letter commands and JMX beans.

Unified JMX metrics from a single process. Simpler dashboards, single alert surface, and correlated logs.

Deployment Topology

Minimum 6 JVM processes (3 ZK + 3 Kafka) for HA. Odd ZK quorum mandatory. More infrastructure costs.

Minimum 3 JVM processes for HA (combined controller+broker). Can also run dedicated controller nodes. Lower resource floor.

Metadata Transaction

Multi-operation ZK transactions are limited. Coordinating complex changes (e.g., partition reassignment) requires multiple round-trips.

All metadata changes are atomic log entries. The controller applies them in order with exactly-once semantics guaranteed by Raft.

Kafka Version Support

Supported in Kafka 0.8 through 3.x. Fully removed in Kafka 4.0. No longer receives bug fixes or features post-3.x.

Available preview in Kafka 2.8+. Production-ready in 3.3+. Sole supported mode in Kafka 4.0+. All future features target KRaft only.

Rolling Upgrades

Kafka brokers and ZK nodes must be rolled separately. Compatible ZK version matrix adds complexity.

Rolling upgrades managed entirely within Kafka. No ZK version compatibility matrix to track. Simpler upgrade runbook.

Cloud-Native Fit

StatefulSets for both ZK and Kafka in Kubernetes. Helm charts are large. Persistent volumes needed for two systems.

Single StatefulSet. Strimzi, Confluent Platform, and MSK all primarily target KRaft for Kubernetes deployments as of 2024.

Strengths & Weaknesses

An honest assessment of where each approach excels and where it falls short, especially at petabyte-scale and real-time streaming workloads.

🦓 ZooKeeper Mode

✅ Strengths

Battle-hardened maturity: 10+ years of production use across thousands of enterprise Kafka clusters globally — failure modes are well-understood.
Ecosystem tooling: Extensive tooling, runbooks, monitoring dashboards (Grafana templates, DataDog integrations) are readily available and mature.
General-purpose coordination: ZooKeeper can serve as a coordination service for other systems (HBase, HDFS NameNode, Solr) alongside Kafka, potentially sharing infrastructure.
Migration pathways proven: Operators have vast documentation, SRE playbooks, and institutional knowledge for ZK-mode cluster operations.
Stable under moderate scale: For clusters below 100K partitions with predictable workloads, ZK mode is reliable and well-understood.

❌ Weaknesses

Scalability ceiling: The infamous "200K partition wall" — ZK watch fan-out causes severe latency spikes during controller re-elections at high partition counts.
Split-brain risk: ZK network partitions can produce ambiguous controller election scenarios, requiring careful timeout tuning.
Slow failover: Controller restart requires full ZK reload — catastrophic minutes of unavailability for large PB-scale clusters.
Dual operational burden: Two independent distributed systems to patch, tune, monitor, and capacity-plan. JVM heap and GC tuning for both.
End-of-life: Deprecated since Kafka 3.x, fully removed in 4.0. No new features. Security patches only until end-of-life.
Write bottleneck: All metadata writes serialized through ZK leader. At PB-scale event volumes this creates a coordination bottleneck.

⚙️ KRaft Mode

✅ Strengths

Unbounded partition scalability: Confluent benchmarks demonstrate millions of partitions — critical for multi-tenant PB-scale platforms with thousands of topics.
Fast failover: Near-instantaneous controller failover (sub-second in tested scenarios) since standby controllers maintain full metadata state via Raft log replication.
Simplified operations: Single JVM, single deployment unit, single monitoring surface, single security domain — dramatically reduces operational complexity.
Event-sourced metadata: The __cluster_metadata log provides an auditable, replayable, append-only history of all cluster state changes — excellent for debugging and compliance.
Cloud-native: Purpose-built for Kubernetes and containerized environments. Minimal infrastructure footprint reduces compute/storage costs.
Future-proof: All Confluent, AWS MSK, and Apache Kafka development exclusively targets KRaft going forward.

❌ Weaknesses

Relative immaturity: Production-ready since 2022, meaning fewer years of widespread enterprise battle-testing compared to ZK mode's decade+ history.
Migration complexity: Migrating from ZooKeeper to KRaft on live clusters requires careful planning, tooling (kafka-storage.sh migration tool), and downtime windows.
Feature parity gaps (historical): Some advanced ACL and delegation token features lagged behind ZK mode in early KRaft releases (most resolved by Kafka 3.5+).
Controller node sizing: Dedicated controller nodes need carefully tuned JVM heap for the metadata log cache — underpowered controllers become the new bottleneck.
Snapshot management: The metadata log requires periodic snapshotting and compaction. Poor configuration leads to slow broker restarts while replaying large logs.

Performance Analysis

Quantitative and qualitative performance characteristics across critical operational dimensions for high-throughput, low-latency streaming systems.

Comparative Scoring (1–10) for PB-Scale / RT Streaming

🦓 ZooKeeper Mode

Partition Scalability

3.5

Controller Failover Speed

4.0

Operational Simplicity

4.5

Ecosystem Maturity

9.5

Security Surface

5.0

Cloud-Native Fit

4.0

Metadata Throughput

4.0

⚙️ KRaft Mode

Partition Scalability

9.7

Controller Failover Speed

9.5

Operational Simplicity

9.0

Ecosystem Maturity

7.2

Security Surface

9.2

Cloud-Native Fit

9.6

Metadata Throughput

9.3

⚡ Key Benchmark Data Points

Metric	ZooKeeper Mode	KRaft Mode	Source / Context
Max Tested Partitions	~200K (practical limit)	3.3M+ partitions	Confluent KIP-500 benchmarks
Controller Failover Time	30–120 sec (load-dependent)	<1 sec (hot standby)	Kafka 3.3 release notes + Confluent testing
Broker Registration (1000 brokers)	Minutes (sequential ZK ops)	Seconds (parallel metadata fetch)	Apache Kafka mailing list / KIP-500
Metadata Propagation Latency	Proportional to ZK write + notify	Bounded by Raft quorum write + fetch	Implementation difference
Memory Overhead (metadata)	ZK heap: ~1–4GB for large clusters	Controller heap: ~2–8GB (configurable)	Empirical production observations
Topic Create Throughput	~100–200 topics/sec	~1000+ topics/sec	Confluent Platform benchmarks

Frameworks & Tooling

Key management, monitoring, and integration frameworks categorized by their alignment with ZooKeeper vs KRaft deployments.

🦓 ZooKeeper Mode

🎛️

Kafka Manager (CMAK)

Cluster management GUI by Yahoo!

Yahoo!'s Cluster Manager for Apache Kafka. ZK-native — reads directly from ZooKeeper znodes. Rich partition rebalance UI. Deprecated in favor of KRaft-compatible tools but still in use at many enterprises.

📊

Kafka Exporter + ZK Exporter

Prometheus / Grafana monitoring

Separate Prometheus exporters for Kafka JMX metrics (kafka-exporter) and ZK metrics (zookeeper_exporter). Maintained separately; two dashboards required per cluster.

🐘

HBase + HDFS Shared ZK

Multi-system coordination

Organizations running HBase and HDFS can share a ZooKeeper ensemble with Kafka, reducing total ZK node count. This is a ZK-specific operational advantage irrelevant in KRaft mode.

🔧

Strimzi Operator (ZK mode)

Kubernetes-native Kafka

Strimzi supported ZK-mode Kubernetes deployments for years. Custom Resource Definitions (CRDs) for KafkaZookeeper. Now migrating all resources to KRaft-first in version 0.37+.

⚙️ KRaft Mode

⛵

Strimzi Operator (KRaft)

Production Kubernetes deployments

The leading open-source Kafka Kubernetes operator. KRaft-first since Strimzi 0.37. Supports dedicated and combined controller/broker topologies. Full lifecycle: rolling upgrades, scale-out, certificate rotation, and storage expansion.

☁️

AWS MSK (Managed Streaming)

Fully managed KRaft on AWS

Amazon MSK defaulted to KRaft mode for new clusters from 2023. Managed controller quorum, automated failover, and integrated IAM-based auth. MSK Express clusters (serverless-like) are KRaft-only.

🏢

Confluent Platform / Cloud

Enterprise Kafka distribution

Confluent Platform 7.4+ ships KRaft as default for new deployments. Confluent Cloud runs KRaft exclusively. Schema Registry, ksqlDB, and Kafka Connect are all KRaft-compatible. Enterprise support SLAs available.

🎯

Redpanda (KRaft-compatible API)

Kafka API-compatible alternative

Redpanda is a C++ Kafka-API-compatible broker with its own Raft implementation (no ZK, no JVM). It validates the architectural principle of KRaft — proving built-in Raft consensus is superior for Kafka-style workloads at scale.

⚙️ KRaft Configuration Reference

Combined Mode (Dev/Small Clusters)

# server.properties — combined controller+broker
process.roles=broker,controller
node.id=1
controller.quorum.voters=1@host1:9093,\
  2@host2:9093,3@host3:9093
listeners=PLAINTEXT://:9092,CONTROLLER://:9093
controller.listener.names=CONTROLLER
inter.broker.listener.name=PLAINTEXT
log.dirs=/var/kafka/data
# Generate cluster UUID:
# kafka-storage.sh random-uuid
          

Dedicated Controllers (Production PB-scale)

# controller.properties
process.roles=controller
node.id=1
controller.quorum.voters=1@ctrl1:9093,\
  2@ctrl2:9093,3@ctrl3:9093
listeners=CONTROLLER://:9093
metadata.log.dir=/nvme/metadata
# Tune controller heap for large clusters
KAFKA_HEAP_OPTS="-Xmx8g -Xms8g"
# broker.properties — separate
process.roles=broker
node.id=10
          

Suitable Ecosystems

Mapping each coordination mode to the architectures, cloud environments, and organizational contexts where it is most appropriate.

🦓 ZooKeeper Mode — Best-Fit Ecosystems

🏛️ Legacy Hadoop Stacks

Organizations running HBase, HDFS NameNode HA, or Apache Solr Cloud that already maintain a ZooKeeper ensemble. Sharing ZK infrastructure avoids additional operational overhead.

🔒 Regulated On-Prem Environments

Financial institutions or healthcare organizations on Kafka 3.x with extensive ZK runbooks, compliance certifications, and change-freeze policies. Migrating carries risk that must be carefully managed.

📦 Moderate Scale (< 100K Partitions)

Stable workloads well within ZK's practical scalability limits, where the ROI of migrating to KRaft doesn't justify operational risk in the short term (on Kafka 3.x only).

⚠️ End-of-Life Warning

All ZK-mode ecosystems have a firm migration deadline: Kafka 4.0 dropped support. Organizations must migrate to KRaft to receive ongoing security patches and feature updates.

⚙️ KRaft Mode — Best-Fit Ecosystems

☁️ Cloud-Native / Kubernetes-First

AWS MSK, Confluent Cloud, Azure HDInsight Kafka, GCP Managed Kafka (Datastream) — all target KRaft. Kubernetes deployments using Strimzi benefit from single-StatefulSet architecture and simplified lifecycle management.

📈 PB-Scale Data Platforms

Streaming data lakes (Apache Iceberg + Kafka), real-time analytics platforms (Flink + Kafka), and multi-tenant SaaS event buses with millions of partitions. KRaft eliminates ZK as the architectural bottleneck.

⚡ Realtime Streaming (IoT, Fintech)

IoT device telemetry (millions of devices, dynamic topic creation), financial market data (microsecond SLAs, frequent partition reassignments), and edge computing pipelines requiring fast self-healing.

🔄 Modern Lambda/Kappa Architectures

Systems pairing Kafka with Apache Flink (stream processing), Apache Spark Structured Streaming, or Kafka Streams for unified batch+stream processing. KRaft's fast failover is critical for exactly-once semantics pipelines.

🛡️ Zero-Trust / High-Security Environments

Single security domain simplifies certificate management, RBAC, and audit logging. PCI-DSS, SOC2, HIPAA environments benefit from reduced attack surface and unified ACL management.

🔌 Integration Frameworks by Ecosystem

Stream Processing

Apache Flink 1.16+
Kafka Streams
ksqlDB
Spark Structured Streaming
Apache Samza

Data Integration

Kafka Connect
Debezium (CDC)
Apache NiFi
Airbyte + Kafka sink
MirrorMaker 2

Storage / Lakehouse

Apache Iceberg
Apache Hudi
Delta Lake
Apache Pinot
ClickHouse Kafka Engine

Migration Path: ZooKeeper → KRaft

For PB-scale production clusters, migrating from ZooKeeper to KRaft mode is a mandatory and critical operational event requiring careful planning.

📋 Migration Strategy Overview

Kafka provides an official ZK-to-KRaft migration tool (Kafka 3.5+) that performs a rolling, online migration without full cluster downtime. The process works in phases:

Phase 1 — Preparation: Upgrade all brokers to Kafka 3.5+ on ZK mode. Verify kafka-metadata-quorum.sh describe output.
Phase 2 — KRaft Controllers: Add dedicated KRaft controller nodes. They observe ZK metadata in "migration mode" without taking over yet.
Phase 3 — Metadata Migration: Trigger migration: ZK metadata is replicated to the KRaft metadata log. Dual-write ensures consistency during transition.
Phase 4 — Broker Migration: Roll each broker to KRaft mode one at a time. Brokers begin reading metadata from KRaft controllers.
Phase 5 — ZK Decommission: Once all brokers are KRaft-mode, ZK ensemble can be safely shut down and decommissioned.

⚠️ Migration Risks & Mitigations

Dual-write Latency Spike

During migration phase, all metadata writes go to both ZK and KRaft. Monitor controller CPU and heap carefully. Use maintenance windows for PB-scale clusters.

ACL / Config Gaps

Verify that all ZK-stored ACLs and custom configs are correctly replicated to KRaft metadata log before cutting over. Use kafka-acls.sh audit before and after.

Consumer Group State

Consumer group offsets (stored in __consumer_offsets) are unaffected — they live in Kafka itself. Only metadata (topics, ACLs, configs) migrates.

Rollback Plan

Until Phase 5 is complete, rolling back to ZK mode is possible by removing KRaft controllers and reverting broker configs. After ZK decommission, rollback requires full cluster restore from backup.

🔧 Migration Command Reference

# Step 1: Verify current ZK cluster health
kafka-metadata-quorum.sh --bootstrap-server broker:9092 describe --status

# Step 2: Generate KRaft cluster ID (must match existing ZK cluster ID)
kafka-storage.sh info -c /path/to/server.properties

# Step 3: Format KRaft controller storage
kafka-storage.sh format -t <CLUSTER_UUID> -c controller.properties --ignore-formatted

# Step 4: Start KRaft controllers in migration mode
zookeeper.metadata.migration.enable=true  # in controller.properties

# Step 5: Enable broker migration (rolling, one broker at a time)
zookeeper.metadata.migration.enable=true  # add to each broker.properties temporarily

# Step 6: Verify KRaft metadata log after each broker
kafka-metadata-quorum.sh --bootstrap-server broker:9092 describe --replication

# Step 7: Complete migration (finalize + remove ZK config)
kafka-metadata-quorum.sh --bootstrap-server broker:9092 describe --status
# Remove zookeeper.connect from all broker configs
# Decommission ZK nodes
      

Architecture Recommendation

Final architectural guidance for Big Data engineers operating PB-scale and real-time streaming systems.

⚙️ New Deployments: KRaft — Always

For any Greenfield Kafka deployment in 2024 and beyond, KRaft is the only rational choice. ZooKeeper mode is deprecated and removed from Kafka 4.0. KRaft delivers superior scalability, faster failover, simpler operations, and is the sole target of all future Kafka development.

Use dedicated controller nodes (3 or 5) for PB-scale clusters
Allocate NVMe storage for metadata.log.dir
Size controller JVM heap at 8–16GB for >500K partitions
Deploy via Strimzi (K8s) or Confluent Platform for managed lifecycle
Enable metadata.log.max.record.bytes.between.snapshots tuning

🦓 Existing ZK Clusters: Plan Migration Now

Organizations on Kafka 3.x ZK mode must plan their KRaft migration timeline. The migration tooling is mature (3.5+), and the risk of staying on ZK mode — including security exposure and operational debt — outweighs the migration effort.

Target Kafka 3.7 or 3.8 for the migration (LTS candidates)
Allocate 2–4 week planning phase for PB-scale clusters
Run migration in staging environment first with production-like load
Schedule migration during low-traffic windows for large clusters
Keep ZK ensemble live for 2–4 weeks post-migration for rollback safety

🌿 Decision Framework

New or existing Kafka deployment?

🆕 New Deployment

Use KRaft exclusively
Kafka 3.3+ or 4.0+

🔄 Existing on ZK

Assess Kafka version

Kafka < 3.5

Upgrade first, then migrate

Kafka 3.5+

Run rolling ZK→KRaft migration

⚠️ Kafka 4.0 deadline: ZooKeeper mode is fully removed. All production clusters must be on KRaft before upgrading to Kafka 4.0.

🦓

ZooKeeper Mode

Battle-tested, limited to ~200K partitions, slow failover, dual operational overhead. Deprecated. Suitable only for short-term legacy maintenance.

⚖️

The Verdict

For PB-scale and real-time streaming, KRaft is the definitive answer. The partition scalability, failover speed, and operational simplicity improvements are architecturally transformative.

⚙️

KRaft Mode

Millions of partitions, sub-second failover, single security domain, cloud-native, event-sourced metadata. The future — and present — of Apache Kafka.