⚡ Big Data Architecture — Deep Dive

ZooKeeper vs KRaft
on Apache Kafka

A comprehensive technical analysis of Kafka's coordination layer evolution — from external ZooKeeper quorum to the built-in KRaft consensus protocol — for petabyte-scale and real-time streaming architectures.

Scope: PB-scale & Realtime Streaming
Kafka versions: 0.8 → 3.x
Protocol: ZAB vs Raft
KIP-500: Deprecation Path

Background & Origins

Understanding how Kafka's metadata management evolved from a third-party coordination service to a self-contained consensus mechanism.

🦓
Apache ZooKeeper
External Coordination

ZooKeeper is a centralized, highly-available coordination service originally developed at Yahoo! and donated to the Apache Software Foundation in 2008. It provides distributed synchronization, configuration management, and naming services via a hierarchical namespace of znodes — essentially a distributed file system optimized for small, frequent reads.


Kafka adopted ZooKeeper from its very inception (Kafka 0.8, 2012) to store cluster metadata: topic configurations, partition leadership, ISR (in-sync replica) lists, consumer group offsets, and controller election. ZooKeeper uses the ZAB (ZooKeeper Atomic Broadcast) protocol — a variant of Paxos — to maintain consensus across its quorum of nodes.

⚙️
Apache KRaft
Built-in Consensus

KRaft (Kafka Raft Metadata Mode) is Kafka's native consensus protocol introduced via KIP-500 (Kafka Improvement Proposal 500), targeting the complete removal of ZooKeeper as a dependency. Introduced in Kafka 2.8 as an early-access feature, it reached production-readiness in Kafka 3.3 (October 2022) and became the sole supported mode from Kafka 4.0 onwards.


KRaft uses a subset of the Raft consensus algorithm, embedding metadata management directly within the Kafka broker process. A dedicated set of controller nodes (or co-located controller+broker nodes) maintains a metadata log — an ordinary Kafka topic (__cluster_metadata) replicated via the Raft protocol.

📅 Historical Timeline
2008
ZooKeeper Open-Sourced
Apache ZooKeeper donated to ASF by Yahoo!, initially part of Hadoop ecosystem.
2012 — Kafka 0.8
Kafka Adopts ZooKeeper
Kafka uses ZooKeeper for broker registration, topic metadata, controller election, and consumer offsets.
2019 — KIP-500 Proposed
ZooKeeper Removal Plan
Community proposes native Kafka metadata quorum to eliminate ZooKeeper dependency entirely.
2021 — Kafka 2.8
KRaft Early Access
KRaft mode ships as a technical preview. Not suitable for production workloads yet.
2022 — Kafka 3.3
KRaft Production-Ready
KRaft declared production-ready; ZooKeeper mode begins official deprecation path.
2024 — Kafka 4.0
ZooKeeper Mode Removed
ZooKeeper support fully removed. KRaft is the only supported metadata management mode.

Architecture Deep Dive

How each coordination model is structurally composed, how metadata flows, and how leadership is elected.

🦓 ZooKeeper Architecture
ZOOKEEPER QUORUM (ZAB Protocol) ZK Leader :2181 ZK Follower :2181 ZK Follower :2181 metadata KAFKA CLUSTER Kafka Controller (elected broker) Broker 1 :9092 Broker N :9092 ⚠ Two separate systems to manage and scale
⚙️ KRaft Architecture
KRAFT CONTROLLER QUORUM (Raft Protocol) __cluster_metadata topic — internal log Active Controller Raft Leader epoch + offset Controller 2 Raft Follower Controller 3 Raft Follower metadata fetch KAFKA BROKER NODES Pull metadata directly from controller log Broker 1 :9092 Broker 2 :9092 Broker N :9092 ✓ Single unified system — no external dependency
🔄 Metadata Flow Comparison
ZooKeeper Metadata Flow
# ZooKeeper znode hierarchy (legacy) /kafka /brokers /ids/0 # ephemeral → broker alive /topics/my-topic/partitions /controller # ephemeral → leader /isr_change_notification /admin/reassign_partitions /config/topics/my-topic # Controller writes ZK → notifies ALL brokers # Brokers read from ZK directly # O(N×P) notification storm on restarts
KRaft Metadata Flow
# KRaft: metadata as a Kafka log __cluster_metadata # partition 0 offset 0: RegisterBrokerRecord offset 1: TopicRecord # topic created offset 2: PartitionRecord offset 3: PartitionChangeRecord offset N: ProducerIdsRecord # Brokers FETCH from controller like consumers # No ZK watches — event-sourced, append-only # Snapshots compacted at configurable intervals

Feature-by-Feature Comparison

A systematic breakdown across operational, architectural, and scalability dimensions critical for PB-scale deployments.

Dimension
🦓 ZooKeeper Mode
⚙️ KRaft Mode
External Dependency
Requires separate ZK cluster (3–5 nodes typical). Adds operational overhead, separate monitoring, and JVM management.
Zero external dependencies. Metadata quorum is embedded within Kafka itself — single process to manage.
Consensus Protocol
ZAB (ZooKeeper Atomic Broadcast) — Paxos variant. Leader-follower model. ZK leader processes all writes.
Raft consensus. Active controller is Raft leader. Metadata log replicated with configurable quorum (typically 3 controllers).
Metadata Storage
In-memory tree of znodes in ZK. Persistent via transaction log + snapshots. Limited to <1MB per znode. ZK write lock is global.
Append-only Kafka log (__cluster_metadata). Periodically snapshotted. No size limits from ZK constraints. Allows arbitrary record types.
Partition Scalability
Practically limited to ~200K partitions per cluster due to ZK watch storms and controller reload latency. Performance degrades significantly beyond this.
Supports millions of partitions. Benchmarks show 10–100× improvement in partition leadership handling. Confluent tested 3.3M partitions in KRaft.
Controller Failover
New controller must reload all metadata from ZK. At 200K partitions this can take 30–60+ seconds of unavailability — a critical operational risk.
Standby controllers maintain a full in-memory replica of the metadata log. Failover is near-instantaneous (seconds, not minutes).
Broker Startup
Each broker reads all partition/topic state from ZK on startup. Slow at scale. Fan-out causes ZK read amplification.
Brokers fetch metadata incrementally from the controller using a Kafka Fetch request. Faster, incremental, and bandwidth-efficient.
TLS / Auth Security
Separate TLS configuration for ZK and Kafka. Two sets of certificates, ACLs, and SASL configs. Double the attack surface.
Single unified security domain. One set of certificates and ACL configurations. Simplified compliance posture for regulated industries.
Observability
Requires monitoring two JVM applications with separate metric namespaces. ZK has its own 4-letter commands and JMX beans.
Unified JMX metrics from a single process. Simpler dashboards, single alert surface, and correlated logs.
Deployment Topology
Minimum 6 JVM processes (3 ZK + 3 Kafka) for HA. Odd ZK quorum mandatory. More infrastructure costs.
Minimum 3 JVM processes for HA (combined controller+broker). Can also run dedicated controller nodes. Lower resource floor.
Metadata Transaction
Multi-operation ZK transactions are limited. Coordinating complex changes (e.g., partition reassignment) requires multiple round-trips.
All metadata changes are atomic log entries. The controller applies them in order with exactly-once semantics guaranteed by Raft.
Kafka Version Support
Supported in Kafka 0.8 through 3.x. Fully removed in Kafka 4.0. No longer receives bug fixes or features post-3.x.
Available preview in Kafka 2.8+. Production-ready in 3.3+. Sole supported mode in Kafka 4.0+. All future features target KRaft only.
Rolling Upgrades
Kafka brokers and ZK nodes must be rolled separately. Compatible ZK version matrix adds complexity.
Rolling upgrades managed entirely within Kafka. No ZK version compatibility matrix to track. Simpler upgrade runbook.
Cloud-Native Fit
StatefulSets for both ZK and Kafka in Kubernetes. Helm charts are large. Persistent volumes needed for two systems.
Single StatefulSet. Strimzi, Confluent Platform, and MSK all primarily target KRaft for Kubernetes deployments as of 2024.

Strengths & Weaknesses

An honest assessment of where each approach excels and where it falls short, especially at petabyte-scale and real-time streaming workloads.

🦓 ZooKeeper Mode

✅ Strengths

  • Battle-hardened maturity: 10+ years of production use across thousands of enterprise Kafka clusters globally — failure modes are well-understood.
  • Ecosystem tooling: Extensive tooling, runbooks, monitoring dashboards (Grafana templates, DataDog integrations) are readily available and mature.
  • General-purpose coordination: ZooKeeper can serve as a coordination service for other systems (HBase, HDFS NameNode, Solr) alongside Kafka, potentially sharing infrastructure.
  • Migration pathways proven: Operators have vast documentation, SRE playbooks, and institutional knowledge for ZK-mode cluster operations.
  • Stable under moderate scale: For clusters below 100K partitions with predictable workloads, ZK mode is reliable and well-understood.

❌ Weaknesses

  • Scalability ceiling: The infamous "200K partition wall" — ZK watch fan-out causes severe latency spikes during controller re-elections at high partition counts.
  • Split-brain risk: ZK network partitions can produce ambiguous controller election scenarios, requiring careful timeout tuning.
  • Slow failover: Controller restart requires full ZK reload — catastrophic minutes of unavailability for large PB-scale clusters.
  • Dual operational burden: Two independent distributed systems to patch, tune, monitor, and capacity-plan. JVM heap and GC tuning for both.
  • End-of-life: Deprecated since Kafka 3.x, fully removed in 4.0. No new features. Security patches only until end-of-life.
  • Write bottleneck: All metadata writes serialized through ZK leader. At PB-scale event volumes this creates a coordination bottleneck.

⚙️ KRaft Mode

✅ Strengths

  • Unbounded partition scalability: Confluent benchmarks demonstrate millions of partitions — critical for multi-tenant PB-scale platforms with thousands of topics.
  • Fast failover: Near-instantaneous controller failover (sub-second in tested scenarios) since standby controllers maintain full metadata state via Raft log replication.
  • Simplified operations: Single JVM, single deployment unit, single monitoring surface, single security domain — dramatically reduces operational complexity.
  • Event-sourced metadata: The __cluster_metadata log provides an auditable, replayable, append-only history of all cluster state changes — excellent for debugging and compliance.
  • Cloud-native: Purpose-built for Kubernetes and containerized environments. Minimal infrastructure footprint reduces compute/storage costs.
  • Future-proof: All Confluent, AWS MSK, and Apache Kafka development exclusively targets KRaft going forward.

❌ Weaknesses

  • Relative immaturity: Production-ready since 2022, meaning fewer years of widespread enterprise battle-testing compared to ZK mode's decade+ history.
  • Migration complexity: Migrating from ZooKeeper to KRaft on live clusters requires careful planning, tooling (kafka-storage.sh migration tool), and downtime windows.
  • Feature parity gaps (historical): Some advanced ACL and delegation token features lagged behind ZK mode in early KRaft releases (most resolved by Kafka 3.5+).
  • Controller node sizing: Dedicated controller nodes need carefully tuned JVM heap for the metadata log cache — underpowered controllers become the new bottleneck.
  • Snapshot management: The metadata log requires periodic snapshotting and compaction. Poor configuration leads to slow broker restarts while replaying large logs.

Performance Analysis

Quantitative and qualitative performance characteristics across critical operational dimensions for high-throughput, low-latency streaming systems.

Comparative Scoring (1–10) for PB-Scale / RT Streaming
🦓 ZooKeeper Mode
Partition Scalability
3.5
Controller Failover Speed
4.0
Operational Simplicity
4.5
Ecosystem Maturity
9.5
Security Surface
5.0
Cloud-Native Fit
4.0
Metadata Throughput
4.0
⚙️ KRaft Mode
Partition Scalability
9.7
Controller Failover Speed
9.5
Operational Simplicity
9.0
Ecosystem Maturity
7.2
Security Surface
9.2
Cloud-Native Fit
9.6
Metadata Throughput
9.3
⚡ Key Benchmark Data Points
Metric ZooKeeper Mode KRaft Mode Source / Context
Max Tested Partitions ~200K (practical limit) 3.3M+ partitions Confluent KIP-500 benchmarks
Controller Failover Time 30–120 sec (load-dependent) <1 sec (hot standby) Kafka 3.3 release notes + Confluent testing
Broker Registration (1000 brokers) Minutes (sequential ZK ops) Seconds (parallel metadata fetch) Apache Kafka mailing list / KIP-500
Metadata Propagation Latency Proportional to ZK write + notify Bounded by Raft quorum write + fetch Implementation difference
Memory Overhead (metadata) ZK heap: ~1–4GB for large clusters Controller heap: ~2–8GB (configurable) Empirical production observations
Topic Create Throughput ~100–200 topics/sec ~1000+ topics/sec Confluent Platform benchmarks

Frameworks & Tooling

Key management, monitoring, and integration frameworks categorized by their alignment with ZooKeeper vs KRaft deployments.

🦓 ZooKeeper Mode
🎛️

Kafka Manager (CMAK)

Cluster management GUI by Yahoo!

Yahoo!'s Cluster Manager for Apache Kafka. ZK-native — reads directly from ZooKeeper znodes. Rich partition rebalance UI. Deprecated in favor of KRaft-compatible tools but still in use at many enterprises.

📊

Kafka Exporter + ZK Exporter

Prometheus / Grafana monitoring

Separate Prometheus exporters for Kafka JMX metrics (kafka-exporter) and ZK metrics (zookeeper_exporter). Maintained separately; two dashboards required per cluster.

🐘

HBase + HDFS Shared ZK

Multi-system coordination

Organizations running HBase and HDFS can share a ZooKeeper ensemble with Kafka, reducing total ZK node count. This is a ZK-specific operational advantage irrelevant in KRaft mode.

🔧

Strimzi Operator (ZK mode)

Kubernetes-native Kafka

Strimzi supported ZK-mode Kubernetes deployments for years. Custom Resource Definitions (CRDs) for KafkaZookeeper. Now migrating all resources to KRaft-first in version 0.37+.

⚙️ KRaft Mode

Strimzi Operator (KRaft)

Production Kubernetes deployments

The leading open-source Kafka Kubernetes operator. KRaft-first since Strimzi 0.37. Supports dedicated and combined controller/broker topologies. Full lifecycle: rolling upgrades, scale-out, certificate rotation, and storage expansion.

☁️

AWS MSK (Managed Streaming)

Fully managed KRaft on AWS

Amazon MSK defaulted to KRaft mode for new clusters from 2023. Managed controller quorum, automated failover, and integrated IAM-based auth. MSK Express clusters (serverless-like) are KRaft-only.

🏢

Confluent Platform / Cloud

Enterprise Kafka distribution

Confluent Platform 7.4+ ships KRaft as default for new deployments. Confluent Cloud runs KRaft exclusively. Schema Registry, ksqlDB, and Kafka Connect are all KRaft-compatible. Enterprise support SLAs available.

🎯

Redpanda (KRaft-compatible API)

Kafka API-compatible alternative

Redpanda is a C++ Kafka-API-compatible broker with its own Raft implementation (no ZK, no JVM). It validates the architectural principle of KRaft — proving built-in Raft consensus is superior for Kafka-style workloads at scale.

⚙️ KRaft Configuration Reference
Combined Mode (Dev/Small Clusters)
# server.properties — combined controller+broker process.roles=broker,controller node.id=1 controller.quorum.voters=1@host1:9093,\ 2@host2:9093,3@host3:9093 listeners=PLAINTEXT://:9092,CONTROLLER://:9093 controller.listener.names=CONTROLLER inter.broker.listener.name=PLAINTEXT log.dirs=/var/kafka/data # Generate cluster UUID: # kafka-storage.sh random-uuid
Dedicated Controllers (Production PB-scale)
# controller.properties process.roles=controller node.id=1 controller.quorum.voters=1@ctrl1:9093,\ 2@ctrl2:9093,3@ctrl3:9093 listeners=CONTROLLER://:9093 metadata.log.dir=/nvme/metadata # Tune controller heap for large clusters KAFKA_HEAP_OPTS="-Xmx8g -Xms8g" # broker.properties — separate process.roles=broker node.id=10

Suitable Ecosystems

Mapping each coordination mode to the architectures, cloud environments, and organizational contexts where it is most appropriate.

🦓 ZooKeeper Mode — Best-Fit Ecosystems

🏛️ Legacy Hadoop Stacks

Organizations running HBase, HDFS NameNode HA, or Apache Solr Cloud that already maintain a ZooKeeper ensemble. Sharing ZK infrastructure avoids additional operational overhead.

🔒 Regulated On-Prem Environments

Financial institutions or healthcare organizations on Kafka 3.x with extensive ZK runbooks, compliance certifications, and change-freeze policies. Migrating carries risk that must be carefully managed.

📦 Moderate Scale (< 100K Partitions)

Stable workloads well within ZK's practical scalability limits, where the ROI of migrating to KRaft doesn't justify operational risk in the short term (on Kafka 3.x only).

⚠️ End-of-Life Warning

All ZK-mode ecosystems have a firm migration deadline: Kafka 4.0 dropped support. Organizations must migrate to KRaft to receive ongoing security patches and feature updates.

⚙️ KRaft Mode — Best-Fit Ecosystems

☁️ Cloud-Native / Kubernetes-First

AWS MSK, Confluent Cloud, Azure HDInsight Kafka, GCP Managed Kafka (Datastream) — all target KRaft. Kubernetes deployments using Strimzi benefit from single-StatefulSet architecture and simplified lifecycle management.

📈 PB-Scale Data Platforms

Streaming data lakes (Apache Iceberg + Kafka), real-time analytics platforms (Flink + Kafka), and multi-tenant SaaS event buses with millions of partitions. KRaft eliminates ZK as the architectural bottleneck.

⚡ Realtime Streaming (IoT, Fintech)

IoT device telemetry (millions of devices, dynamic topic creation), financial market data (microsecond SLAs, frequent partition reassignments), and edge computing pipelines requiring fast self-healing.

🔄 Modern Lambda/Kappa Architectures

Systems pairing Kafka with Apache Flink (stream processing), Apache Spark Structured Streaming, or Kafka Streams for unified batch+stream processing. KRaft's fast failover is critical for exactly-once semantics pipelines.

🛡️ Zero-Trust / High-Security Environments

Single security domain simplifies certificate management, RBAC, and audit logging. PCI-DSS, SOC2, HIPAA environments benefit from reduced attack surface and unified ACL management.

🔌 Integration Frameworks by Ecosystem
Stream Processing
  • Apache Flink 1.16+
  • Kafka Streams
  • ksqlDB
  • Spark Structured Streaming
  • Apache Samza
Data Integration
  • Kafka Connect
  • Debezium (CDC)
  • Apache NiFi
  • Airbyte + Kafka sink
  • MirrorMaker 2
Storage / Lakehouse
  • Apache Iceberg
  • Apache Hudi
  • Delta Lake
  • Apache Pinot
  • ClickHouse Kafka Engine

Migration Path: ZooKeeper → KRaft

For PB-scale production clusters, migrating from ZooKeeper to KRaft mode is a mandatory and critical operational event requiring careful planning.

📋 Migration Strategy Overview

Kafka provides an official ZK-to-KRaft migration tool (Kafka 3.5+) that performs a rolling, online migration without full cluster downtime. The process works in phases:

  1. Phase 1 — Preparation: Upgrade all brokers to Kafka 3.5+ on ZK mode. Verify kafka-metadata-quorum.sh describe output.
  2. Phase 2 — KRaft Controllers: Add dedicated KRaft controller nodes. They observe ZK metadata in "migration mode" without taking over yet.
  3. Phase 3 — Metadata Migration: Trigger migration: ZK metadata is replicated to the KRaft metadata log. Dual-write ensures consistency during transition.
  4. Phase 4 — Broker Migration: Roll each broker to KRaft mode one at a time. Brokers begin reading metadata from KRaft controllers.
  5. Phase 5 — ZK Decommission: Once all brokers are KRaft-mode, ZK ensemble can be safely shut down and decommissioned.
⚠️ Migration Risks & Mitigations
Dual-write Latency Spike

During migration phase, all metadata writes go to both ZK and KRaft. Monitor controller CPU and heap carefully. Use maintenance windows for PB-scale clusters.

ACL / Config Gaps

Verify that all ZK-stored ACLs and custom configs are correctly replicated to KRaft metadata log before cutting over. Use kafka-acls.sh audit before and after.

Consumer Group State

Consumer group offsets (stored in __consumer_offsets) are unaffected — they live in Kafka itself. Only metadata (topics, ACLs, configs) migrates.

Rollback Plan

Until Phase 5 is complete, rolling back to ZK mode is possible by removing KRaft controllers and reverting broker configs. After ZK decommission, rollback requires full cluster restore from backup.

🔧 Migration Command Reference
# Step 1: Verify current ZK cluster health kafka-metadata-quorum.sh --bootstrap-server broker:9092 describe --status # Step 2: Generate KRaft cluster ID (must match existing ZK cluster ID) kafka-storage.sh info -c /path/to/server.properties # Step 3: Format KRaft controller storage kafka-storage.sh format -t <CLUSTER_UUID> -c controller.properties --ignore-formatted # Step 4: Start KRaft controllers in migration mode zookeeper.metadata.migration.enable=true # in controller.properties # Step 5: Enable broker migration (rolling, one broker at a time) zookeeper.metadata.migration.enable=true # add to each broker.properties temporarily # Step 6: Verify KRaft metadata log after each broker kafka-metadata-quorum.sh --bootstrap-server broker:9092 describe --replication # Step 7: Complete migration (finalize + remove ZK config) kafka-metadata-quorum.sh --bootstrap-server broker:9092 describe --status # Remove zookeeper.connect from all broker configs # Decommission ZK nodes

Architecture Recommendation

Final architectural guidance for Big Data engineers operating PB-scale and real-time streaming systems.

⚙️ New Deployments: KRaft — Always

For any Greenfield Kafka deployment in 2024 and beyond, KRaft is the only rational choice. ZooKeeper mode is deprecated and removed from Kafka 4.0. KRaft delivers superior scalability, faster failover, simpler operations, and is the sole target of all future Kafka development.

  • Use dedicated controller nodes (3 or 5) for PB-scale clusters
  • Allocate NVMe storage for metadata.log.dir
  • Size controller JVM heap at 8–16GB for >500K partitions
  • Deploy via Strimzi (K8s) or Confluent Platform for managed lifecycle
  • Enable metadata.log.max.record.bytes.between.snapshots tuning

🦓 Existing ZK Clusters: Plan Migration Now

Organizations on Kafka 3.x ZK mode must plan their KRaft migration timeline. The migration tooling is mature (3.5+), and the risk of staying on ZK mode — including security exposure and operational debt — outweighs the migration effort.

  • Target Kafka 3.7 or 3.8 for the migration (LTS candidates)
  • Allocate 2–4 week planning phase for PB-scale clusters
  • Run migration in staging environment first with production-like load
  • Schedule migration during low-traffic windows for large clusters
  • Keep ZK ensemble live for 2–4 weeks post-migration for rollback safety
🌿 Decision Framework
New or existing Kafka deployment?
🆕 New Deployment
Use KRaft exclusively
Kafka 3.3+ or 4.0+
🔄 Existing on ZK
Assess Kafka version
Kafka < 3.5
Upgrade first, then migrate
Kafka 3.5+
Run rolling ZK→KRaft migration

⚠️ Kafka 4.0 deadline: ZooKeeper mode is fully removed. All production clusters must be on KRaft before upgrading to Kafka 4.0.

🦓
ZooKeeper Mode
Battle-tested, limited to ~200K partitions, slow failover, dual operational overhead. Deprecated. Suitable only for short-term legacy maintenance.
⚖️
The Verdict
For PB-scale and real-time streaming, KRaft is the definitive answer. The partition scalability, failover speed, and operational simplicity improvements are architecturally transformative.
⚙️
KRaft Mode
Millions of partitions, sub-second failover, single security domain, cloud-native, event-sourced metadata. The future — and present — of Apache Kafka.