⚙️

HDFS Architecture

How the distributed file system is structured internally

HDFS (Hadoop Distributed File System) is modeled after the Google File System (GFS) paper published in 2003. It follows a master–worker architecture with a single centralized metadata server (NameNode) and many distributed storage nodes (DataNodes). The system is designed for batch processing of large sequential reads, not random access or low-latency retrieval.

Client Application Metadata RPC NameNode fsimage (snapshot) editLog (journal) block → DataNode map namespace tree (RAM) ~150 bytes / object block location DataNode 1 Rack A · Blk 1,4,7 Heartbeat every 3s DataNode 2 Rack A · Blk 2,5,8 Block report 1h DataNode 3 Rack B · Blk 3,6,9 Block report 1h DataNode 4 Replica Blk 1 Rack B replication Standby NameNode HA (since Hadoop 2.0) edit sync Journal Nodes Quorum (≥3) for HA ZooKeeper Failover coordination

Figure 1 — HDFS Architecture: NameNode, DataNodes, Standby NameNode, JournalNodes, and ZooKeeper coordination

💾

How HDFS Organizes Data Storage

Blocks, replication, rack awareness, and the namespace
128 MB
Default block size
(CDH default; tunable)
Default replication
factor per block
~150 B
NameNode memory
per file/dir/block inode
~400 M
Practical file limit
per HDFS cluster
🧩 Block-Based Storage

Files are split into fixed-size blocks (default 128 MB). Each block is stored independently on DataNode local disks. Unlike traditional file systems, HDFS stores each block of a file as a standalone file on the underlying OS. A 500 MB file becomes ~4 blocks distributed across different DataNodes.

# Check block info for a file hdfs fsck /user/data/bigfile.csv \ -files -blocks -locations # Output: blocks, replicas, DataNode locations
🗺️ Rack-Aware Replication

HDFS uses a rack-aware placement policy to maximize data reliability and network efficiency. For a replication factor of 3:

Blk 1
Rack A
Primary
(same client rack)
Blk 1
Rack B
Replica 2
(different rack)
Blk 1
Rack B
Replica 3
(same rack as R2)

This ensures survival of a full rack failure while reducing cross-rack bandwidth.

📁 Namespace & Metadata

The NameNode maintains a hierarchical POSIX-like namespace entirely in RAM for performance. Metadata includes the namespace tree (files and directories), file-to-block mapping, and block-to-DataNode locations. The latter is NOT persisted — DataNodes rebuild it at startup via block reports.

Two key files on the NameNode disk:

fsimage # periodic namespace snapshot editLog # journal of changes since snapshot # Merged by Secondary NameNode (Hadoop 1) # or Standby NameNode (Hadoop 2+)
💓 DataNode Heartbeats & Block Reports

DataNodes maintain cluster health through two mechanisms:

  • Heartbeat — sent every 3 seconds. Confirms the DataNode is alive. If no heartbeat for 10 minutes, NameNode declares the node dead and triggers re-replication.
  • Block report — sent every ~1 hour. Lists all blocks stored on that DataNode so the NameNode can verify replication factor compliance.
📌 Node failure rate at Yahoo's clusters: 2–3 per 1,000 nodes per day — HDFS was designed assuming hardware failure is the norm, not the exception.
⚠️

Real Operational Problems with HDFS

Issues encountered at production scale by Facebook, Yahoo, DiDi, and others

While HDFS has powered petabyte-scale clusters for over a decade, organizations at scale — including Facebook, Yahoo, and DiDi — have all publicly documented serious operational challenges. These are not theoretical concerns; they are real, recurring pain points in production.

NameNode: Single Point of Failure Critical

In Hadoop 1.x, the NameNode was the sole custodian of all metadata. If it failed, the entire cluster became unavailable. Recovery required an administrator to manually reconstruct the namespace from the Secondary NameNode — a process that could take 30+ minutes on large clusters, often with data loss due to the lag in edit log merging. Facebook explicitly cited this as a major risk for their production deployments.

Hadoop 2.0+ HA: Active/Standby NameNode pair using Quorum Journal Manager (QJM) with ≥3 JournalNodes. ZooKeeper coordinates automatic failover. Standby takes over in seconds.
The Small Files Problem Persistent

HDFS was designed to store a small number of large files, not millions of small ones. Every file, directory, and block occupies ~150 bytes of NameNode heap memory, regardless of actual file size. 100 million files consume hundreds of gigabytes of RAM. A 1 KB log file consumes the same metadata overhead as a 128 MB file. IoT, streaming, and event-driven pipelines constantly produce small files, overwhelming the NameNode.

# Impact: 10M small files → 3 GB+ NameNode RAM # Each object: ~150 bytes 10,000,000 × 150 bytes1.4 GB heap # Plus 2 replicas → blocks × 150 each 30,000,000 × 150 bytes4.2 GB total
Mitigations: HAR files (Hadoop Archives), SequenceFiles, Avro containers, CombineFileInputFormat — though none are seamless. Apache Ozone solves this architecturally.
NameNode Memory Scalability Ceiling Scaling

Since all namespace metadata is stored in the NameNode's JVM heap, the cluster's file count is bounded by the RAM of a single machine. In practice this limits clusters to around 400 million files. HDFS Federation (Hadoop 2.0+) allows multiple independent NameNodes, each owning a portion of the namespace — but this adds significant operational complexity and does not share metadata across namespaces.

HDFS Federation: multiple NameNodes, each owning a namespace volume. Apache Ozone: moves metadata to RocksDB, tested to 10 billion objects.
No Low-Latency or Random Access Performance

HDFS optimizes for high-throughput streaming reads of large files. It explicitly sacrifices low-latency access. The read path involves an RPC call to the NameNode for block locations, then sequential reads from DataNodes. Random reads require seeking to arbitrary block offsets, causing multiple network hops. This makes HDFS unsuitable for interactive analytics, databases, or any workload requiring sub-second response times.

Alluxio provides an in-memory caching layer above HDFS. Apache Spark's data locality features help reduce latency for compute-on-storage patterns.
Write-Once, Append-Only Semantics Flexibility

HDFS files are fundamentally write-once, read-many. Appending was added in Hadoop 0.21 but is unreliable under concurrent access. Overwriting a file requires deleting and re-creating it. There is no in-place update support. This creates major challenges for GDPR/CCPA data deletion requirements, schema corrections, and any workload requiring ACID-style updates — such as dimension table updates in a data warehouse.

Apache Hudi, Iceberg, Delta Lake provide ACID transactions and row-level updates on top of HDFS or object stores.
Storage–Compute Coupling Cloud Readiness

HDFS ties storage and compute together on the same nodes to exploit data locality (processing data where it lives). While efficient for MapReduce, this model means you cannot scale storage and compute independently. Adding more compute requires adding storage nodes and vice versa. In cloud environments where elastic scaling is the norm, this is a fundamental mismatch — you must provision for peak storage AND peak compute simultaneously, driving up costs.

Cloud object stores (S3, GCS, ADLS) decouple storage from compute entirely. Apache Ozone + YARN/Kubernetes hybrid deployments help on-premise.
Replication Overhead (3× Storage Amplification) Cost

The default 3× replication means you need 3 TB of raw disk to store 1 TB of data. At petabyte scale this is enormously expensive. While Erasure Coding (EC) was added in Hadoop 3.0 and can reduce overhead to ~1.5×, it introduces significant CPU overhead for encoding/decoding and increased network I/O during reconstruction. DiDi reported switching from 3× replication to EC 6-3 and saving hundreds of petabytes of storage.

Hadoop 3.0 Erasure Coding (Reed-Solomon). Apache Ozone also supports EC natively, with DiDi reporting ~50% storage cost reduction vs. 3× replication.
Slow NameNode Cold Start Availability

When a NameNode restarts from scratch, it must load the entire namespace (fsimage) into memory and replay all edit logs. On large clusters with tens of millions of files, this safe mode period can last 30–60 minutes, during which the cluster is read-only. DataNodes also need time to report their blocks before the NameNode considers the cluster healthy, further delaying availability.

Standby NameNode (Hadoop 2+) eliminates cold start for failover since the standby has already replayed all edits in memory.
🔧

New Tools That Emerged to Solve These Problems

From Apache Ozone to table formats — the modern HDFS-adjacent ecosystem
🪨
Apache Ozone
Next-gen Object Store (Hadoop-native)

Ozone was developed inside the Hadoop project specifically to overcome HDFS scalability limits. It separates metadata management (Ozone Manager, backed by RocksDB) from block management (Storage Container Manager). Tested to 10 billion objects vs. HDFS's ~400M limit. DiDi migrated hundreds of petabytes to Ozone and saw P90 metadata latency drop from 90ms to 17ms and 20%+ read throughput improvement. Supports S3 API and HDFS API simultaneously. Cloudera's TPC-DS benchmark showed Ozone outperforms HDFS by an average of 3.5% on query completion time.

S3-compatible RocksDB metadata Erasure Coding 10B objects tested Hadoop-native
🏔️
Apache Iceberg
Open Table Format

Originally designed at Netflix, Iceberg solves the lack of ACID semantics on HDFS and object stores. It provides snapshot isolation, schema evolution, hidden partitioning, and time travel without rewriting data. Works on HDFS, S3, GCS, ADLS. Widely considered the leading open table format in 2025–2026 due to its engine-agnostic design (works with Spark, Flink, Trino, Hive, Dremio). Critical for GDPR row-level deletes that HDFS cannot natively support.

ACID transactions Time travel Schema evolution Multi-engine Hidden partitioning
🔺
Delta Lake
Open Table Format (Databricks)

Created by Databricks in 2019 and open-sourced under the Linux Foundation. Delta Lake adds a transaction log (delta log) on top of Parquet files stored in HDFS or object stores, enabling ACID operations, data versioning, and Z-order clustering. Best suited for Spark-centric ecosystems. Its Delta UniForm feature allows reading Delta tables as Iceberg or Hudi tables for cross-format interoperability.

ACID on Spark Transaction log Z-order clustering Parquet-native Delta UniForm
🐝
Apache Hudi
Data Lakehouse Platform

Developed at Uber for their data ingestion pipelines, Hudi (Hadoop Upserts Deletes and Incrementals) was the first table format to bring row-level updates and deletes to HDFS. Its merge-on-read architecture enables near-real-time upserts with low latency. Hudi 1.x now positions itself as a full DLMS (Data Lakehouse Management System) and supports Iceberg as an underlying table format.

Row-level upserts Merge-on-read Incremental CDC Streaming ingestion Iceberg-compatible
Alluxio
Distributed Cache / Data Orchestration

Alluxio sits between compute engines (Spark, Presto, Hive) and storage systems (HDFS, S3, Azure). It provides an in-memory distributed cache, enabling sub-second data access for hot datasets without changing the underlying storage. It also abstracts the storage layer, allowing transparent migration from HDFS to cloud object stores. Often deployed to eliminate the latency gap when moving from HDFS-local to S3-remote storage.

In-memory cache Storage abstraction Transparent tiering Multi-cloud
🗄️
HopsFS
Distributed HDFS Metadata (Research)

HopsFS replaces the single-node in-memory metadata service with a distributed metadata service built on NDB (MySQL Cluster). It can store 24× more metadata than vanilla HDFS and delivers 2.6× the throughput. Spotify contributed real workload traces for benchmarking. HopsFS is the storage layer for Hopsworks, a managed ML platform. This approach demonstrates that distributing NameNode metadata is technically feasible, influencing design discussions for future HDFS iterations.

Distributed metadata 24× more files NewSQL backend ML platform

HDFS Strengths

Where HDFS genuinely excels and why it remained dominant for a decade
💪 Where HDFS Wins
Extremely high sequential throughput — designed for streaming reads of large files, HDFS can deliver sustained GB/s throughput by reading from multiple DataNodes in parallel.
Proven fault tolerance — 3× replication (or EC in Hadoop 3) means data survives DataNode failures automatically without administrator intervention.
Data locality for MapReduce/Spark — compute tasks are scheduled on nodes that already hold the data, eliminating most network I/O for batch jobs.
Commodity hardware — runs on standard x86 servers with JBODs (Just a Bunch of Disks). No specialized hardware required, dramatically reducing costs.
Battle-tested at extreme scale — Facebook (hundreds of PB), Yahoo (tens of PB), Twitter, LinkedIn, Alibaba all ran HDFS at massive scale.
Strong POSIX-like consistency — unlike eventual-consistent object stores (S3), HDFS provides strong read-after-write consistency without coordination overhead.
Rich Hadoop ecosystem integration — native integration with Hive, HBase, Spark, Pig, Flume, and the entire Apache big data ecosystem without adapters.
❌ Where HDFS Struggles
Small files problem — millions of small files exhaust NameNode heap memory and degrade performance across all cluster operations.
No low-latency access — high-throughput design makes it fundamentally unsuitable for interactive queries, OLTP workloads, or sub-second dashboards.
NameNode is a scalability ceiling — even with HDFS Federation, managing multiple federated namespaces adds significant operational complexity.
Write-once limitation — no native support for record updates or deletes, making compliance (GDPR), corrections, and dimension table updates complex.
3× storage overhead — unless Erasure Coding (Hadoop 3+) is used, every byte of data costs 3× in raw disk, making archival storage expensive.
Storage–compute coupling — cannot scale storage and compute independently. Poor fit for cloud-native, elastic infrastructure patterns.
Operational complexity — requires dedicated Hadoop expertise for tuning, maintenance, upgrades, and incident response. High ops burden compared to managed cloud storage.
🔄

Alternatives to HDFS

Cloud-native, on-premise, and hybrid storage systems replacing or supplementing HDFS
☁️
Amazon S3
Cloud Object Store

The most widely adopted HDFS alternative for cloud workloads. S3 decouples storage from compute entirely. However, raw HDFS read performance can be 3× faster than S3 for large sequential workloads. The s3a:// connector allows Spark/Hive to read from S3 with no code changes. Best combined with a table format (Iceberg/Delta) to add ACID semantics.

Elastic Managed 11 9s durability Lifecycle tiering
🦑
MinIO
S3-Compatible On-Prem Object Store

MinIO is a high-performance, Kubernetes-native object store compatible with the Amazon S3 API. Unlike HDFS, it does not require a Hadoop cluster and can run anywhere from a laptop to a datacenter. Best suited for organizations that want S3 compatibility on-premise without managing a Hadoop cluster. Provides eventual consistency semantics by default, contrasting with HDFS's strong consistency.

S3-API Kubernetes-native Easy deployment No Hadoop needed
🐙
Ceph (CephFS / RGW)
Unified Distributed Storage

Ceph is an open-source distributed storage system offering object storage (via Ceph RADOS Gateway, S3-compatible), block storage (RBD), and a POSIX-compliant file system (CephFS). Unlike HDFS, Ceph has no single metadata server bottleneck — metadata is distributed via the CRUSH algorithm across monitor nodes. Widely used in OpenStack and private cloud deployments. Suitable for mixed workloads requiring block, file, and object access simultaneously.

Unified storage No SPOF metadata POSIX + S3 Open-source
🏔️
Google Cloud Storage (GCS)
Managed Cloud Object Store

GCS supports strong read-after-write consistency (unlike early S3) and integrates natively with Dataproc (Google's managed Hadoop). The gs:// connector gives Spark and Hadoop direct access. Google internally uses Colossus (GFS successor) for their own workloads. GCS is the preferred HDFS replacement for GCP-based big data workloads.

Strong consistency GCP-native Dataproc integration Managed
🌊
Azure Data Lake Storage Gen2
Managed Hierarchical Object Store

ADLS Gen2 uniquely combines Azure Blob Storage's object store characteristics with a hierarchical namespace that enables directory-level atomic operations (rename, delete), critical for Hadoop compatibility. This gives it significant performance advantages over flat-namespace S3 for workloads that rely on directory renames (e.g., Spark job output commits). The preferred HDFS replacement for Azure HDInsight and Synapse Analytics.

Hierarchical NS Atomic rename Azure-native Hadoop-compatible
🪨
Apache Ozone (Recap)
Best HDFS Drop-in On-Prem

For organizations that must stay on-premise and need HDFS compatibility without its limitations, Apache Ozone is the recommended evolution path. It supports both HDFS API and S3 API, integrates with the entire Hadoop ecosystem, and scales to billions of objects. Cloudera has adopted it as the default storage for CDP Private Cloud. DiDi, one of the largest Ozone deployments, manages hundreds of petabytes with it.

On-prem preferred HDFS + S3 API CDP-integrated DiDi: 100s PB
📊

Feature Comparison Matrix

HDFS vs modern alternatives across key operational dimensions
Feature / Dimension HDFS Apache Ozone Amazon S3 Ceph RGW MinIO
Small Files Handling Poor Excellent Good Good Good
Max Scalability (files) ~400M 10B+ tested Unlimited Very High High
Sequential Throughput Excellent Very Good Good Good Very Good
Random Access Latency Poor Fair Fair Fair Fair
POSIX Compliance Partial Partial None CephFS: Full None
S3 API Compatibility No Yes Native Yes Yes
ACID Transactions No No (need Iceberg) No (need Iceberg) No No
Hadoop Ecosystem Fit Native Native Via s3a:// Via s3a:// Via s3a://
Storage Overhead 3× (default) 1.5× (EC 6-3) Managed Configurable Configurable
Operational Complexity High High Low (managed) Medium Medium
Compute–Storage Decoupling No Partial Yes Partial Yes
On-Premise Deployment Yes Yes Cloud only Yes Yes
Data Consistency Model Strong Strong (Raft) Eventual → Strong (2021) Eventual Eventual

* ACID transactions are typically provided by a table format layer (Apache Iceberg, Delta Lake, Hudi) on top of any storage system.

📖

References & Further Reading

Primary sources, official documentation, and research papers
[1] Apache Hadoop Docs HDFS High Availability — Official Guide (v3.4.1)
apache.org · Published 2024-10-09
[2] Cloudera Blog Small Files, Big Foils: Metadata and Application Challenges
Cloudera Technical Blog
[3] USENIX Login HDFS Scalability: The Limits to Growth — Konstantin Shvachko
USENIX ;login: 2010, Vol. 36
[4] Apache Ozone Apache Ozone — Official Homepage
ozone.apache.org
[5] ASF Blog How DiDi Scaled to Hundreds of Petabytes with Apache Ozone
Apache Software Foundation Blog · Jan 2026
[6] Cloudera FAQ What Is Apache Ozone? — Cloudera
Cloudera Resources
[7] GTech Analysis Apache Ozone: Powerful Solution for Big Data Projects
gtech.com.tr · Jan 2025
[10] ScienceDirect Small Files Problem in Hadoop: Systematic Literature Review
Journal of King Saud Univ. · 2021
[12] Preferred Networks A Year with Apache Ozone
tech.preferred.jp · Dec 2021
[13] Wiley / CPE An Archive-based Method for Efficiently Handling Small Files in HDFS
Concurrency Comput.: Practice Exp. · 2024
[14] Onehouse Blog Apache Hudi vs Delta Lake vs Iceberg — Lakehouse Feature Comparison
onehouse.ai · Updated Oct 2025
[16] Quora HDFS Alternatives for Hadoop Used in Production
Quora · Expert answers
[17] Arenadata Docs HDFS vs Ozone: Comparison of Features
docs.arenadata.io
[18] ResearchGate Comparative Analysis of HDFS and Apache Ozone
ResearchGate · Mar 2025