HDFS Architecture
HDFS (Hadoop Distributed File System) is modeled after the Google File System (GFS) paper published in 2003. It follows a master–worker architecture with a single centralized metadata server (NameNode) and many distributed storage nodes (DataNodes). The system is designed for batch processing of large sequential reads, not random access or low-latency retrieval.
Figure 1 — HDFS Architecture: NameNode, DataNodes, Standby NameNode, JournalNodes, and ZooKeeper coordination
How HDFS Organizes Data Storage
(CDH default; tunable)
factor per block
per file/dir/block inode
per HDFS cluster
Files are split into fixed-size blocks (default 128 MB). Each block is stored independently on DataNode local disks. Unlike traditional file systems, HDFS stores each block of a file as a standalone file on the underlying OS. A 500 MB file becomes ~4 blocks distributed across different DataNodes.
HDFS uses a rack-aware placement policy to maximize data reliability and network efficiency. For a replication factor of 3:
Rack A
(same client rack)
Rack B
(different rack)
Rack B
(same rack as R2)
This ensures survival of a full rack failure while reducing cross-rack bandwidth.
The NameNode maintains a hierarchical POSIX-like namespace entirely in RAM for performance. Metadata includes the namespace tree (files and directories), file-to-block mapping, and block-to-DataNode locations. The latter is NOT persisted — DataNodes rebuild it at startup via block reports.
Two key files on the NameNode disk:
DataNodes maintain cluster health through two mechanisms:
- Heartbeat — sent every 3 seconds. Confirms the DataNode is alive. If no heartbeat for 10 minutes, NameNode declares the node dead and triggers re-replication.
- Block report — sent every ~1 hour. Lists all blocks stored on that DataNode so the NameNode can verify replication factor compliance.
Real Operational Problems with HDFS
While HDFS has powered petabyte-scale clusters for over a decade, organizations at scale — including Facebook, Yahoo, and DiDi — have all publicly documented serious operational challenges. These are not theoretical concerns; they are real, recurring pain points in production.
In Hadoop 1.x, the NameNode was the sole custodian of all metadata. If it failed, the entire cluster became unavailable. Recovery required an administrator to manually reconstruct the namespace from the Secondary NameNode — a process that could take 30+ minutes on large clusters, often with data loss due to the lag in edit log merging. Facebook explicitly cited this as a major risk for their production deployments.
HDFS was designed to store a small number of large files, not millions of small ones. Every file, directory, and block occupies ~150 bytes of NameNode heap memory, regardless of actual file size. 100 million files consume hundreds of gigabytes of RAM. A 1 KB log file consumes the same metadata overhead as a 128 MB file. IoT, streaming, and event-driven pipelines constantly produce small files, overwhelming the NameNode.
Since all namespace metadata is stored in the NameNode's JVM heap, the cluster's file count is bounded by the RAM of a single machine. In practice this limits clusters to around 400 million files. HDFS Federation (Hadoop 2.0+) allows multiple independent NameNodes, each owning a portion of the namespace — but this adds significant operational complexity and does not share metadata across namespaces.
HDFS optimizes for high-throughput streaming reads of large files. It explicitly sacrifices low-latency access. The read path involves an RPC call to the NameNode for block locations, then sequential reads from DataNodes. Random reads require seeking to arbitrary block offsets, causing multiple network hops. This makes HDFS unsuitable for interactive analytics, databases, or any workload requiring sub-second response times.
HDFS files are fundamentally write-once, read-many. Appending was added in Hadoop 0.21 but is unreliable under concurrent access. Overwriting a file requires deleting and re-creating it. There is no in-place update support. This creates major challenges for GDPR/CCPA data deletion requirements, schema corrections, and any workload requiring ACID-style updates — such as dimension table updates in a data warehouse.
HDFS ties storage and compute together on the same nodes to exploit data locality (processing data where it lives). While efficient for MapReduce, this model means you cannot scale storage and compute independently. Adding more compute requires adding storage nodes and vice versa. In cloud environments where elastic scaling is the norm, this is a fundamental mismatch — you must provision for peak storage AND peak compute simultaneously, driving up costs.
The default 3× replication means you need 3 TB of raw disk to store 1 TB of data. At petabyte scale this is enormously expensive. While Erasure Coding (EC) was added in Hadoop 3.0 and can reduce overhead to ~1.5×, it introduces significant CPU overhead for encoding/decoding and increased network I/O during reconstruction. DiDi reported switching from 3× replication to EC 6-3 and saving hundreds of petabytes of storage.
When a NameNode restarts from scratch, it must load the entire namespace (fsimage) into memory and replay all edit logs. On large clusters with tens of millions of files, this safe mode period can last 30–60 minutes, during which the cluster is read-only. DataNodes also need time to report their blocks before the NameNode considers the cluster healthy, further delaying availability.
New Tools That Emerged to Solve These Problems
Ozone was developed inside the Hadoop project specifically to overcome HDFS scalability limits. It separates metadata management (Ozone Manager, backed by RocksDB) from block management (Storage Container Manager). Tested to 10 billion objects vs. HDFS's ~400M limit. DiDi migrated hundreds of petabytes to Ozone and saw P90 metadata latency drop from 90ms to 17ms and 20%+ read throughput improvement. Supports S3 API and HDFS API simultaneously. Cloudera's TPC-DS benchmark showed Ozone outperforms HDFS by an average of 3.5% on query completion time.
Originally designed at Netflix, Iceberg solves the lack of ACID semantics on HDFS and object stores. It provides snapshot isolation, schema evolution, hidden partitioning, and time travel without rewriting data. Works on HDFS, S3, GCS, ADLS. Widely considered the leading open table format in 2025–2026 due to its engine-agnostic design (works with Spark, Flink, Trino, Hive, Dremio). Critical for GDPR row-level deletes that HDFS cannot natively support.
Created by Databricks in 2019 and open-sourced under the Linux Foundation. Delta Lake adds a transaction log (delta log) on top of Parquet files stored in HDFS or object stores, enabling ACID operations, data versioning, and Z-order clustering. Best suited for Spark-centric ecosystems. Its Delta UniForm feature allows reading Delta tables as Iceberg or Hudi tables for cross-format interoperability.
Developed at Uber for their data ingestion pipelines, Hudi (Hadoop Upserts Deletes and Incrementals) was the first table format to bring row-level updates and deletes to HDFS. Its merge-on-read architecture enables near-real-time upserts with low latency. Hudi 1.x now positions itself as a full DLMS (Data Lakehouse Management System) and supports Iceberg as an underlying table format.
Alluxio sits between compute engines (Spark, Presto, Hive) and storage systems (HDFS, S3, Azure). It provides an in-memory distributed cache, enabling sub-second data access for hot datasets without changing the underlying storage. It also abstracts the storage layer, allowing transparent migration from HDFS to cloud object stores. Often deployed to eliminate the latency gap when moving from HDFS-local to S3-remote storage.
HopsFS replaces the single-node in-memory metadata service with a distributed metadata service built on NDB (MySQL Cluster). It can store 24× more metadata than vanilla HDFS and delivers 2.6× the throughput. Spotify contributed real workload traces for benchmarking. HopsFS is the storage layer for Hopsworks, a managed ML platform. This approach demonstrates that distributing NameNode metadata is technically feasible, influencing design discussions for future HDFS iterations.
HDFS Strengths
Alternatives to HDFS
The most widely adopted HDFS alternative for cloud workloads. S3 decouples storage from compute entirely. However, raw HDFS read performance can be 3× faster than S3 for large sequential workloads. The s3a:// connector allows Spark/Hive to read from S3 with no code changes. Best combined with a table format (Iceberg/Delta) to add ACID semantics.
MinIO is a high-performance, Kubernetes-native object store compatible with the Amazon S3 API. Unlike HDFS, it does not require a Hadoop cluster and can run anywhere from a laptop to a datacenter. Best suited for organizations that want S3 compatibility on-premise without managing a Hadoop cluster. Provides eventual consistency semantics by default, contrasting with HDFS's strong consistency.
Ceph is an open-source distributed storage system offering object storage (via Ceph RADOS Gateway, S3-compatible), block storage (RBD), and a POSIX-compliant file system (CephFS). Unlike HDFS, Ceph has no single metadata server bottleneck — metadata is distributed via the CRUSH algorithm across monitor nodes. Widely used in OpenStack and private cloud deployments. Suitable for mixed workloads requiring block, file, and object access simultaneously.
GCS supports strong read-after-write consistency (unlike early S3) and integrates natively with Dataproc (Google's managed Hadoop). The gs:// connector gives Spark and Hadoop direct access. Google internally uses Colossus (GFS successor) for their own workloads. GCS is the preferred HDFS replacement for GCP-based big data workloads.
ADLS Gen2 uniquely combines Azure Blob Storage's object store characteristics with a hierarchical namespace that enables directory-level atomic operations (rename, delete), critical for Hadoop compatibility. This gives it significant performance advantages over flat-namespace S3 for workloads that rely on directory renames (e.g., Spark job output commits). The preferred HDFS replacement for Azure HDInsight and Synapse Analytics.
For organizations that must stay on-premise and need HDFS compatibility without its limitations, Apache Ozone is the recommended evolution path. It supports both HDFS API and S3 API, integrates with the entire Hadoop ecosystem, and scales to billions of objects. Cloudera has adopted it as the default storage for CDP Private Cloud. DiDi, one of the largest Ozone deployments, manages hundreds of petabytes with it.
Feature Comparison Matrix
| Feature / Dimension | HDFS | Apache Ozone | Amazon S3 | Ceph RGW | MinIO |
|---|---|---|---|---|---|
| Small Files Handling | Poor | Excellent | Good | Good | Good |
| Max Scalability (files) | ~400M | 10B+ tested | Unlimited | Very High | High |
| Sequential Throughput | Excellent | Very Good | Good | Good | Very Good |
| Random Access Latency | Poor | Fair | Fair | Fair | Fair |
| POSIX Compliance | Partial | Partial | None | CephFS: Full | None |
| S3 API Compatibility | No | Yes | Native | Yes | Yes |
| ACID Transactions | No | No (need Iceberg) | No (need Iceberg) | No | No |
| Hadoop Ecosystem Fit | Native | Native | Via s3a:// | Via s3a:// | Via s3a:// |
| Storage Overhead | 3× (default) | 1.5× (EC 6-3) | Managed | Configurable | Configurable |
| Operational Complexity | High | High | Low (managed) | Medium | Medium |
| Compute–Storage Decoupling | No | Partial | Yes | Partial | Yes |
| On-Premise Deployment | Yes | Yes | Cloud only | Yes | Yes |
| Data Consistency Model | Strong | Strong (Raft) | Eventual → Strong (2021) | Eventual | Eventual |
* ACID transactions are typically provided by a table format layer (Apache Iceberg, Delta Lake, Hudi) on top of any storage system.