A comprehensive technical comparison for architects and engineers designing petabyte-scale data systems
Amazon S3 is a massively scalable object storage service launched in 2006. It decouples compute from storage, enabling independent scaling of each. S3 stores data as flat objects with metadata, accessible via HTTP REST APIs. At petabyte scale, S3 becomes the backbone of nearly every cloud-native data lakehouse architecture, with durability guarantees of 99.999999999% (11 nines) achieved through automatic replication across ≥3 AZs.
HDFS is the fault-tolerant, distributed file system at the core of the Hadoop ecosystem, designed in 2006 inspired by Google's GFS paper. It stores data in large blocks (128MB–256MB) distributed across DataNodes with a centralized NameNode for metadata. HDFS co-locates compute and storage (data locality), enabling high-throughput sequential reads ideal for large-scale batch processing workloads. It operates best in managed cluster environments.
Context: At petabyte scale, the choice between S3 and HDFS is rarely binary. Modern architectures increasingly adopt a hybrid or lakehouse model — storing cold/warm data in S3 while using HDFS-compatible APIs (via EMR, Databricks, or open-source clusters) for hot compute-intensive workloads. The decision hinges on access patterns, cost structures, team capability, and cloud strategy.
Key trait: Eventual consistency was the default until 2020. Since Dec 2020, S3 offers strong read-after-write consistency for all operations — a major architectural improvement for stream and incremental workloads.
Key trait: The NameNode is a single point of metadata bottleneck. In large deployments, the in-memory metadata (~150 bytes/file) limits namespace to ~500M–1B files without HDFS Federation. HA NameNode (ZKFC + JournalNodes) mitigates SPOF but not the memory ceiling.
PB-Scale Performance Note: At petabyte scale with 10,000+ concurrent Spark tasks, S3's aggregate throughput (5,500 GET/s per prefix × N prefixes) often exceeds HDFS total cluster bandwidth. However, S3 network egress costs and inter-request latency become dominant cost/latency factors. Solutions: S3 Express One Zone (~10ms latency), partition-aware prefix design, and intelligent tiering via S3 Intelligent-Tiering.
The dominant S3 compute engine. Spark's S3A connector (with magic committer) enables efficient large-scale reads/writes. AWS EMR uses EMRFS for S3 optimizations including consistent view and optimized list operations. Databricks AutoOptimize auto-compacts small files on S3.
The recommended table format for S3-based lakehouses. Iceberg adds hidden partitioning, schema evolution, time-travel, ACID transactions, and partition pruning on top of S3 object storage. Netflix, Apple, and LinkedIn use Iceberg at exabyte scale on S3. Works seamlessly with Spark, Flink, Trino, and Athena.
Athena enables serverless SQL directly on S3 data with zero cluster management. Glue Data Catalog acts as the Hive Metastore. Athena v3 (Trino-based) supports Iceberg, Hudi, and Delta Lake natively. At $5/TB scanned, partition pruning via columnar formats (Parquet/ORC) is essential for cost control.
Flink's StreamingFileSink (now FileSink) writes exactly-once to S3 via Kinesis Data Firehose or direct S3 sink. Micro-batch compaction on S3 with Flink + Iceberg enables near-real-time lakehouse. Kinesis Firehose buffers records and batches into S3, optimizing for Parquet/ORC format conversion.
DuckDB can query Parquet files on S3 directly from a laptop — ideal for data exploration at scale without spinning up clusters. Trino (formerly PrestoSQL) enables distributed interactive SQL on S3 at PB scale with sub-minute query times on Parquet/ORC Iceberg tables.
Hudi is optimized for incremental data ingestion and CDC (Change Data Capture) workloads on S3. Copy-on-Write (CoW) for read-heavy and Merge-on-Read (MoR) for write-heavy pipelines. Uber originally built Hudi to handle millions of trips/day updating S3-backed tables.
The original HDFS compute paradigm. YARN's locality-aware scheduler ensures map tasks run on DataNodes holding the input splits, achieving near-zero network I/O for reads. Still used for ETL workloads where predictable, stable throughput outweighs Spark's in-memory speed.
HBase is tightly coupled to HDFS — its WAL (Write-Ahead Log) and HFile storage require HDFS append semantics and atomic renames. Provides millisecond random reads via its LSM-tree architecture. Facebook's Messages used HBase on HDFS at hundreds of PB. Not feasible on raw S3.
Kafka log segments can be tiered to HDFS via Confluent Tiered Storage or the Kafka HDFS Connector (Kafka Connect). Enables long-term log retention on HDFS with Kafka's sequential I/O pattern matching HDFS block-sequential reads perfectly.
Hive with LLAP (Live Long and Process) daemon caches data in HDFS-adjacent memory, enabling sub-second interactive queries. HDFS's fast metadata enables efficient partition management at scale. Hive ACID on ORC provides row-level insert/update/delete on HDFS-backed tables.
Spark on HDFS achieves maximum performance when co-located with DataNodes. YARN's node-local, rack-local scheduling minimizes data shuffle distance. For iterative ML workloads (MLlib), HDFS-backed RDD caching on local disks outperforms S3-backed Spark significantly.
Flink's state backends (RocksDB) checkpoint directly to HDFS, leveraging append semantics and reliable block storage. HDFS-backed checkpoints are faster to write and restore than S3 for large state sizes (100s of GB). Ideal for stateful CEP and windowed aggregation pipelines.
✔ Building a greenfield cloud-native architecture with no on-premises constraints.
✔ Your data access is primarily read-heavy and batch, tolerating 100–500ms latency.
✔ You want zero operational overhead — no cluster team, no capacity planning.
✔ Your data temperature varies — some data is hot, most is warm/cold. S3 tiering saves dramatically.
✔ You need global multi-region durability (CRR) and compliance (WORM, Object Lock).
✔ You're building a data lakehouse with Iceberg/Delta + Spark/Trino/Athena stack.
✔ Your compute is elastic and bursty — you pay only for active processing cycles.
✔ Team expertise is in cloud-native tools rather than Hadoop ecosystem administration.
✔ You have significant on-premises infrastructure investment and regulatory constraints preventing cloud adoption.
✔ Your workloads are shuffle-intensive and latency-sensitive — e.g., iterative ML, graph processing where data locality eliminates network I/O.
✔ You run HBase or other systems requiring POSIX append semantics and atomic renames.
✔ Your egress costs on cloud would be catastrophic — heavy compute jobs reading PBs repeatedly.
✔ You require sub-10ms read latency from local DataNode reads (e.g., real-time ML inference backed by Hive LLAP).
✔ You already operate a mature Hadoop platform with a dedicated ops team and want incremental migration.
✔ Your data consists of large sequential files accessed in batch patterns — HDFS block streaming is optimally tuned for this.
✔ You need vendor-neutral open-source with full data sovereignty.
Most large-scale enterprises end up here. Use HDFS for hot/operational data (HBase, Kafka, real-time Flink) while simultaneously using S3 as the canonical data lake for historical, analytical, and cold data. DistCp or Flink pipelines replicate from HDFS to S3 for long-term retention and cloud analytics. This pattern is common at LinkedIn, Twitter/X, Alibaba, and Tencent. The S3A connector in Hadoop 3.x enables Spark to read both HDFS and S3 transparently, enabling gradual migration. Platforms like Cloudera CDP and AWS EMR explicitly support this hybrid model.
Multi-layered approach to data quality and reliability on S3:
Tip: Use S3 Lifecycle to auto-move Bronze → Glacier after 90 days. Only Gold tables need Standard tier.
Classic pattern for combining batch accuracy with real-time speed:
Note: Kappa architecture (Flink only) is increasingly preferred for simplicity on HDFS/S3.
Phased migration from on-prem HDFS to cloud S3:
Architect's Verdict: In 2024–2025, the industry has decisively shifted toward S3 as the long-term storage layer for new big data systems — driven by lakehouse formats (Iceberg, Delta), serverless query engines (Athena, Snowflake on S3), and the elimination of the data locality advantage through high-bandwidth networking. HDFS remains irreplaceable for HBase, stateful Flink checkpoints, and on-premises regulated environments. For PB-scale greenfield systems: adopt the S3 lakehouse pattern with Iceberg + Spark on EMR Serverless or Databricks. For existing HDFS investments: plan a phased migration using DistCp for cold data and S3A connector for hybrid reads, targeting a 3–5 year full migration horizon.