Apache Iceberg — The Complete Guide

📘 Foundation

What is Apache Iceberg?

A high-performance open table format for huge analytic datasets, designed at Netflix and donated to the Apache Software Foundation in 2018.

Apache Iceberg is an open table format — not a storage system, not a query engine, not a database. Think of it as a specification and implementation that sits between your data files (Parquet, ORC, Avro) and your compute engines (Spark, Flink, Trino, Athena).

The name "Iceberg" is metaphorical: like an actual iceberg, only a small part is visible (what query engines interact with), while a massive, rich structure lies beneath — tracking every file, schema version, partition, and snapshot of your data.

Before Iceberg, teams using Hive tables suffered from inconsistent reads during writes, painful schema changes, no time-travel, and partitioning schemes baked into folder structures. Iceberg solves all of this elegantly.

💡

Core Insight: Iceberg tracks data files at the file level rather than the partition/directory level. This unlocks ACID transactions, schema evolution, and time-travel without rewriting data.

Feature	Hive	Iceberg
ACID Transactions	✗	✓
Schema Evolution	Limited	Full
Time Travel	✗	✓
Hidden Partitioning	✗	✓
Concurrent Writes	✗	✓
Partition Evolution	✗	✓
Row-level Deletes	✗	✓
Multi-engine Access	Limited	Fully Open

🏗️ Architecture

How Iceberg is Built

Iceberg's architecture is a carefully designed layered system — each layer serves a specific purpose and communicates upward and downward with precision.

Iceberg Layered Architecture

⚡ Apache Spark

🌊 Apache Flink

🔺 Trino / Presto

☁️ AWS Athena

❄️ Snowflake

↕

📚 Layer 1: Catalog catalog/

Central registry for all Iceberg tables. Stores pointers to the current metadata file. Can be: Hive Metastore, AWS Glue, Nessie (multi-branch), REST Catalog, JDBC, Hadoop. When a table is opened, the catalog resolves the table name → current metadata file location.

↕

📋 Layer 2: Metadata Files metadata/*.json

JSON files capturing complete table state at each snapshot. Contains: current schema, partition specs, snapshot history, sort order, table properties. Each write creates a new metadata file; old ones are preserved for time-travel.

↕

📑 Layer 3: Manifest List + Manifest Files metadata/*.avro

Manifest List: An Avro file listing all manifest files for a snapshot, with partition-level stats.
Manifest Files: Track individual data files with column-level stats (min/max/null counts). Essential for file pruning during query planning.

↕

🗄️ Layer 4: Data Files data/*.parquet

Actual data stored as Parquet, ORC, or Avro files on object storage (S3, GCS, ADLS, HDFS). Delete files (position or equality) handle row-level deletes without full rewrites. Files are immutable — new writes always create new files.

The Snapshot Model

Every Iceberg write operation creates a new immutable snapshot. This is the heart of Iceberg's power.

📸

Snapshot = Table State

Every INSERT, UPDATE, or DELETE creates a new snapshot. A snapshot points to a manifest list, which points to manifest files, which point to data files. The full lineage is preserved.

⏪

Immutable History

Old snapshots are never deleted until explicitly expired. This enables time travel (read data as of any past snapshot), rollback, and audit trails without any extra tooling.

🔀

Optimistic Concurrency

Multiple writers use optimistic locking via the catalog. A writer reads current state, makes changes, and atomically swaps the metadata pointer. Conflicts are detected and retried automatically.

⚙️ Mechanics

How Iceberg Works

From a query arriving at a compute engine to data files being read from object storage — here's the full journey.

Read Path

Catalog Lookup

Engine queries the catalog (Glue, Hive Metastore, REST) for the table name → gets the current metadata file path (e.g., s3://bucket/table/metadata/v42.json).

Metadata File Read

Engine reads the JSON metadata file to get the current snapshot, schema, and partition spec. For time-travel queries, a specific snapshot ID is used instead.

Manifest Pruning

Engine reads the manifest list. Each manifest has partition-level stats. Manifests with partitions that don't match the WHERE clause are skipped entirely — this is coarse pruning.

File-level Pruning

Within matching manifests, each data file's column min/max/null stats are checked. Files that can't contain matching rows are skipped — this is fine-grained pruning.

Data File Read

Only the qualifying files are opened. Row-level deletes (position/equality delete files) are applied. Parquet column pruning and predicate pushdown happen at the file level.

Write Path

Write Data Files

Writer creates new Parquet/ORC files on object storage. Files are not yet visible to any reader — they're staged in isolation. No locks held during this phase.

Create Manifest Files

Writer creates manifest Avro files listing the new data files with their column statistics (min, max, null counts). These stats are critical for future read performance.

Create New Snapshot

Writer creates a new metadata JSON file with the new snapshot. The snapshot references the new manifest list combining old + new manifests. Parent snapshot is recorded.

Atomic Catalog Commit

Writer atomically updates the catalog to point to the new metadata file. This single atomic operation makes the entire transaction visible. Either all changes appear or none do.

🔒

ACID Guarantee: Because data files are written first (invisible), and only the final catalog pointer swap is the "commit", Iceberg achieves serializable isolation without distributed locks.

Key Operations — SQL Examples

Time Travel Query

-- Query data at a specific snapshot
SELECT * FROM prod.orders
VERSION AS OF 8349571234;

-- Query data at a specific timestamp
SELECT * FROM prod.orders
TIMESTAMP AS OF '2024-01-15 09:00:00';

-- View all snapshots
SELECT * FROM prod.orders.snapshots
ORDER BY committed_at DESC;

Schema Evolution

-- Add a column safely (no rewrite!)
ALTER TABLE prod.orders
ADD COLUMN loyalty_tier STRING;

-- Rename a column safely
ALTER TABLE prod.orders
RENAME COLUMN total TO total_amount;

-- Change partition scheme live
ALTER TABLE prod.orders
ADD PARTITION FIELD months(order_date);

MERGE INTO (Upsert)

MERGE INTO prod.customers t
USING staging.customers_updates s
ON t.customer_id = s.customer_id
WHEN MATCHED THEN
  UPDATE SET *
WHEN NOT MATCHED THEN
  INSERT *;

Partition Evolution

-- Old data: partitioned by day
-- New data: partitioned by hour
-- Iceberg handles both transparently!

ALTER TABLE prod.events
ADD PARTITION FIELD hours(event_time);

-- Query seamlessly across both
SELECT * FROM prod.events
WHERE event_time >= '2024-01-01';

🌟 Benefits

Why Iceberg is a Game-Changer

Iceberg doesn't just solve one problem — it comprehensively addresses every major pain point in big data table management.

🕐 Time Travel & Rollback

Query historical data at any snapshot without ETL pipelines or separate backup infrastructure. Rollback a bad write in seconds — just update the catalog pointer to a previous snapshot.

Data Reliability

🧬 Schema Evolution without Rewrite

Add, drop, rename, or reorder columns without touching existing files. Iceberg uses field IDs (not names) internally, so renaming is purely a metadata operation — zero data movement.

Zero Downtime

🗂️ Hidden & Evolving Partitioning

Query engines don't need to know partitioning details — Iceberg handles it transparently. Partition schemes can evolve over time; old and new data coexist without migration.

Smart Optimization

⚡ Massive Query Performance

Metadata-driven file pruning eliminates irrelevant files before opening any data. Column statistics at the manifest level skip entire file groups. 10–100× fewer files scanned vs. Hive.

Performance

🔗 True Multi-Engine Support

Spark, Flink, Trino, Presto, Hive, Athena, Snowflake, DuckDB all read from the same table simultaneously with full consistency. No engine lock-in — switch or run multiple engines freely.

Interoperability

⚖️ GDPR / Row-Level Deletes

Delete specific rows without rewriting entire partitions. Position deletes and equality deletes are tracked in delete files, merged at read time. Ideal for right-to-erasure compliance.

Compliance

📊 Query Performance vs. Hive (Relative Improvement)

Point Lookup

~100× faster

Partition Scan

~20–50× faster

Full Table Scan

~2–5× faster

Schema Change

Instant (metadata)

Concurrent Writes

Fully supported

🌐 Ecosystem

The Iceberg Ecosystem

Iceberg is supported by virtually every major data engine, cloud provider, and data tool — making it the de facto standard open table format.

Compute Engines

⚡

Apache Spark

Batch + Streaming

🌊

Apache Flink

Real-time Streaming

🔺

Trino / Presto

Interactive SQL

🦆

DuckDB

Local Analytics

Cloud & Managed Services

☁️

AWS Athena

Serverless SQL

❄️

Snowflake

External Tables

🧱

Databricks

Delta ↔ Iceberg

🔵

BigQuery

Managed Iceberg

Catalogs & Tooling

🌿

Project Nessie

Git-like Catalog

🗃️

AWS Glue

Managed Catalog

🐝

Hive Metastore

Traditional Catalog

🚀

Apache Kafka

CDC → Iceberg

📅 Iceberg Timeline

2017

Born at Netflix

Created by Ryan Blue and Daniel Weeks to solve Netflix's petabyte-scale table management problems with Hive.

2018

Donated to Apache

Netflix donated the project to the Apache Software Foundation. First public release followed shortly after.

2020

Apache Top-Level Project

Graduated to Apache Top-Level Project status. Spark integration matured significantly. v0.9 released.

2022

Industry Adoption Explodes

Apple, LinkedIn, Adobe, Airbnb, and hundreds of companies adopt Iceberg. v0.14 introduces row-level deletes.

2024+

Standard Open Format

Snowflake, BigQuery, and Databricks all support Iceberg natively. The open standard for the lakehouse era.

🆚 Iceberg vs Competitors

Feature	Iceberg	Delta Lake	Hudi
Open Spec	✓ Apache	Linux Fnd	Apache
Time Travel	✓	✓	Limited
Multi-Engine	Best	Good	Good
Partition Evolve	✓	✗	✗
Hidden Partitions	✓	✗	✗
Cloud Native	✓	✓	Partial

🚀 Production

Production Setup Guide

A battle-tested configuration guide for running Apache Iceberg in production with high reliability and performance.

📚

1. Choose Your Catalog

The catalog is your single most important infrastructure choice. It must be highly available, consistent, and support atomic operations.

Spark Config — AWS Glue Catalog

spark.sql.catalog.prod = \
  org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.prod.catalog-impl = \
  org.apache.iceberg.aws.glue.GlueCatalog
spark.sql.catalog.prod.warehouse = \
  s3://my-datalake/warehouse
spark.sql.catalog.prod.io-impl = \
  org.apache.iceberg.aws.s3.S3FileIO

🗄️

2. Storage Configuration

Object storage (S3/GCS/ADLS) is the recommended storage layer. Configure S3 for high-throughput multipart uploads and server-side encryption.

S3 Optimized Config

spark.hadoop.fs.s3a.multipart.size=128M
spark.hadoop.fs.s3a.multipart.threshold=128M
spark.hadoop.fs.s3a.connection.maximum=200
spark.hadoop.fs.s3a.fast.upload=true
spark.hadoop.fs.s3a.block.size=256M
# Enable S3 Select for predicate pushdown
spark.sql.catalog.prod.s3.enable-s3-select=true

⚙️

3. Table Properties

Set critical table-level properties for performance, data quality, and operational manageability from the start.

CREATE TABLE with Best Practices

CREATE TABLE prod.events (
  event_id   BIGINT,
  event_time TIMESTAMP,
  user_id    STRING,
  payload    STRING
) USING iceberg
PARTITIONED BY (days(event_time))
TBLPROPERTIES (
  'write.format.default'='parquet',
  'write.parquet.compression-codec'='zstd',
  'write.target-file-size-bytes'='268435456',
  'history.expire.max-snapshot-age-ms'='604800000',
  'write.metadata.delete-after-commit.enabled'='true',
  'write.metadata.previous-versions-max'='10'
);

🗜️

4. Compaction & Maintenance

Schedule regular compaction to merge small files, expire old snapshots, and remove orphaned data. Critical for performance at scale.

Maintenance Jobs (Spark)

-- Compact small files (run nightly)
CALL prod.system.rewrite_data_files(
  table => 'events',
  strategy => 'binpack',
  options => map(
    'target-file-size-bytes', '268435456',
    'max-concurrent-file-group-rewrites', '10'
  )
);

-- Expire old snapshots (run daily)
CALL prod.system.expire_snapshots(
  table => 'events',
  older_than => TIMESTAMP '2024-01-08 00:00:00',
  retain_last => 10
);

-- Remove orphan files (run weekly)
CALL prod.system.remove_orphan_files(
  table => 'events'
);

🌊 Flink Streaming Ingestion Setup

For real-time data pipelines, Apache Flink writes to Iceberg with exactly-once semantics using Flink's checkpoint mechanism.

⚡

Checkpoint Interval: Set Flink checkpoints every 1–5 minutes for a balance of latency and file count. Each checkpoint = one Iceberg commit.

Flink → Iceberg

// FlinkSink to Iceberg table
FlinkSink.forRowData(dataStream)
  .table(table)
  .tableLoader(tableLoader)
  .upsert(true)
  .equalityFieldColumns(
    Arrays.asList("user_id")
  )
  .writeParallelism(8)
  .append();

// Enable checkpointing
env.enableCheckpointing(60_000);
env.getCheckpointConfig()
   .setCheckpointingMode(
     CheckpointingMode.EXACTLY_ONCE
   );

🌐 Petabyte Scale

Running Iceberg at Petabyte Scale

Specific configurations, patterns, and architectural decisions for production deployments handling hundreds of terabytes to multiple petabytes of data.

Target File Size

256–512 MB

Optimal Parquet file size balancing S3 request overhead vs. read amplification

Files per Partition

10–100

Keep manageable with compaction. More files = more manifest overhead

Manifest File Limit

<8 MB

Rewrite manifests to keep them small. Large manifests slow query planning

Snapshot Retention

7 Days

Keep 7 days for time-travel SLA. Expire older ones to control metadata bloat

Compaction Frequency

Daily

Run bin-packing compaction on streaming tables nightly during low traffic

Parallelism

100–500

Spark tasks for writing/compacting PB-scale tables. Scale with data volume

🎯 Z-Order Clustering for PB Tables

For petabyte-scale tables with high-cardinality columns, Z-order clustering dramatically reduces files scanned per query by co-locating related data.

Z-Order Compaction

-- Z-order by frequently filtered columns
CALL prod.system.rewrite_data_files(
  table => 'events',
  strategy => 'sort',
  sort_order => 'zorder(user_id, event_type)',
  options => map(
    'target-file-size-bytes', '536870912',
    'max-file-group-size-bytes', '10737418240',
    'partial-progress.enabled', 'true'
  )
);

Manifest Rewriting

-- Rewrite manifests for scan performance
CALL prod.system.rewrite_manifests(
  table => 'events',
  use_caching => true
);

-- Check table health
SELECT *
FROM prod.events.metadata_log_entries
ORDER BY timestamp DESC
LIMIT 20;

🏛️ Reference Architecture: PB-Scale Lakehouse

📡

Ingestion

Kafka · Kinesis
CDC · Firehose
Flink (streaming)

🧊

Iceberg Tables

Raw · Cleansed
Aggregated layers
S3 + Glue Catalog

⚙️

Processing

Spark EMR / Glue
Flink · dbt
Nightly compaction

📊

Consumption

Trino · Athena
Snowflake · BI Tools
ML / Feature Store

💾

Storage: Use S3 Intelligent-Tiering. Separate hot (recent) and cold (historical) data with lifecycle policies.

📈

Monitoring: Track files-per-partition, snapshot count, metadata size, and compaction lag as operational KPIs.

🔐

Security: Use IAM roles per table with S3 bucket policies. Enable server-side encryption (SSE-KMS) for compliance.

🛠️ Operational Best Practices

📏

File Size Discipline

Small files kill S3 ListObjects performance at PB scale. Target 256–512MB files. Streaming jobs create many small files — always pair with nightly compaction.

🌿

Use Nessie for Branching

Project Nessie adds Git-like branching to Iceberg. Create feature branches for ETL development, test transformations, then merge — zero risk to production.

🚨

Monitor Metadata Growth

At PB scale, metadata files can accumulate rapidly. Set write.metadata.previous-versions-max=10 and schedule regular snapshot expiry jobs.

⚖️

Partition Sizing

Aim for partitions of 1–100GB of data. Too-small partitions create excessive metadata; too-large partitions reduce pruning effectiveness. Day-level or hour-level often works well.

🔄

Incremental Reads

Use Iceberg's incremental read API to process only new files since the last run. Ideal for Change Data Capture pipelines — avoids full table scans for downstream processing.

🧪

Test with Dry Runs

Always validate compaction plans before executing at PB scale. Use partial-progress.enabled=true so large compaction jobs can be paused and resumed safely.

ApacheIceberg

What is Apache Iceberg?

How Iceberg is Built

The Snapshot Model

Snapshot = Table State

Immutable History

Optimistic Concurrency

How Iceberg Works

Read Path

Catalog Lookup

Metadata File Read

Manifest Pruning

File-level Pruning

Data File Read

Write Path

Write Data Files

Create Manifest Files

Create New Snapshot

Atomic Catalog Commit

Key Operations — SQL Examples

Why Iceberg is a Game-Changer

🕐 Time Travel & Rollback

🧬 Schema Evolution without Rewrite

🗂️ Hidden & Evolving Partitioning

⚡ Massive Query Performance

🔗 True Multi-Engine Support

⚖️ GDPR / Row-Level Deletes

📊 Query Performance vs. Hive (Relative Improvement)

The Iceberg Ecosystem

Compute Engines

Cloud & Managed Services

Catalogs & Tooling

📅 Iceberg Timeline

🆚 Iceberg vs Competitors

Production Setup Guide

1. Choose Your Catalog

2. Storage Configuration

3. Table Properties

4. Compaction & Maintenance

🌊 Flink Streaming Ingestion Setup

Running Iceberg at Petabyte Scale

🎯 Z-Order Clustering for PB Tables

🏛️ Reference Architecture: PB-Scale Lakehouse

🛠️ Operational Best Practices

File Size Discipline

Use Nessie for Branching

Monitor Metadata Growth

Partition Sizing

Incremental Reads

Test with Dry Runs

Apache
Iceberg