Open Table Format · Apache Foundation

Apache
Iceberg

The open-source table format revolutionizing how massive datasets are stored, queried, and managed in modern Big Data platforms — enabling petabyte-scale analytics with full ACID guarantees.

PB+ Data Scale
ACID Transactions
10× Query Speed
Multi-engine Compatible
VISIBLE ~10% of Iceberg Catalog Layer Table registry & pointers Metadata Layer Snapshots · Schema · Stats Data Layer Parquet · ORC · Avro files 90% Hidden

What is Apache Iceberg?

A high-performance open table format for huge analytic datasets, designed at Netflix and donated to the Apache Software Foundation in 2018.

Apache Iceberg is an open table format — not a storage system, not a query engine, not a database. Think of it as a specification and implementation that sits between your data files (Parquet, ORC, Avro) and your compute engines (Spark, Flink, Trino, Athena).

The name "Iceberg" is metaphorical: like an actual iceberg, only a small part is visible (what query engines interact with), while a massive, rich structure lies beneath — tracking every file, schema version, partition, and snapshot of your data.

Before Iceberg, teams using Hive tables suffered from inconsistent reads during writes, painful schema changes, no time-travel, and partitioning schemes baked into folder structures. Iceberg solves all of this elegantly.

💡
Core Insight: Iceberg tracks data files at the file level rather than the partition/directory level. This unlocks ACID transactions, schema evolution, and time-travel without rewriting data.
Feature Hive Iceberg
ACID Transactions
Schema Evolution Limited Full
Time Travel
Hidden Partitioning
Concurrent Writes
Partition Evolution
Row-level Deletes
Multi-engine Access Limited Fully Open

How Iceberg is Built

Iceberg's architecture is a carefully designed layered system — each layer serves a specific purpose and communicates upward and downward with precision.

Iceberg Layered Architecture
⚡ Apache Spark
🌊 Apache Flink
🔺 Trino / Presto
☁️ AWS Athena
❄️ Snowflake
📚 Layer 1: Catalog catalog/
Central registry for all Iceberg tables. Stores pointers to the current metadata file. Can be: Hive Metastore, AWS Glue, Nessie (multi-branch), REST Catalog, JDBC, Hadoop. When a table is opened, the catalog resolves the table name → current metadata file location.
📋 Layer 2: Metadata Files metadata/*.json
JSON files capturing complete table state at each snapshot. Contains: current schema, partition specs, snapshot history, sort order, table properties. Each write creates a new metadata file; old ones are preserved for time-travel.
📑 Layer 3: Manifest List + Manifest Files metadata/*.avro
Manifest List: An Avro file listing all manifest files for a snapshot, with partition-level stats.
Manifest Files: Track individual data files with column-level stats (min/max/null counts). Essential for file pruning during query planning.
🗄️ Layer 4: Data Files data/*.parquet
Actual data stored as Parquet, ORC, or Avro files on object storage (S3, GCS, ADLS, HDFS). Delete files (position or equality) handle row-level deletes without full rewrites. Files are immutable — new writes always create new files.

The Snapshot Model

Every Iceberg write operation creates a new immutable snapshot. This is the heart of Iceberg's power.

📸

Snapshot = Table State

Every INSERT, UPDATE, or DELETE creates a new snapshot. A snapshot points to a manifest list, which points to manifest files, which point to data files. The full lineage is preserved.

Immutable History

Old snapshots are never deleted until explicitly expired. This enables time travel (read data as of any past snapshot), rollback, and audit trails without any extra tooling.

🔀

Optimistic Concurrency

Multiple writers use optimistic locking via the catalog. A writer reads current state, makes changes, and atomically swaps the metadata pointer. Conflicts are detected and retried automatically.

How Iceberg Works

From a query arriving at a compute engine to data files being read from object storage — here's the full journey.

Read Path

1

Catalog Lookup

Engine queries the catalog (Glue, Hive Metastore, REST) for the table name → gets the current metadata file path (e.g., s3://bucket/table/metadata/v42.json).

2

Metadata File Read

Engine reads the JSON metadata file to get the current snapshot, schema, and partition spec. For time-travel queries, a specific snapshot ID is used instead.

3

Manifest Pruning

Engine reads the manifest list. Each manifest has partition-level stats. Manifests with partitions that don't match the WHERE clause are skipped entirely — this is coarse pruning.

4

File-level Pruning

Within matching manifests, each data file's column min/max/null stats are checked. Files that can't contain matching rows are skipped — this is fine-grained pruning.

5

Data File Read

Only the qualifying files are opened. Row-level deletes (position/equality delete files) are applied. Parquet column pruning and predicate pushdown happen at the file level.

Write Path

1

Write Data Files

Writer creates new Parquet/ORC files on object storage. Files are not yet visible to any reader — they're staged in isolation. No locks held during this phase.

2

Create Manifest Files

Writer creates manifest Avro files listing the new data files with their column statistics (min, max, null counts). These stats are critical for future read performance.

3

Create New Snapshot

Writer creates a new metadata JSON file with the new snapshot. The snapshot references the new manifest list combining old + new manifests. Parent snapshot is recorded.

4

Atomic Catalog Commit

Writer atomically updates the catalog to point to the new metadata file. This single atomic operation makes the entire transaction visible. Either all changes appear or none do.

🔒
ACID Guarantee: Because data files are written first (invisible), and only the final catalog pointer swap is the "commit", Iceberg achieves serializable isolation without distributed locks.

Key Operations — SQL Examples

Time Travel Query
-- Query data at a specific snapshot
SELECT * FROM prod.orders
VERSION AS OF 8349571234;

-- Query data at a specific timestamp
SELECT * FROM prod.orders
TIMESTAMP AS OF '2024-01-15 09:00:00';

-- View all snapshots
SELECT * FROM prod.orders.snapshots
ORDER BY committed_at DESC;
Schema Evolution
-- Add a column safely (no rewrite!)
ALTER TABLE prod.orders
ADD COLUMN loyalty_tier STRING;

-- Rename a column safely
ALTER TABLE prod.orders
RENAME COLUMN total TO total_amount;

-- Change partition scheme live
ALTER TABLE prod.orders
ADD PARTITION FIELD months(order_date);
MERGE INTO (Upsert)
MERGE INTO prod.customers t
USING staging.customers_updates s
ON t.customer_id = s.customer_id
WHEN MATCHED THEN
  UPDATE SET *
WHEN NOT MATCHED THEN
  INSERT *;
Partition Evolution
-- Old data: partitioned by day
-- New data: partitioned by hour
-- Iceberg handles both transparently!

ALTER TABLE prod.events
ADD PARTITION FIELD hours(event_time);

-- Query seamlessly across both
SELECT * FROM prod.events
WHERE event_time >= '2024-01-01';

Why Iceberg is a Game-Changer

Iceberg doesn't just solve one problem — it comprehensively addresses every major pain point in big data table management.

01

🕐 Time Travel & Rollback

Query historical data at any snapshot without ETL pipelines or separate backup infrastructure. Rollback a bad write in seconds — just update the catalog pointer to a previous snapshot.

Data Reliability
02

🧬 Schema Evolution without Rewrite

Add, drop, rename, or reorder columns without touching existing files. Iceberg uses field IDs (not names) internally, so renaming is purely a metadata operation — zero data movement.

Zero Downtime
03

🗂️ Hidden & Evolving Partitioning

Query engines don't need to know partitioning details — Iceberg handles it transparently. Partition schemes can evolve over time; old and new data coexist without migration.

Smart Optimization
04

⚡ Massive Query Performance

Metadata-driven file pruning eliminates irrelevant files before opening any data. Column statistics at the manifest level skip entire file groups. 10–100× fewer files scanned vs. Hive.

Performance
05

🔗 True Multi-Engine Support

Spark, Flink, Trino, Presto, Hive, Athena, Snowflake, DuckDB all read from the same table simultaneously with full consistency. No engine lock-in — switch or run multiple engines freely.

Interoperability
06

⚖️ GDPR / Row-Level Deletes

Delete specific rows without rewriting entire partitions. Position deletes and equality deletes are tracked in delete files, merged at read time. Ideal for right-to-erasure compliance.

Compliance

📊 Query Performance vs. Hive (Relative Improvement)

Point Lookup
~100× faster
Partition Scan
~20–50× faster
Full Table Scan
~2–5× faster
Schema Change
Instant (metadata)
Concurrent Writes
Fully supported

The Iceberg Ecosystem

Iceberg is supported by virtually every major data engine, cloud provider, and data tool — making it the de facto standard open table format.

Compute Engines

Apache Spark
Batch + Streaming
🌊
Apache Flink
Real-time Streaming
🔺
Trino / Presto
Interactive SQL
🦆
DuckDB
Local Analytics

Cloud & Managed Services

☁️
AWS Athena
Serverless SQL
❄️
Snowflake
External Tables
🧱
Databricks
Delta ↔ Iceberg
🔵
BigQuery
Managed Iceberg

Catalogs & Tooling

🌿
Project Nessie
Git-like Catalog
🗃️
AWS Glue
Managed Catalog
🐝
Hive Metastore
Traditional Catalog
🚀
Apache Kafka
CDC → Iceberg

📅 Iceberg Timeline

2017
Born at Netflix
Created by Ryan Blue and Daniel Weeks to solve Netflix's petabyte-scale table management problems with Hive.
2018
Donated to Apache
Netflix donated the project to the Apache Software Foundation. First public release followed shortly after.
2020
Apache Top-Level Project
Graduated to Apache Top-Level Project status. Spark integration matured significantly. v0.9 released.
2022
Industry Adoption Explodes
Apple, LinkedIn, Adobe, Airbnb, and hundreds of companies adopt Iceberg. v0.14 introduces row-level deletes.
2024+
Standard Open Format
Snowflake, BigQuery, and Databricks all support Iceberg natively. The open standard for the lakehouse era.

🆚 Iceberg vs Competitors

Feature Iceberg Delta Lake Hudi
Open Spec ✓ Apache Linux Fnd Apache
Time Travel Limited
Multi-Engine Best Good Good
Partition Evolve
Hidden Partitions
Cloud Native Partial

Production Setup Guide

A battle-tested configuration guide for running Apache Iceberg in production with high reliability and performance.

📚

1. Choose Your Catalog

The catalog is your single most important infrastructure choice. It must be highly available, consistent, and support atomic operations.

Spark Config — AWS Glue Catalog
spark.sql.catalog.prod = \
  org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.prod.catalog-impl = \
  org.apache.iceberg.aws.glue.GlueCatalog
spark.sql.catalog.prod.warehouse = \
  s3://my-datalake/warehouse
spark.sql.catalog.prod.io-impl = \
  org.apache.iceberg.aws.s3.S3FileIO
🗄️

2. Storage Configuration

Object storage (S3/GCS/ADLS) is the recommended storage layer. Configure S3 for high-throughput multipart uploads and server-side encryption.

S3 Optimized Config
spark.hadoop.fs.s3a.multipart.size=128M
spark.hadoop.fs.s3a.multipart.threshold=128M
spark.hadoop.fs.s3a.connection.maximum=200
spark.hadoop.fs.s3a.fast.upload=true
spark.hadoop.fs.s3a.block.size=256M
# Enable S3 Select for predicate pushdown
spark.sql.catalog.prod.s3.enable-s3-select=true
⚙️

3. Table Properties

Set critical table-level properties for performance, data quality, and operational manageability from the start.

CREATE TABLE with Best Practices
CREATE TABLE prod.events (
  event_id   BIGINT,
  event_time TIMESTAMP,
  user_id    STRING,
  payload    STRING
) USING iceberg
PARTITIONED BY (days(event_time))
TBLPROPERTIES (
  'write.format.default'='parquet',
  'write.parquet.compression-codec'='zstd',
  'write.target-file-size-bytes'='268435456',
  'history.expire.max-snapshot-age-ms'='604800000',
  'write.metadata.delete-after-commit.enabled'='true',
  'write.metadata.previous-versions-max'='10'
);
🗜️

4. Compaction & Maintenance

Schedule regular compaction to merge small files, expire old snapshots, and remove orphaned data. Critical for performance at scale.

Maintenance Jobs (Spark)
-- Compact small files (run nightly)
CALL prod.system.rewrite_data_files(
  table => 'events',
  strategy => 'binpack',
  options => map(
    'target-file-size-bytes', '268435456',
    'max-concurrent-file-group-rewrites', '10'
  )
);

-- Expire old snapshots (run daily)
CALL prod.system.expire_snapshots(
  table => 'events',
  older_than => TIMESTAMP '2024-01-08 00:00:00',
  retain_last => 10
);

-- Remove orphan files (run weekly)
CALL prod.system.remove_orphan_files(
  table => 'events'
);

🌊 Flink Streaming Ingestion Setup

For real-time data pipelines, Apache Flink writes to Iceberg with exactly-once semantics using Flink's checkpoint mechanism.

Checkpoint Interval: Set Flink checkpoints every 1–5 minutes for a balance of latency and file count. Each checkpoint = one Iceberg commit.
Flink → Iceberg
// FlinkSink to Iceberg table
FlinkSink.forRowData(dataStream)
  .table(table)
  .tableLoader(tableLoader)
  .upsert(true)
  .equalityFieldColumns(
    Arrays.asList("user_id")
  )
  .writeParallelism(8)
  .append();

// Enable checkpointing
env.enableCheckpointing(60_000);
env.getCheckpointConfig()
   .setCheckpointingMode(
     CheckpointingMode.EXACTLY_ONCE
   );

Running Iceberg at Petabyte Scale

Specific configurations, patterns, and architectural decisions for production deployments handling hundreds of terabytes to multiple petabytes of data.

Target File Size
256–512 MB
Optimal Parquet file size balancing S3 request overhead vs. read amplification
Files per Partition
10–100
Keep manageable with compaction. More files = more manifest overhead
Manifest File Limit
<8 MB
Rewrite manifests to keep them small. Large manifests slow query planning
Snapshot Retention
7 Days
Keep 7 days for time-travel SLA. Expire older ones to control metadata bloat
Compaction Frequency
Daily
Run bin-packing compaction on streaming tables nightly during low traffic
Parallelism
100–500
Spark tasks for writing/compacting PB-scale tables. Scale with data volume

🎯 Z-Order Clustering for PB Tables

For petabyte-scale tables with high-cardinality columns, Z-order clustering dramatically reduces files scanned per query by co-locating related data.

Z-Order Compaction
-- Z-order by frequently filtered columns
CALL prod.system.rewrite_data_files(
  table => 'events',
  strategy => 'sort',
  sort_order => 'zorder(user_id, event_type)',
  options => map(
    'target-file-size-bytes', '536870912',
    'max-file-group-size-bytes', '10737418240',
    'partial-progress.enabled', 'true'
  )
);
Manifest Rewriting
-- Rewrite manifests for scan performance
CALL prod.system.rewrite_manifests(
  table => 'events',
  use_caching => true
);

-- Check table health
SELECT *
FROM prod.events.metadata_log_entries
ORDER BY timestamp DESC
LIMIT 20;

🏛️ Reference Architecture: PB-Scale Lakehouse

📡
Ingestion
Kafka · Kinesis
CDC · Firehose
Flink (streaming)
🧊
Iceberg Tables
Raw · Cleansed
Aggregated layers
S3 + Glue Catalog
⚙️
Processing
Spark EMR / Glue
Flink · dbt
Nightly compaction
📊
Consumption
Trino · Athena
Snowflake · BI Tools
ML / Feature Store
💾
Storage: Use S3 Intelligent-Tiering. Separate hot (recent) and cold (historical) data with lifecycle policies.
📈
Monitoring: Track files-per-partition, snapshot count, metadata size, and compaction lag as operational KPIs.
🔐
Security: Use IAM roles per table with S3 bucket policies. Enable server-side encryption (SSE-KMS) for compliance.

🛠️ Operational Best Practices

📏

File Size Discipline

Small files kill S3 ListObjects performance at PB scale. Target 256–512MB files. Streaming jobs create many small files — always pair with nightly compaction.

🌿

Use Nessie for Branching

Project Nessie adds Git-like branching to Iceberg. Create feature branches for ETL development, test transformations, then merge — zero risk to production.

🚨

Monitor Metadata Growth

At PB scale, metadata files can accumulate rapidly. Set write.metadata.previous-versions-max=10 and schedule regular snapshot expiry jobs.

⚖️

Partition Sizing

Aim for partitions of 1–100GB of data. Too-small partitions create excessive metadata; too-large partitions reduce pruning effectiveness. Day-level or hour-level often works well.

🔄

Incremental Reads

Use Iceberg's incremental read API to process only new files since the last run. Ideal for Change Data Capture pipelines — avoids full table scans for downstream processing.

🧪

Test with Dry Runs

Always validate compaction plans before executing at PB scale. Use partial-progress.enabled=true so large compaction jobs can be paused and resumed safely.