The open-source table format revolutionizing how massive datasets are stored, queried, and managed in modern Big Data platforms — enabling petabyte-scale analytics with full ACID guarantees.
A high-performance open table format for huge analytic datasets, designed at Netflix and donated to the Apache Software Foundation in 2018.
Apache Iceberg is an open table format — not a storage system, not a query engine, not a database. Think of it as a specification and implementation that sits between your data files (Parquet, ORC, Avro) and your compute engines (Spark, Flink, Trino, Athena).
The name "Iceberg" is metaphorical: like an actual iceberg, only a small part is visible (what query engines interact with), while a massive, rich structure lies beneath — tracking every file, schema version, partition, and snapshot of your data.
Before Iceberg, teams using Hive tables suffered from inconsistent reads during writes, painful schema changes, no time-travel, and partitioning schemes baked into folder structures. Iceberg solves all of this elegantly.
| Feature | Hive | Iceberg |
|---|---|---|
| ACID Transactions | ✗ | ✓ |
| Schema Evolution | Limited | Full |
| Time Travel | ✗ | ✓ |
| Hidden Partitioning | ✗ | ✓ |
| Concurrent Writes | ✗ | ✓ |
| Partition Evolution | ✗ | ✓ |
| Row-level Deletes | ✗ | ✓ |
| Multi-engine Access | Limited | Fully Open |
Iceberg's architecture is a carefully designed layered system — each layer serves a specific purpose and communicates upward and downward with precision.
Every Iceberg write operation creates a new immutable snapshot. This is the heart of Iceberg's power.
Every INSERT, UPDATE, or DELETE creates a new snapshot. A snapshot points to a manifest list, which points to manifest files, which point to data files. The full lineage is preserved.
Old snapshots are never deleted until explicitly expired. This enables time travel (read data as of any past snapshot), rollback, and audit trails without any extra tooling.
Multiple writers use optimistic locking via the catalog. A writer reads current state, makes changes, and atomically swaps the metadata pointer. Conflicts are detected and retried automatically.
From a query arriving at a compute engine to data files being read from object storage — here's the full journey.
Engine queries the catalog (Glue, Hive Metastore, REST) for the table name → gets the current metadata file path (e.g., s3://bucket/table/metadata/v42.json).
Engine reads the JSON metadata file to get the current snapshot, schema, and partition spec. For time-travel queries, a specific snapshot ID is used instead.
Engine reads the manifest list. Each manifest has partition-level stats. Manifests with partitions that don't match the WHERE clause are skipped entirely — this is coarse pruning.
Within matching manifests, each data file's column min/max/null stats are checked. Files that can't contain matching rows are skipped — this is fine-grained pruning.
Only the qualifying files are opened. Row-level deletes (position/equality delete files) are applied. Parquet column pruning and predicate pushdown happen at the file level.
Writer creates new Parquet/ORC files on object storage. Files are not yet visible to any reader — they're staged in isolation. No locks held during this phase.
Writer creates manifest Avro files listing the new data files with their column statistics (min, max, null counts). These stats are critical for future read performance.
Writer creates a new metadata JSON file with the new snapshot. The snapshot references the new manifest list combining old + new manifests. Parent snapshot is recorded.
Writer atomically updates the catalog to point to the new metadata file. This single atomic operation makes the entire transaction visible. Either all changes appear or none do.
-- Query data at a specific snapshot SELECT * FROM prod.orders VERSION AS OF 8349571234; -- Query data at a specific timestamp SELECT * FROM prod.orders TIMESTAMP AS OF '2024-01-15 09:00:00'; -- View all snapshots SELECT * FROM prod.orders.snapshots ORDER BY committed_at DESC;
-- Add a column safely (no rewrite!) ALTER TABLE prod.orders ADD COLUMN loyalty_tier STRING; -- Rename a column safely ALTER TABLE prod.orders RENAME COLUMN total TO total_amount; -- Change partition scheme live ALTER TABLE prod.orders ADD PARTITION FIELD months(order_date);
MERGE INTO prod.customers t USING staging.customers_updates s ON t.customer_id = s.customer_id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *;
-- Old data: partitioned by day -- New data: partitioned by hour -- Iceberg handles both transparently! ALTER TABLE prod.events ADD PARTITION FIELD hours(event_time); -- Query seamlessly across both SELECT * FROM prod.events WHERE event_time >= '2024-01-01';
Iceberg doesn't just solve one problem — it comprehensively addresses every major pain point in big data table management.
Query historical data at any snapshot without ETL pipelines or separate backup infrastructure. Rollback a bad write in seconds — just update the catalog pointer to a previous snapshot.
Data ReliabilityAdd, drop, rename, or reorder columns without touching existing files. Iceberg uses field IDs (not names) internally, so renaming is purely a metadata operation — zero data movement.
Zero DowntimeQuery engines don't need to know partitioning details — Iceberg handles it transparently. Partition schemes can evolve over time; old and new data coexist without migration.
Smart OptimizationMetadata-driven file pruning eliminates irrelevant files before opening any data. Column statistics at the manifest level skip entire file groups. 10–100× fewer files scanned vs. Hive.
PerformanceSpark, Flink, Trino, Presto, Hive, Athena, Snowflake, DuckDB all read from the same table simultaneously with full consistency. No engine lock-in — switch or run multiple engines freely.
InteroperabilityDelete specific rows without rewriting entire partitions. Position deletes and equality deletes are tracked in delete files, merged at read time. Ideal for right-to-erasure compliance.
ComplianceIceberg is supported by virtually every major data engine, cloud provider, and data tool — making it the de facto standard open table format.
| Feature | Iceberg | Delta Lake | Hudi |
|---|---|---|---|
| Open Spec | ✓ Apache | Linux Fnd | Apache |
| Time Travel | ✓ | ✓ | Limited |
| Multi-Engine | Best | Good | Good |
| Partition Evolve | ✓ | ✗ | ✗ |
| Hidden Partitions | ✓ | ✗ | ✗ |
| Cloud Native | ✓ | ✓ | Partial |
A battle-tested configuration guide for running Apache Iceberg in production with high reliability and performance.
The catalog is your single most important infrastructure choice. It must be highly available, consistent, and support atomic operations.
spark.sql.catalog.prod = \ org.apache.iceberg.spark.SparkCatalog spark.sql.catalog.prod.catalog-impl = \ org.apache.iceberg.aws.glue.GlueCatalog spark.sql.catalog.prod.warehouse = \ s3://my-datalake/warehouse spark.sql.catalog.prod.io-impl = \ org.apache.iceberg.aws.s3.S3FileIO
Object storage (S3/GCS/ADLS) is the recommended storage layer. Configure S3 for high-throughput multipart uploads and server-side encryption.
spark.hadoop.fs.s3a.multipart.size=128M spark.hadoop.fs.s3a.multipart.threshold=128M spark.hadoop.fs.s3a.connection.maximum=200 spark.hadoop.fs.s3a.fast.upload=true spark.hadoop.fs.s3a.block.size=256M # Enable S3 Select for predicate pushdown spark.sql.catalog.prod.s3.enable-s3-select=true
Set critical table-level properties for performance, data quality, and operational manageability from the start.
CREATE TABLE prod.events ( event_id BIGINT, event_time TIMESTAMP, user_id STRING, payload STRING ) USING iceberg PARTITIONED BY (days(event_time)) TBLPROPERTIES ( 'write.format.default'='parquet', 'write.parquet.compression-codec'='zstd', 'write.target-file-size-bytes'='268435456', 'history.expire.max-snapshot-age-ms'='604800000', 'write.metadata.delete-after-commit.enabled'='true', 'write.metadata.previous-versions-max'='10' );
Schedule regular compaction to merge small files, expire old snapshots, and remove orphaned data. Critical for performance at scale.
-- Compact small files (run nightly) CALL prod.system.rewrite_data_files( table => 'events', strategy => 'binpack', options => map( 'target-file-size-bytes', '268435456', 'max-concurrent-file-group-rewrites', '10' ) ); -- Expire old snapshots (run daily) CALL prod.system.expire_snapshots( table => 'events', older_than => TIMESTAMP '2024-01-08 00:00:00', retain_last => 10 ); -- Remove orphan files (run weekly) CALL prod.system.remove_orphan_files( table => 'events' );
For real-time data pipelines, Apache Flink writes to Iceberg with exactly-once semantics using Flink's checkpoint mechanism.
// FlinkSink to Iceberg table FlinkSink.forRowData(dataStream) .table(table) .tableLoader(tableLoader) .upsert(true) .equalityFieldColumns( Arrays.asList("user_id") ) .writeParallelism(8) .append(); // Enable checkpointing env.enableCheckpointing(60_000); env.getCheckpointConfig() .setCheckpointingMode( CheckpointingMode.EXACTLY_ONCE );
Specific configurations, patterns, and architectural decisions for production deployments handling hundreds of terabytes to multiple petabytes of data.
For petabyte-scale tables with high-cardinality columns, Z-order clustering dramatically reduces files scanned per query by co-locating related data.
-- Z-order by frequently filtered columns CALL prod.system.rewrite_data_files( table => 'events', strategy => 'sort', sort_order => 'zorder(user_id, event_type)', options => map( 'target-file-size-bytes', '536870912', 'max-file-group-size-bytes', '10737418240', 'partial-progress.enabled', 'true' ) );
-- Rewrite manifests for scan performance CALL prod.system.rewrite_manifests( table => 'events', use_caching => true ); -- Check table health SELECT * FROM prod.events.metadata_log_entries ORDER BY timestamp DESC LIMIT 20;
Small files kill S3 ListObjects performance at PB scale. Target 256–512MB files. Streaming jobs create many small files — always pair with nightly compaction.
Project Nessie adds Git-like branching to Iceberg. Create feature branches for ETL development, test transformations, then merge — zero risk to production.
At PB scale, metadata files can accumulate rapidly. Set write.metadata.previous-versions-max=10 and schedule regular snapshot expiry jobs.
Aim for partitions of 1–100GB of data. Too-small partitions create excessive metadata; too-large partitions reduce pruning effectiveness. Day-level or hour-level often works well.
Use Iceberg's incremental read API to process only new files since the last run. Ideal for Change Data Capture pipelines — avoids full table scans for downstream processing.
Always validate compaction plans before executing at PB scale. Use partial-progress.enabled=true so large compaction jobs can be paused and resumed safely.