Iceberg — Data Organization, Bucketing & Indexing

Section 01 — Physical Layout

How Iceberg organizes files on disk

Every Iceberg table is a directory on object storage with a strict, well-defined layout. Understanding this layout is the foundation for everything else — partitioning, metadata, pruning.

s3://warehouse/database/table_name/

├── metadata/ ← ALL metadata lives here

├── v1.metadata.json ← table creation snapshot

├── v2.metadata.json ← after first write

├── v3.metadata.json ← current (pointed by catalog)

├── snap-8120-1-abc123.avro ← manifest list (old snapshot)

├── snap-9012-1-def456.avro ← manifest list (current snapshot)

├── a1b2c3d4-m0.avro ← manifest file (data file list + stats)

├── e5f6a7b8-m0.avro ← manifest file

└── stats/ ← Puffin statistics blobs (v2+)

└── 00001-stats.puffin ← Bloom filters, NDV sketches

└── data/ ← ALL actual data files live here

├── region=APAC/ ← partition directory (optional, hidden to queries)

├── 00001-0-abcd-00000.parquet ← data file

└── 00001-0-abcd-00001.parquet

├── region=EMEA/

└── 00002-0-efgh-00000.parquet

└── region=AMER/

└── 00003-0-ijkl-00000.parquet

Key insight: The directory structure under data/ is an implementation detail — Iceberg never scans directories. It reads file paths from manifest files. The partition directories exist for compatibility but are completely irrelevant to query planning. This is why partition evolution works: paths don't encode the partition scheme.

3

file types in metadata/

Avro

manifest format

JSON

table metadata format

Puffin

stats blob format

File Naming Convention

Data file naming anatomy

      00001 - 0 - a4b8c2d1e5f6a7b8c9d0e1f2a3b4c5d6 - 00000 .parquet
    

Task ID

Writer task number within the job (00001). Used for deduplication detection.

Attempt

Task attempt number (0). If the task retries, produces 1, 2... Iceberg uses this to deduplicate partial retries.

UUID

Random UUID for global uniqueness. Prevents collisions even across multiple concurrent jobs writing to the same table.

File Number

Sequential within a single task. One task may produce multiple files if it exceeds target size.

Section 02 — Partitioning

Partition transforms — the full spec

Iceberg doesn't store raw column values as partition keys. It applies transform functions that produce derived partition values. This is what enables hidden partitioning and safe evolution.

IDENTITY

identity(col)

Uses the raw column value as the partition value. Equivalent to Hive-style partitioning. Best for low-cardinality columns like region, status, country.

identity(region) → "APAC", "EMEA", "AMER"

BUCKET

bucket(N, col)

Hashes column value using Murmur3 hash, then MOD N to produce bucket 0..N-1. Best for high-cardinality IDs (user_id, order_id). Ensures even distribution. Joins on bucketed columns can be co-located.

bucket(16, user_id) → 0..15

TRUNCATE

truncate(W, col)

Truncates a string or integer to width W. Strings: takes first W characters. Integers: rounds down to nearest multiple of W. Good for prefix-based partitioning.

truncate(3, "abcdef") → "abc"
truncate(100, 1234) → 1200

YEAR

year(ts_col)

Extracts the year from a timestamp or date column. Returns an integer representing years since 1970. Ideal for long-term archival data rarely queried at fine granularity.

year(2024-03-15) → 54 (years since epoch)

MONTH

month(ts_col)

Extracts year+month as an integer (months since epoch). Most common temporal partitioning choice — balances partition count with granularity for daily/hourly data.

month(2024-03-15) → 2024-03 partition

DAY / HOUR

days(ts) / hours(ts)

days() for daily partitions. hours() for hourly — used in near-real-time streaming pipelines where data arrives and is queried at hour granularity.

hours(2024-03-15T14:30Z) → 2024-03-15-14

VOID

void(col)

Always produces NULL. Used during partition evolution to "deactivate" a partition field without removing it from the spec. Old writes retain old partition values; new writes go to NULL partition.

Used in partition evolution transitions only

Multi-dimensional Partitioning

Combining multiple transforms

-- E-commerce events table: partition by time + region + user bucket CREATE TABLE prod.events.user_actions ( event_id BIGINT, user_id BIGINT, session_id STRING, event_ts TIMESTAMP, event_type STRING, region STRING, amount DOUBLE ) USING iceberg PARTITIONED BY ( hours(event_ts), -- time dimension: ~24 partitions/day identity(region), -- region dimension: ~5 partitions bucket(32, user_id) -- user dimension: 32 buckets, enables join pruning ); -- Total partitions: 24 * 5 * 32 = 3,840 partitions/day -- Query: WHERE event_ts='2024-01-15 14:xx' AND region='APAC' AND user_id=12345 -- Iceberg prunes to exactly 1 partition out of 3,840 — without user specifying partition cols

Partition Evolution in action: Your table starts partitioned by months(event_ts). Traffic grows 10x and monthly partitions become too large. You run ALTER TABLE REPLACE PARTITION FIELD months(event_ts) WITH hours(event_ts). Iceberg assigns new files to the new hourly spec. Old Parquet files stay untouched under the old monthly spec. Queries work across both — Iceberg knows which spec each file belongs to via the manifest file's spec-id field.

{
  "partition-specs": [
    { "spec-id": 0, "fields": [ {"name":"event_ts_month", "transform":"month", "source-id":4}]} ← OLD spec
    { "spec-id": 1, "fields": [ {"name":"event_ts_hour", "transform":"hour", "source-id":4}]} ← NEW spec (current)
  ],
  "default-spec-id": 1 ← new writes use spec-id 1
}

Section 03 — Bucketing

Bucket partitioning — the deep mechanics

Buckets are the most powerful and misunderstood feature of Iceberg partitioning. They enable even data distribution, eliminate data skew, and unlock co-located joins — all based on deterministic Murmur3 hashing.

Hash Algorithm

Murmur3 hash function

HASH ALGORITHM

Iceberg uses Murmur3 hash (32-bit) applied to the column's canonical binary representation. The result is taken modulo N to assign a bucket number 0..N-1.

The hash is computed on the type-specific binary encoding — not the string representation — ensuring consistent bucket assignment across all engines (Spark, Trino, Flink all produce the same bucket for the same value).

TYPE ENCODINGS

INT / LONG → little-endian 4/8 bytes

FLOAT / DOUBLE → IEEE 754 binary representation

STRING → UTF-8 bytes

UUID → 16 bytes, big-endian

DATE → days since 1970-01-01 as INT32

TIMESTAMP → microseconds since epoch as INT64

DECIMAL → unscaled value as big-endian bytes

// Iceberg bucket assignment — exact spec implementation function assign_bucket(value, n_buckets, column_type): bytes = to_canonical_bytes(value, column_type) // type-specific encoding hash = murmur3_x86_32(bytes, seed=0) // Murmur3 with seed=0 // CRITICAL: use unsigned modulo, then take absolute value // This prevents negative bucket numbers from negative hash values bucket = (hash and Integer.MAX_VALUE) % n_buckets // always 0..N-1 return bucket // Example: bucket(16, user_id) for user_id=12345 bytes = little_endian_8(12345) // → b'\x39\x30\x00\x00\x00\x00\x00\x00' hash = murmur3_x86_32(bytes) // → 1277776311 (example) bucket = 1277776311 % 16 // → 7 // user_id=12345 always lands in bucket 7 (deterministic, engine-independent)

Distribution

`bucket(8, user_id)` — data distribution

Bucket 0

user_id=1001 h→2931

user_id=2345 h→9847

user_id=8812 h→4123

Bucket 1

user_id=5533 h→6531

user_id=9901 h→1111

Bucket 2

user_id=7720 h→8823

user_id=3341 h→5551

user_id=4412 h→2200

Bucket 3

user_id=6612 h→7734

Bucket 4

user_id=1220 h→3345

user_id=8001 h→9009

Bucket 5

user_id=2210 h→6643

user_id=9923 h→4412

Bucket 6

user_id=4400 h→2211

user_id=5521 h→8871

user_id=7731 h→1098

Bucket 7

user_id=12345 h→1277

user_id=9001 h→5532

Join Optimization

Bucket joins — co-location magic

Bucket join optimization: When two tables are both bucketed by the same column with the same N, or one is a multiple of the other, Spark/Trino can perform a bucket join — matching bucket 3 of table A with bucket 3 of table B, eliminating the full shuffle. This is one of the biggest performance wins in Iceberg for repeated large joins.

-- Both tables bucketed by user_id with same N=32 -- Iceberg + Spark can perform co-located join (no shuffle!) CREATE TABLE events PARTITIONED BY (bucket(32, user_id)); CREATE TABLE users PARTITIONED BY (bucket(32, user_id)); -- This join reads bucket_0_of_events JOIN bucket_0_of_users locally, -- bucket_1_of_events JOIN bucket_1_of_users locally, etc. -- No cross-machine data movement needed (full shuffle avoided) SELECT e.*, u.name, u.tier FROM events e JOIN users u ON e.user_id = u.user_id WHERE e.event_ts >= '2024-01-01'; -- Execution: 32 local joins in parallel, zero network bytes for join keys

Sizing Guide

Choosing bucket count N

TOO FEW BUCKETS

Files grow too large per bucket. Slows down queries that must read a full bucket. Risk of hot partitions if load is uneven. Good range: ≥ total data size / target file size.

SWEET SPOT

Rule of thumb: N = ceil(total_GB / 128). For a 10TB table: ~80 buckets. For 1TB: ~8 buckets. Align to powers of 2 for join co-location (2, 4, 8, 16, 32, 64...).

TOO MANY BUCKETS

Creates tiny files that are expensive to list and open. Each file has Parquet footer overhead (~8KB). Compaction cost increases. Don't exceed 1 bucket per ~100MB of data.

Section 04 — Sort Orders

Sort orders & clustering

Iceberg maintains a sort order specification separate from the partition spec. Sorting determines how rows are physically ordered within each data file — which directly impacts column statistics quality and scan efficiency.

-- Define table with explicit sort order CREATE TABLE prod.events ( event_id BIGINT, user_id BIGINT, event_ts TIMESTAMP, region STRING, amount DOUBLE ) USING iceberg PARTITIONED BY (months(event_ts), identity(region)) TBLPROPERTIES ( -- Sort within each partition by event_ts ASC, then user_id ASC 'sort-order' = 'event_ts ASC NULLS LAST, user_id ASC NULLS LAST', 'write.distribution-mode' = 'hash', -- hash-distribute then sort 'write.wap.enabled' = 'true' ); -- Alternatively: define sort order explicitly (Spark) ALTER TABLE prod.events WRITE ORDERED BY event_ts ASC NULLS LAST;

Sort Key	Direction	NULLs	Transform	Benefit
event_ts	ASC	LAST	identity	Time-range queries skip entire files via min/max stats
user_id	ASC	LAST	identity	User-specific queries (after time filter) are contiguous in file
bucket(16, user_id)	ASC	LAST	bucket	Co-locates same-bucket rows — combine with bucket partitioning
truncate(2, category)	ASC	FIRST	truncate	Groups similar category strings, improves string column stats

Why sort order is critical for pruning: Iceberg tracks min/max per column per file. If a file contains rows from all time ranges (random order), min/max spans the entire range — useless for pruning. If sorted by event_ts, each file's min/max covers a tight time window — 99% of files can be skipped for time-bounded queries. The sort order is stored in table metadata so engines can be aware of clustering and plan optimally.

Multi-dimensional Sort

Z-Order clustering — 2D locality

❌ Sorted by user_id only

Files are sorted by user_id. Query on (user_id=X AND event_ts between A and B) finds user rows contiguous, but event_ts values are scattered across all files — can't prune on time.

✓ Z-Order by (user_id, event_ts)

Z-Order interleaves bits of both dimensions. Row locality for both user_id AND event_ts within the same file — queries on either dimension (or both) benefit from file pruning.

import org.apache.iceberg.spark.actions.SparkActions // Z-order compaction: rewrite files with Z-order clustering SparkActions.get(spark) .rewriteDataFiles(table) .strategy( SparkActions.get(spark) .rewriteDataFiles(table) .zOrder("user_id", "event_ts") // Z-order on 2 columns ) .option("target-file-size-bytes", "536870912") // 512MB target .option("min-file-size-bytes", "268435456") // skip files >256MB .execute()

Section 05 — Metadata Architecture

The complete metadata stack

Iceberg metadata is a 4-level hierarchy. Each level serves a specific purpose. Understanding each file type is essential for debugging, performance tuning, and building tooling on top of Iceberg.

Level 1

metadata.json — the table definition

{
  "format-version": 2,
  "table-uuid": "9c12102f-7e93-4b71-aa35-b2123fe12cd4",
  "location": "s3://warehouse/db/events",
  "last-sequence-number": 412, // monotonically increasing, tracks operation order
  "last-column-id": 8, // max column ID assigned (IDs never reused!)
  "current-schema-id": 2,
  "schemas": [ /* all historical schemas kept */ ],
  "default-spec-id": 1,
  "partition-specs": [ /* all partition specs — never deleted */ ],
  "default-sort-order-id": 1,
  "sort-orders": [ /* sort order definitions */ ],
  "current-snapshot-id": 9012345678901234, // ← THIS is what makes a table "current"
  "snapshots": [
    { "snapshot-id":8120.., "manifest-list":"s3://.../snap-8120.avro", "parent-id":7341.. },
    { "snapshot-id":9012.., "manifest-list":"s3://.../snap-9012.avro", "parent-id":8120.. }
  ],
  "refs": {
    "main": {"type":"branch", "snapshot-id":9012..},
    "ml_v3": {"type":"tag", "snapshot-id":8120..} // named tag
  },
  "statistics": [ /* Puffin file references */ ],
  "properties": { "write.target-file-size-bytes":"536870912" }
}

Levels 2 & 3

Manifest list & manifest files

Manifest List (snap-9012.avro)

manifest_path

s3://…/a1b2-m0.avro

manifest_length

18291

partition_spec_id

1

content

DATA (0)

sequence_number

412

added_snapshot_id

9012345…

added_data_files_count

24

existing_data_files_count

3,841

deleted_data_files_count

0

partitions (summary)

region IN (APAC, EMEA)

lower_bounds

event_ts ≥ 2024-01-15

upper_bounds

event_ts ≤ 2024-01-15

Manifest File (a1b2-m0.avro) — 1 row per data file

status

ADDED (1)

snapshot_id

9012345…

data_sequence_number

412

file_path

s3://…/data/…/00001.parquet

file_format

PARQUET

partition

{month:2024-01, region:APAC}

record_count

2,412,881

file_size_in_bytes

524,288,000

column_sizes

{1:4MB, 2:8MB, 3:12MB…}

value_counts

{1:2412881, 2:2412881…}

null_value_counts

{1:0, 7:12}

lower_bounds

{3: 2024-01-01 00:00:00}

upper_bounds

{3: 2024-01-31 23:59:59}

The manifest list is the key to fast planning: It contains partition-level summaries per manifest. A planner reading the manifest list never needs to open manifest files for partitions that don't match. For a query on 1 day out of 3 years, Iceberg skips 99.9% of manifests by reading only the manifest list (a few KB), then reads only the matching manifest files for fine-grained file pruning.

Format Versions

Spec v1 vs v2 vs v3

Capability	Spec v1 (2018)	Spec v2 (2021)	Spec v3 (2024)
ACID Transactions	✓	✓	✓
Row-level Deletes	✗	✓ Position + Equality	✓ + DVs
Deletion Vectors (DVs)	✗	✗	✓ (bitmap-based)
Row Lineage (sequence numbers)	✗	✓	✓
Branches & Tags	✗	✓	✓
Puffin Statistics	✗	⚠ Partial	✓ Full (NDV, Bloom, etc.)
Multi-table Transactions	✗	✗	✓ (catalog-level)
Variant Type (semi-structured)	✗	✗	✓
Geometry / Geography types	✗	✗	✓

Section 06 — Indexing

Indexing strategies in Iceberg

Iceberg doesn't have traditional database indexes. Instead, it builds a multi-layer system of statistics, bloom filters, and data layout optimizations that collectively act as a distributed index. Each layer adds another level of pruning power.

📐 Min/Max Column Statistics LEVEL 3

Every manifest entry (one per data file) stores lower_bounds and upper_bounds per column — the minimum and maximum value in that file for each column.

Stored as the column's native binary representation. During query planning, the planner checks: does the query predicate overlap with [min, max]? If not, skip the file — no S3 read required.

Best for: Sorted or time-series columns (event_ts, date). Worst for high-cardinality unordered columns (UUID, random string IDs) where every file's range spans the full domain.

🧮 Null / Value Counts LEVEL 3

Each manifest entry tracks value_counts (non-null rows), null_value_counts, and nan_value_counts per column per file.

Enables fast IS NULL / IS NOT NULL predicate pruning. Also feeds query optimizer cost estimates — knowing a column is 90% NULL in a file lets the planner avoid projecting that column.

No extra cost: These stats are collected during the write path and stored inline in the manifest Avro record — zero additional overhead at read time.

🎯 Bloom Filters (Puffin) LEVEL 3 — v2+

Bloom filters answer "does this file contain value X?" with zero false negatives and a configurable false positive rate (default ~1%). Stored in Puffin files — binary blobs referenced from table statistics.

Critical for point lookups and IN-list queries on high-cardinality columns where min/max is useless (e.g., WHERE user_id = 12345 — every file has user IDs from 1 to 100M, so min/max doesn't help).

Storage cost: ~1.2MB per 10M rows per column at 1% FPR. Worth it for frequently filtered high-cardinality columns like user_id, device_id, session_id.

📊 NDV Sketches (Puffin) LEVEL 3 — v3

Near-Distinct-Value (NDV) estimates using Apache DataSketches HLL sketches. Stored in Puffin blobs. Gives the query optimizer an accurate estimate of how many distinct values a column has — without a full scan.

Powers cost-based optimizer decisions: join order selection, hash vs sort-merge join choice, broadcast join threshold. Without NDV, optimizers must guess — leading to bad plans on skewed data.

Automatic maintenance: Iceberg's analyze table command refreshes NDV sketches. Can be scheduled as a lightweight background job.

🗺️ Partition-level Pruning LEVEL 2

The manifest list stores partition-level summary statistics for each manifest file. Before even reading individual manifest files, the planner eliminates entire manifests based on partition ranges.

This is two-level pruning: first eliminate manifests (manifest list → small read), then eliminate files within surviving manifests (manifest file → medium read), then eliminate row groups inside Parquet files (Parquet statistics → no extra read).

Cascade effect: A query targeting 1 day of data in a 10-year table eliminates 3,649 days of manifests from the manifest list read alone — before touching a single manifest file.

📦 Parquet Row Group Stats LEVEL 4 — In-file

Inside each Parquet file, row groups (128MB chunks) carry their own column statistics: min, max, null count, and distinct count. Parquet page-level statistics go even deeper.

Iceberg's pruning pipeline doesn't stop at files — after determining which files to read, the Parquet reader itself skips row groups that can't satisfy predicates. This is in-file index pruning.

Column indexes (Parquet v2): Page-level statistics and offset indexes enable skipping individual pages within a row group. Combined with Iceberg's sort order (which clusters rows), this achieves near-index performance.

Puffin File Format

Puffin — statistics blob storage

What is Puffin? Puffin is Iceberg's binary file format for storing extended statistics that don't fit in Avro manifest records. Named after the puffin bird (compact, self-contained). Each Puffin file contains a list of "blobs" — arbitrary byte sequences with a type tag. Blob types include apache-datasketches-theta-v1 (NDV), apache-datasketches-hll-v1 (HLL), and apache-iceberg-flink-1.12-stats-v1 (Bloom filters).

Theta Sketch (NDV)

Apache DataSketches Theta sketch for near-distinct-value estimation. Mergeable — multiple task outputs can be combined. Typical size: 2–8KB per column.

Use: COUNT DISTINCT optimization, join cost estimation

HLL++ Sketch

HyperLogLog++ for cardinality estimation. More space-efficient than Theta for very high cardinality. Sub-1% error rate at 4KB size.

Use: High-cardinality columns (device IDs, user IDs, UUIDs)

Bloom Filter

Split-block Bloom filter (Parquet-compatible format). Per-file, per-column. Enables O(1) membership test: "does this file have user_id=X?"

Use: Point lookups, IN list queries, foreign key joins

-- Enable Bloom filters for specific columns ALTER TABLE prod.events SET TBLPROPERTIES ( -- Write Bloom filters for high-cardinality lookup columns 'write.parquet.bloom-filter-enabled.column.user_id' = 'true', 'write.parquet.bloom-filter-enabled.column.session_id' = 'true', 'write.parquet.bloom-filter-enabled.column.device_id' = 'true', -- False positive probability (lower = larger filter) 'write.parquet.bloom-filter-max-bytes' = '1048576', -- 1MB max per column -- Collect NDV statistics for optimizer 'write.parquet.statistics.enabled' = 'true' ); -- Refresh table statistics (updates Puffin blobs) ANALYZE TABLE prod.events COMPUTE STATISTICS FOR ALL COLUMNS;

Section 07 — Query Pruning

The complete pruning pipeline

When a query hits an Iceberg table, it goes through a 6-stage pruning cascade. Each stage eliminates data at a different granularity — from entire manifests down to individual Parquet pages. Understanding this pipeline explains why Iceberg can achieve 99%+ skip ratios.

01

Catalog Resolution (nanoseconds)

Query engine calls the catalog (Glue, Nessie, Hive Metastore) to get the current metadata.json path. Reads metadata.json (~10KB JSON) to identify current snapshot ID and manifest list path. No data files are touched.

~1 HTTP call

02

Manifest List Scan — Partition Elimination (milliseconds)

Reads the manifest list Avro file (one record per manifest). Each record contains partition summary stats. Planners evaluate predicates against partition ranges: if a manifest's partition range can't intersect the query's WHERE clause, skip the entire manifest. For a 3-year table queried on today: skips ~1,095 manifests by reading a single file of ~100KB.

Eliminates 90–99%

03

Bloom Filter Check — Point Lookup Acceleration (milliseconds)

For point lookup predicates (WHERE user_id = X), Iceberg loads Puffin-stored Bloom filters for surviving manifests. For each data file referenced in a manifest, queries the Bloom filter: "does this file contain user_id=12345?" — answer is NO (certain skip) or MAYBE (must read). Eliminates files that definitely don't contain the target value. Typical skip: 90–99% for uniform hash distributions.

O(1) per file

04

Manifest File Scan — Column-level File Pruning (milliseconds)

Reads surviving manifest Avro files (one record per data file). Uses per-file lower_bounds / upper_bounds per column to apply range predicates. If a predicate is event_ts BETWEEN A AND B, files where upper_bound < A or lower_bound > B are skipped. This is especially powerful for sorted tables — files contain tight time windows with non-overlapping ranges.

Eliminates 80–99%

05

Parquet Row Group Pruning (file open cost)

For surviving Parquet files, the Parquet reader checks row group statistics (min/max per column per row group, stored in the Parquet footer). Row groups that can't satisfy predicates are skipped without reading their pages. A 512MB Parquet file typically has 4–8 row groups of 64–128MB. Row group pruning can reduce actual bytes read to 10–25% of the file.

Row group level

06

Parquet Page-level + Column Projection (I/O reduction)

Within a surviving row group, Parquet's column index and offset index (v2 spec) enable skipping individual data pages. Additionally, column projection ensures only requested columns are read from object storage — Parquet's columnar layout means SELECTing 5 of 50 columns reads only 10% of the file bytes. Together, these reduce actual S3 bytes read to sometimes <1% of total table size.

Page level

Cumulative skip ratio: On a well-organized Iceberg table (partitioned + sorted + bloom filters), a targeted query (specific time range + specific user) can skip 99.9%+ of total data across all 6 levels before touching a single Parquet row. The remaining reads are then served from OS page cache or S3 intelligent tiering — further reducing effective latency.

Section 08 — Compaction & Maintenance

Compaction — keeping data healthy

Streaming writes and frequent small batch jobs produce many small files. Small files destroy pruning efficiency (many manifest entries to check), waste S3 list overhead, and slow Parquet reads. Compaction is the maintenance operation that rebalances file sizes and improves layout.

⚠ Before Compaction (streaming)

part-0001.parquet (2MB)

part-0002.parquet (1MB)

part-0003.parquet (3MB)

part-0004.parquet (2MB)

part-0005.parquet (1MB)

part-0006.parquet (4MB)

part-0007.parquet (2MB)

part-0008.parquet (1MB)

→

Bin-pack compaction

Groups files to fill target size

Target: 512MB
[2+1+3+2+1+4+2+1=16MB]
→ 1 output file

→

✓ After Compaction

part-0001-compacted.parquet (16MB)

8 files → 1 file
8 manifest entries → 1 entry
Better min/max range coverage
Faster listing, faster planning

Strategies

Compaction strategies

📦 Bin-Packing

Default strategy. Groups small files together until they fill the target file size (~512MB). Uses a bin-packing algorithm to minimize file count while hitting target size.

Does NOT change the sort order of rows within or across files. Simply concatenates content. Best for tables that are already well-sorted or where you just need to reduce file count after streaming ingestion.

Trigger: Schedule every 1–6 hours for streaming tables. Run on partitions older than 1 hour (recent partitions are still actively written).

🔀 Sort + Rewrite

Most powerful for query performance. Reads all files in a partition, sorts all rows by the table's sort order, then writes out optimally-sized files. This maximizes min/max pruning effectiveness.

Cost: significantly more compute than bin-packing (full sort of all rows). Best run daily or weekly on hot partitions where read performance matters most.

Combine with Z-Order for multi-dimensional locality: rewriteDataFiles.zOrder("user_id", "event_ts") applies Z-curve ordering across both dimensions.

import org.apache.iceberg.spark.actions.SparkActions import org.apache.iceberg.expressions.Expressions val table = catalog.loadTable(TableIdentifier.of("prod", "events")) val actions = SparkActions.get(spark) // ─── STEP 1: Compact small files (bin-pack, 512MB target) ─────────────── actions.rewriteDataFiles(table) .option("target-file-size-bytes", "536870912") // 512MB .option("min-file-size-bytes", "134217728") // skip files >128MB .option("max-file-group-size-bytes", "10737418240") // 10GB max per group .option("partial-progress.enabled", "true") // commit partial progress .filter(Expressions.lessThan("event_ts", ninetyMinutesAgo())) .execute() // ─── STEP 2: Sort-rewrite hot partitions for optimal pruning ──────────── actions.rewriteDataFiles(table) .zOrder("user_id", "event_ts") // Z-order on 2 most-queried cols .option("target-file-size-bytes", "536870912") .filter(Expressions.and( Expressions.greaterThanOrEqual("event_ts", sevenDaysAgo()), Expressions.lessThan("event_ts", yesterday()) )) .execute() // ─── STEP 3: Rewrite manifests (reduce manifest file count) ───────────── actions.rewriteManifests(table) .option("min-manifests-count-to-merge", "10") .option("target-manifest-file-size-bytes", "8388608") // 8MB .execute() // ─── STEP 4: Expire old snapshots (retain 7 days) ─────────────────────── actions.expireSnapshots(table) .expireOlderThan(sevenDaysAgo()) .retainLast(100) // always keep at least 100 snapshots .cleanExpiredFiles(true) // delete unreferenced data files .execute() // ─── STEP 5: Remove orphan files (safety cleanup) ─────────────────────── actions.deleteOrphanFiles(table) .olderThan(threeDaysAgo()) // safety buffer: in-flight writes may take time .execute()

Delete Modes

Copy-on-Write vs Merge-on-Read

Copy-on-Write (COW)

On DELETE/UPDATE

Rewrites entire affected files

Write amplification

High (full file rewrite)

Read overhead

Zero (no delete files)

Compaction needed

Less often

Best for

Read-heavy, infrequent updates

Use case

Batch ETL, daily corrections

Merge-on-Read (MOR)

On DELETE/UPDATE

Writes small delete files

Write amplification

Low (append delete file)

Read overhead

Merge delete files at scan time

Compaction needed

Regularly (clean delete files)

Best for

Write-heavy, CDC, streaming

Use case

CDC pipelines, GDPR deletes

Deletion Vectors (Spec v3): Iceberg v3 introduces Deletion Vectors (DVs) — a bitmap-based delete format borrowed from Apache Hudi and Delta Lake. A DV is a roaring bitmap stored in a Puffin blob that marks which row positions in a data file are deleted. DVs are more space-efficient than position-delete files and don't require sorting. On read, the engine applies the bitmap mask to skip deleted rows — no full merge required. Expected 2–10x reduction in delete overhead for CDC workloads.

How Iceberg organizes data

How Iceberg organizes files on disk

Data file naming anatomy

Partition transforms — the full spec

Combining multiple transforms

Bucket partitioning — the deep mechanics

Murmur3 hash function

bucket(8, user_id) — data distribution

Bucket joins — co-location magic

Choosing bucket count N

Sort orders & clustering

Z-Order clustering — 2D locality

The complete metadata stack

metadata.json — the table definition

Manifest list & manifest files

Spec v1 vs v2 vs v3

Indexing strategies in Iceberg

Puffin — statistics blob storage

The complete pruning pipeline

Compaction — keeping data healthy

Compaction strategies

Copy-on-Write vs Merge-on-Read

How Iceberg
organizes data

`bucket(8, user_id)` — data distribution