A comparative technical analysis of the top open-source Data Catalog frameworks for PB-scale HDFS/Apache Iceberg environments — covering ingestion architecture, scalability, security, and deep integration patterns.
Managing Petabytes of data across HDFS while migrating aggressively to Apache Iceberg creates a dual challenge: operational metadata volatility (Iceberg's immutable manifest files generate thousands of new metadata artefacts per commit) and semantic metadata complexity (schemas, lineage, ownership, PII tags, and quality SLOs must be discoverable across hundreds of tables and pipelines).
A production-grade Data Catalog must satisfy five simultaneous constraints at PB scale: near-real-time ingestion of metadata events, horizontal scalability of the metadata store, native Iceberg awareness (snapshots, manifest lists, partition specs), fine-grained security integrated with Apache Ranger, and tight coupling with query engines like Spark and Trino so every query sees the authoritative, current schema.
DataHub (v1.x, open-source) emerges as the clear leader for PB-scale HDFS/Iceberg environments due to its stream-first architecture (Kafka-backed MCE/MAE pipeline), native Iceberg REST Catalog integration introduced in v1.0 (Feb 2025), horizontally scalable dual-store (MySQL/Postgres + Elasticsearch), and a maturing Apache Ranger authorisation plugin. For organisations with legacy Hadoop estates requiring classification-driven policy propagation, Apache Atlas is a compelling secondary option.
Event-driven, Kafka-backed ingestion with dual-store (RDBMS + ES). Native Iceberg REST Catalog in v1.0 (2025). Highest velocity OSS community.
Deep Hadoop hooks, tag-based lineage propagation, Ranger ABAC integration. Best for heavily regulated Hadoop-native environments.
Lightweight discovery, Google-like UX, Neo4j graph backend. Community activity significantly declined — not recommended for new deployments.
Evaluated across six dimensions critical for PB-scale HDFS/Iceberg deployments.
Push-vs-Pull fundamentally determines latency and operational coupling. Stream-based systems (DataHub) achieve sub-minute metadata freshness; pull-based systems (Amundsen) rely on scheduled crawls.
Graph databases (JanusGraph/Neo4j) excel at relationship traversal but struggle with write-heavy, high-partition workloads. Relational+Elasticsearch separates concerns and scales writes independently.
True Iceberg-native support means tracking snapshots, manifest lists, partition evolution, and statistics — not just reading schema from Hive Metastore. Only DataHub v1.x achieves this with its REST Catalog.
| Dimension | DataHub Best | Apache Atlas | Amundsen |
|---|---|---|---|
| Ingestion Model | Stream (Kafka MCE) + Pull | Hook-based (Hive, Spark, Kafka) + REST API | Pull only (Airflow-scheduled) |
| Metadata Freshness | Near real-time (<60s via Kafka) | Near real-time (hook-triggered) | Batch (hourly/daily crawl) |
| Primary Store | MySQL / PostgreSQL (documents) + Elasticsearch (search + graph via MAE) | JanusGraph (BerkeleyDB or HBase back-end) + Solr / Elasticsearch | Neo4j (graph) + Elasticsearch |
| Graph Backend | Relational + ES Graph (no dedicated graph DB requirement) | JanusGraph — deep graph traversal but write-bottleneck at scale | Neo4j — strong for discovery, limited at PB partition counts |
| Iceberg Native | Full REST Catalog (v1.0+) · Snapshots · Manifest Lists · Stats | Via Hive hook (partial) — no snapshot-level awareness | Via generic extractor only — schema-level only |
| Partition Scalability | High — stateless MCE consumers, ES sharding | Medium — JanusGraph write throughput bounded by BerkeleyDB | Low — Neo4j node density degrades at 10M+ partitions |
| Column Lineage | ✅ End-to-end, SQL-parsed | ✅ Classification propagation via lineage graph | ⚠️ Basic table-level only |
| Ranger Integration | ✅ Plugin (Ranger authoriser for DataHub policies) | ✅ Native — tag-driven Ranger policies (gold standard) | ❌ None |
| PII / Classification | ✅ AI-powered auto-classification + Actions Framework | ✅ Classification system (PII, SENSITIVE, EXPIRES_ON) + lineage propagation | ⚠️ Manual tagging only |
| Deployment Complexity | High — Kafka, MySQL/PG, ES, GMS pods (K8s Helm chart) | High — JanusGraph, Solr/ES, Kafka, ZooKeeper | Low — Neo4j + ES + microservices (easiest) |
| Community Health (2025) | Very Active — 12,500+ Slack, 10k+ stars, Acryl backing | Active — Apache ASF governance, enterprise adoption | Dormant — no active roadmap (Dec 2025) |
| OSS License | Apache 2.0 | Apache 2.0 | Apache 2.0 |
At 10M Iceberg partitions, DataHub adds one Elasticsearch shard and scales GMS pods — no schema migrations or graph re-indexing required.
At 10M+ partitions with HBase back-end, Atlas is viable but requires careful region server tuning. BerkeleyDB is a non-starter at this scale.
Not recommended for PB-scale HDFS/Iceberg. Amundsen is architecturally unsuited and the project is effectively dormant.
DataHub's architecture is purpose-built for distributed, event-driven metadata management. Its separation of concerns between ingestion, storage, search, and serving tiers enables independent scaling at PB volumes.
Three key topics form the backbone of DataHub's event pipeline. The MetadataChangeProposal (MCP) topic carries proposals to mutate metadata entities (datasets, schemas, ownership, tags). The MetadataAuditEvent (MAE) topic is published after GMS successfully commits a change — downstream consumers (search indexer, graph builder) react to MAE, not MCP. The FailedMCP topic acts as a dead-letter queue for proposals rejected by GMS validation.
GMS is the authoritative write-and-read service. Built in Spring Boot, it exposes both a GraphQL API (consumed by the React frontend) and a REST API (consumed by programmatic clients and integrations). GMS validates incoming MCPs, writes entity aspects to the relational store, and emits MAEs downstream. It is stateless — any number of GMS pods can run behind a load balancer.
MySQL/PostgreSQL stores entity aspects — structured JSON documents keyed by entity URN and aspect name. This gives DataHub a strongly consistent, transactional primary store that handles schema evolution trivially. Elasticsearch receives a projected, denormalised copy via the MAE Consumer and provides sub-second full-text search, faceted filtering, and graph-edge queries across hundreds of millions of entities.
The Python-based ingestion system supports both push-based (Kafka emitter from Spark listeners) and pull-based (scheduled recipes via Airflow). Recipes are YAML-configured pipelines with a Source, Transformer, and Sink pattern. The Iceberg Source plugin reads table metadata via the Iceberg REST Catalog API or directly from the Hive Metastore, extracting schema fields, partition specs, snapshot history, and table statistics.
# DataHub Iceberg ingestion recipe (YAML) source: type: iceberg config: catalog: name: production type: rest uri: http://datahub-iceberg-catalog:8181 user_ownership_property: owner table_parallelism: 16 include_table_stats: true include_snapshots: true sink: type: datahub-kafka config: connection: bootstrap: kafka-broker:9092
The React frontend communicates exclusively via GraphQL against GMS. The Actions Framework allows DataHub to react to metadata changes programmatically: trigger PII review workflows when a column is classified, send Slack alerts when a deprecated dataset is queried, or propagate glossary terms downstream via lineage — all without custom code.
Trace the exact path a metadata change takes from the moment a Spark job commits an Iceberg snapshot to the moment it appears in the DataHub UI and is enforced by Ranger policies.
A Spark job writing to warehouse.transactions via spark.sql("INSERT INTO ...") calls the Iceberg CommitTableTransaction. This atomically replaces the metadata.json pointer with a new snapshot entry, appending a new manifest list file and one or more manifest files to HDFS. This is a purely Iceberg-internal operation — no catalog notification has been sent yet.
// Spark Iceberg write with DataHub listener spark.conf.set("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener") spark.conf.set("spark.openlineage.transport.type", "datahub") spark.conf.set("spark.openlineage.transport.url", "http://datahub-gms:8080")
The OpenLineage Spark Listener (attached to SparkContext) intercepts the SparkListenerJobEnd event. It extracts input/output datasets, schema (from Iceberg's current snapshot schema), and job metadata, then emits an OpenLineage RunEvent (JSON) to the DataHub Kafka transport. This is a non-blocking async call — the Spark job does not wait for DataHub acknowledgement.
The DataHub Kafka transport converts the OpenLineage RunEvent into one or more MetadataChangeProposals (MCPs). Separate MCPs are generated for: the Dataset entity (schema, partition spec), the DataJob entity (lineage), and the DataJobInputOutput aspect (upstream/downstream linkage). These are published to the MetadataChangeProposal_v1 Kafka topic with key = entity_urn, enabling ordered processing per entity.
The MCE Consumer (Kafka consumer group) reads MCPs and submits them to GMS via the internal REST API. GMS validates the proposal against the registered entity registry, checks authorization (does the emitter have EDIT_ENTITY_SCHEMA privileges?), writes the aspect as a new row to MySQL/PostgreSQL, and publishes a MetadataAuditEvent (MAE) to the MetadataAuditEvent_v4 Kafka topic confirming the commit. End-to-end latency from Spark job completion to MAE publication: typically 15–45 seconds in production at scale.
The MAE Consumer reads the MetadataAuditEvent and performs two actions: (a) updates the denormalised Elasticsearch document for the dataset entity — reflecting new schema fields, updated statistics, and lineage graph edges; (b) updates the graph index (stored as Elasticsearch edges) for column-level lineage. After this step, the DataHub UI returns the new schema in search results and the lineage panel shows the latest Spark job as an upstream producer.
Concurrently, the Actions Framework subscribes to the MAE stream. If the new schema contains a column matching a PII pattern (e.g., email, ssn, credit_card), the configured action automatically: (a) applies the PII glossary term to the column, (b) emits a Ranger sync event to propagate the masking policy to the Iceberg table, (c) sends a Slack notification to the data steward for review confirmation. This entire governance loop completes within 60–90 seconds of the original Iceberg commit.
At PB scale with 100+ concurrent Spark jobs, the critical path is: Iceberg commit → OpenLineage event (async, ~1s) → Kafka publish (async, ~500ms) → GMS persist (sync within consumer, ~2–5s) → MAE publish → ES index update (~5–15s). Total: ~10–60 seconds to metadata visibility depending on Kafka consumer lag and ES indexing throughput. This is orders of magnitude faster than pull-based crawlers (which operate on hourly/daily schedules).
Iceberg's append-only metadata design creates a "metadata explosion" problem: every commit generates new manifest files, and tables with frequent small writes accumulate thousands of snapshot entries. Without mitigation, the DataHub catalog ingests millions of redundant metadata events per day.
Each Spark INSERT/MERGE creates: 1 new Snapshot entry, 1 new Manifest List (snap-*.avro), N new Manifest files (*.avro) — one per output partition. A 1,000-partition table with hourly writes produces ~24,000 manifest files/day. Without snapshot expiry, the metadata.json file grows unboundedly.
The Iceberg Source plugin does not ingest individual manifest files — it reads the current snapshot's schema, statistics, and partition spec as a single DataHub Dataset aspect update. This compresses the explosion: DataHub sees one metadata change per table per commit, not one per manifest file.
By keying MCP Kafka messages on entity_urn (the table URN), Kafka log compaction retains only the latest MCP per table. Back-pressure from rapid successive commits is absorbed by the compacted log — GMS processes the final state, not every intermediate snapshot.
history.expire.min-snapshots-to-keep and history.expire.max-snapshot-age-ms at the Iceberg level. Expire old snapshots before they're ingested — DataHub only sees the active snapshot tree. Run ExpireSnapshots as a daily Spark job.RewriteManifests action to consolidate thousands of small manifests into fewer large ones. This reduces Iceberg metadata overhead and accelerates DataHub's snapshot read during ingestion.stateful_ingestion.enabled: true with a state store (S3 or local) to checkpoint which tables have changed since the last run.partition_patterns filter to exclude dynamically-generated temp partitions.MetadataChangeProposal_v1 consumer group lag. A growing lag indicates GMS or ES is falling behind — scale GMS pods or ES indexing threads before the lag cascades into stale metadata.# datahub-iceberg-recipe.yml — PB-scale production settings source: type: iceberg config: catalog: name: prod_warehouse type: rest uri: http://iceberg-rest-catalog:8181 ## Stateful ingestion — only emit changed tables stateful_ingestion: enabled: true state_provider: type: datahub remove_stale_metadata: true ## Partition handling for 10M+ partition tables include_table_stats: true include_partition_stats: false # Disable per-partition values max_partitions_for_statistics: 1000 # Sample top-N only include_snapshots: true max_snapshot_history: 10 # Last N snapshots only ## Parallelism for large warehouses table_parallelism: 32 schema_parallelism: 16 sink: type: datahub-kafka config: connection: bootstrap: kafka-prod-1:9092,kafka-prod-2:9092,kafka-prod-3:9092 producer_config: compression.type: snappy batch.size: 65536 linger.ms: 50
At PB scale, security governance cannot be manual. The catalog must enforce policies automatically as new data arrives, and those policies must propagate to the query engines enforcing access at runtime.
DataHub implements a two-tier authorization model. Roles (Admin, Editor, Viewer) define coarse-grained capability sets. Policies provide fine-grained attribute-based control: rules like "only members of the pii-stewards group can view datasets tagged PII:HIGH" or "domain owners can edit glossary terms on their domain's datasets". Policies are evaluated in real-time by GMS on every API request — no caching means policy changes take effect immediately.
DataHub ships a Ranger Authoriser Plugin that externalises DataHub's authorization decisions to Apache Ranger. When configured, GMS calls the Ranger Policy Engine (via REST) for every metadata access check. This allows a single Ranger policy admin to control access to both data (HDFS, Hive, Iceberg tables via Ranger's Iceberg plugin) and metadata (DataHub catalog entries) from one place.
DataHub's AI-powered classification engine scans column names and (optionally) data samples against a configurable dictionary of PII patterns. When a PII column is detected, the Actions Framework automatically: (1) applies the Classified: PII glossary term; (2) triggers a governance workflow requiring steward approval; (3) emits a Ranger sync event to apply data masking. Apache Atlas's classification approach — PII, SENSITIVE, EXPIRES_ON tags with lineage propagation — is arguably more mature for Hadoop-native environments.
Atlas's native classification system is deeply integrated with Ranger's Tag-Based Policy (TBP) engine. When Atlas classifies a Hive/Iceberg column as PII, Ranger's tag-sync daemon (TagSync) propagates that classification to Ranger's policy engine — which then enforces masking on all query engines (Hive, Spark, Trino) without any manual policy update. This classify once, enforce everywhere pattern is Atlas's strongest differentiator for regulatory compliance in Hadoop environments.
When a data steward classifies transactions.credit_card_number as PII:HIGH in DataHub, Ranger's TagSync daemon detects the tag change (via DataHub's Ranger plugin webhook), updates the Ranger policy store, and within seconds all Spark and Trino queries against that column return XXXX-XXXX-XXXX-[last4] without any per-engine policy change.
The catalog is only valuable if query engines — Spark, Trino, Apache Doris — use it as the authoritative source of schema, partition information, and access policies. This requires tight, bidirectional integration.
Configure Spark with the DataHub Iceberg REST Catalog as the Iceberg catalog implementation. Spark reads the current table metadata (schema, partition spec, snapshot) directly from DataHub's catalog service — ensuring every Spark session sees the authoritative current state, not a cached or stale Hive Metastore entry.
-- spark-defaults.conf spark.sql.catalog.dh=org.apache.iceberg.spark.SparkCatalog spark.sql.catalog.dh.type=rest spark.sql.catalog.dh.uri=http://datahub:8181/iceberg spark.sql.catalog.dh.warehouse=hdfs:///warehouse
Trino's Iceberg connector supports the REST Catalog protocol natively. Pointing Trino at DataHub's Iceberg REST endpoint means every SHOW SCHEMAS, DESCRIBE TABLE, and query planning call reads the current Iceberg metadata from DataHub — not from a potentially-stale Hive Metastore. OpenLineage's Trino event listener pushes lineage after each query.
# trino/etc/catalog/iceberg.properties connector.name=iceberg iceberg.catalog.type=rest iceberg.rest-catalog.uri=http://datahub:8181/iceberg iceberg.rest-catalog.warehouse=hdfs:///warehouse iceberg.rest-catalog.security=OAUTH2
Apache Doris (v2.x+) supports Iceberg external catalogs via its Multi-Catalog feature. Configured with the Iceberg REST Catalog URI, Doris reads partition metadata and snapshot info from DataHub's catalog service at query planning time. DataHub ingests Doris query logs via the datahub-rest lineage API to track read lineage from Doris analytical queries.
-- Doris: Create Iceberg Catalog CREATE CATALOG iceberg_dh PROPERTIES ( "type" = "iceberg", "iceberg.catalog.type" = "rest", "uri" = "http://datahub:8181/iceberg", "warehouse" = "hdfs:///warehouse" );
The canonical pattern for guaranteeing that all query engines see the same current Iceberg table state is to route all catalog reads through DataHub's Iceberg REST Catalog service, rather than allowing each engine to maintain its own Hive Metastore connection.
The Iceberg REST Catalog specification (introduced in Iceberg 0.14) defines a standard HTTP API for catalog operations. DataHub's implementation of this spec means any Iceberg-compatible engine can use DataHub as its catalog without engine-specific integration. This is the critical architectural shift — DataHub moves from a passive observer to an active participant in the Iceberg table lifecycle.
When Spark commits an Iceberg snapshot through DataHub's REST Catalog, DataHub atomically updates its metadata store and makes the new snapshot visible to all catalog clients. There is no window where Trino could see an old schema while Spark has already committed a new one — the catalog is the single serialisation point.
iceberg ingestion plugin, including catalog types, stateful ingestion, and partition handling. docs.datahub.com/docs/generated/ingestion/sources/icebergRecommendation: Deploy DataHub v1.x (open-source) as the primary Data Catalog and Iceberg REST Catalog server. Integrate Apache Ranger as the enforcement plane via DataHub's Ranger Authoriser plugin and TagSync. Use OpenLineage listeners on all Spark and Trino clusters for near-real-time lineage. Implement stateful ingestion recipes with partition sampling and snapshot expiry for the metadata explosion problem. Evaluate Apache Polaris (incubating) in 12–18 months as a lightweight Iceberg-native metastore complement to DataHub's governance layer.