Principal Architecture Report · 2025

Data Catalog & Metadata
Governance at Petabyte Scale

A comparative technical analysis of the top open-source Data Catalog frameworks for PB-scale HDFS/Apache Iceberg environments — covering ingestion architecture, scalability, security, and deep integration patterns.

Contenders Evaluated DataHub · Atlas · Amundsen
Focus Stack HDFS · Iceberg · Spark · Trino
Governance Scope Ranger · RBAC · PII · ABAC
PB+
Data Volume Addressed
10M+
Table Partitions / Manifests

The Metadata Governance Imperative

Managing Petabytes of data across HDFS while migrating aggressively to Apache Iceberg creates a dual challenge: operational metadata volatility (Iceberg's immutable manifest files generate thousands of new metadata artefacts per commit) and semantic metadata complexity (schemas, lineage, ownership, PII tags, and quality SLOs must be discoverable across hundreds of tables and pipelines).

A production-grade Data Catalog must satisfy five simultaneous constraints at PB scale: near-real-time ingestion of metadata events, horizontal scalability of the metadata store, native Iceberg awareness (snapshots, manifest lists, partition specs), fine-grained security integrated with Apache Ranger, and tight coupling with query engines like Spark and Trino so every query sees the authoritative, current schema.

🏆 Top Recommendation

DataHub (v1.x, open-source) emerges as the clear leader for PB-scale HDFS/Iceberg environments due to its stream-first architecture (Kafka-backed MCE/MAE pipeline), native Iceberg REST Catalog integration introduced in v1.0 (Feb 2025), horizontally scalable dual-store (MySQL/Postgres + Elasticsearch), and a maturing Apache Ranger authorisation plugin. For organisations with legacy Hadoop estates requiring classification-driven policy propagation, Apache Atlas is a compelling secondary option.

📦 DataHub
Stream-first Iceberg REST Kafka-native

Event-driven, Kafka-backed ingestion with dual-store (RDBMS + ES). Native Iceberg REST Catalog in v1.0 (2025). Highest velocity OSS community.

👑 Recommended
🔗 Apache Atlas
Hook-based JanusGraph Ranger native

Deep Hadoop hooks, tag-based lineage propagation, Ranger ABAC integration. Best for heavily regulated Hadoop-native environments.

🔍 Amundsen
Pull-based Neo4j Dormant

Lightweight discovery, Google-like UX, Neo4j graph backend. Community activity significantly declined — not recommended for new deployments.

Top 3 Contenders for PB-Scale Environments

Evaluated across six dimensions critical for PB-scale HDFS/Iceberg deployments.

📥

Ingestion Architecture

Push-vs-Pull fundamentally determines latency and operational coupling. Stream-based systems (DataHub) achieve sub-minute metadata freshness; pull-based systems (Amundsen) rely on scheduled crawls.

⚖️

Metadata Store Scalability

Graph databases (JanusGraph/Neo4j) excel at relationship traversal but struggle with write-heavy, high-partition workloads. Relational+Elasticsearch separates concerns and scales writes independently.

🧊

Iceberg Native Support

True Iceberg-native support means tracking snapshots, manifest lists, partition evolution, and statistics — not just reading schema from Hive Metastore. Only DataHub v1.x achieves this with its REST Catalog.

Dimension DataHub Best Apache Atlas Amundsen
Ingestion Model Stream (Kafka MCE) + Pull Hook-based (Hive, Spark, Kafka) + REST API Pull only (Airflow-scheduled)
Metadata Freshness Near real-time (<60s via Kafka) Near real-time (hook-triggered) Batch (hourly/daily crawl)
Primary Store MySQL / PostgreSQL (documents) + Elasticsearch (search + graph via MAE) JanusGraph (BerkeleyDB or HBase back-end) + Solr / Elasticsearch Neo4j (graph) + Elasticsearch
Graph Backend Relational + ES Graph (no dedicated graph DB requirement) JanusGraph — deep graph traversal but write-bottleneck at scale Neo4j — strong for discovery, limited at PB partition counts
Iceberg Native Full REST Catalog (v1.0+) · Snapshots · Manifest Lists · Stats Via Hive hook (partial) — no snapshot-level awareness Via generic extractor only — schema-level only
Partition Scalability High — stateless MCE consumers, ES sharding Medium — JanusGraph write throughput bounded by BerkeleyDB Low — Neo4j node density degrades at 10M+ partitions
Column Lineage ✅ End-to-end, SQL-parsed ✅ Classification propagation via lineage graph ⚠️ Basic table-level only
Ranger Integration ✅ Plugin (Ranger authoriser for DataHub policies) ✅ Native — tag-driven Ranger policies (gold standard) ❌ None
PII / Classification ✅ AI-powered auto-classification + Actions Framework ✅ Classification system (PII, SENSITIVE, EXPIRES_ON) + lineage propagation ⚠️ Manual tagging only
Deployment Complexity High — Kafka, MySQL/PG, ES, GMS pods (K8s Helm chart) High — JanusGraph, Solr/ES, Kafka, ZooKeeper Low — Neo4j + ES + microservices (easiest)
Community Health (2025) Very Active — 12,500+ Slack, 10k+ stars, Acryl backing Active — Apache ASF governance, enterprise adoption Dormant — no active roadmap (Dec 2025)
OSS License Apache 2.0 Apache 2.0 Apache 2.0

Scalability Deep-Dive: How Metadata Stores Handle Millions of Partitions

DataHub — MySQL + ES

  • Entities stored as JSON documents in MySQL — horizontally scalable with read replicas
  • Elasticsearch sharded across cluster nodes — handles 100M+ entity search
  • Stateless GMS (Graph Metadata Service) — Kubernetes HPA scales pods independently
  • MCE Kafka consumer groups allow parallel partition ingestion

At 10M Iceberg partitions, DataHub adds one Elasticsearch shard and scales GMS pods — no schema migrations or graph re-indexing required.

Apache Atlas — JanusGraph

  • JanusGraph with BerkeleyDB back-end: single-host write bottleneck
  • HBase back-end mitigates write throughput but adds operational complexity
  • Solr/ES search index must be kept consistent with JanusGraph — dual-write risk
  • Graph traversal for lineage is extremely fast but write-heavy at PB scale

At 10M+ partitions with HBase back-end, Atlas is viable but requires careful region server tuning. BerkeleyDB is a non-starter at this scale.

Amundsen — Neo4j

  • Neo4j Community Edition: single-node limit — not suitable for PB metadata
  • Neo4j Enterprise (causal clustering) is costly and still has write scalability ceiling
  • Graph node density degrades performance beyond ~5–10M nodes in practice
  • No stream ingestion — crawl latency means stale metadata post-Iceberg commit

Not recommended for PB-scale HDFS/Iceberg. Amundsen is architecturally unsuited and the project is effectively dormant.

Internal Architecture — Component Breakdown

DataHub's architecture is purpose-built for distributed, event-driven metadata management. Its separation of concerns between ingestion, storage, search, and serving tiers enables independent scaling at PB volumes.

DataHub Component Architecture

🧊
Iceberg
REST Catalog / Spark
🐝
Hive/HDFS
Metastore Hook
Spark / Trino
OpenLineage Listener
🔄
dbt / Airflow
DAG / Model Lineage
🐍
datahub-ingestion
Python SDK · Recipes · Plugins
📡
REST Emitter / Kafka Emitter
MCP · MCE · MCLL
📨
Apache Kafka
MetadataChangeProposal · MetadataAuditEvent · FailedMCE
🧠
GMS — Graph Metadata Service
Spring Boot · GraphQL API · REST API
⚙️
MCE Consumer
Async Kafka Consumer Group
📣
MAE Consumer
Search Index + Graph Updates
🗄️
MySQL / PostgreSQL
Entity Documents · Aspects · Relationships
🔍
Elasticsearch
Full-text Search · Graph Index · Facets
🗃️
Iceberg REST Catalog
Snapshot Store · Manifest Registry
🖥️
DataHub Frontend
React · GraphQL · Search UI
📊
Lineage Graph UI
Column-level · Cross-domain
🤖
Actions Framework
Webhook · PII Workflow · Slack

Core Components Explained

📨 Kafka Message Bus — The Backbone

Three key topics form the backbone of DataHub's event pipeline. The MetadataChangeProposal (MCP) topic carries proposals to mutate metadata entities (datasets, schemas, ownership, tags). The MetadataAuditEvent (MAE) topic is published after GMS successfully commits a change — downstream consumers (search indexer, graph builder) react to MAE, not MCP. The FailedMCP topic acts as a dead-letter queue for proposals rejected by GMS validation.

🧠 GMS — Graph Metadata Service

GMS is the authoritative write-and-read service. Built in Spring Boot, it exposes both a GraphQL API (consumed by the React frontend) and a REST API (consumed by programmatic clients and integrations). GMS validates incoming MCPs, writes entity aspects to the relational store, and emits MAEs downstream. It is stateless — any number of GMS pods can run behind a load balancer.

🗄️ Dual-Store: Relational + Elasticsearch

MySQL/PostgreSQL stores entity aspects — structured JSON documents keyed by entity URN and aspect name. This gives DataHub a strongly consistent, transactional primary store that handles schema evolution trivially. Elasticsearch receives a projected, denormalised copy via the MAE Consumer and provides sub-second full-text search, faceted filtering, and graph-edge queries across hundreds of millions of entities.

🐍 Ingestion Framework

The Python-based ingestion system supports both push-based (Kafka emitter from Spark listeners) and pull-based (scheduled recipes via Airflow). Recipes are YAML-configured pipelines with a Source, Transformer, and Sink pattern. The Iceberg Source plugin reads table metadata via the Iceberg REST Catalog API or directly from the Hive Metastore, extracting schema fields, partition specs, snapshot history, and table statistics.

# DataHub Iceberg ingestion recipe (YAML)
source:
  type: iceberg
  config:
    catalog:
      name: production
      type: rest
      uri:  http://datahub-iceberg-catalog:8181
    user_ownership_property: owner
    table_parallelism: 16
    include_table_stats: true
    include_snapshots:   true
sink:
  type: datahub-kafka
  config:
    connection:
      bootstrap: kafka-broker:9092

🖥️ Frontend + Actions Framework

The React frontend communicates exclusively via GraphQL against GMS. The Actions Framework allows DataHub to react to metadata changes programmatically: trigger PII review workflows when a column is classified, send Slack alerts when a deprecated dataset is queried, or propagate glossary terms downstream via lineage — all without custom code.

From Iceberg Commit to Catalog — Near Real-Time

Trace the exact path a metadata change takes from the moment a Spark job commits an Iceberg snapshot to the moment it appears in the DataHub UI and is enforced by Ranger policies.

1

Spark Job Commits Iceberg Snapshot

A Spark job writing to warehouse.transactions via spark.sql("INSERT INTO ...") calls the Iceberg CommitTableTransaction. This atomically replaces the metadata.json pointer with a new snapshot entry, appending a new manifest list file and one or more manifest files to HDFS. This is a purely Iceberg-internal operation — no catalog notification has been sent yet.

// Spark Iceberg write with DataHub listener
spark.conf.set("spark.extraListeners",
  "io.openlineage.spark.agent.OpenLineageSparkListener")
spark.conf.set("spark.openlineage.transport.type", "datahub")
spark.conf.set("spark.openlineage.transport.url",
  "http://datahub-gms:8080")
2

OpenLineage Listener Emits Run Events

The OpenLineage Spark Listener (attached to SparkContext) intercepts the SparkListenerJobEnd event. It extracts input/output datasets, schema (from Iceberg's current snapshot schema), and job metadata, then emits an OpenLineage RunEvent (JSON) to the DataHub Kafka transport. This is a non-blocking async call — the Spark job does not wait for DataHub acknowledgement.

3

MCP Published to Kafka Topic

The DataHub Kafka transport converts the OpenLineage RunEvent into one or more MetadataChangeProposals (MCPs). Separate MCPs are generated for: the Dataset entity (schema, partition spec), the DataJob entity (lineage), and the DataJobInputOutput aspect (upstream/downstream linkage). These are published to the MetadataChangeProposal_v1 Kafka topic with key = entity_urn, enabling ordered processing per entity.

4

GMS Validates & Persists Aspects

The MCE Consumer (Kafka consumer group) reads MCPs and submits them to GMS via the internal REST API. GMS validates the proposal against the registered entity registry, checks authorization (does the emitter have EDIT_ENTITY_SCHEMA privileges?), writes the aspect as a new row to MySQL/PostgreSQL, and publishes a MetadataAuditEvent (MAE) to the MetadataAuditEvent_v4 Kafka topic confirming the commit. End-to-end latency from Spark job completion to MAE publication: typically 15–45 seconds in production at scale.

5

MAE Consumer Updates Elasticsearch Index

The MAE Consumer reads the MetadataAuditEvent and performs two actions: (a) updates the denormalised Elasticsearch document for the dataset entity — reflecting new schema fields, updated statistics, and lineage graph edges; (b) updates the graph index (stored as Elasticsearch edges) for column-level lineage. After this step, the DataHub UI returns the new schema in search results and the lineage panel shows the latest Spark job as an upstream producer.

6

Actions Framework Triggers Governance Workflows

Concurrently, the Actions Framework subscribes to the MAE stream. If the new schema contains a column matching a PII pattern (e.g., email, ssn, credit_card), the configured action automatically: (a) applies the PII glossary term to the column, (b) emits a Ranger sync event to propagate the masking policy to the Iceberg table, (c) sends a Slack notification to the data steward for review confirmation. This entire governance loop completes within 60–90 seconds of the original Iceberg commit.

End-to-End Latency Budget

At PB scale with 100+ concurrent Spark jobs, the critical path is: Iceberg commit → OpenLineage event (async, ~1s) → Kafka publish (async, ~500ms) → GMS persist (sync within consumer, ~2–5s) → MAE publish → ES index update (~5–15s). Total: ~10–60 seconds to metadata visibility depending on Kafka consumer lag and ES indexing throughput. This is orders of magnitude faster than pull-based crawlers (which operate on hourly/daily schedules).

Taming High-Churn Iceberg Metadata at PB Scale

Iceberg's append-only metadata design creates a "metadata explosion" problem: every commit generates new manifest files, and tables with frequent small writes accumulate thousands of snapshot entries. Without mitigation, the DataHub catalog ingests millions of redundant metadata events per day.

The Problem: Iceberg Metadata Lifecycle

What Happens During an Iceberg Commit

Each Spark INSERT/MERGE creates: 1 new Snapshot entry, 1 new Manifest List (snap-*.avro), N new Manifest files (*.avro) — one per output partition. A 1,000-partition table with hourly writes produces ~24,000 manifest files/day. Without snapshot expiry, the metadata.json file grows unboundedly.

DataHub's Ingestion-Level Mitigation

The Iceberg Source plugin does not ingest individual manifest files — it reads the current snapshot's schema, statistics, and partition spec as a single DataHub Dataset aspect update. This compresses the explosion: DataHub sees one metadata change per table per commit, not one per manifest file.

Snapshot Deduplication via Kafka Key Compaction

By keying MCP Kafka messages on entity_urn (the table URN), Kafka log compaction retains only the latest MCP per table. Back-pressure from rapid successive commits is absorbed by the compacted log — GMS processes the final state, not every intermediate snapshot.

Operational Strategies

  • Iceberg Snapshot Expiry: Configure history.expire.min-snapshots-to-keep and history.expire.max-snapshot-age-ms at the Iceberg level. Expire old snapshots before they're ingested — DataHub only sees the active snapshot tree. Run ExpireSnapshots as a daily Spark job.
  • Manifest Rewriting (Small-File Compaction): Use Iceberg's RewriteManifests action to consolidate thousands of small manifests into fewer large ones. This reduces Iceberg metadata overhead and accelerates DataHub's snapshot read during ingestion.
  • Selective Aspect Ingestion: Configure the DataHub Iceberg recipe to emit only changed aspects (schema, statistics) rather than re-emitting all aspects on every crawl. Use stateful_ingestion.enabled: true with a state store (S3 or local) to checkpoint which tables have changed since the last run.
  • Partition Statistics Sampling: For tables with millions of partitions, enable partition spec ingestion (the partition column definitions) rather than per-partition value ingestion. Use DataHub's partition_patterns filter to exclude dynamically-generated temp partitions.
  • Kafka Consumer Lag Monitoring: Alert on MetadataChangeProposal_v1 consumer group lag. A growing lag indicates GMS or ES is falling behind — scale GMS pods or ES indexing threads before the lag cascades into stale metadata.
  • Tiered Retention in Elasticsearch: Use ILM (Index Lifecycle Management) in Elasticsearch to roll over old entity history snapshots to cheaper storage tiers (warm/cold), keeping only the current entity state in the hot tier. This prevents ES disk from growing unboundedly with historical aspect versions.

Stateful Ingestion Configuration (DataHub Recipe)

# datahub-iceberg-recipe.yml — PB-scale production settings
source:
  type: iceberg
  config:
    catalog:
      name: prod_warehouse
      type: rest
      uri:  http://iceberg-rest-catalog:8181
    ## Stateful ingestion — only emit changed tables
    stateful_ingestion:
      enabled: true
      state_provider:
        type: datahub
      remove_stale_metadata: true
    ## Partition handling for 10M+ partition tables
    include_table_stats: true
    include_partition_stats: false      # Disable per-partition values
    max_partitions_for_statistics: 1000  # Sample top-N only
    include_snapshots: true
    max_snapshot_history: 10            # Last N snapshots only
    ## Parallelism for large warehouses
    table_parallelism: 32
    schema_parallelism: 16
sink:
  type: datahub-kafka
  config:
    connection:
      bootstrap: kafka-prod-1:9092,kafka-prod-2:9092,kafka-prod-3:9092
      producer_config:
        compression.type: snappy
        batch.size: 65536
        linger.ms: 50

RBAC, ABAC, Ranger Integration & PII Discovery

At PB scale, security governance cannot be manual. The catalog must enforce policies automatically as new data arrives, and those policies must propagate to the query engines enforcing access at runtime.

DataHub Native RBAC / ABAC

DataHub implements a two-tier authorization model. Roles (Admin, Editor, Viewer) define coarse-grained capability sets. Policies provide fine-grained attribute-based control: rules like "only members of the pii-stewards group can view datasets tagged PII:HIGH" or "domain owners can edit glossary terms on their domain's datasets". Policies are evaluated in real-time by GMS on every API request — no caching means policy changes take effect immediately.

  • Role-based templates + attribute-based policy overlay
  • Group membership sourced from LDAP/OIDC (SSO)
  • Resource-level policies: Dataset, Dashboard, DataFlow
  • Privilege-level: VIEW, EDIT, MANAGE, ADMIN per resource type

Apache Ranger Integration

DataHub ships a Ranger Authoriser Plugin that externalises DataHub's authorization decisions to Apache Ranger. When configured, GMS calls the Ranger Policy Engine (via REST) for every metadata access check. This allows a single Ranger policy admin to control access to both data (HDFS, Hive, Iceberg tables via Ranger's Iceberg plugin) and metadata (DataHub catalog entries) from one place.

  • Ranger policy engine evaluates DataHub metadata access
  • Tag-based policies: classify in DataHub → Ranger enforces on data plane
  • Full audit trail in Ranger's Solr/Kafka audit sink
  • Enables unified ABAC across HDFS + Iceberg + DataHub

PII Discovery & Auto-Classification

DataHub's AI-powered classification engine scans column names and (optionally) data samples against a configurable dictionary of PII patterns. When a PII column is detected, the Actions Framework automatically: (1) applies the Classified: PII glossary term; (2) triggers a governance workflow requiring steward approval; (3) emits a Ranger sync event to apply data masking. Apache Atlas's classification approach — PII, SENSITIVE, EXPIRES_ON tags with lineage propagation — is arguably more mature for Hadoop-native environments.

  • Column-name pattern matching (regex configurable)
  • Optional data sampling for value-level classification
  • Automatic lineage propagation of PII tags to downstream tables
  • Ranger masking policy triggered on PII tag assignment

Apache Atlas — Classification-Driven Governance

Atlas's native classification system is deeply integrated with Ranger's Tag-Based Policy (TBP) engine. When Atlas classifies a Hive/Iceberg column as PII, Ranger's tag-sync daemon (TagSync) propagates that classification to Ranger's policy engine — which then enforces masking on all query engines (Hive, Spark, Trino) without any manual policy update. This classify once, enforce everywhere pattern is Atlas's strongest differentiator for regulatory compliance in Hadoop environments.

  • Pluggable authoriser: Simple / Ranger / None
  • Tag-Based Policy (TBP) propagation via TagSync
  • Classification inheritance through lineage graph
  • Data masking policies enforced at query engine level via Ranger

How Ranger + DataHub Enforces Access on Iceberg Tables

🧊
Iceberg Table
HDFS · warehouse/db/table
🔖
DataHub Catalog
PII Tag Applied
🔄
Ranger TagSync
Propagates PII Tag
🛡️
Ranger Policy Engine
Masking · Row Filter · Deny
Spark / Trino
Enforces Masking at Query

End-to-End Policy Flow

When a data steward classifies transactions.credit_card_number as PII:HIGH in DataHub, Ranger's TagSync daemon detects the tag change (via DataHub's Ranger plugin webhook), updates the Ranger policy store, and within seconds all Spark and Trino queries against that column return XXXX-XXXX-XXXX-[last4] without any per-engine policy change.

Ensuring Query Engines Always See the Latest Truth

The catalog is only valuable if query engines — Spark, Trino, Apache Doris — use it as the authoritative source of schema, partition information, and access policies. This requires tight, bidirectional integration.

Apache Spark
Primary Iceberg Write Engine

Catalog Integration

Configure Spark with the DataHub Iceberg REST Catalog as the Iceberg catalog implementation. Spark reads the current table metadata (schema, partition spec, snapshot) directly from DataHub's catalog service — ensuring every Spark session sees the authoritative current state, not a cached or stale Hive Metastore entry.

-- spark-defaults.conf
spark.sql.catalog.dh=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.dh.type=rest
spark.sql.catalog.dh.uri=http://datahub:8181/iceberg
spark.sql.catalog.dh.warehouse=hdfs:///warehouse
  • OpenLineage Spark Listener → DataHub (near real-time lineage)
  • Iceberg REST Catalog ensures atomic schema reads
  • Ranger plugin enforces column masking at scan level
🔵
Trino
Interactive Query Engine

Catalog Integration

Trino's Iceberg connector supports the REST Catalog protocol natively. Pointing Trino at DataHub's Iceberg REST endpoint means every SHOW SCHEMAS, DESCRIBE TABLE, and query planning call reads the current Iceberg metadata from DataHub — not from a potentially-stale Hive Metastore. OpenLineage's Trino event listener pushes lineage after each query.

# trino/etc/catalog/iceberg.properties
connector.name=iceberg
iceberg.catalog.type=rest
iceberg.rest-catalog.uri=http://datahub:8181/iceberg
iceberg.rest-catalog.warehouse=hdfs:///warehouse
iceberg.rest-catalog.security=OAUTH2
  • REST Catalog protocol — no separate Hive Metastore dependency
  • Snapshot isolation: each query sees a consistent table snapshot
  • Column-level access control via Ranger + Trino system access control
🟢
Apache Doris
Real-Time Analytical Engine

Catalog Integration

Apache Doris (v2.x+) supports Iceberg external catalogs via its Multi-Catalog feature. Configured with the Iceberg REST Catalog URI, Doris reads partition metadata and snapshot info from DataHub's catalog service at query planning time. DataHub ingests Doris query logs via the datahub-rest lineage API to track read lineage from Doris analytical queries.

-- Doris: Create Iceberg Catalog
CREATE CATALOG iceberg_dh
PROPERTIES (
  "type" = "iceberg",
  "iceberg.catalog.type" = "rest",
  "uri" = "http://datahub:8181/iceberg",
  "warehouse" = "hdfs:///warehouse"
);
  • Multi-Catalog reads Iceberg REST for fresh partition info
  • DataHub ingests Doris audit log for read lineage
  • Ranger integration via Doris's built-in Ranger plugin (v2.1+)

Guaranteeing Metadata Consistency — The "Single Source of Truth" Pattern

The canonical pattern for guaranteeing that all query engines see the same current Iceberg table state is to route all catalog reads through DataHub's Iceberg REST Catalog service, rather than allowing each engine to maintain its own Hive Metastore connection.

Why REST Catalog Is the Key

The Iceberg REST Catalog specification (introduced in Iceberg 0.14) defines a standard HTTP API for catalog operations. DataHub's implementation of this spec means any Iceberg-compatible engine can use DataHub as its catalog without engine-specific integration. This is the critical architectural shift — DataHub moves from a passive observer to an active participant in the Iceberg table lifecycle.

Consistency at Commit Time

When Spark commits an Iceberg snapshot through DataHub's REST Catalog, DataHub atomically updates its metadata store and makes the new snapshot visible to all catalog clients. There is no window where Trino could see an old schema while Spark has already committed a new one — the catalog is the single serialisation point.

Integration Checklist

  • Remove direct HMS connections: All engines must read Iceberg metadata via DataHub REST Catalog, not direct Hive Metastore JDBC/Thrift — this eliminates the HMS/Iceberg metadata split-brain problem.
  • Configure Iceberg catalog credential vending: DataHub's REST Catalog supports HDFS credential delegation tokens — Spark/Trino receive short-lived credentials for HDFS data access scoped to the specific tables they are querying.
  • Enable OpenLineage on all engines: Spark, Trino, and Doris should all emit OpenLineage events to DataHub. This builds a complete cross-engine lineage graph — critical for PII propagation and impact analysis.
  • Ranger access control synchronisation: Configure Ranger's Iceberg plugin to read policies from Ranger's policy store (which DataHub's Ranger plugin keeps updated). This ensures access decisions are enforced at the engine level, not just the catalog level.
  • Schema evolution alerts: Use DataHub Actions to alert downstream consumers when an Iceberg schema evolution (column addition/rename/drop) is committed — before their next query fails due to an unexpected schema change.

Sources & Further Reading

[1]
DataHub 1.0 Released: Next-Gen Metadata & AI Management — Acryl Data / DataHub Blog, February 2025. Covers the Iceberg REST Catalog beta, v1.6 full support, and roadmap. datahub.com/blog/datahub-1-0-is-here
[2]
DataHub Ingestion Framework Architecture — Official DataHub documentation. Describes MCP/MAE pipeline, Kafka topics, GMS internals. docs.datahub.com/docs/architecture/metadata-ingestion
[3]
DataHub Iceberg Source Documentation — Official DataHub docs for the iceberg ingestion plugin, including catalog types, stateful ingestion, and partition handling. docs.datahub.com/docs/generated/ingestion/sources/iceberg
[4]
DataHub Iceberg REST Catalog (Beta) — Documentation for DataHub acting as an Iceberg REST Catalog server. docs.datahub.com/docs/iceberg-catalog
[5]
Configuring Apache Ranger Authorization with DataHub — DataHub docs on the Ranger Authoriser plugin for externalising DataHub's access control to Ranger. docs.datahub.com/docs/how/configuring-authorization-with-apache-ranger
[6]
What Is Apache Atlas? Architecture, Features & Alternatives (2026) — Atlan comprehensive guide covering Atlas's RBAC/ABAC, Ranger integration, JanusGraph scalability, and classification-driven governance. atlan.com/what-is-apache-atlas
[7]
Apache Atlas Official Documentation — Classification system (PII, SENSITIVE, EXPIRES_ON), Ranger integration, tag propagation via lineage graph. atlas.apache.org
[8]
Amundsen vs DataHub: How to Choose (2026 Guide) — Atlan analysis documenting Amundsen's dormant state as of December 2025 and recommending DataHub for new deployments. atlan.com/amundsen-vs-datahub
[9]
Comprehensive Data Catalog Comparison (2025) — Onehouse analysis comparing Glue, DataHub, Unity, Polaris, and Gravitino across open table format support, connector breadth, and access control. onehouse.ai/blog/comprehensive-data-catalog-comparison
[10]
Open-Source Data Governance Frameworks: Strategic Analysis — TheDataGuy, August 2025. Compares OpenMetadata, DataHub, Apache Atlas, and Amundsen across governance, PII, and community dimensions. thedataguy.pro/blog
[11]
Apache Ranger — RBAC vs ABAC — Privacera blog explaining Ranger's tag-based (TBAC), role-based (RBAC), and attribute-based (ABAC) policy models, and how Atlas classifications drive Ranger policies. privacera.com/blog
[12]
DataHub Data Governance Platform — DataHub Cloud product page documenting RBAC/ABAC policies, Actions Framework, PII workflow automation, and Ranger integration. datahub.com/products/data-governance
[13]
Apache Iceberg Official Documentation — Iceberg REST Catalog specification, snapshot management, manifest lifecycle, and ExpireSnapshots API. iceberg.apache.org/docs/latest
[14]
OpenLineage Specification — Standard metadata format for data lineage across Spark, Trino, Airflow. The interop layer between compute engines and DataHub. openlineage.io

📌 Architecture Decision Record

Recommendation: Deploy DataHub v1.x (open-source) as the primary Data Catalog and Iceberg REST Catalog server. Integrate Apache Ranger as the enforcement plane via DataHub's Ranger Authoriser plugin and TagSync. Use OpenLineage listeners on all Spark and Trino clusters for near-real-time lineage. Implement stateful ingestion recipes with partition sampling and snapshot expiry for the metadata explosion problem. Evaluate Apache Polaris (incubating) in 12–18 months as a lightweight Iceberg-native metastore complement to DataHub's governance layer.