Open Source · Lakehouse Architecture · v1.0

Brand Data Platform
Architecture

Full-stack open-source lakehouse · 10PB scale · ACID compliant · Real-time + Batch · HA/HA
Designed by: Data Architecture Team

⬡ 10 PB Storage ✦ ACID / Lakehouse ⚡ Batch + Streaming 🔐 AuthN / AuthZ ♾ High Availability 🔍 Direct Query

// 01 — Requirements

Platform Criteria Mapping

💾

10 PB Data Volume

Petabyte-scale object storage with horizontal scalability and erasure coding.

MinIOHDFS

📄

Text + Mixed Data Types

Structured, semi-structured, unstructured text. JSON, Parquet, ORC, CSV, plain text, logs.

IcebergMinIO

🔄

ACID Transactions

Row-level ACID, snapshot isolation, time travel, schema evolution via open table format.

Apache Iceberg

⚡

Batch + Streaming

Unified processing for both bulk batch ETL and real-time event stream processing.

SparkFlink

🔍

Direct Query on Data

Query-in-place on object storage — no ETL to a separate RDBMS required.

TrinoDruid

🔐

AuthN + AuthZ

Centralized identity via OAuth2/OIDC. Column/row-level data access control.

KeycloakApache Ranger

🏗️

Lakehouse Architecture

Medallion (Bronze→Silver→Gold) layers on object storage with open table format.

IcebergNessie

🌡️

High Availability

Multi-node clusters, replication factor ≥ 3, no single point of failure, auto-failover.

K8sMinIO Erasure

// 02 — Platform Layers

7-Layer Data Platform Architecture

L1 🌐

DATA SOURCES

All origin systems feeding raw data into the platform

Ingress

🗄️

RDBMS

PostgreSQL, MySQL, Oracle — transactional structured data

StructuredCDC

📦

NoSQL / Document

MongoDB, Cassandra, Elasticsearch — flexible schemas

Semi-structured

📡

Event / IoT Streams

Sensors, clickstream, app events — high-frequency data

Real-timeJSON

📁

Files / Object Store

CSV, JSON, XML, Parquet, PDF, text documents

Unstructured

🔗

REST APIs / SaaS

CRM, ERP, marketing platforms via HTTP connectors

Batch Pull

📋

Log Systems

Application logs, audit logs, system metrics (syslog, fluentd)

TextHigh-vol

▼ Ingestion Pipeline ▼

L2 ⚡

INGESTION LAYER

Batch + streaming pipelines — pull, push, CDC, event-driven

Pipeline

🔀

Apache Kafka

Distributed event streaming bus. Message broker for real-time event ingestion at scale

Streaming10M+ msg/sHA

🔌

Kafka Connect

Connector framework for JDBC, S3, HDFS, Elasticsearch, and 200+ source/sink connectors

Source/SinkScalable

🔄

Debezium

Change Data Capture (CDC) — captures row-level changes from RDBMS in real-time

CDCMySQL/PG

🌊

Apache NiFi

Visual dataflow orchestration for batch file ingestion — drag-drop pipeline builder

BatchVisual ETL

🔗

Airbyte

Open-source ELT platform — 300+ pre-built connectors for SaaS and cloud sources

ELT300+ conn.

▼ Persisted to Object Storage ▼

L3 🏗️

STORAGE + LAKEHOUSE LAYER

Petabyte-scale object storage with ACID open table format — Medallion architecture

Core

💾

MinIO

S3-compatible distributed object storage. Erasure coding for 10PB+ scale. Multi-tenant. On-premise or cloud

10PB+S3 APIErasureHA

🧊

Apache Iceberg

Open table format: ACID transactions, snapshot isolation, time travel, hidden partitioning, schema evolution

ACIDTime TravelSchema Evo

📚

Project Nessie

Git-like transactional catalog for Iceberg. Branch, commit, merge — multi-table transactions

CatalogGit-like

🗂️

Apache Parquet / ORC

Columnar storage formats — high compression, predicate pushdown, efficient analytics I/O

ColumnarCompressed

🥉 Bronze Layer

Raw ingested data — immutable, schema-on-read, full fidelity. Never deleted.

🥈 Silver Layer

Cleansed, validated, deduplicated, typed. Schema enforced. Business entities formed.

🥇 Gold Layer

Aggregated, business-ready data marts. Optimized for BI, ML, and analytics.

▼ Processing + Transformation ▼

L4 ⚙️

PROCESSING LAYER

Unified batch and streaming compute + workflow orchestration

Compute

✨

Apache Spark

Distributed batch processing engine. Spark SQL, DataFrame API. Bronze→Silver→Gold transforms

Batch ETLMLSQL

🌊

Apache Flink

Real-time stateful stream processing. Exactly-once semantics, event time, windowing, CEP

StreamExactly-onceLow-latency

🎯

Apache Airflow

Workflow orchestration via DAGs. Schedule, monitor, retry batch pipelines. 1000+ operators

OrchestrationDAGScheduler

📓

Apache Zeppelin / Jupyter

Interactive notebooks for data exploration, ad-hoc analysis, and ML experimentation

NotebookEDA

▼ Query Engines ▼

L5 🔍

QUERY LAYER

Interactive, federated, and real-time OLAP query engines — direct on storage

Query

⚡

Apache Trino

Federated SQL query engine — query Iceberg, Hive, RDBMS, Kafka from a single SQL interface. Sub-second to minutes

FederatedANSI SQLDirect Query

🚀

Apache Druid

Real-time OLAP database — sub-second queries on streaming + historical data. Power dashboards

Real-time OLAPSub-secondTimeSeries

💻

Spark SQL

SQL interface over Spark cluster — complex batch SQL transformations on Iceberg tables

Batch SQLComplex Joins

📊

Hive Metastore

Metadata registry for tables, schemas, partitions — used by Trino, Spark, Flink as catalog backend

MetadataSchema Reg

▼ Governance + Security ▼

L6 🔐

SECURITY & GOVERNANCE LAYER

Authentication, authorization, data catalog, lineage, and compliance

Security

🗝️

Keycloak

Identity & access management. OAuth2 / OIDC / SAML. SSO for all platform services. MFA support

AuthNOAuth2SSO

🛡️

Apache Ranger

Fine-grained row/column/table ACL policies. Audit logging. Integrates with Trino, Spark, HDFS, Kafka

AuthZRow/Col ACLAudit

🗺️

Apache Atlas

Data catalog, metadata management, data lineage tracking, classification (PII, sensitive)

CatalogLineageClassification

🔏

MinIO IAM Policies

S3-compatible bucket-level and prefix-level access policies. Integrated with Keycloak via OIDC

S3 PolicyBucket ACL

▼ Business Intelligence & Consumption ▼

L7 📊

BI & CONSUMPTION LAYER

Self-service analytics, dashboards, APIs, and ML model serving

Serving

📈

Apache Superset

Modern BI platform — 40+ chart types, dashboards, SQL Lab, role-based access. Connects via Trino/Druid

BIDashboardsSQL Lab

📉

Metabase

Business-friendly self-service analytics — no-code question builder for non-technical users

Self-serviceNo-code

🤖

MLflow

ML experiment tracking, model registry, and serving — connects to Gold layer feature store

MLOpsModel Reg

🔌

REST / JDBC API

Trino JDBC/REST API exposes Gold layer data to external apps, microservices, data science tools

IntegrationJDBC

// 03 — Data Flow

End-to-End Data Flow

⬛ Batch Path

🗄️

Source DB

RDBMS/Files

→

🌊

NiFi / Airbyte

Batch Ingest

→

🥉

MinIO Bronze

Raw Parquet

↓

✨

Spark ETL

Transform

→

🥈

Silver Layer

Iceberg

→

🥇

Gold Layer

Aggregated

↓

🎯

Airflow DAG

Orchestrates

→

⚡

Trino Query

Direct SQL

→

📊

Superset BI

Dashboards

⚡ Streaming Path

📡

Events / IoT

Producers

→

🔀

Kafka Topic

Event Bus

→

🌊

Flink Job

Stream Proc

↓

🧊

Iceberg Tables

Bronze/Silver

→

🚀

Apache Druid

Real-time OLAP

↓

📈

Live Dashboard

Superset

→

🔔

Alerts / API

Real-time

// 04 — Component Deep Dive

Technology Decision Rationale

MinIO — Object Storage

Petabyte-scale · S3-compatible · Self-hosted

Why MinIO?

Only open-source solution delivering true S3 API compatibility at 10PB+ scale. Erasure coding provides fault tolerance with configurable parity. No Hadoop dependency needed.

Sizing

10PB across 16+ nodes, erasure set EC:8+4 (12 drives per set). Hardware JBOD recommended. Throughput: 10+ GB/s per cluster.

HA Configuration

MinIO Operator on Kubernetes. Multi-site active-active replication. Load balancer with health checks.

S3 CompatibleErasure CodeMulti-tenantK8s Operator

🧊

Apache Iceberg — Table Format

ACID · Time Travel · Schema Evolution

ACID Guarantees

Snapshot isolation ensures readers are never blocked by writers. Atomic commits. Concurrent writes via optimistic concurrency control.

Time Travel

Query any historical snapshot: SELECT * FROM orders FOR SYSTEM_TIME AS OF '2024-01-01'. Auditing and point-in-time recovery built-in.

Hidden Partitioning

No partition columns in queries. Iceberg optimizes reads via metadata without user knowledge of partition scheme.

ACIDTime TravelSchema EvoCompaction

⚡

Apache Kafka — Event Streaming

Distributed · Durable · High-throughput

Role in Architecture

Central nervous system of real-time data. Acts as buffer between source systems and processing engines. Decouples producers from consumers.

Throughput

10M+ messages/second per cluster. Configurable retention (hours to months). Replication factor = 3 for HA. KRaft mode (no ZooKeeper).

Kafka Connect + Debezium

CDC from RDBMS (MySQL binlog, PG WAL). Sink to MinIO/Iceberg. Schema Registry for Avro/JSON schema management.

KRaftSchema RegistryDebezium CDCKSQL

🌊

Apache Flink — Stream Processing

Exactly-once · Stateful · Low-latency

Why Flink over Spark Streaming?

True event-time processing with watermarks. Millisecond latency. Native stateful operators. Checkpointing for fault tolerance without micro-batch lag.

Iceberg Sink

Flink 1.16+ native Iceberg sink with exactly-once semantics. Writes directly to Bronze/Silver Iceberg tables in MinIO. Configurable compaction.

Use Cases

Real-time aggregations, fraud detection, sessionization, joins of streams, enrichment lookups against Iceberg tables.

Exactly-onceCEPWindowingFlink SQL

🔍

Apache Trino — Query Engine

Federated SQL · Direct Query · ANSI SQL

Direct Query on Storage

Query Iceberg tables on MinIO directly — zero data movement. MPP engine with dynamic partition pruning. Predicate pushdown to storage.

Federated Queries

Single SQL to JOIN across Iceberg, PostgreSQL, Kafka, Elasticsearch, MongoDB simultaneously. No ETL pre-work needed.

Cluster Sizing

1 coordinator + N workers (horizontal scale). Workers sized at 128–256GB RAM. Fault-tolerant execution with spill to disk.

ANSI SQLIceberg Conn.MPPFault Tolerant

🔐

Keycloak + Apache Ranger

AuthN + AuthZ · Zero-Trust · Audit

Keycloak (AuthN)

Central IdP: OAuth2/OIDC tokens issued to all services (Trino, Kafka, Superset, MinIO). SSO via LDAP/AD integration. MFA enforced.

Apache Ranger (AuthZ)

Fine-grained policies: which user/group can SELECT which table/column. Row-level filters. Masks PII columns. Full audit trail.

Apache Atlas (Governance)

Data lineage from ingestion → Bronze → Silver → Gold → Dashboard. Auto-classify PII, sensitive fields. Data dictionary.

OAuth2/OIDCRow/Col ACLPII MaskingAudit Log

// 05 — Technology Matrix

Open-Source Stack Decision Matrix

Domain	Selected Tool	Category	ACID	Scale	HA	License
Object Storage	MinIO	Storage	✓	10PB+	Active-Active	AGPL-3
Table Format	Apache Iceberg	Lakehouse	Full ACID	Exabytes	Snapshot	Apache 2
Data Catalog	Project Nessie	Catalog	Multi-table	✓	✓	Apache 2
Event Streaming	Apache Kafka	Messaging	At-least-once	10M msg/s	Replication 3x	Apache 2
CDC	Debezium	Ingestion	Exactly-once	✓	✓	Apache 2
Batch Ingest	Apache NiFi / Airbyte	ELT/ETL	Partial	✓	Cluster	Apache 2
Batch Processing	Apache Spark	Compute	via Iceberg	PB-scale	YARN/K8s	Apache 2
Stream Processing	Apache Flink	Compute	Exactly-once	✓	Checkpoint	Apache 2
Orchestration	Apache Airflow	Workflow	—	Celery/K8s	HA Mode	Apache 2
Interactive SQL	Apache Trino	Query	via Iceberg	PB-scale	Fault-tolerant	Apache 2
Real-time OLAP	Apache Druid	Query	Eventual	Streaming	Replicated	Apache 2
Authentication	Keycloak	Security	—	Cluster	Active-Passive	Apache 2
Authorization	Apache Ranger	Security	—	✓	HA	Apache 2
Data Governance	Apache Atlas	Governance	—	✓	✓	Apache 2
BI / Visualization	Apache Superset	BI	—	Stateless	Multi-replica	Apache 2
MLOps	MLflow	ML Platform	—	✓	Basic	Apache 2
Infrastructure	Kubernetes	Orchestration	—	Unlimited	Multi-node	Apache 2

// 06 — Infrastructure

Deployment & Sizing Guidelines

Component	Min Nodes	CPU / Node	RAM / Node	Storage / Node	HA Mode
MinIO	16 nodes	16 cores	64 GB	16× HDD 6TB (≈ 1.5PB raw / cluster)	Erasure EC:8+4
Kafka Brokers	6 nodes	32 cores	128 GB	12 TB NVMe	Replication 3x
Spark Workers	20 nodes	32 cores	256 GB	2 TB NVMe local	YARN / K8s
Flink Workers	10 nodes	16 cores	128 GB	1 TB NVMe	Checkpointing
Trino Workers	10 nodes	32 cores	256 GB	2 TB NVMe (spill)	Fault-tolerant exec
Apache Druid	8 nodes	16 cores	128 GB	4 TB SSD	Tiered segments
Keycloak	3 nodes	8 cores	16 GB	100 GB SSD	Active-Passive
Airflow	3 nodes	8 cores	32 GB	500 GB SSD	Celery / K8s exec
Superset	3 nodes	8 cores	16 GB	Stateless	Multi-replica + Redis