Open Source · Lakehouse Architecture · v1.0
Brand Data Platform
Architecture
Full-stack open-source lakehouse · 10PB scale · ACID compliant · Real-time + Batch · HA/HA
Designed by: Data Architecture Team
⬡ 10 PB Storage
✦ ACID / Lakehouse
⚡ Batch + Streaming
🔐 AuthN / AuthZ
♾ High Availability
🔍 Direct Query
// 01 — Requirements
Platform Criteria Mapping
💾
10 PB Data Volume
Petabyte-scale object storage with horizontal scalability and erasure coding.
MinIOHDFS
📄
Text + Mixed Data Types
Structured, semi-structured, unstructured text. JSON, Parquet, ORC, CSV, plain text, logs.
IcebergMinIO
🔄
ACID Transactions
Row-level ACID, snapshot isolation, time travel, schema evolution via open table format.
Apache Iceberg
⚡
Batch + Streaming
Unified processing for both bulk batch ETL and real-time event stream processing.
SparkFlink
🔍
Direct Query on Data
Query-in-place on object storage — no ETL to a separate RDBMS required.
TrinoDruid
🔐
AuthN + AuthZ
Centralized identity via OAuth2/OIDC. Column/row-level data access control.
KeycloakApache Ranger
🏗️
Lakehouse Architecture
Medallion (Bronze→Silver→Gold) layers on object storage with open table format.
IcebergNessie
🌡️
High Availability
Multi-node clusters, replication factor ≥ 3, no single point of failure, auto-failover.
K8sMinIO Erasure
// 02 — Platform Layers
7-Layer Data Platform Architecture
🗄️
RDBMS
PostgreSQL, MySQL, Oracle — transactional structured data
StructuredCDC
📦
NoSQL / Document
MongoDB, Cassandra, Elasticsearch — flexible schemas
Semi-structured
📡
Event / IoT Streams
Sensors, clickstream, app events — high-frequency data
Real-timeJSON
📁
Files / Object Store
CSV, JSON, XML, Parquet, PDF, text documents
Unstructured
🔗
REST APIs / SaaS
CRM, ERP, marketing platforms via HTTP connectors
Batch Pull
📋
Log Systems
Application logs, audit logs, system metrics (syslog, fluentd)
TextHigh-vol
▼ Ingestion Pipeline ▼
🔀
Apache Kafka
Distributed event streaming bus. Message broker for real-time event ingestion at scale
Streaming10M+ msg/sHA
🔌
Kafka Connect
Connector framework for JDBC, S3, HDFS, Elasticsearch, and 200+ source/sink connectors
Source/SinkScalable
🔄
Debezium
Change Data Capture (CDC) — captures row-level changes from RDBMS in real-time
CDCMySQL/PG
🌊
Apache NiFi
Visual dataflow orchestration for batch file ingestion — drag-drop pipeline builder
BatchVisual ETL
🔗
Airbyte
Open-source ELT platform — 300+ pre-built connectors for SaaS and cloud sources
ELT300+ conn.
▼ Persisted to Object Storage ▼
💾
MinIO
S3-compatible distributed object storage. Erasure coding for 10PB+ scale. Multi-tenant. On-premise or cloud
10PB+S3 APIErasureHA
🧊
Apache Iceberg
Open table format: ACID transactions, snapshot isolation, time travel, hidden partitioning, schema evolution
ACIDTime TravelSchema Evo
📚
Project Nessie
Git-like transactional catalog for Iceberg. Branch, commit, merge — multi-table transactions
CatalogGit-like
🗂️
Apache Parquet / ORC
Columnar storage formats — high compression, predicate pushdown, efficient analytics I/O
ColumnarCompressed
🥉 Bronze Layer
Raw ingested data — immutable, schema-on-read, full fidelity. Never deleted.
🥈 Silver Layer
Cleansed, validated, deduplicated, typed. Schema enforced. Business entities formed.
🥇 Gold Layer
Aggregated, business-ready data marts. Optimized for BI, ML, and analytics.
▼ Processing + Transformation ▼
✨
Apache Spark
Distributed batch processing engine. Spark SQL, DataFrame API. Bronze→Silver→Gold transforms
Batch ETLMLSQL
🌊
Apache Flink
Real-time stateful stream processing. Exactly-once semantics, event time, windowing, CEP
StreamExactly-onceLow-latency
🎯
Apache Airflow
Workflow orchestration via DAGs. Schedule, monitor, retry batch pipelines. 1000+ operators
OrchestrationDAGScheduler
📓
Apache Zeppelin / Jupyter
Interactive notebooks for data exploration, ad-hoc analysis, and ML experimentation
NotebookEDA
▼ Query Engines ▼
⚡
Apache Trino
Federated SQL query engine — query Iceberg, Hive, RDBMS, Kafka from a single SQL interface. Sub-second to minutes
FederatedANSI SQLDirect Query
🚀
Apache Druid
Real-time OLAP database — sub-second queries on streaming + historical data. Power dashboards
Real-time OLAPSub-secondTimeSeries
💻
Spark SQL
SQL interface over Spark cluster — complex batch SQL transformations on Iceberg tables
Batch SQLComplex Joins
📊
Hive Metastore
Metadata registry for tables, schemas, partitions — used by Trino, Spark, Flink as catalog backend
MetadataSchema Reg
▼ Governance + Security ▼
🗝️
Keycloak
Identity & access management. OAuth2 / OIDC / SAML. SSO for all platform services. MFA support
AuthNOAuth2SSO
🛡️
Apache Ranger
Fine-grained row/column/table ACL policies. Audit logging. Integrates with Trino, Spark, HDFS, Kafka
AuthZRow/Col ACLAudit
🗺️
Apache Atlas
Data catalog, metadata management, data lineage tracking, classification (PII, sensitive)
CatalogLineageClassification
🔏
MinIO IAM Policies
S3-compatible bucket-level and prefix-level access policies. Integrated with Keycloak via OIDC
S3 PolicyBucket ACL
▼ Business Intelligence & Consumption ▼
📈
Apache Superset
Modern BI platform — 40+ chart types, dashboards, SQL Lab, role-based access. Connects via Trino/Druid
BIDashboardsSQL Lab
📉
Metabase
Business-friendly self-service analytics — no-code question builder for non-technical users
Self-serviceNo-code
🤖
MLflow
ML experiment tracking, model registry, and serving — connects to Gold layer feature store
MLOpsModel Reg
🔌
REST / JDBC API
Trino JDBC/REST API exposes Gold layer data to external apps, microservices, data science tools
IntegrationJDBC
// 03 — Data Flow
End-to-End Data Flow
⬛ Batch Path
→
🌊
NiFi / Airbyte
Batch Ingest
→
↓
↓
⚡ Streaming Path
↓
🧊
Iceberg Tables
Bronze/Silver
→
🚀
Apache Druid
Real-time OLAP
↓
// 04 — Component Deep Dive
Technology Decision Rationale
Why MinIO?
Only open-source solution delivering true S3 API compatibility at 10PB+ scale. Erasure coding provides fault tolerance with configurable parity. No Hadoop dependency needed.
Sizing
10PB across 16+ nodes, erasure set EC:8+4 (12 drives per set). Hardware JBOD recommended. Throughput: 10+ GB/s per cluster.
HA Configuration
MinIO Operator on Kubernetes. Multi-site active-active replication. Load balancer with health checks.
S3 CompatibleErasure CodeMulti-tenantK8s Operator
ACID Guarantees
Snapshot isolation ensures readers are never blocked by writers. Atomic commits. Concurrent writes via optimistic concurrency control.
Time Travel
Query any historical snapshot: SELECT * FROM orders FOR SYSTEM_TIME AS OF '2024-01-01'. Auditing and point-in-time recovery built-in.
Hidden Partitioning
No partition columns in queries. Iceberg optimizes reads via metadata without user knowledge of partition scheme.
ACIDTime TravelSchema EvoCompaction
Role in Architecture
Central nervous system of real-time data. Acts as buffer between source systems and processing engines. Decouples producers from consumers.
Throughput
10M+ messages/second per cluster. Configurable retention (hours to months). Replication factor = 3 for HA. KRaft mode (no ZooKeeper).
Kafka Connect + Debezium
CDC from RDBMS (MySQL binlog, PG WAL). Sink to MinIO/Iceberg. Schema Registry for Avro/JSON schema management.
KRaftSchema RegistryDebezium CDCKSQL
Why Flink over Spark Streaming?
True event-time processing with watermarks. Millisecond latency. Native stateful operators. Checkpointing for fault tolerance without micro-batch lag.
Iceberg Sink
Flink 1.16+ native Iceberg sink with exactly-once semantics. Writes directly to Bronze/Silver Iceberg tables in MinIO. Configurable compaction.
Use Cases
Real-time aggregations, fraud detection, sessionization, joins of streams, enrichment lookups against Iceberg tables.
Exactly-onceCEPWindowingFlink SQL
Direct Query on Storage
Query Iceberg tables on MinIO directly — zero data movement. MPP engine with dynamic partition pruning. Predicate pushdown to storage.
Federated Queries
Single SQL to JOIN across Iceberg, PostgreSQL, Kafka, Elasticsearch, MongoDB simultaneously. No ETL pre-work needed.
Cluster Sizing
1 coordinator + N workers (horizontal scale). Workers sized at 128–256GB RAM. Fault-tolerant execution with spill to disk.
ANSI SQLIceberg Conn.MPPFault Tolerant
Keycloak (AuthN)
Central IdP: OAuth2/OIDC tokens issued to all services (Trino, Kafka, Superset, MinIO). SSO via LDAP/AD integration. MFA enforced.
Apache Ranger (AuthZ)
Fine-grained policies: which user/group can SELECT which table/column. Row-level filters. Masks PII columns. Full audit trail.
Apache Atlas (Governance)
Data lineage from ingestion → Bronze → Silver → Gold → Dashboard. Auto-classify PII, sensitive fields. Data dictionary.
OAuth2/OIDCRow/Col ACLPII MaskingAudit Log
// 05 — Technology Matrix
Open-Source Stack Decision Matrix
| Domain |
Selected Tool |
Category |
ACID |
Scale |
HA |
License |
| Object Storage | MinIO | Storage | ✓ | 10PB+ | Active-Active | AGPL-3 |
| Table Format | Apache Iceberg | Lakehouse | Full ACID | Exabytes | Snapshot | Apache 2 |
| Data Catalog | Project Nessie | Catalog | Multi-table | ✓ | ✓ | Apache 2 |
| Event Streaming | Apache Kafka | Messaging | At-least-once | 10M msg/s | Replication 3x | Apache 2 |
| CDC | Debezium | Ingestion | Exactly-once | ✓ | ✓ | Apache 2 |
| Batch Ingest | Apache NiFi / Airbyte | ELT/ETL | Partial | ✓ | Cluster | Apache 2 |
| Batch Processing | Apache Spark | Compute | via Iceberg | PB-scale | YARN/K8s | Apache 2 |
| Stream Processing | Apache Flink | Compute | Exactly-once | ✓ | Checkpoint | Apache 2 |
| Orchestration | Apache Airflow | Workflow | — | Celery/K8s | HA Mode | Apache 2 |
| Interactive SQL | Apache Trino | Query | via Iceberg | PB-scale | Fault-tolerant | Apache 2 |
| Real-time OLAP | Apache Druid | Query | Eventual | Streaming | Replicated | Apache 2 |
| Authentication | Keycloak | Security | — | Cluster | Active-Passive | Apache 2 |
| Authorization | Apache Ranger | Security | — | ✓ | HA | Apache 2 |
| Data Governance | Apache Atlas | Governance | — | ✓ | ✓ | Apache 2 |
| BI / Visualization | Apache Superset | BI | — | Stateless | Multi-replica | Apache 2 |
| MLOps | MLflow | ML Platform | — | ✓ | Basic | Apache 2 |
| Infrastructure | Kubernetes | Orchestration | — | Unlimited | Multi-node | Apache 2 |
// 06 — Infrastructure
Deployment & Sizing Guidelines
| Component | Min Nodes | CPU / Node | RAM / Node | Storage / Node | HA Mode |
| MinIO | 16 nodes | 16 cores | 64 GB | 16× HDD 6TB (≈ 1.5PB raw / cluster) | Erasure EC:8+4 |
| Kafka Brokers | 6 nodes | 32 cores | 128 GB | 12 TB NVMe | Replication 3x |
| Spark Workers | 20 nodes | 32 cores | 256 GB | 2 TB NVMe local | YARN / K8s |
| Flink Workers | 10 nodes | 16 cores | 128 GB | 1 TB NVMe | Checkpointing |
| Trino Workers | 10 nodes | 32 cores | 256 GB | 2 TB NVMe (spill) | Fault-tolerant exec |
| Apache Druid | 8 nodes | 16 cores | 128 GB | 4 TB SSD | Tiered segments |
| Keycloak | 3 nodes | 8 cores | 16 GB | 100 GB SSD | Active-Passive |
| Airflow | 3 nodes | 8 cores | 32 GB | 500 GB SSD | Celery / K8s exec |
| Superset | 3 nodes | 8 cores | 16 GB | Stateless | Multi-replica + Redis |