Open Source · Lakehouse Architecture · v1.0

Brand Data Platform
Architecture

Full-stack open-source lakehouse · 10PB scale · ACID compliant · Real-time + Batch · HA/HA
Designed by: Data Architecture Team

⬡ 10 PB Storage ✦ ACID / Lakehouse ⚡ Batch + Streaming 🔐 AuthN / AuthZ ♾ High Availability 🔍 Direct Query
Platform Criteria Mapping
💾
10 PB Data Volume
Petabyte-scale object storage with horizontal scalability and erasure coding.
MinIOHDFS
📄
Text + Mixed Data Types
Structured, semi-structured, unstructured text. JSON, Parquet, ORC, CSV, plain text, logs.
IcebergMinIO
🔄
ACID Transactions
Row-level ACID, snapshot isolation, time travel, schema evolution via open table format.
Apache Iceberg
Batch + Streaming
Unified processing for both bulk batch ETL and real-time event stream processing.
SparkFlink
🔍
Direct Query on Data
Query-in-place on object storage — no ETL to a separate RDBMS required.
TrinoDruid
🔐
AuthN + AuthZ
Centralized identity via OAuth2/OIDC. Column/row-level data access control.
KeycloakApache Ranger
🏗️
Lakehouse Architecture
Medallion (Bronze→Silver→Gold) layers on object storage with open table format.
IcebergNessie
🌡️
High Availability
Multi-node clusters, replication factor ≥ 3, no single point of failure, auto-failover.
K8sMinIO Erasure
7-Layer Data Platform Architecture
L1 🌐
DATA SOURCES
All origin systems feeding raw data into the platform
Ingress
🗄️
RDBMS
PostgreSQL, MySQL, Oracle — transactional structured data
StructuredCDC
📦
NoSQL / Document
MongoDB, Cassandra, Elasticsearch — flexible schemas
Semi-structured
📡
Event / IoT Streams
Sensors, clickstream, app events — high-frequency data
Real-timeJSON
📁
Files / Object Store
CSV, JSON, XML, Parquet, PDF, text documents
Unstructured
🔗
REST APIs / SaaS
CRM, ERP, marketing platforms via HTTP connectors
Batch Pull
📋
Log Systems
Application logs, audit logs, system metrics (syslog, fluentd)
TextHigh-vol
▼ Ingestion Pipeline ▼
L2
INGESTION LAYER
Batch + streaming pipelines — pull, push, CDC, event-driven
Pipeline
🔀
Apache Kafka
Distributed event streaming bus. Message broker for real-time event ingestion at scale
Streaming10M+ msg/sHA
🔌
Kafka Connect
Connector framework for JDBC, S3, HDFS, Elasticsearch, and 200+ source/sink connectors
Source/SinkScalable
🔄
Debezium
Change Data Capture (CDC) — captures row-level changes from RDBMS in real-time
CDCMySQL/PG
🌊
Apache NiFi
Visual dataflow orchestration for batch file ingestion — drag-drop pipeline builder
BatchVisual ETL
🔗
Airbyte
Open-source ELT platform — 300+ pre-built connectors for SaaS and cloud sources
ELT300+ conn.
▼ Persisted to Object Storage ▼
L3 🏗️
STORAGE + LAKEHOUSE LAYER
Petabyte-scale object storage with ACID open table format — Medallion architecture
Core
💾
MinIO
S3-compatible distributed object storage. Erasure coding for 10PB+ scale. Multi-tenant. On-premise or cloud
10PB+S3 APIErasureHA
🧊
Apache Iceberg
Open table format: ACID transactions, snapshot isolation, time travel, hidden partitioning, schema evolution
ACIDTime TravelSchema Evo
📚
Project Nessie
Git-like transactional catalog for Iceberg. Branch, commit, merge — multi-table transactions
CatalogGit-like
🗂️
Apache Parquet / ORC
Columnar storage formats — high compression, predicate pushdown, efficient analytics I/O
ColumnarCompressed
🥉 Bronze Layer
Raw ingested data — immutable, schema-on-read, full fidelity. Never deleted.
🥈 Silver Layer
Cleansed, validated, deduplicated, typed. Schema enforced. Business entities formed.
🥇 Gold Layer
Aggregated, business-ready data marts. Optimized for BI, ML, and analytics.
▼ Processing + Transformation ▼
L4 ⚙️
PROCESSING LAYER
Unified batch and streaming compute + workflow orchestration
Compute
Apache Spark
Distributed batch processing engine. Spark SQL, DataFrame API. Bronze→Silver→Gold transforms
Batch ETLMLSQL
🌊
Apache Flink
Real-time stateful stream processing. Exactly-once semantics, event time, windowing, CEP
StreamExactly-onceLow-latency
🎯
Apache Airflow
Workflow orchestration via DAGs. Schedule, monitor, retry batch pipelines. 1000+ operators
OrchestrationDAGScheduler
📓
Apache Zeppelin / Jupyter
Interactive notebooks for data exploration, ad-hoc analysis, and ML experimentation
NotebookEDA
▼ Query Engines ▼
L5 🔍
QUERY LAYER
Interactive, federated, and real-time OLAP query engines — direct on storage
Query
Apache Trino
Federated SQL query engine — query Iceberg, Hive, RDBMS, Kafka from a single SQL interface. Sub-second to minutes
FederatedANSI SQLDirect Query
🚀
Apache Druid
Real-time OLAP database — sub-second queries on streaming + historical data. Power dashboards
Real-time OLAPSub-secondTimeSeries
💻
Spark SQL
SQL interface over Spark cluster — complex batch SQL transformations on Iceberg tables
Batch SQLComplex Joins
📊
Hive Metastore
Metadata registry for tables, schemas, partitions — used by Trino, Spark, Flink as catalog backend
MetadataSchema Reg
▼ Governance + Security ▼
L6 🔐
SECURITY & GOVERNANCE LAYER
Authentication, authorization, data catalog, lineage, and compliance
Security
🗝️
Keycloak
Identity & access management. OAuth2 / OIDC / SAML. SSO for all platform services. MFA support
AuthNOAuth2SSO
🛡️
Apache Ranger
Fine-grained row/column/table ACL policies. Audit logging. Integrates with Trino, Spark, HDFS, Kafka
AuthZRow/Col ACLAudit
🗺️
Apache Atlas
Data catalog, metadata management, data lineage tracking, classification (PII, sensitive)
CatalogLineageClassification
🔏
MinIO IAM Policies
S3-compatible bucket-level and prefix-level access policies. Integrated with Keycloak via OIDC
S3 PolicyBucket ACL
▼ Business Intelligence & Consumption ▼
L7 📊
BI & CONSUMPTION LAYER
Self-service analytics, dashboards, APIs, and ML model serving
Serving
📈
Apache Superset
Modern BI platform — 40+ chart types, dashboards, SQL Lab, role-based access. Connects via Trino/Druid
BIDashboardsSQL Lab
📉
Metabase
Business-friendly self-service analytics — no-code question builder for non-technical users
Self-serviceNo-code
🤖
MLflow
ML experiment tracking, model registry, and serving — connects to Gold layer feature store
MLOpsModel Reg
🔌
REST / JDBC API
Trino JDBC/REST API exposes Gold layer data to external apps, microservices, data science tools
IntegrationJDBC
End-to-End Data Flow
⬛ Batch Path
🗄️
Source DB
RDBMS/Files
🌊
NiFi / Airbyte
Batch Ingest
🥉
MinIO Bronze
Raw Parquet
Spark ETL
Transform
🥈
Silver Layer
Iceberg
🥇
Gold Layer
Aggregated
🎯
Airflow DAG
Orchestrates
Trino Query
Direct SQL
📊
Superset BI
Dashboards
⚡ Streaming Path
📡
Events / IoT
Producers
🔀
Kafka Topic
Event Bus
🌊
Flink Job
Stream Proc
🧊
Iceberg Tables
Bronze/Silver
🚀
Apache Druid
Real-time OLAP
📈
Live Dashboard
Superset
🔔
Alerts / API
Real-time
Technology Decision Rationale
S3
MinIO — Object Storage
Petabyte-scale · S3-compatible · Self-hosted
Why MinIO?
Only open-source solution delivering true S3 API compatibility at 10PB+ scale. Erasure coding provides fault tolerance with configurable parity. No Hadoop dependency needed.
Sizing
10PB across 16+ nodes, erasure set EC:8+4 (12 drives per set). Hardware JBOD recommended. Throughput: 10+ GB/s per cluster.
HA Configuration
MinIO Operator on Kubernetes. Multi-site active-active replication. Load balancer with health checks.
S3 CompatibleErasure CodeMulti-tenantK8s Operator
🧊
Apache Iceberg — Table Format
ACID · Time Travel · Schema Evolution
ACID Guarantees
Snapshot isolation ensures readers are never blocked by writers. Atomic commits. Concurrent writes via optimistic concurrency control.
Time Travel
Query any historical snapshot: SELECT * FROM orders FOR SYSTEM_TIME AS OF '2024-01-01'. Auditing and point-in-time recovery built-in.
Hidden Partitioning
No partition columns in queries. Iceberg optimizes reads via metadata without user knowledge of partition scheme.
ACIDTime TravelSchema EvoCompaction
Apache Kafka — Event Streaming
Distributed · Durable · High-throughput
Role in Architecture
Central nervous system of real-time data. Acts as buffer between source systems and processing engines. Decouples producers from consumers.
Throughput
10M+ messages/second per cluster. Configurable retention (hours to months). Replication factor = 3 for HA. KRaft mode (no ZooKeeper).
Kafka Connect + Debezium
CDC from RDBMS (MySQL binlog, PG WAL). Sink to MinIO/Iceberg. Schema Registry for Avro/JSON schema management.
KRaftSchema RegistryDebezium CDCKSQL
🌊
Apache Flink — Stream Processing
Exactly-once · Stateful · Low-latency
Why Flink over Spark Streaming?
True event-time processing with watermarks. Millisecond latency. Native stateful operators. Checkpointing for fault tolerance without micro-batch lag.
Iceberg Sink
Flink 1.16+ native Iceberg sink with exactly-once semantics. Writes directly to Bronze/Silver Iceberg tables in MinIO. Configurable compaction.
Use Cases
Real-time aggregations, fraud detection, sessionization, joins of streams, enrichment lookups against Iceberg tables.
Exactly-onceCEPWindowingFlink SQL
🔍
Apache Trino — Query Engine
Federated SQL · Direct Query · ANSI SQL
Direct Query on Storage
Query Iceberg tables on MinIO directly — zero data movement. MPP engine with dynamic partition pruning. Predicate pushdown to storage.
Federated Queries
Single SQL to JOIN across Iceberg, PostgreSQL, Kafka, Elasticsearch, MongoDB simultaneously. No ETL pre-work needed.
Cluster Sizing
1 coordinator + N workers (horizontal scale). Workers sized at 128–256GB RAM. Fault-tolerant execution with spill to disk.
ANSI SQLIceberg Conn.MPPFault Tolerant
🔐
Keycloak + Apache Ranger
AuthN + AuthZ · Zero-Trust · Audit
Keycloak (AuthN)
Central IdP: OAuth2/OIDC tokens issued to all services (Trino, Kafka, Superset, MinIO). SSO via LDAP/AD integration. MFA enforced.
Apache Ranger (AuthZ)
Fine-grained policies: which user/group can SELECT which table/column. Row-level filters. Masks PII columns. Full audit trail.
Apache Atlas (Governance)
Data lineage from ingestion → Bronze → Silver → Gold → Dashboard. Auto-classify PII, sensitive fields. Data dictionary.
OAuth2/OIDCRow/Col ACLPII MaskingAudit Log
Open-Source Stack Decision Matrix
Domain Selected Tool Category ACID Scale HA License
Object StorageMinIOStorage10PB+Active-ActiveAGPL-3
Table FormatApache IcebergLakehouseFull ACIDExabytesSnapshotApache 2
Data CatalogProject NessieCatalogMulti-tableApache 2
Event StreamingApache KafkaMessagingAt-least-once10M msg/sReplication 3xApache 2
CDCDebeziumIngestionExactly-onceApache 2
Batch IngestApache NiFi / AirbyteELT/ETLPartialClusterApache 2
Batch ProcessingApache SparkComputevia IcebergPB-scaleYARN/K8sApache 2
Stream ProcessingApache FlinkComputeExactly-onceCheckpointApache 2
OrchestrationApache AirflowWorkflowCelery/K8sHA ModeApache 2
Interactive SQLApache TrinoQueryvia IcebergPB-scaleFault-tolerantApache 2
Real-time OLAPApache DruidQueryEventualStreamingReplicatedApache 2
AuthenticationKeycloakSecurityClusterActive-PassiveApache 2
AuthorizationApache RangerSecurityHAApache 2
Data GovernanceApache AtlasGovernanceApache 2
BI / VisualizationApache SupersetBIStatelessMulti-replicaApache 2
MLOpsMLflowML PlatformBasicApache 2
InfrastructureKubernetesOrchestrationUnlimitedMulti-nodeApache 2
Deployment & Sizing Guidelines
ComponentMin NodesCPU / NodeRAM / NodeStorage / NodeHA Mode
MinIO16 nodes16 cores64 GB16× HDD 6TB (≈ 1.5PB raw / cluster)Erasure EC:8+4
Kafka Brokers6 nodes32 cores128 GB12 TB NVMeReplication 3x
Spark Workers20 nodes32 cores256 GB2 TB NVMe localYARN / K8s
Flink Workers10 nodes16 cores128 GB1 TB NVMeCheckpointing
Trino Workers10 nodes32 cores256 GB2 TB NVMe (spill)Fault-tolerant exec
Apache Druid8 nodes16 cores128 GB4 TB SSDTiered segments
Keycloak3 nodes8 cores16 GB100 GB SSDActive-Passive
Airflow3 nodes8 cores32 GB500 GB SSDCelery / K8s exec
Superset3 nodes8 cores16 GBStatelessMulti-replica + Redis