⬡ Big Data Platform Deep Dive · 2024

Everything About
Databricks

The unified analytics platform built on Apache Spark — designed to scale from gigabytes to petabytes, power AI/ML workloads, and bridge the gap between Data Engineering, Data Science, and Analytics in a single Lakehouse.

10,000+
Enterprise Customers
ExaBytes
Data Processed Daily
3× Faster
vs. Standard Spark
AWS · Azure · GCP
Multi-cloud Native

What is Databricks?

A cloud-native, unified data intelligence platform combining data engineering, streaming, ML, and BI in one place.

Databricks was founded in 2013 by the original creators of Apache Spark at UC Berkeley's AMPLab. It commercializes Spark as a fully managed cloud service — dramatically simplifying large-scale data processing that once required an entire infrastructure team.

At its core, Databricks implements the Lakehouse Architecture — a paradigm that merges the scalability and low cost of a Data Lake with the reliability, performance, and ACID transaction guarantees of a Data Warehouse.

💡 The Lakehouse Concept

Store raw data cheaply in cloud object storage (S3, ADLS, GCS), then apply schema, governance, and query performance on top — without copying data into a separate warehouse system.

🔷

Unified Platform

One platform for ETL pipelines, SQL analytics, machine learning, streaming, and BI — eliminating the "data silo" problem between teams.

Apache Spark Engine

Databricks Runtime is an optimized, proprietary fork of Apache Spark with performance-critical patches, providing 2–5× faster execution.

🌊

Delta Lake Storage

An open-source storage layer providing ACID transactions, schema enforcement, time travel, and CDC on top of Parquet files.

How Databricks Works

From cluster provisioning to distributed job execution — understanding the core mechanics.

▲ Databricks Lakehouse Platform Stack
🖥️ NotebooksPython · SQL · Scala · R
⚙️ WorkflowsJob Orchestration
🤖 AutoMLModel Training
📊 SQL WarehouseBI & Dashboards
↕ Control Plane (Databricks-managed)
🔥 Databricks RuntimeOptimized Spark + Photon
🔄 Delta Live TablesDeclarative ETL Pipelines
🤝 MLflowML Lifecycle Management
↕ Data Plane (Customer's Cloud Account)
🔷 Delta LakeACID · Time Travel · CDC
🗃️ Unity CatalogGovernance & Lineage
🔗 External DataKafka · JDBC · APIs
↕ Storage Layer
☁️ Amazon S3AWS
🗄️ Azure ADLS Gen2Azure
🪣 Google GCSGCP

⚙️ Cluster Lifecycle

1

Cluster Request

User or job triggers cluster creation. Control plane instructs your cloud provider (via API) to spin up EC2/VMs.

2

Driver + Workers Start

One Driver node coordinates tasks. N Worker nodes execute parallel computation using Spark executor JVMs.

3

Job Execution (DAG)

Spark converts code into a Directed Acyclic Graph (DAG) of stages, distributed across worker cores for parallel execution.

4

Auto-termination

Cluster auto-terminates after idle timeout. Results are persisted to Delta Lake / cloud storage — no data loss.

🔷 Delta Lake: The Magic Layer

Delta Lake is what makes Databricks uniquely reliable for large-scale production. It wraps Parquet files with a transaction log (_delta_log/) that records every change.

🔐
ACID Transactions
Atomicity, Consistency, Isolation, Durability — even on S3.
Time Travel
Query historical snapshots: VERSION AS OF 5 or TIMESTAMP AS OF '2024-01-01'
🔄
Schema Evolution
Safely add/remove columns. Schema enforcement catches bad data at write time.
📡
Change Data Feed (CDC)
Read only inserted/updated/deleted rows since last sync — crucial for streaming pipelines.

📄 Core PySpark + Delta Lake Patterns

# ── Reading from Delta Lake ──────────────────────────────────────────
df = spark.read.format("delta").load("abfss://raw@datalake.dfs.core.windows.net/events")

# ── Writing with MERGE (upsert) ───────────────────────────────────────
from delta.tables import DeltaTable

deltaTable = DeltaTable.forPath(spark, "/mnt/silver/users")
deltaTable.alias("target").merge(
    newData.alias("source"),
    "target.user_id = source.user_id"
).whenMatchedUpdateAll() \
 .whenNotMatchedInsertAll() \
 .execute()

# ── Time Travel Query ─────────────────────────────────────────────────
df_v5 = spark.read.format("delta") \
    .option("versionAsOf", 5) \
    .load("/mnt/gold/orders")

# ── Structured Streaming ──────────────────────────────────────────────
streamDf = spark.readStream.format("delta").load("/mnt/bronze/events")
query = streamDf.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "/mnt/checkpoints/events") \
    .trigger(processingTime="1 minute") \
    .start("/mnt/silver/events_cleaned")

Databricks Architecture

Two-plane architecture separating control logic from data processing for security and scalability.

Control Plane

Databricks-Managed Layer

Runs in Databricks' cloud account. Handles:

  • Web UI, REST API, CLI
  • Cluster Manager & Job Scheduler
  • Notebook collaboration & versioning
  • Workflow orchestration (Databricks Workflows)
  • Unity Catalog metadata & governance
  • Authentication & RBAC
🔒 Security note

Only metadata and credentials pass through the control plane. Actual data never leaves your cloud environment.

Data Plane

Your Cloud Account

Runs inside your AWS/Azure/GCP account. Contains:

  • Spark clusters (EC2, Azure VMs, GCE)
  • Delta Lake files in object storage
  • VPC/VNet with private networking
  • Notebooks executed in secure containers
  • Compute-optimized nodes (GPU/CPU)
  • Customer-managed encryption keys
💡 Bring Your Own Cloud

You pay AWS/Azure/GCP directly for compute — Databricks charges separately for the platform layer (DBUs).

🏅 Medallion Architecture (Bronze → Silver → Gold)

The de-facto data organisation pattern in Databricks Lakehouses — progressively refining raw data into business-ready assets.

🥉 Bronze Layer

Raw Ingestion

  • Exact copy of source data
  • Schema-on-read
  • No transformation, ever
  • Append-only or full snapshot
  • Retained indefinitely (audit)
🥈 Silver Layer

Cleansed & Conformed

  • Deduplication & null handling
  • Schema enforcement
  • Type casting & standardisation
  • Business key resolution
  • Joined/enriched datasets
🥇 Gold Layer

Business-Ready

  • Aggregated / aggregated facts
  • Domain-specific data marts
  • Optimised for BI tools
  • Feature store for ML
  • SLA-governed, well-documented
💎 Platinum (optional)

ML Features / Serving

  • Feature Store tables
  • Model serving endpoints
  • Real-time feature lookups
  • Experiment tracking via MLflow

🧩 Key Platform Components

Runtime

Photon Engine

Databricks' native vectorized query engine written in C++ — replaces the Java-based Spark SQL engine for 2–12× faster queries on large scans.

🌊
Pipelines

Delta Live Tables (DLT)

Declarative pipeline framework. Define your pipeline as SQL/Python expectations, and DLT handles dependency resolution, retries, and data quality checks.

🤖
ML/AI

MLflow + Feature Store

Open-source ML lifecycle tool (experiment tracking, model registry, deployment). Feature Store ensures consistent feature computation between training and serving.

🏛️
Governance

Unity Catalog

Unified data governance for tables, views, ML models, files. Fine-grained column/row level security, audit logs, automated lineage, and data discovery.

🔁
Orchestration

Databricks Workflows

Native job scheduler with DAG-based task dependency, retry logic, email alerting, and integrations with dbt, Airflow, and external webhooks.

📡
Streaming

Structured Streaming

Kafka → Delta Lake pipelines with exactly-once semantics. Trigger modes: micro-batch (seconds to minutes) or continuous processing (milliseconds).

Why Databricks? The Competitive Edge

What separates Databricks from alternatives like Snowflake, AWS EMR, or raw Spark on Kubernetes.

Capability Databricks Snowflake AWS EMR BigQuery
Batch ETL at Scale ✅ Native Spark ⚡ SQL only ✅ DIY Spark ⚡ SQL-focused
ML / Deep Learning ✅ GPU clusters, MLflow ❌ Limited ⚡ Manual setup ⚡ Vertex AI separate
Real-time Streaming ✅ Structured Streaming + DLT ⚡ Snowpipe only ✅ Kinesis integration ⚡ Dataflow needed
Open Format Storage ✅ Delta / Iceberg / Hudi ❌ Proprietary format ✅ Open formats ❌ Proprietary
Multi-language Support ✅ Python, Scala, SQL, R ⚡ SQL + Snowpark ✅ Any JVM lang ⚡ SQL + Python
Data Governance ✅ Unity Catalog ✅ Strong ❌ Manual / Lake Formation ✅ IAM + DLP
Managed Operations ✅ Fully managed clusters ✅ Serverless ❌ Self-managed ✅ Serverless
Cost at Petabyte Scale ✅ Storage + DBU separation ❌ High storage cost ⚡ Spot instances help ⚡ Per-query pricing
💰

Cost Efficiency at Scale

Data stored as open Parquet/Delta files on cheap object storage (~$0.023/GB/month on S3 vs $40+/TB/month for Snowflake credits). Pay for compute only when running.

🔓

No Vendor Lock-in

Delta Lake, Apache Spark, MLflow, and Apache Iceberg are all open-source. Your data format is portable — you can read Delta files with any Spark cluster.

🚀

Photon + Runtime Optimisations

Databricks Runtime applies 100+ optimisations over open-source Spark: adaptive query execution, Z-ordering, liquid clustering, and ZSTD compression.

🤝

Unified Team Collaboration

Data Engineers, Data Scientists, ML Engineers, and Analysts all work on the same platform — shared notebooks, lineage, and governance via Unity Catalog.

🌐

Multi-cloud Portability

Run on AWS, Azure, or GCP — or all three. Same APIs, same notebooks, same governance. Deploy where your data or compliance requirements dictate.

🧠

GenAI & LLM Integration

Mosaic AI (formerly MosaicML), Vector Search, and Foundation Model APIs allow fine-tuning and deploying LLMs directly within the Lakehouse — on your data.

What is Databricks Good For?

Industry-specific and cross-industry workloads where Databricks delivers outsized value.

⚙️ Data Engineering

Large-scale ETL / ELT Pipelines

Transform terabytes to petabytes daily. Delta Live Tables brings CI/CD-style pipeline reliability with automatic retries, data quality assertions, and lineage tracking.

Best for:

CDC from operational DBs, event log aggregation, complex multi-hop transformations, regulatory reporting pipelines.

🤖 Machine Learning

End-to-end ML Lifecycle

From feature engineering on PB datasets → distributed training with Horovod/DeepSpeed → MLflow experiment tracking → model registry → real-time serving endpoints.

Best for:

Fraud detection, recommendation systems, predictive maintenance, NLP at scale, large language model fine-tuning.

📡 Streaming Analytics

Real-time Data Pipelines

Ingest from Kafka, Kinesis, Event Hubs with exactly-once semantics. Trigger alerts, update dashboards, or feed operational systems with sub-minute latency.

Best for:

IoT telemetry, clickstream analytics, financial market data, real-time inventory, live personalization.

📊 SQL Analytics / BI

Serverless SQL Warehouses

Run ANSI SQL on Delta tables with Photon engine. Sub-second queries on billions of rows. Connect Tableau, Power BI, Looker, or Superset via JDBC/ODBC.

Best for:

Enterprise dashboards, ad-hoc exploration, self-service analytics, cost-efficient alternative to Snowflake for compute-heavy BI.

🧬 Genomics / Life Sciences

Scientific Computing at Scale

Process whole-genome sequencing (WGS) datasets using Glow (genomics library on Spark). Run population-scale GWAS, variant annotation, and cohort analysis.

🏦 Financial Services

Risk, Compliance & Trading

Monte Carlo simulations across millions of scenarios. AML/fraud model training. Regulatory reporting (BCBS 239, CCAR) with full audit trail via Delta time travel.

Production Setup — Petabyte Scale

A battle-tested checklist for deploying Databricks at PB+ data volumes in enterprise production environments.

⚠️ Petabyte Scale Principles

At PB scale: 1) Storage costs dominate → use Z-ordering, liquid clustering, OPTIMIZE. 2) Network I/O is the bottleneck → keep compute in same region/AZ as storage. 3) Autoscaling needs tuning → aggressive scale-down wastes startup time, aggressive scale-up wastes money.

🏗️
Step 1
Infrastructure & Networking

Isolate the data plane in a private VNet/VPC. Use VPC peering or Private Link to connect to source systems.

  • Enable VPC/VNet injection (no public IPs on workers)
  • Configure NAT Gateway for outbound traffic only
  • Use Private Link for Databricks control plane communication
  • Place clusters in same AZ as primary object storage bucket
  • Dedicated subnets per workspace (at least /24 for PB workloads)
# Terraform: Databricks on Azure (private)
resource "azurerm_databricks_workspace" "prod" {
  name                        = "dbx-prod"
  resource_group_name         = var.rg
  location                    = "eastus2"
  sku                         = "premium"
  public_network_access_enabled = false
  custom_parameters {
    virtual_network_id        = var.vnet_id
    public_subnet_name        = "dbx-public"
    private_subnet_name       = "dbx-private"
    no_public_ip              = true
  }
}
☁️
Step 2
Storage Architecture

Organise your Lakehouse storage using Medallion pattern with separate containers per zone.

  • Separate storage accounts per environment (dev/staging/prod)
  • Enable hierarchical namespace (ADLS Gen2 / S3 with table format)
  • Set lifecycle policies: archive Bronze after 90 days to cold tier
  • Enable versioning on Gold layer containers
  • Use customer-managed keys (CMK) for encryption at rest
# Recommended folder structure
abfss://bronze@prod.dfs.core.windows.net/
  ├── source_system_a/
  ├── source_system_b/
abfss://silver@prod.dfs.core.windows.net/
  ├── domain_users/
  ├── domain_orders/
abfss://gold@prod.dfs.core.windows.net/
  ├── mart_finance/
  ├── mart_operations/
Step 3
Cluster Configuration (PB Scale)

For PB workloads, use instance fleets with autoscaling and spot/preemptible instances on workers.

# Recommended cluster config (JSON)
{
  "cluster_name": "prod-etl-large",
  "spark_version": "14.3.x-scala2.12",
  "node_type_id": "Standard_E64ds_v4",
  "driver_node_type_id": "Standard_E32ds_v4",
  "autoscale": {
    "min_workers": 4,
    "max_workers": 200
  },
  "enable_elastic_disk": true,
  "spot_bid_max_price": 100,
  "spark_conf": {
    "spark.databricks.delta.optimizeWrite.enabled": "true",
    "spark.databricks.delta.autoCompact.enabled": "true",
    "spark.sql.shuffle.partitions": "auto",
    "spark.databricks.photon.enabled": "true"
  }
}
🏛️
Step 4
Unity Catalog & Governance

Unity Catalog is mandatory for production. It provides fine-grained access control, lineage, and auditing.

  • Create one metastore per cloud region
  • Use 3-level namespace: catalog.schema.table
  • Apply row-level filters and column masking for PII
  • Enable audit logging to SIEM (Splunk, Sentinel)
  • Use Service Principals for all automated jobs (no personal tokens)
-- Unity Catalog: Fine-grained permissions
GRANT SELECT ON TABLE prod_catalog.gold.orders
  TO `data-analysts@company.com`;

CREATE ROW FILTER orders_region_filter
  ON TABLE prod_catalog.gold.orders
  USING (region = current_user_region());

ALTER TABLE prod_catalog.gold.orders
  SET ROW FILTER orders_region_filter ();
🔧
Step 5
Delta Lake Optimisation (PB)

At petabyte scale, proper Delta table tuning is critical for performance and cost.

-- 1. OPTIMIZE + Z-ORDER (run weekly on Gold)
OPTIMIZE prod_catalog.gold.events
  ZORDER BY (event_date, user_id);

-- 2. Liquid Clustering (Databricks 13+) 
--    replaces static partitioning
ALTER TABLE prod_catalog.silver.events
  CLUSTER BY (event_date, region);

-- 3. VACUUM old files (30-day retention)
VACUUM prod_catalog.gold.events
  RETAIN 720 HOURS;

-- 4. Bloom filter for high-cardinality lookups
CREATE BLOOMFILTER INDEX
  ON TABLE prod_catalog.gold.events
  FOR COLUMNS (user_id OPTIONS (fpp=0.1));
📊
Step 6
Observability & Cost Control

Monitor costs, performance, and data quality proactively — before they become incidents.

  • Enable Cluster Usage & DBU dashboards in Databricks Admin Console
  • Export cluster events to Spark UI / Log Analytics
  • Set up DLT data quality expectations with @dlt.expect_or_fail
  • Budget alerts via Databricks Cost Management or AWS Cost Explorer
  • Use Spot interruption handling with autoscaling retry policies
  • Enable predictive autoscaling (Databricks Enhanced Autoscaling)
# DLT Data Quality Expectations
import dlt

@dlt.table
@dlt.expect_or_drop("valid_order_id",
    "order_id IS NOT NULL")
@dlt.expect_or_fail("positive_amount",
    "amount > 0")
def silver_orders():
    return dlt.read("bronze_orders") \
        .filter("status != 'CANCELLED'")

Production Config Reference

Key Spark and Databricks configurations for petabyte-scale production deployments.

🔥 Spark Performance Configs

# spark-defaults.conf for PB workloads

# Adaptive Query Execution (AQE)
spark.sql.adaptive.enabled                          true
spark.sql.adaptive.coalescePartitions.enabled       true
spark.sql.adaptive.skewJoin.enabled                 true

# Dynamic Partition Pruning
spark.sql.optimizer.dynamicPartitionPruning.enabled true

# Shuffle tuning (auto for Databricks 14+)
spark.sql.shuffle.partitions                        auto

# Memory management
spark.memory.fraction                              0.8
spark.memory.storageFraction                       0.3

# Delta-specific
spark.databricks.delta.optimizeWrite.enabled        true
spark.databricks.delta.autoCompact.enabled          true
spark.databricks.delta.retentionDurationCheck.enabled false

# Photon engine (all SQL workloads)
spark.databricks.photon.enabled                     true

# Kryo serialization (30% faster than Java ser)
spark.serializer         org.apache.spark.serializer.KryoSerializer

💾 Recommended Instance Types

General ETL
AWS: r6id.4xlarge → r6id.32xlarge
Azure: Standard_E16ds_v5 → E96ds_v5
Memory-optimised, NVMe local SSD for spill
SQL / BI Workloads
AWS: m6id.4xlarge + Photon
Azure: Standard_D32ds_v5 + Photon
Balanced compute with fast local disk
ML Training (GPU)
AWS: p4d.24xlarge (A100 x8)
Azure: Standard_ND96asr_v4
NVLink interconnect for distributed training
Streaming (Low-latency)
AWS: c6id.4xlarge → c6id.16xlarge
Azure: Standard_F32s_v2
Compute-optimised, high network bandwidth

🏆 PB-Scale Production Checklist

Infrastructure
  • Private VPC/VNet with no public worker IPs
  • Instance Fleet with Spot + On-demand fallback
  • Enhanced autoscaling enabled
  • Cluster pools to reduce startup time
  • Same region AZ as primary storage
Data & Storage
  • Medallion architecture (Bronze/Silver/Gold)
  • Liquid clustering on high-traffic tables
  • Weekly OPTIMIZE + VACUUM jobs
  • Target file size 128–512 MB
  • ZSTD compression for columnar files
Security & Governance
  • Unity Catalog with 3-level namespace
  • Service Principal per pipeline (no PATs)
  • CMK encryption + TLS in transit
  • Audit logs to SIEM
  • IP access lists on workspaces
Operations
  • CI/CD via Databricks Bundles (DABs)
  • DLT expectations for data quality SLAs
  • Budget alerts & DBU cost tagging
  • Terraform IaC for all workspaces
  • Multi-workspace promotion (dev→staging→prod)