The unified analytics platform built on Apache Spark — designed to scale from gigabytes to petabytes, power AI/ML workloads, and bridge the gap between Data Engineering, Data Science, and Analytics in a single Lakehouse.
A cloud-native, unified data intelligence platform combining data engineering, streaming, ML, and BI in one place.
Databricks was founded in 2013 by the original creators of Apache Spark at UC Berkeley's AMPLab. It commercializes Spark as a fully managed cloud service — dramatically simplifying large-scale data processing that once required an entire infrastructure team.
At its core, Databricks implements the Lakehouse Architecture — a paradigm that merges the scalability and low cost of a Data Lake with the reliability, performance, and ACID transaction guarantees of a Data Warehouse.
Store raw data cheaply in cloud object storage (S3, ADLS, GCS), then apply schema, governance, and query performance on top — without copying data into a separate warehouse system.
One platform for ETL pipelines, SQL analytics, machine learning, streaming, and BI — eliminating the "data silo" problem between teams.
Databricks Runtime is an optimized, proprietary fork of Apache Spark with performance-critical patches, providing 2–5× faster execution.
An open-source storage layer providing ACID transactions, schema enforcement, time travel, and CDC on top of Parquet files.
From cluster provisioning to distributed job execution — understanding the core mechanics.
User or job triggers cluster creation. Control plane instructs your cloud provider (via API) to spin up EC2/VMs.
One Driver node coordinates tasks. N Worker nodes execute parallel computation using Spark executor JVMs.
Spark converts code into a Directed Acyclic Graph (DAG) of stages, distributed across worker cores for parallel execution.
Cluster auto-terminates after idle timeout. Results are persisted to Delta Lake / cloud storage — no data loss.
Delta Lake is what makes Databricks uniquely reliable for large-scale production. It wraps Parquet files with a
transaction log (_delta_log/) that records every change.
VERSION AS OF 5 or TIMESTAMP AS OF '2024-01-01'# ── Reading from Delta Lake ────────────────────────────────────────── df = spark.read.format("delta").load("abfss://raw@datalake.dfs.core.windows.net/events") # ── Writing with MERGE (upsert) ─────────────────────────────────────── from delta.tables import DeltaTable deltaTable = DeltaTable.forPath(spark, "/mnt/silver/users") deltaTable.alias("target").merge( newData.alias("source"), "target.user_id = source.user_id" ).whenMatchedUpdateAll() \ .whenNotMatchedInsertAll() \ .execute() # ── Time Travel Query ───────────────────────────────────────────────── df_v5 = spark.read.format("delta") \ .option("versionAsOf", 5) \ .load("/mnt/gold/orders") # ── Structured Streaming ────────────────────────────────────────────── streamDf = spark.readStream.format("delta").load("/mnt/bronze/events") query = streamDf.writeStream \ .format("delta") \ .outputMode("append") \ .option("checkpointLocation", "/mnt/checkpoints/events") \ .trigger(processingTime="1 minute") \ .start("/mnt/silver/events_cleaned")
Two-plane architecture separating control logic from data processing for security and scalability.
Runs in Databricks' cloud account. Handles:
Only metadata and credentials pass through the control plane. Actual data never leaves your cloud environment.
Runs inside your AWS/Azure/GCP account. Contains:
You pay AWS/Azure/GCP directly for compute — Databricks charges separately for the platform layer (DBUs).
The de-facto data organisation pattern in Databricks Lakehouses — progressively refining raw data into business-ready assets.
Databricks' native vectorized query engine written in C++ — replaces the Java-based Spark SQL engine for 2–12× faster queries on large scans.
Declarative pipeline framework. Define your pipeline as SQL/Python expectations, and DLT handles dependency resolution, retries, and data quality checks.
Open-source ML lifecycle tool (experiment tracking, model registry, deployment). Feature Store ensures consistent feature computation between training and serving.
Unified data governance for tables, views, ML models, files. Fine-grained column/row level security, audit logs, automated lineage, and data discovery.
Native job scheduler with DAG-based task dependency, retry logic, email alerting, and integrations with dbt, Airflow, and external webhooks.
Kafka → Delta Lake pipelines with exactly-once semantics. Trigger modes: micro-batch (seconds to minutes) or continuous processing (milliseconds).
What separates Databricks from alternatives like Snowflake, AWS EMR, or raw Spark on Kubernetes.
| Capability | Databricks | Snowflake | AWS EMR | BigQuery |
|---|---|---|---|---|
| Batch ETL at Scale | ✅ Native Spark | ⚡ SQL only | ✅ DIY Spark | ⚡ SQL-focused |
| ML / Deep Learning | ✅ GPU clusters, MLflow | ❌ Limited | ⚡ Manual setup | ⚡ Vertex AI separate |
| Real-time Streaming | ✅ Structured Streaming + DLT | ⚡ Snowpipe only | ✅ Kinesis integration | ⚡ Dataflow needed |
| Open Format Storage | ✅ Delta / Iceberg / Hudi | ❌ Proprietary format | ✅ Open formats | ❌ Proprietary |
| Multi-language Support | ✅ Python, Scala, SQL, R | ⚡ SQL + Snowpark | ✅ Any JVM lang | ⚡ SQL + Python |
| Data Governance | ✅ Unity Catalog | ✅ Strong | ❌ Manual / Lake Formation | ✅ IAM + DLP |
| Managed Operations | ✅ Fully managed clusters | ✅ Serverless | ❌ Self-managed | ✅ Serverless |
| Cost at Petabyte Scale | ✅ Storage + DBU separation | ❌ High storage cost | ⚡ Spot instances help | ⚡ Per-query pricing |
Data stored as open Parquet/Delta files on cheap object storage (~$0.023/GB/month on S3 vs $40+/TB/month for Snowflake credits). Pay for compute only when running.
Delta Lake, Apache Spark, MLflow, and Apache Iceberg are all open-source. Your data format is portable — you can read Delta files with any Spark cluster.
Databricks Runtime applies 100+ optimisations over open-source Spark: adaptive query execution, Z-ordering, liquid clustering, and ZSTD compression.
Data Engineers, Data Scientists, ML Engineers, and Analysts all work on the same platform — shared notebooks, lineage, and governance via Unity Catalog.
Run on AWS, Azure, or GCP — or all three. Same APIs, same notebooks, same governance. Deploy where your data or compliance requirements dictate.
Mosaic AI (formerly MosaicML), Vector Search, and Foundation Model APIs allow fine-tuning and deploying LLMs directly within the Lakehouse — on your data.
Industry-specific and cross-industry workloads where Databricks delivers outsized value.
Transform terabytes to petabytes daily. Delta Live Tables brings CI/CD-style pipeline reliability with automatic retries, data quality assertions, and lineage tracking.
CDC from operational DBs, event log aggregation, complex multi-hop transformations, regulatory reporting pipelines.
From feature engineering on PB datasets → distributed training with Horovod/DeepSpeed → MLflow experiment tracking → model registry → real-time serving endpoints.
Fraud detection, recommendation systems, predictive maintenance, NLP at scale, large language model fine-tuning.
Ingest from Kafka, Kinesis, Event Hubs with exactly-once semantics. Trigger alerts, update dashboards, or feed operational systems with sub-minute latency.
IoT telemetry, clickstream analytics, financial market data, real-time inventory, live personalization.
Run ANSI SQL on Delta tables with Photon engine. Sub-second queries on billions of rows. Connect Tableau, Power BI, Looker, or Superset via JDBC/ODBC.
Enterprise dashboards, ad-hoc exploration, self-service analytics, cost-efficient alternative to Snowflake for compute-heavy BI.
Process whole-genome sequencing (WGS) datasets using Glow (genomics library on Spark). Run population-scale GWAS, variant annotation, and cohort analysis.
Monte Carlo simulations across millions of scenarios. AML/fraud model training. Regulatory reporting (BCBS 239, CCAR) with full audit trail via Delta time travel.
A battle-tested checklist for deploying Databricks at PB+ data volumes in enterprise production environments.
At PB scale: 1) Storage costs dominate → use Z-ordering, liquid clustering, OPTIMIZE. 2) Network I/O is the bottleneck → keep compute in same region/AZ as storage. 3) Autoscaling needs tuning → aggressive scale-down wastes startup time, aggressive scale-up wastes money.
Isolate the data plane in a private VNet/VPC. Use VPC peering or Private Link to connect to source systems.
# Terraform: Databricks on Azure (private) resource "azurerm_databricks_workspace" "prod" { name = "dbx-prod" resource_group_name = var.rg location = "eastus2" sku = "premium" public_network_access_enabled = false custom_parameters { virtual_network_id = var.vnet_id public_subnet_name = "dbx-public" private_subnet_name = "dbx-private" no_public_ip = true } }
Organise your Lakehouse storage using Medallion pattern with separate containers per zone.
# Recommended folder structure abfss://bronze@prod.dfs.core.windows.net/ ├── source_system_a/ ├── source_system_b/ abfss://silver@prod.dfs.core.windows.net/ ├── domain_users/ ├── domain_orders/ abfss://gold@prod.dfs.core.windows.net/ ├── mart_finance/ ├── mart_operations/
For PB workloads, use instance fleets with autoscaling and spot/preemptible instances on workers.
# Recommended cluster config (JSON) { "cluster_name": "prod-etl-large", "spark_version": "14.3.x-scala2.12", "node_type_id": "Standard_E64ds_v4", "driver_node_type_id": "Standard_E32ds_v4", "autoscale": { "min_workers": 4, "max_workers": 200 }, "enable_elastic_disk": true, "spot_bid_max_price": 100, "spark_conf": { "spark.databricks.delta.optimizeWrite.enabled": "true", "spark.databricks.delta.autoCompact.enabled": "true", "spark.sql.shuffle.partitions": "auto", "spark.databricks.photon.enabled": "true" } }
Unity Catalog is mandatory for production. It provides fine-grained access control, lineage, and auditing.
catalog.schema.table-- Unity Catalog: Fine-grained permissions GRANT SELECT ON TABLE prod_catalog.gold.orders TO `data-analysts@company.com`; CREATE ROW FILTER orders_region_filter ON TABLE prod_catalog.gold.orders USING (region = current_user_region()); ALTER TABLE prod_catalog.gold.orders SET ROW FILTER orders_region_filter ();
At petabyte scale, proper Delta table tuning is critical for performance and cost.
-- 1. OPTIMIZE + Z-ORDER (run weekly on Gold) OPTIMIZE prod_catalog.gold.events ZORDER BY (event_date, user_id); -- 2. Liquid Clustering (Databricks 13+) -- replaces static partitioning ALTER TABLE prod_catalog.silver.events CLUSTER BY (event_date, region); -- 3. VACUUM old files (30-day retention) VACUUM prod_catalog.gold.events RETAIN 720 HOURS; -- 4. Bloom filter for high-cardinality lookups CREATE BLOOMFILTER INDEX ON TABLE prod_catalog.gold.events FOR COLUMNS (user_id OPTIONS (fpp=0.1));
Monitor costs, performance, and data quality proactively — before they become incidents.
@dlt.expect_or_fail# DLT Data Quality Expectations import dlt @dlt.table @dlt.expect_or_drop("valid_order_id", "order_id IS NOT NULL") @dlt.expect_or_fail("positive_amount", "amount > 0") def silver_orders(): return dlt.read("bronze_orders") \ .filter("status != 'CANCELLED'")
Key Spark and Databricks configurations for petabyte-scale production deployments.
# spark-defaults.conf for PB workloads # Adaptive Query Execution (AQE) spark.sql.adaptive.enabled true spark.sql.adaptive.coalescePartitions.enabled true spark.sql.adaptive.skewJoin.enabled true # Dynamic Partition Pruning spark.sql.optimizer.dynamicPartitionPruning.enabled true # Shuffle tuning (auto for Databricks 14+) spark.sql.shuffle.partitions auto # Memory management spark.memory.fraction 0.8 spark.memory.storageFraction 0.3 # Delta-specific spark.databricks.delta.optimizeWrite.enabled true spark.databricks.delta.autoCompact.enabled true spark.databricks.delta.retentionDurationCheck.enabled false # Photon engine (all SQL workloads) spark.databricks.photon.enabled true # Kryo serialization (30% faster than Java ser) spark.serializer org.apache.spark.serializer.KryoSerializer