Apache Doris vs Apache Spark

01

What Are They, Really?

🔵

Apache Doris

Real-Time Analytical Database (MPP OLAP)

Apache Doris is a modern MPP (Massively Parallel Processing) analytical database built specifically for online analytical processing (OLAP). Originated at Baidu, open-sourced in 2017, it presents itself as a MySQL-compatible SQL database that can ingest real-time data and serve sub-second ad-hoc queries at scale.

Think of it as a high-performance analytical database — data goes in, queries come out fast. No Spark context, no JVM tuning for data scientists; just SQL over your warehouse data.

MPP Engine MySQL Protocol Real-Time Ingestion Columnar Storage Sub-second Queries

🟠

Apache Spark

Unified Distributed Analytics Engine

Apache Spark is a general-purpose distributed computing engine that handles batch, streaming, machine learning, and graph processing under one unified API. Born at UC Berkeley's AMPLab in 2009, it replaced MapReduce by doing everything in-memory — often 100× faster.

Think of it as a programmable data processing fabric — you write transformations (Python, Scala, Java, R), and Spark distributes them across a cluster. It is a compute engine, not a database.

In-Memory DAG Batch + Streaming MLlib PySpark API Spark SQL

02

Architecture Deep-Dive

🏗️

Doris Architecture: Shared-Nothing MPP

Two process types, no external dependencies, pure SQL serving layer

⬡ FE — Frontend Node

Query planning, parsing & optimization (CBO + RBO)
Metadata management via BDB JE (no ZooKeeper needed)
MySQL-compatible protocol handler on port 9030
Manages tablet assignments and replication policies
Follower FEs provide HA via Paxos consensus

⬡ BE — Backend Node

Columnar storage in Apache ORC-like Doris-native format
Vectorized query execution engine (SIMD, AVX2)
Local data caching: hot/warm/cold tier support
Segment-based columnar files with zone maps & bitmap indexes
Handles data ingestion via Stream Load, Routine Load, Broker Load

⬡ Storage Models

Duplicate Key: raw event data, log storage, no deduplication
Aggregate Key: pre-aggregated metrics (SUM, MAX, REPLACE)
Unique Key: upsert semantics, CDC from MySQL/Kafka
Primary Key (Merge-on-Read): high-frequency updates with MoR compaction

🏗️

Spark Architecture: Driver + Executor DAG Engine

Pluggable cluster manager, rich in-memory RDD/DataFrame abstraction

⬡ Driver Program

SparkContext / SparkSession entry point
Catalyst optimizer converts SQL → logical → physical plan
DAG Scheduler breaks plan into stages of tasks
Task Scheduler dispatches tasks to executors
Tungsten engine generates optimized JVM bytecode

⬡ Executor Processes

Run on worker nodes (YARN, Kubernetes, Mesos, Standalone)
Each executor has JVM heap; tasks share executor memory
Cached RDDs stored in executor off-heap or on-heap
Shuffle service moves data between stages (sort-merge)
Dynamic Resource Allocation scales executors up/down

⬡ Core Modules

Spark SQL: ANSI SQL + DataFrame API, Hive/Glue metastore integration
Structured Streaming: micro-batch and continuous exactly-once streams
MLlib: distributed ML — linear models, trees, embeddings, pipelines
GraphX: property graph processing with Pregel API

03

Side-by-Side Comparison

Dimension	🔵 Apache Doris	🟠 Apache Spark
Category What kind of system	OLAP Database Persistent storage + query engine	Compute Engine No native storage, reads from any source
Query Latency Typical interactive query	10ms – 1s Sub-second on pre-loaded data	2s – minutes Job startup overhead ~2-10s
Data Freshness Ingest to query delay	Seconds Real-time via Routine Load / Stream Load	Minutes – hours Structured Streaming can do minutes
Query Interface How you talk to it	`MySQL-compatible SQL` Connect any MySQL client, BI tool directly	`DataFrame API + SparkSQL` Python/Scala/Java/R via SparkSession
Concurrency Simultaneous queries	High (thousands) Designed for many concurrent BI users	Limited (tens) Each job consumes cluster resources heavily
ETL / Transformation Data pipeline capability	Basic SQL only Not designed for complex multi-hop ETL	Excellent Core use case, rich transformation APIs
Machine Learning Built-in ML capabilities	None built-in SQL UDFs possible but not the tool	MLlib + SparkML Distributed training, feature engineering
Streaming Real-time stream processing	Kafka → Doris Ingest from Kafka, not stream compute	Structured Streaming Full stateful stream processing, windowing
Storage Where data lives	Self-managed columnar Local BE disks or S3 (cold storage)	External only HDFS, S3, GCS, ADLS, Delta, Iceberg, Hudi
Joins at Scale Multi-table join perf	Colocated + broadcast joins Partition colocation eliminates shuffle for star schema	Sort-merge shuffle joins Expensive shuffle, AQE helps adaptively
Cluster Management Ops complexity	Simple (2 process types) No external dependencies like ZooKeeper	Complex Needs YARN/K8s, Hive metastore, tuning
Scalability Data size limits	PB-scale (tested) Baidu runs at petabyte scale internally	Unlimited scale Practically unbounded via cloud object storage
Language Support Developer APIs	`SQL only` Java/C++ UDFs via plugin API	`Python · Scala · Java · R` Full native programming model
Table Format Support Open lakehouse formats	External Catalog Read Delta/Iceberg/Hudi via external catalog	Native Delta/Iceberg/Hudi Full read/write with time-travel & ACID
Cost Model Typical deployment cost	Higher $/TB stored Always-on cluster, fast disk needed	Lower $/TB stored Ephemeral clusters, cheap object storage

04

Performance Characteristics

🔵 Doris Performance Profile

Ad-hoc SQL Query (< 1M rows)98/100

Dashboard / BI Concurrency95/100

Real-Time Ingestion Throughput88/100

Point Lookup (Unique Key)92/100

Large-Scale Batch ETL42/100

ML / Complex Transformations18/100

🟠 Spark Performance Profile

Large-Scale Batch ETL (TB+)97/100

ML Training at Scale94/100

Complex Multi-Step Pipelines95/100

Structured Streaming85/100

Interactive Sub-Second SQL30/100

High-Concurrency BI (1000+ users)20/100

Benchmark context: Doris consistently ranks top-3 in TPC-H and TPC-DS benchmarks at scale factors up to 10TB, often beating Clickhouse and Presto on star-schema workloads. Spark's Adaptive Query Execution (AQE) in Spark 3.x significantly closed the gap for complex SQL, but startup latency (JVM + executor allocation) still prevents sub-second responses. Doris's vectorized engine (added 2021) achieves 3–10× speedup over its pre-vectorized version via SIMD AVX2 operations on columnar batches.

05

Strengths & Weaknesses

✅ Doris Strengths

⚡
Sub-second latency: Vectorized execution + pre-loaded columnar data enables 10ms–500ms queries that BI tools demand.
🔌
Drop-in MySQL replacement: Any tool that speaks MySQL (Grafana, Superset, Tableau, DBeaver) connects instantly — zero SDK changes.
🔄
True real-time freshness: Routine Load ingests Kafka topics with seconds-level delay, visible immediately to queries.
👥
Massive concurrency: Thousands of simultaneous queries from BI dashboards without resource contention; connection pooling is trivial.
🧩
Simplicity of operations: Two process types (FE + BE), no ZooKeeper, no Hadoop. A 3-node cluster in 30 minutes.
🗄️
Multiple table models: Duplicate, Aggregate, Unique, Primary Key models serve different data patterns without separate systems.
❄️
Cold-hot tiering: Hot data on SSD, cold data auto-migrated to S3 — cost-efficient for time-series data.

⚠️ Doris Weaknesses

🚫
Not a compute engine: Cannot run arbitrary Python or complex multi-step transformations outside of SQL.
💾
Storage cost: Always-on cluster with fast disks is expensive compared to serverless compute over cheap object storage.
🤖
No ML workloads: Feature engineering and model training are outside its scope entirely.
📦
Smaller ecosystem: Far fewer integrations, tutorials, and community size compared to Spark. Chinese-language docs dominate.
🔁
Schema rigidity: DDL schema changes require careful planning; ALTER TABLE on large datasets can be slow.
📊
Limited graph / text: No native graph processing or full-text search at Elasticsearch level.

✅ Spark Strengths

🔀
Universal ETL engine: Reads any source (S3, HDFS, RDBMS, Kafka, APIs), transforms with full code expressiveness, writes anywhere.
🤖
Integrated ML pipeline: Feature engineering → model training → scoring in a single Spark job with MLlib & SparkML pipelines.
🌊
Stateful streaming: Stateful aggregations, event-time windowing, watermarks, exactly-once semantics via Structured Streaming.
🌍
Massive ecosystem: Delta Lake, Iceberg, Hudi, dbt, Great Expectations, Airflow, MLflow — everything integrates with Spark.
📈
Infinite horizontal scale: Scales to exabytes; clusters resize dynamically in cloud environments (EMR, Databricks, GCP Dataproc).
💰
Cost-efficient for batch: Ephemeral spot clusters on object storage dramatically reduce cost for nightly batch jobs.
🔬
Rich language support: PySpark, Scala, Java, R — data engineers and data scientists share the same engine.

⚠️ Spark Weaknesses

🐢
High query latency: Job startup time (JVM + executor allocation) makes sub-second interactive queries impossible.
👥
Poor concurrency: Running 100+ simultaneous queries saturates cluster resources; requires careful queue management.
⚙️
Complex tuning: Memory fractions, shuffle partitions, AQE settings, GC tuning — a full-time expertise area.
📉
OOM sensitivity: Skewed joins and large shuffles cause executor OOM errors that require expert debugging.
🔌
Not a database: BI tools need a persistent SQL endpoint; Spark requires Thrift Server or Databricks SQL Warehouse — extra infra.
🔄
Real-time freshness gap: True second-level freshness requires additional infra (Kafka + Flink/Delta Streaming) on top of Spark.

06

Ecosystem & Integrations

🔵 Doris Ecosystem

Best-in-class for BI serving and real-time analytics stacks.

Apache Kafka Apache Flink (write) MySQL CDC Apache Superset Grafana Tableau Power BI dbt (via MySQL adapter) DataX / SeaTunnel Spark (as sink) Hive External Catalog Iceberg Catalog JDBC Catalog (PG, MySQL) SelectDB Cloud

🟠 Spark Ecosystem

Largest ecosystem in the data engineering world — integrates with everything.

Delta Lake Apache Iceberg Apache Hudi MLflow Apache Airflow dbt (Spark adapter) Databricks AWS EMR GCP Dataproc Azure HDInsight Apache Kafka TensorFlow / PyTorch Hugging Face Great Expectations Apache Hive / Glue

07

When to Use Which — Real Scenarios

Scenario 01

Real-Time Business Dashboard

Your product team needs a live dashboard showing today's orders, revenue, and funnel metrics refreshing every 30 seconds with sub-second page loads. 50 simultaneous users.

🔵 Use Doris

Kafka → Doris Routine Load → Superset/Grafana. Latency: <1s. Concurrency: no problem.

Scenario 02

Nightly ETL — 10TB Data Lake

Every night you need to join 20+ tables across your data lake, apply complex business logic, deduplicate, and write cleaned Parquet to S3. Job runs for 2-4 hours.

🟠 Use Spark

Ephemeral EMR cluster on spot instances. Cost-efficient, massively scalable, expressive Python transforms.

Scenario 03

ML Feature Engineering Pipeline

Your data science team needs to compute 500+ features from raw clickstream events, train gradient boosting models, and deploy weekly. Data: 5TB/day.

🟠 Use Spark

PySpark + MLlib/SparkML pipelines + MLflow tracking. Feature store writes back to serving layer.

Scenario 04

User-Facing Analytics (SaaS Product)

You're building an analytics page inside your SaaS product — each customer queries their own data. 10,000 tenants, queries must return in <500ms, served via your API.

🔵 Use Doris

Doris's high concurrency + MySQL protocol makes it trivial to query from any backend language. Spark would collapse under this load.

Scenario 05

Event Stream Processing (Stateful)

Count unique users per session, detect fraud patterns in payment events, compute 5-minute rolling windows with exactly-once semantics from Kafka at 1M events/sec.

🟠 Use Spark

Structured Streaming with stateful mapGroupsWithState. (Flink is also a top choice here.)

Scenario 06

Log Analytics & Observability

Ingest 100GB/hour of application logs. Ops team needs instant drill-down: "show me all errors from service X in the last 10 minutes" — must complete in <2 seconds.

🔵 Use Doris

Doris Duplicate Key model + bitmap/bloom filter indexes = blazing-fast predicate pushdown on log data.

Scenario 07

Data Lakehouse Migration

Moving 500TB from legacy Oracle to an open lakehouse (Delta/Iceberg on S3). Need full ACID, schema evolution, time-travel, and to reprocess historical partitions.

🟠 Use Spark

Spark + Delta Lake is the de-facto standard. MERGE INTO, Z-ordering, auto-compaction, GDPR delete.

Scenario 08

Multi-Dimensional Ad-Hoc Analysis

Business analysts run free-form SQL across a 500M-row fact table with 20 dimension tables daily. Queries vary wildly. No pre-defined aggregations. SLA: <3s per query.

🔵 Use Doris

Doris's CBO + colocation join + vectorized engine handles star-schema ad-hoc without pre-computing cubes.

Scenario 09

End-to-End Lakehouse Pipeline

Build a modern data platform: ingest → transform → serve. You need ETL flexibility AND fast query serving, both at petabyte scale, in one coherent architecture.

⚡ Use Both Together

Spark for ETL → write to Doris for serving. This is the most common production pattern at companies like ByteDance, Meituan, and JD.com.

08

Quick Decision Guide

Ask yourself these questions in order ↓

Do you need query results in < 1 second?

↓

YES → latency-critical

🔵 Lean toward Doris

NO → throughput-oriented

🟠 Lean toward Spark

Are you running ML training or complex Python transformations?

↓

YES → compute-heavy

🟠 Spark (only viable choice)

NO → SQL is enough

🔵 Doris handles it well

Do you need > 100 concurrent query users?

↓

YES → concurrency matters

🔵 Doris (Spark will break)

NO → batch / pipeline

🟠 Spark is fine

Is your data already in S3/HDFS and you want zero data copy?

↓

YES → stay in lake

🟠 Spark reads native

NO → load into system

🔵 Doris ingestion pipeline

Need BOTH fast serving AND heavy transformation?

↓

⚡ Use Both — Spark for ETL → Doris for Serving

🏁 The Real Verdict

Apache Doris is a laser-focused analytical database — it does one thing better than almost anything else: serve fast, concurrent SQL queries over fresh data. If you're building dashboards, real-time reports, or user-facing analytics where milliseconds matter and hundreds of users query simultaneously, Doris is your answer. It's operationally simple, MySQL-compatible, and will make your BI team ecstatic.

Apache Spark is a general-purpose distributed computing engine — the glue of the modern data stack. If you're transforming terabytes, training models, migrating data lakes, or building complex multi-step pipelines, Spark's expressiveness, ecosystem depth, and horizontal scalability are unmatched. It's the engine powering most Fortune 500 data platforms.

The most powerful architecture uses both: Spark handles the heavy lifting — cleaning, enriching, aggregating — and writes results to Doris, which serves them to users at sub-second speed. This Spark → Doris pattern is battle-tested at massive scale (ByteDance, Meituan, JD.com) and represents the ideal modern analytical architecture.

Doris = Serve Layer Spark = Transform Layer Doris for BI Concurrency Spark for ETL + ML Doris for Real-Time Freshness Spark for Data Lakes Best: Use Both Together