Big Data · Query Engine Deep Dive

Apache Doris vs Apache Spark

A thorough, no-fluff comparison of two powerhouses in the modern data stack.

🔵 Real-Time OLAP 🟠 Unified Analytics Engine ⚖️ Know When to Use Each
🔵
Apache Doris
Real-Time Analytical Database (MPP OLAP)

Apache Doris is a modern MPP (Massively Parallel Processing) analytical database built specifically for online analytical processing (OLAP). Originated at Baidu, open-sourced in 2017, it presents itself as a MySQL-compatible SQL database that can ingest real-time data and serve sub-second ad-hoc queries at scale.

Think of it as a high-performance analytical database — data goes in, queries come out fast. No Spark context, no JVM tuning for data scientists; just SQL over your warehouse data.

MPP Engine MySQL Protocol Real-Time Ingestion Columnar Storage Sub-second Queries
🟠
Apache Spark
Unified Distributed Analytics Engine

Apache Spark is a general-purpose distributed computing engine that handles batch, streaming, machine learning, and graph processing under one unified API. Born at UC Berkeley's AMPLab in 2009, it replaced MapReduce by doing everything in-memory — often 100× faster.

Think of it as a programmable data processing fabric — you write transformations (Python, Scala, Java, R), and Spark distributes them across a cluster. It is a compute engine, not a database.

In-Memory DAG Batch + Streaming MLlib PySpark API Spark SQL
🏗️
Doris Architecture: Shared-Nothing MPP
Two process types, no external dependencies, pure SQL serving layer

FE — Frontend Node

  • Query planning, parsing & optimization (CBO + RBO)
  • Metadata management via BDB JE (no ZooKeeper needed)
  • MySQL-compatible protocol handler on port 9030
  • Manages tablet assignments and replication policies
  • Follower FEs provide HA via Paxos consensus

BE — Backend Node

  • Columnar storage in Apache ORC-like Doris-native format
  • Vectorized query execution engine (SIMD, AVX2)
  • Local data caching: hot/warm/cold tier support
  • Segment-based columnar files with zone maps & bitmap indexes
  • Handles data ingestion via Stream Load, Routine Load, Broker Load

Storage Models

  • Duplicate Key: raw event data, log storage, no deduplication
  • Aggregate Key: pre-aggregated metrics (SUM, MAX, REPLACE)
  • Unique Key: upsert semantics, CDC from MySQL/Kafka
  • Primary Key (Merge-on-Read): high-frequency updates with MoR compaction
🏗️
Spark Architecture: Driver + Executor DAG Engine
Pluggable cluster manager, rich in-memory RDD/DataFrame abstraction

Driver Program

  • SparkContext / SparkSession entry point
  • Catalyst optimizer converts SQL → logical → physical plan
  • DAG Scheduler breaks plan into stages of tasks
  • Task Scheduler dispatches tasks to executors
  • Tungsten engine generates optimized JVM bytecode

Executor Processes

  • Run on worker nodes (YARN, Kubernetes, Mesos, Standalone)
  • Each executor has JVM heap; tasks share executor memory
  • Cached RDDs stored in executor off-heap or on-heap
  • Shuffle service moves data between stages (sort-merge)
  • Dynamic Resource Allocation scales executors up/down

Core Modules

  • Spark SQL: ANSI SQL + DataFrame API, Hive/Glue metastore integration
  • Structured Streaming: micro-batch and continuous exactly-once streams
  • MLlib: distributed ML — linear models, trees, embeddings, pipelines
  • GraphX: property graph processing with Pregel API
Dimension 🔵 Apache Doris 🟠 Apache Spark
Category
What kind of system
OLAP Database
Persistent storage + query engine
Compute Engine
No native storage, reads from any source
Query Latency
Typical interactive query
10ms – 1s
Sub-second on pre-loaded data
2s – minutes
Job startup overhead ~2-10s
Data Freshness
Ingest to query delay
Seconds
Real-time via Routine Load / Stream Load
Minutes – hours
Structured Streaming can do minutes
Query Interface
How you talk to it
MySQL-compatible SQL
Connect any MySQL client, BI tool directly
DataFrame API + SparkSQL
Python/Scala/Java/R via SparkSession
Concurrency
Simultaneous queries
High (thousands)
Designed for many concurrent BI users
Limited (tens)
Each job consumes cluster resources heavily
ETL / Transformation
Data pipeline capability
Basic SQL only
Not designed for complex multi-hop ETL
Excellent
Core use case, rich transformation APIs
Machine Learning
Built-in ML capabilities
None built-in
SQL UDFs possible but not the tool
MLlib + SparkML
Distributed training, feature engineering
Streaming
Real-time stream processing
Kafka → Doris
Ingest from Kafka, not stream compute
Structured Streaming
Full stateful stream processing, windowing
Storage
Where data lives
Self-managed columnar
Local BE disks or S3 (cold storage)
External only
HDFS, S3, GCS, ADLS, Delta, Iceberg, Hudi
Joins at Scale
Multi-table join perf
Colocated + broadcast joins
Partition colocation eliminates shuffle for star schema
Sort-merge shuffle joins
Expensive shuffle, AQE helps adaptively
Cluster Management
Ops complexity
Simple (2 process types)
No external dependencies like ZooKeeper
Complex
Needs YARN/K8s, Hive metastore, tuning
Scalability
Data size limits
PB-scale (tested)
Baidu runs at petabyte scale internally
Unlimited scale
Practically unbounded via cloud object storage
Language Support
Developer APIs
SQL only
Java/C++ UDFs via plugin API
Python · Scala · Java · R
Full native programming model
Table Format Support
Open lakehouse formats
External Catalog
Read Delta/Iceberg/Hudi via external catalog
Native Delta/Iceberg/Hudi
Full read/write with time-travel & ACID
Cost Model
Typical deployment cost
Higher $/TB stored
Always-on cluster, fast disk needed
Lower $/TB stored
Ephemeral clusters, cheap object storage

🔵 Doris Performance Profile

Ad-hoc SQL Query (< 1M rows)98/100
Dashboard / BI Concurrency95/100
Real-Time Ingestion Throughput88/100
Point Lookup (Unique Key)92/100
Large-Scale Batch ETL42/100
ML / Complex Transformations18/100

🟠 Spark Performance Profile

Large-Scale Batch ETL (TB+)97/100
ML Training at Scale94/100
Complex Multi-Step Pipelines95/100
Structured Streaming85/100
Interactive Sub-Second SQL30/100
High-Concurrency BI (1000+ users)20/100

Benchmark context: Doris consistently ranks top-3 in TPC-H and TPC-DS benchmarks at scale factors up to 10TB, often beating Clickhouse and Presto on star-schema workloads. Spark's Adaptive Query Execution (AQE) in Spark 3.x significantly closed the gap for complex SQL, but startup latency (JVM + executor allocation) still prevents sub-second responses. Doris's vectorized engine (added 2021) achieves 3–10× speedup over its pre-vectorized version via SIMD AVX2 operations on columnar batches.

✅ Doris Strengths

  • Sub-second latency: Vectorized execution + pre-loaded columnar data enables 10ms–500ms queries that BI tools demand.
  • 🔌
    Drop-in MySQL replacement: Any tool that speaks MySQL (Grafana, Superset, Tableau, DBeaver) connects instantly — zero SDK changes.
  • 🔄
    True real-time freshness: Routine Load ingests Kafka topics with seconds-level delay, visible immediately to queries.
  • 👥
    Massive concurrency: Thousands of simultaneous queries from BI dashboards without resource contention; connection pooling is trivial.
  • 🧩
    Simplicity of operations: Two process types (FE + BE), no ZooKeeper, no Hadoop. A 3-node cluster in 30 minutes.
  • 🗄️
    Multiple table models: Duplicate, Aggregate, Unique, Primary Key models serve different data patterns without separate systems.
  • ❄️
    Cold-hot tiering: Hot data on SSD, cold data auto-migrated to S3 — cost-efficient for time-series data.

⚠️ Doris Weaknesses

  • 🚫
    Not a compute engine: Cannot run arbitrary Python or complex multi-step transformations outside of SQL.
  • 💾
    Storage cost: Always-on cluster with fast disks is expensive compared to serverless compute over cheap object storage.
  • 🤖
    No ML workloads: Feature engineering and model training are outside its scope entirely.
  • 📦
    Smaller ecosystem: Far fewer integrations, tutorials, and community size compared to Spark. Chinese-language docs dominate.
  • 🔁
    Schema rigidity: DDL schema changes require careful planning; ALTER TABLE on large datasets can be slow.
  • 📊
    Limited graph / text: No native graph processing or full-text search at Elasticsearch level.

✅ Spark Strengths

  • 🔀
    Universal ETL engine: Reads any source (S3, HDFS, RDBMS, Kafka, APIs), transforms with full code expressiveness, writes anywhere.
  • 🤖
    Integrated ML pipeline: Feature engineering → model training → scoring in a single Spark job with MLlib & SparkML pipelines.
  • 🌊
    Stateful streaming: Stateful aggregations, event-time windowing, watermarks, exactly-once semantics via Structured Streaming.
  • 🌍
    Massive ecosystem: Delta Lake, Iceberg, Hudi, dbt, Great Expectations, Airflow, MLflow — everything integrates with Spark.
  • 📈
    Infinite horizontal scale: Scales to exabytes; clusters resize dynamically in cloud environments (EMR, Databricks, GCP Dataproc).
  • 💰
    Cost-efficient for batch: Ephemeral spot clusters on object storage dramatically reduce cost for nightly batch jobs.
  • 🔬
    Rich language support: PySpark, Scala, Java, R — data engineers and data scientists share the same engine.

⚠️ Spark Weaknesses

  • 🐢
    High query latency: Job startup time (JVM + executor allocation) makes sub-second interactive queries impossible.
  • 👥
    Poor concurrency: Running 100+ simultaneous queries saturates cluster resources; requires careful queue management.
  • ⚙️
    Complex tuning: Memory fractions, shuffle partitions, AQE settings, GC tuning — a full-time expertise area.
  • 📉
    OOM sensitivity: Skewed joins and large shuffles cause executor OOM errors that require expert debugging.
  • 🔌
    Not a database: BI tools need a persistent SQL endpoint; Spark requires Thrift Server or Databricks SQL Warehouse — extra infra.
  • 🔄
    Real-time freshness gap: True second-level freshness requires additional infra (Kafka + Flink/Delta Streaming) on top of Spark.

🔵 Doris Ecosystem

Best-in-class for BI serving and real-time analytics stacks.

Apache Kafka Apache Flink (write) MySQL CDC Apache Superset Grafana Tableau Power BI dbt (via MySQL adapter) DataX / SeaTunnel Spark (as sink) Hive External Catalog Iceberg Catalog JDBC Catalog (PG, MySQL) SelectDB Cloud

🟠 Spark Ecosystem

Largest ecosystem in the data engineering world — integrates with everything.

Delta Lake Apache Iceberg Apache Hudi MLflow Apache Airflow dbt (Spark adapter) Databricks AWS EMR GCP Dataproc Azure HDInsight Apache Kafka TensorFlow / PyTorch Hugging Face Great Expectations Apache Hive / Glue
Scenario 01

Real-Time Business Dashboard

Your product team needs a live dashboard showing today's orders, revenue, and funnel metrics refreshing every 30 seconds with sub-second page loads. 50 simultaneous users.

🔵 Use Doris

Kafka → Doris Routine Load → Superset/Grafana. Latency: <1s. Concurrency: no problem.

Scenario 02

Nightly ETL — 10TB Data Lake

Every night you need to join 20+ tables across your data lake, apply complex business logic, deduplicate, and write cleaned Parquet to S3. Job runs for 2-4 hours.

🟠 Use Spark

Ephemeral EMR cluster on spot instances. Cost-efficient, massively scalable, expressive Python transforms.

Scenario 03

ML Feature Engineering Pipeline

Your data science team needs to compute 500+ features from raw clickstream events, train gradient boosting models, and deploy weekly. Data: 5TB/day.

🟠 Use Spark

PySpark + MLlib/SparkML pipelines + MLflow tracking. Feature store writes back to serving layer.

Scenario 04

User-Facing Analytics (SaaS Product)

You're building an analytics page inside your SaaS product — each customer queries their own data. 10,000 tenants, queries must return in <500ms, served via your API.

🔵 Use Doris

Doris's high concurrency + MySQL protocol makes it trivial to query from any backend language. Spark would collapse under this load.

Scenario 05

Event Stream Processing (Stateful)

Count unique users per session, detect fraud patterns in payment events, compute 5-minute rolling windows with exactly-once semantics from Kafka at 1M events/sec.

🟠 Use Spark

Structured Streaming with stateful mapGroupsWithState. (Flink is also a top choice here.)

Scenario 06

Log Analytics & Observability

Ingest 100GB/hour of application logs. Ops team needs instant drill-down: "show me all errors from service X in the last 10 minutes" — must complete in <2 seconds.

🔵 Use Doris

Doris Duplicate Key model + bitmap/bloom filter indexes = blazing-fast predicate pushdown on log data.

Scenario 07

Data Lakehouse Migration

Moving 500TB from legacy Oracle to an open lakehouse (Delta/Iceberg on S3). Need full ACID, schema evolution, time-travel, and to reprocess historical partitions.

🟠 Use Spark

Spark + Delta Lake is the de-facto standard. MERGE INTO, Z-ordering, auto-compaction, GDPR delete.

Scenario 08

Multi-Dimensional Ad-Hoc Analysis

Business analysts run free-form SQL across a 500M-row fact table with 20 dimension tables daily. Queries vary wildly. No pre-defined aggregations. SLA: <3s per query.

🔵 Use Doris

Doris's CBO + colocation join + vectorized engine handles star-schema ad-hoc without pre-computing cubes.

Scenario 09

End-to-End Lakehouse Pipeline

Build a modern data platform: ingest → transform → serve. You need ETL flexibility AND fast query serving, both at petabyte scale, in one coherent architecture.

⚡ Use Both Together

Spark for ETL → write to Doris for serving. This is the most common production pattern at companies like ByteDance, Meituan, and JD.com.

Ask yourself these questions in order ↓

Do you need query results in < 1 second?
YES → latency-critical
🔵 Lean toward Doris
NO → throughput-oriented
🟠 Lean toward Spark
Are you running ML training or complex Python transformations?
YES → compute-heavy
🟠 Spark (only viable choice)
NO → SQL is enough
🔵 Doris handles it well
Do you need > 100 concurrent query users?
YES → concurrency matters
🔵 Doris (Spark will break)
NO → batch / pipeline
🟠 Spark is fine
Is your data already in S3/HDFS and you want zero data copy?
YES → stay in lake
🟠 Spark reads native
NO → load into system
🔵 Doris ingestion pipeline
Need BOTH fast serving AND heavy transformation?
⚡ Use Both — Spark for ETL → Doris for Serving

🏁 The Real Verdict

Apache Doris is a laser-focused analytical database — it does one thing better than almost anything else: serve fast, concurrent SQL queries over fresh data. If you're building dashboards, real-time reports, or user-facing analytics where milliseconds matter and hundreds of users query simultaneously, Doris is your answer. It's operationally simple, MySQL-compatible, and will make your BI team ecstatic.

Apache Spark is a general-purpose distributed computing engine — the glue of the modern data stack. If you're transforming terabytes, training models, migrating data lakes, or building complex multi-step pipelines, Spark's expressiveness, ecosystem depth, and horizontal scalability are unmatched. It's the engine powering most Fortune 500 data platforms.

The most powerful architecture uses both: Spark handles the heavy lifting — cleaning, enriching, aggregating — and writes results to Doris, which serves them to users at sub-second speed. This Spark → Doris pattern is battle-tested at massive scale (ByteDance, Meituan, JD.com) and represents the ideal modern analytical architecture.

Doris = Serve Layer Spark = Transform Layer Doris for BI Concurrency Spark for ETL + ML Doris for Real-Time Freshness Spark for Data Lakes Best: Use Both Together