StarRocks (formerly DorisDB) is a next-generation, open-source, fully-vectorized MPP (Massively Parallel Processing) OLAP database engine. It was originally developed by the team behind Apache Doris at Baidu, later spun off as an independent project. It is now a graduated Apache Software Foundation top-level project known as Apache Doris (upstream), with StarRocks as its enterprise-grade fork with significant performance and feature improvements.
• Ad-hoc exploratory analytics
• Real-time metrics monitoring
• Multidimensional OLAP analysis
• High-concurrency reporting
• Columnar storage format
• Cost-Based Optimizer (CBO)
• Pipeline execution engine
• Adaptive query scheduling
• ANSI SQL-99 + extensions
• JDBC/ODBC drivers
• Spark/Flink connectors
• BI tool native support
StarRocks processes queries using a pipeline-based, fully vectorized MPP engine. A query enters through a Frontend (FE) node, gets parsed and optimized by the CBO, distributed as fragments to Backend (BE) nodes, executed in parallel with vectorized operators, and the results are merged and returned.
StarRocks uses a classic separation of Frontend (Metadata/Coordination) and Backend (Compute/Storage) nodes. In shared-nothing deployments, BE nodes own data locally. In shared-data (cloud-native) mode, storage is decoupled to object storage.
Port 9030
Tableau / Superset
Connector
Stream Load
BDBJE Raft
Read replica
Read replica
Inside FE
Execute + Storage
Execute + Storage
Execute + Storage
Horizontal scale
Hot data
Cold / Shared data
Columnar format
Query optimization — CBO generates and selects optimal physical plan.
Metadata management — stores all catalog, table, partition, and tablet metadata using BDB-JE (Berkeley DB Java Edition) with Raft-based consensus for HA.
Load scheduling — coordinates data ingestion jobs (Broker Load, Routine Load, Stream Load).
Query execution — receives plan fragments from FE, executes vectorized operators (scan, filter, join, agg, sort).
Data exchange — shuffles intermediate data between BEs during joins/aggregations over the internal network.
Compaction — background merging of delta rowsets into base rowsets for read efficiency.
StarRocks consistently benchmarks at 3–10x faster than Hive, Presto/Trino, and ClickHouse on multi-table join analytics. Here's a deep look at its key strengths.
| Feature | StarRocks | ClickHouse | Presto/Trino | Apache Druid |
|---|---|---|---|---|
| Multi-table JOINs | ✅ Excellent | ⚠️ Limited | ✅ Good | ❌ Poor |
| Real-time Ingestion | ✅ <1s latency | ✅ Good | ❌ Query only | ✅ Sub-second |
| Data Lake Query | ✅ Native Catalogs | ⚠️ Limited | ✅ Excellent | ❌ No |
| UPDATE / DELETE | ✅ Primary Key | ⚠️ Mutation | ❌ No | ⚠️ Segment |
| Materialized Views | ✅ Auto-rewrite | ✅ Manual | ❌ No | ⚠️ Rollup only |
| Open Source License | ✅ Apache 2.0 | ✅ Apache 2.0 | ✅ Apache 2.0 | ✅ Apache 2.0 |
Deploying StarRocks at PB scale requires careful hardware sizing, replication topology, partitioning strategy, and operational tooling. Below is a production-grade reference deployment.
| Role | Count | CPU | RAM | Storage | Network |
|---|---|---|---|---|---|
| FE Leader | 1 | 16–32 cores | 64–128 GB | 500GB SSD (meta) | 25 GbE |
| FE Follower | 2 | 16–32 cores | 64–128 GB | 500GB SSD (meta) | 25 GbE |
| BE Node (hot) | 20–100+ | 32–64 cores | 256–512 GB | 8x 3.8TB NVMe SSD | 25–100 GbE |
| S3 / HDFS | — | — | — | Unlimited (cold data) | 10 GbE min |
priority_networks to specify the right network interface.
meta_dir = /data/starrocks/fe/meta
priority_networks = 10.0.0.0/24
http_port = 8030
rpc_port = 9020
query_port = 9030
edit_log_port = 9010
sys_log_level = INFO
# Start leader first, then add followers via SQL:
# ALTER SYSTEM ADD FOLLOWER "fe2:9010";
storage_root_path = /nvme0/starrocks;/nvme1/starrocks;/nvme2/starrocks
priority_networks = 10.0.0.0/24
be_port = 9060
be_http_port = 8040
heartbeat_service_port = 9050
brpc_port = 8060
mem_limit = 80%
# Register BE to FE cluster:
# ALTER SYSTEM ADD BACKEND "be1:9050";
event_date DATE NOT NULL,
user_id BIGINT NOT NULL,
event_type VARCHAR(64),
properties JSON,
amount DECIMAL(18,4),
created_at DATETIME
) ENGINE=OLAP
PRIMARY KEY (event_date, user_id)
PARTITION BY RANGE(event_date) (
START ("2024-01-01") END ("2026-12-31")
EVERY INTERVAL 1 MONTH
)
DISTRIBUTED BY HASH(user_id) BUCKETS 128
PROPERTIES (
"replication_num" = "3",
"storage_medium" = "SSD",
"storage_cooldown_ttl" = "30 day"
);
desired_concurrent_number.
COLUMNS TERMINATED BY ",",
COLUMNS (event_date, user_id, event_type, properties, amount, created_at)
PROPERTIES (
"desired_concurrent_number" = "8",
"max_batch_interval" = "10",
"max_error_number" = "1000",
"format" = "json"
)
FROM KAFKA (
"kafka_broker_list" = "broker1:9092,broker2:9092",
"kafka_topic" = "prod.events",
"kafka_partitions" = "0,1,2,3,4,5,6,7"
);
DISTRIBUTED BY HASH(event_date) BUCKETS 32
REFRESH ASYNC EVERY(INTERVAL 5 MINUTE)
AS
SELECT
event_date,
event_type,
COUNT(*) AS event_count,
SUM(amount) AS total_amount,
COUNT(DISTINCT user_id) AS dau
FROM events
GROUP BY event_date, event_type;
TO (user=bi_user, role=analyst)
WITH (
"cpu_core_limit" = "32",
"mem_limit" = "40%",
"concurrency_limit" = "50",
"type" = "normal"
);
CREATE RESOURCE GROUP etl_group
TO (user=etl_user)
WITH (
"cpu_core_limit" = "8",
"mem_limit" = "20%",
"type" = "slow_query"
);
StarRocks does not operate in isolation — it's most powerful as the query and serving layer in a broader big data platform. Below are the best open-source frameworks for each layer of the data stack.
This is the recommended full-stack open-source big data architecture with StarRocks as the core analytical engine — covering data sources, ingestion, processing, storage, query, and BI layers.
| Layer | Framework | Role in Platform | Interface with StarRocks |
|---|---|---|---|
| Batch | Apache Spark | Large-scale data transformation, ML feature prep | spark-connector-starrocks (Sink/Source) |
| Batch | dbt | SQL-based in-warehouse ELT transformations | dbt-starrocks adapter (MySQL dialect) |
| Batch | Apache Airflow | Pipeline orchestration, scheduling | StarRocksHook, MySQLOperator |
| Streaming | Apache Flink | Real-time ETL, CEP, windowed aggregation | flink-connector-starrocks (StreamLoad) |
| Streaming | Apache Kafka | Message bus, event streaming backbone | Routine Load (native consumer) |
| Streaming | Debezium | CDC from RDBMS to Kafka | Via Kafka → Routine Load / Flink |
| Storage | Apache Iceberg | Open table format for data lake | External Catalog (Iceberg Catalog) |
| Storage | S3 / HDFS | Raw file storage, cold data tier | Broker Load, Shared-data mode |
| Storage | Apache Hudi | Incremental upsert table format | External Catalog (Hudi Catalog) |
| Query | StarRocks FE | CBO optimization, metadata, SQL parsing | Core engine |
| Query | StarRocks BE | Vectorized execution, columnar storage | Core engine |
| BI | Apache Superset | Self-serve dashboards, SQL Lab | MySQL connector / SQLAlchemy |
| BI | Grafana | Real-time operational metrics dashboards | MySQL data source plugin |
| Governance | Apache Atlas | Data lineage and catalog | Via Hive Metastore integration |
| Governance | Apache Ranger | Centralized security and RBAC | Ranger StarRocks plugin |
StarRocks is licensed under Apache 2.0 — github.com/StarRocks/starrocks