What is dbt?
dbt (data build tool) is an open-source command-line tool that enables data analysts and engineers to transform raw data in their warehouses using SQL.
🎯 Core Definition
dbt is the T in ELT (Extract, Load, Transform). It operates directly inside your data warehouse, letting you write modular SQL SELECT statements that dbt compiles and executes, creating tables and views for your analytics layer. Think of it as software engineering applied to data transformation.
dbt Core (Open Source)
The free, open-source CLI version. Run dbt locally or on your infrastructure. Full control over scheduling, orchestration, and compute. Works with BigQuery, Snowflake, Redshift, Databricks, Spark, DuckDB, and more.
dbt Cloud (Managed SaaS)
A managed platform with a web IDE, job scheduling, CI/CD, documentation hosting, observability, and team collaboration features — built on top of dbt Core. Includes a Semantic Layer for metric definitions.
The ELT Revolution
Traditional ETL transformed data before loading (expensive pipelines). dbt enables ELT: load raw data first into cloud warehouses, then transform in place using SQL. The warehouse is the engine — not a middleware layer.
Analytics Engineering
dbt created the Analytics Engineer role — bridging data engineering and analytics. Analysts own and maintain transformation pipelines using software engineering best practices like version control and testing.
How dbt Works
dbt orchestrates SQL transformations across a directed acyclic graph (DAG) of models. Here's the full execution lifecycle.
📄 A dbt Model (SQL file)
-- models/marts/orders_daily.sql -- Reference other models with ref() WITH orders AS ( SELECT * FROM {{ ref('stg_orders') }} ), customers AS ( SELECT * FROM {{ ref('stg_customers') }} ), daily_agg AS ( SELECT DATE_TRUNC('day', o.created_at) AS order_date, c.country, COUNT(o.order_id) AS total_orders, SUM(o.revenue) AS total_revenue, AVG(o.revenue) AS avg_order_value FROM orders o JOIN customers c USING (customer_id) GROUP BY 1, 2 ) SELECT * FROM daily_agg
⚙️ Model Config (YAML)
# models/marts/schema.yml version: 2 models: - name: orders_daily description: "Daily order aggregation by country" config: materialized: incremental unique_key: [order_date, country] partition_by: field: order_date data_type: date granularity: day columns: - name: order_date tests: - not_null - unique # per country partition - name: total_revenue tests: - not_null - dbt_expectations.expect_column_values_to_be_between: min_value: 0
🔧 Key dbt CLI Commands
# Initialize a new project dbt init my_project # Compile SQL without executing (shows rendered SQL) dbt compile # Run all models (creates tables/views in warehouse) dbt run # Run only changed models + downstream (smart CI) dbt run --select state:modified+ # Run specific model and all upstream dependencies dbt run --select +orders_daily # Execute all data quality tests dbt test # Generate and serve documentation site dbt docs generate && dbt docs serve # Run incremental models only dbt run --select config.materialized:incremental # Snapshot slow-changing dimensions (SCD Type 2) dbt snapshot
📋 Materializations Explained
| Type | What dbt Creates | When to Use | Cost |
|---|---|---|---|
| view | Virtual SQL view (no data stored) | Staging models, lightweight transforms | Query-time compute only |
| table | Full physical table (drop + rebuild) | Final mart tables, aggregations | Full rebuild on every run |
| incremental | Appends/merges only new rows | Large event tables, fact tables | Low — processes only delta |
| ephemeral | CTE injected into referencing models | Intermediate logic, no object created | Zero storage cost |
| snapshot | SCD Type 2 history table | Track row changes over time | Grows with history |
| dynamic table | Warehouse-managed refresh (Snowflake) | Near-real-time refresh without Airflow | Auto-refreshed by warehouse |
Why dbt? What Makes it Great?
dbt brings software engineering discipline — version control, testing, documentation, modularity — to the data transformation layer.
Dependency Management via DAG
Using ref() and source() functions, dbt auto-resolves all model dependencies and builds a Directed Acyclic Graph — no manual pipeline wiring needed.
Built-in Data Testing
Define schema tests (not_null, unique, accepted_values, relationships) and custom SQL tests. Tests run against your warehouse — no separate testing framework needed.
Auto-generated Documentation
Every model, column, test, and source is auto-documented into a browsable data catalog with lineage graphs. No more stale wikis — docs live alongside code.
DRY Code with Macros
Jinja2 templating + macros let you write reusable SQL logic. The dbt-utils and dbt-expectations packages add hundreds of pre-built macros and tests to your project.
Git-based Version Control
All models are plain .sql files — they live in Git. Teams can PR, review, branch, rollback, and audit transformations exactly like application code.
Warehouse-native Compute
dbt doesn't run compute itself — it pushes all execution to your data warehouse (BigQuery, Snowflake, Spark, etc.), leveraging its MPP engine and auto-scaling at zero extra cost.
Modular Layered Architecture
dbt enforces staging → intermediate → mart layers. Each layer has a clear responsibility, making pipelines understandable, testable, and maintainable at scale.
Semantic Layer (MetricFlow)
Define business metrics once in YAML (revenue, DAU, churn rate). All BI tools and queries consume a single consistent definition — no more metric disagreements.
Huge Ecosystem (Packages)
dbt Hub has 200+ open-source packages: dbt-utils, dbt-expectations, dbt-audit-helper, elementary, re_data, and adapters for every major warehouse.
🏆 What dbt is Particularly Good At
- Complex multi-hop transformations — staging → intermediate → aggregation → mart layers with full lineage
- Incremental processing — handling petabyte-scale fact tables by only processing new/changed data
- Slow-changing dimension tracking — SCD Type 2 via dbt Snapshots with zero custom code
- Cross-team data contracts — schema tests enforce SLAs between producers and consumers
- CI/CD for data pipelines — slim CI runs only affected models on PRs, saving hours of runtime
How dbt is Built Internally
dbt Core is a Python application composed of several key subsystems that work together to compile, plan, and execute your transformation pipeline.
dbt Core Internal Subsystems
dbt run, dbt test), loads profiles and project config from dbt_project.yml and profiles.yml.sql, .yml, .py files. Parses Jinja, resolves ref() and source() calls. Outputs a manifest.json — the complete project graph--select, --exclude)run_results.json. Manages retries and partial failuresmanifest.json (project graph), run_results.json (execution results), catalog.json (schema introspection), sources.json. Used for state-based selection in CI/CD📁 Project File Structure
my_dbt_project/ ├── dbt_project.yml # Project config ├── profiles.yml # DB connections ├── packages.yml # Package deps ├── models/ │ ├── staging/ # Raw → clean │ │ ├── stg_orders.sql │ │ └── schema.yml │ ├── intermediate/ # Business logic │ │ └── int_orders_with_customers.sql │ └── marts/ # Analytics-ready │ ├── orders_daily.sql │ └── schema.yml ├── macros/ # Jinja macros ├── tests/ # Custom SQL tests ├── snapshots/ # SCD Type 2 ├── seeds/ # Static CSV data ├── analyses/ # Ad-hoc SQL └── target/ # Compiled output ├── manifest.json ├── run_results.json └── catalog.json
🔄 Incremental Strategy Logic
-- models/fact/fact_events.sql {{ config( materialized = 'incremental', unique_key = 'event_id', incremental_strategy = 'merge', partition_by = { 'field': 'event_date', 'data_type': 'date', 'granularity': 'day' }, cluster_by = ['user_id', 'event_type'] ) }} SELECT event_id, user_id, event_type, DATE(event_timestamp) AS event_date, event_timestamp, properties FROM {{ source('raw', 'events') }} -- Only new records on incremental runs {% if is_incremental() %} WHERE event_timestamp > (SELECT MAX(event_timestamp) FROM {{ this }}) {% endif %}
How dbt Builds Your Pipeline
When you run dbt run, a precise sequence of steps transforms your SQL files into warehouse tables and views.
-
1
Project & Profile Loading
Reads
dbt_project.yml(model configs, variables, hooks) andprofiles.yml(warehouse credentials + connection params). Validates adapter version compatibility. -
2
File Parsing & Manifest Generation
Every
.sqland.ymlfile is parsed. Jinja is partially rendered to detectref(),source(), andconfig()calls. A fullmanifest.jsonis written totarget/. -
3
DAG Construction & Topological Sort
All dependencies form a DAG. dbt uses topological sort (Kahn's algorithm) to determine the correct execution order. Models with no dependencies run first; downstream models follow. Cycles cause an error.
-
4
Node Selection & Filtering
Applies
--select,--exclude, and--selectorflags. Supports graph operators:+model(upstream),model+(downstream),state:modified(changed since last run), and tag/config selectors. -
5
Jinja Compilation (Full Render)
Each model's SQL is fully rendered: all macros expanded,
ref()replaced with actual schema-qualified table names, config blocks resolved. Compiled SQL is saved totarget/compiled/for debugging. -
6
Parallel Execution via Thread Pool
Models whose dependencies have all succeeded are eligible to run in parallel. dbt uses a configurable thread count (default 4). Each thread opens a warehouse connection and submits the DDL/DML. Threads respect DAG ordering automatically.
-
7
Pre-hooks, Run, Post-hooks
For each model: run
pre-hookSQL → execute model DDL/DML → runpost-hookSQL. Hooks enable grant permissions, update audit tables, or run custom SQL around each model execution. -
8
Result Collection & Artifact Write
dbt collects row counts, timing, and success/error status for every node. Writes
run_results.json. Failed models are reported; their downstream dependents are automatically skipped to prevent data corruption.
🧵 Parallelism Model
dbt's parallelism is bounded by the DAG structure. Independent branches run concurrently; dependent models wait. With --threads 8, up to 8 models execute simultaneously. For very wide pipelines, this dramatically cuts total run time. On dbt Cloud, threads can reach 32+.
Setting Up dbt in Production
A production dbt deployment requires orchestration, CI/CD, environment separation, observability, and security. Here's the complete setup.
🏗️ Production Stack Architecture
🔐 profiles.yml (Production)
my_project: outputs: dev: type: bigquery method: oauth project: my-company-dev dataset: dbt_{{ env_var('DBT_USER') }} threads: 4 timeout_seconds: 300 prod: type: bigquery method: service-account project: my-company-prod dataset: dbt_production keyfile: {{ env_var('BQ_KEYFILE_PATH') }} threads: 16 timeout_seconds: 3600 location: US maximum_bytes_billed: 10000000000000 target: dev
⚙️ Airflow DAG for dbt
from airflow import DAG from airflow.operators.bash import BashOperator from cosmos import DbtDag, ProjectConfig, ProfileConfig # Option A: Using Astronomer Cosmos (recommended) dbt_dag = DbtDag( dag_id="dbt_production_daily", project_config=ProjectConfig( dbt_project_path="/usr/local/airflow/dbt", ), profile_config=ProfileConfig( profile_name="my_project", target_name="prod", ), schedule_interval="0 4 * * *", # 4 AM UTC operator_args={ "vars": {"run_date": "{{ ds }}"}, "dbt_cmd_flags": ["--threads", "16"], }, ) # Option B: BashOperator (simpler) run_dbt = BashOperator( task_id="dbt_run", bash_command=""" dbt run --target prod \ --select tag:daily \ --threads 16 """ )
🔄 GitHub Actions CI/CD Pipeline
# .github/workflows/dbt_ci.yml name: dbt CI/CD Pipeline on: pull_request: branches: [main] push: branches: [main] jobs: dbt_ci: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Set up Python uses: actions/setup-python@v4 with: python-version: '3.11' - name: Install dbt run: pip install dbt-bigquery==1.8.0 - name: dbt deps run: dbt deps - name: dbt compile (syntax check) run: dbt compile --target prod - name: Slim CI - run only changed models # Uses dbt state to only run modified models + downstream run: | dbt run --target prod \ --select state:modified+ \ --defer \ --state ./prod-artifacts/ - name: dbt test (only on changed models) run: | dbt test --target prod \ --select state:modified+ \ --defer --state ./prod-artifacts/ dbt_prod_deploy: needs: dbt_ci if: github.ref == 'refs/heads/main' runs-on: ubuntu-latest steps: - name: Full production run run: dbt run --target prod --threads 16 - name: Run all tests in production run: dbt test --target prod --threads 8
dbt at Petabyte (PB) Data Volume
Running dbt against petabyte-scale datasets demands careful materialization strategies, partitioning, incremental patterns, and warehouse-specific tuning.
⚠️ The Core Challenge at PB Scale
At petabyte volumes, a full table materialization means scanning and rewriting terabytes on every run — costing hours and thousands of dollars. The solution: aggressive incremental processing, partition pruning, and warehouse-native optimizations.
Strategy 1: Partition-First Design
Partition all large fact tables by date. Every query and incremental merge scans only the relevant partitions, reducing compute by 100x–1000x. Combine with clustering on high-cardinality filter columns (user_id, event_type).
Strategy 2: Micro-batch Incremental
Instead of daily full refreshes, run incrementals every 15–60 minutes processing only new records. Use insert_overwrite strategy to replace entire date partitions atomically — faster than row-level MERGE on huge tables.
Strategy 3: Model Decomposition
Break one massive model into partitioned sub-models. Process each partition in parallel using dbt's thread pool or Airflow parallelism. Recombine downstream. This converts one 6-hour job into many 15-minute jobs running in parallel.
Strategy 4: Intermediate Caching
Materialize expensive intermediate aggregations as table or incremental models rather than ephemeral or view. Avoids recomputing the same expensive joins/aggregations multiple times across dependent models.
🔧 PB-Scale BigQuery Configuration
-- models/fact/fact_events_pb.sql (handles petabyte-scale event logs) {{ config( materialized = 'incremental', incremental_strategy = 'insert_overwrite', -- replaces whole partitions atomically unique_key = 'event_id', partition_by = { 'field' : 'event_date', 'data_type' : 'date', 'granularity' : 'day' }, cluster_by = ['user_id', 'event_type', 'platform'], require_partition_filter = true, -- NEVER allow full table scans! on_schema_change = 'append_new_columns', hours_to_expiration = None, -- production data never expires labels = {'pipeline': 'dbt', 'team': 'data_engineering'}, ) }} WITH raw_events AS ( SELECT event_id, user_id, LOWER(event_type) AS event_type, platform, DATE(event_timestamp, 'UTC') AS event_date, event_timestamp, JSON_VALUE(properties, '$.session_id') AS session_id, properties FROM {{ source('raw_events', 'events') }} WHERE event_id IS NOT NULL AND event_timestamp IS NOT NULL -- Incremental filter: only process the last N days on each run {% if is_incremental() %} AND DATE(event_timestamp) >= DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY) -- 3-day lookback window {% else %} -- First full build: last 90 days only (bootstrap gradually) AND DATE(event_timestamp) >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY) {% endif %} ) SELECT * FROM raw_events
⚙️ dbt_project.yml — Production Tuning
name: my_project version: '1.0.0' config-version: 2 vars: lookback_days: 3 # global incremental window max_bytes_billed: 10000000000000 # 10TB query cost guard models: my_project: staging: +materialized: view # always views in staging +schema: staging intermediate: +materialized: ephemeral # no storage for intermediate marts: +materialized: table # final marts are full tables +schema: marts +grants: select: ["group:bi-analysts@company.com"] fact: +materialized: incremental # all facts are incremental +schema: fact +tags: ["fact", "incremental"] on-run-start: - "CALL my_project.pre_run_checks()" # custom pre-run validation on-run-end: - "{{ log_run_metadata() }}" # write run stats to audit table - {{ elementary.upload_dbt_artifacts() }}
📋 PB Scale Best Practices Checklist
- Always partition large tables — date-based partitioning is mandatory for fact tables over 10GB. Enable
require_partition_filterto prevent accidental full scans. - Use
insert_overwritestrategy — for BigQuery and Spark, replacing full date partitions is faster than row-level MERGE/UPSERT at PB scale. MERGE causes full partition scans. - Lookback window incremental logic — process last N days (not just latest timestamp) to handle late-arriving events. Tune N based on your data SLA.
- Cluster by high-selectivity columns — Snowflake micro-partitions and BigQuery clustering reduce bytes scanned by 10x–100x for filtered queries.
- Avoid SELECT * from huge tables — always project only needed columns. Use staging models as a projection + filtering layer before joins.
- Set query cost guards — BigQuery's
maximum_bytes_billedand Snowflake'sstatement_timeout_in_secondsprevent runaway queries from costing thousands. - Increase thread count for production — default 4 threads is too low. Use 16–32 threads in production to parallelize independent model branches.
- Use dbt Slim CI in PRs —
state:modified+selection means PRs only run changed models, not the full pipeline. Saves 95%+ of CI runtime and cost. - Bootstrap large tables in batches — on first build, use a custom macro to process historical data in monthly or weekly chunks instead of one massive query.
- Monitor with Elementary or Monte Carlo — attach data observability to detect schema changes, volume anomalies, and freshness violations before they impact dashboards.
- Materialize expensive intermediates — if an intermediate CTE is referenced by 3+ downstream models, materialize it as a table to avoid recomputing it multiple times.
- Separate dev / CI / prod environments — dev uses personal schemas (
dbt_username), CI uses isolated schemas, prod writes to blessed schemas. Never let dev run overwrite production data.
📊 Environment Strategy Summary
| Environment | Target | Schema | Threads | Trigger | Data Scope |
|---|---|---|---|---|---|
| Development | dev |
dbt_username |
4 | Manual / dbt CLI | Last 7–30 days |
| CI / PR | ci |
dbt_ci_prXXX |
8 | Pull Request | Modified models + 3 days data |
| Staging | staging |
dbt_staging |
16 | Merge to main | Last 90 days |
| Production | prod |
dbt_production |
32 | Scheduled (Airflow/dbt Cloud) | Full historical data |
🚀 Key Numbers at PB Scale (Reference)
- BigQuery: Process 10TB/min. With partitioning, reduce scans to ~10–50GB per incremental run.
- Snowflake: X-Large warehouse (16 nodes) handles most dbt workloads. Use multi-cluster warehouses for concurrent runs.
- Databricks: Use dbt-spark or dbt-databricks. Enable Delta Lake for MERGE support and time travel. Optimize for Photon engine.
- Incremental runs at PB scale: Should complete in <30 min with proper partitioning. Full rebuilds reserved for schema migrations only.