dbt — The Complete Guide for Big Data Platforms

01 — Introduction

What is dbt?

dbt (data build tool) is an open-source command-line tool that enables data analysts and engineers to transform raw data in their warehouses using SQL.

🎯 Core Definition

dbt is the T in ELT (Extract, Load, Transform). It operates directly inside your data warehouse, letting you write modular SQL SELECT statements that dbt compiles and executes, creating tables and views for your analytics layer. Think of it as software engineering applied to data transformation.

30K+

Companies using dbt

ELT

Paradigm (not ETL)

SQL

Primary language

Jinja2

Templating engine

📦

dbt Core (Open Source)

The free, open-source CLI version. Run dbt locally or on your infrastructure. Full control over scheduling, orchestration, and compute. Works with BigQuery, Snowflake, Redshift, Databricks, Spark, DuckDB, and more.

☁️

dbt Cloud (Managed SaaS)

A managed platform with a web IDE, job scheduling, CI/CD, documentation hosting, observability, and team collaboration features — built on top of dbt Core. Includes a Semantic Layer for metric definitions.

🧠

The ELT Revolution

Traditional ETL transformed data before loading (expensive pipelines). dbt enables ELT: load raw data first into cloud warehouses, then transform in place using SQL. The warehouse is the engine — not a middleware layer.

🔄

Analytics Engineering

dbt created the Analytics Engineer role — bridging data engineering and analytics. Analysts own and maintain transformation pipelines using software engineering best practices like version control and testing.

02 — Mechanics

How dbt Works

dbt orchestrates SQL transformations across a directed acyclic graph (DAG) of models. Here's the full execution lifecycle.

📝

SQL Models

Write SELECT statements

→

🔧

Jinja Compile

Macros + refs resolved

→

🔗

DAG Resolution

Dependency ordering

→

⚙️

Warehouse Execution

DDL/DML runs in-warehouse

→

✅

Tests + Docs

Auto-validation & catalog

📄 A dbt Model (SQL file)

-- models/marts/orders_daily.sql
-- Reference other models with ref()

WITH orders AS (
  SELECT * FROM {{ ref('stg_orders') }}
),

customers AS (
  SELECT * FROM {{ ref('stg_customers') }}
),

daily_agg AS (
  SELECT
    DATE_TRUNC('day', o.created_at) AS order_date,
    c.country,
    COUNT(o.order_id)           AS total_orders,
    SUM(o.revenue)             AS total_revenue,
    AVG(o.revenue)             AS avg_order_value
  FROM orders o
  JOIN customers c USING (customer_id)
  GROUP BY 1, 2
)

SELECT * FROM daily_agg

⚙️ Model Config (YAML)

# models/marts/schema.yml
version: 2

models:
  - name: orders_daily
    description: "Daily order aggregation by country"
    config:
      materialized: incremental
      unique_key: [order_date, country]
      partition_by:
        field: order_date
        data_type: date
        granularity: day

    columns:
      - name: order_date
        tests:
          - not_null
          - unique  # per country partition
      - name: total_revenue
        tests:
          - not_null
          - dbt_expectations.expect_column_values_to_be_between:
              min_value: 0

🔧 Key dbt CLI Commands

# Initialize a new project
dbt init my_project

# Compile SQL without executing (shows rendered SQL)
dbt compile

# Run all models (creates tables/views in warehouse)
dbt run

# Run only changed models + downstream (smart CI)
dbt run --select state:modified+

# Run specific model and all upstream dependencies
dbt run --select +orders_daily

# Execute all data quality tests
dbt test

# Generate and serve documentation site
dbt docs generate && dbt docs serve

# Run incremental models only
dbt run --select config.materialized:incremental

# Snapshot slow-changing dimensions (SCD Type 2)
dbt snapshot

📋 Materializations Explained

Type	What dbt Creates	When to Use	Cost
view	Virtual SQL view (no data stored)	Staging models, lightweight transforms	Query-time compute only
table	Full physical table (drop + rebuild)	Final mart tables, aggregations	Full rebuild on every run
incremental	Appends/merges only new rows	Large event tables, fact tables	Low — processes only delta
ephemeral	CTE injected into referencing models	Intermediate logic, no object created	Zero storage cost
snapshot	SCD Type 2 history table	Track row changes over time	Grows with history
dynamic table	Warehouse-managed refresh (Snowflake)	Near-real-time refresh without Airflow	Auto-refreshed by warehouse

03 — Value Proposition

Why dbt? What Makes it Great?

dbt brings software engineering discipline — version control, testing, documentation, modularity — to the data transformation layer.

🔗

Dependency Management via DAG

Using ref() and source() functions, dbt auto-resolves all model dependencies and builds a Directed Acyclic Graph — no manual pipeline wiring needed.

🧪

Built-in Data Testing

Define schema tests (not_null, unique, accepted_values, relationships) and custom SQL tests. Tests run against your warehouse — no separate testing framework needed.

📚

Auto-generated Documentation

Every model, column, test, and source is auto-documented into a browsable data catalog with lineage graphs. No more stale wikis — docs live alongside code.

♻️

DRY Code with Macros

Jinja2 templating + macros let you write reusable SQL logic. The dbt-utils and dbt-expectations packages add hundreds of pre-built macros and tests to your project.

🔀

Git-based Version Control

All models are plain .sql files — they live in Git. Teams can PR, review, branch, rollback, and audit transformations exactly like application code.

⚡

Warehouse-native Compute

dbt doesn't run compute itself — it pushes all execution to your data warehouse (BigQuery, Snowflake, Spark, etc.), leveraging its MPP engine and auto-scaling at zero extra cost.

🏗️

Modular Layered Architecture

dbt enforces staging → intermediate → mart layers. Each layer has a clear responsibility, making pipelines understandable, testable, and maintainable at scale.

📊

Semantic Layer (MetricFlow)

Define business metrics once in YAML (revenue, DAU, churn rate). All BI tools and queries consume a single consistent definition — no more metric disagreements.

🔌

Huge Ecosystem (Packages)

dbt Hub has 200+ open-source packages: dbt-utils, dbt-expectations, dbt-audit-helper, elementary, re_data, and adapters for every major warehouse.

      🏆 What dbt is Particularly Good At
      Complex multi-hop transformations — staging → intermediate → aggregation → mart layers with full lineage
Incremental processing — handling petabyte-scale fact tables by only processing new/changed data
Slow-changing dimension tracking — SCD Type 2 via dbt Snapshots with zero custom code
Cross-team data contracts — schema tests enforce SLAs between producers and consumers
CI/CD for data pipelines — slim CI runs only affected models on PRs, saving hours of runtime

    

04 — Internal Architecture

How dbt is Built Internally

dbt Core is a Python application composed of several key subsystems that work together to compile, plan, and execute your transformation pipeline.

dbt Core Internal Subsystems

🖥️

CLI / Programmatic Interface

Entry point. Parses user commands (dbt run, dbt test), loads profiles and project config from dbt_project.yml and profiles.yml

↓

📂

Manifest Builder (Parser)

Reads all .sql, .yml, .py files. Parses Jinja, resolves ref() and source() calls. Outputs a manifest.json — the complete project graph

↓

🔗

DAG Engine (Graph Module)

Builds and validates the Directed Acyclic Graph using NetworkX. Performs topological sort for execution order. Detects cycles. Applies node selectors (--select, --exclude)

↓

⚡

Jinja2 Compilation Engine

Renders Jinja templates with context: model configs, macros, variables. Produces raw SQL. Supports custom macro libraries. Injects adapter-specific SQL via the dispatch system

↓

🔀

Adapter Layer (Plugin Interface)

Abstract interface to each warehouse (BigQuery, Snowflake, Redshift, Spark, DuckDB…). Each adapter implements: connection pool, DDL generation, column type mapping, incremental merge strategies

↓

🏃

Runner (Execution Engine)

Thread-pool based parallel executor. Respects DAG order. Sends compiled SQL to adapter. Captures results, timing, and row counts into run_results.json. Manages retries and partial failures

↓

🗄️

Artifacts Store

Writes manifest.json (project graph), run_results.json (execution results), catalog.json (schema introspection), sources.json. Used for state-based selection in CI/CD

📁 Project File Structure

my_dbt_project/
├── dbt_project.yml       # Project config
├── profiles.yml          # DB connections
├── packages.yml          # Package deps
├── models/
│   ├── staging/          # Raw → clean
│   │   ├── stg_orders.sql
│   │   └── schema.yml
│   ├── intermediate/     # Business logic
│   │   └── int_orders_with_customers.sql
│   └── marts/            # Analytics-ready
│       ├── orders_daily.sql
│       └── schema.yml
├── macros/               # Jinja macros
├── tests/                # Custom SQL tests
├── snapshots/            # SCD Type 2
├── seeds/                # Static CSV data
├── analyses/             # Ad-hoc SQL
└── target/               # Compiled output
    ├── manifest.json
    ├── run_results.json
    └── catalog.json

🔄 Incremental Strategy Logic

-- models/fact/fact_events.sql
{{
  config(
    materialized = 'incremental',
    unique_key   = 'event_id',
    incremental_strategy = 'merge',
    partition_by = {
      'field': 'event_date',
      'data_type': 'date',
      'granularity': 'day'
    },
    cluster_by = ['user_id', 'event_type']
  )
}}

SELECT
  event_id,
  user_id,
  event_type,
  DATE(event_timestamp) AS event_date,
  event_timestamp,
  properties
FROM {{ source('raw', 'events') }}

-- Only new records on incremental runs
{% if is_incremental() %}
  WHERE event_timestamp >
    (SELECT MAX(event_timestamp)
     FROM {{ this }})
{% endif %}

05 — Execution Model

How dbt Builds Your Pipeline

When you run dbt run, a precise sequence of steps transforms your SQL files into warehouse tables and views.

1

Project & Profile Loading

Reads dbt_project.yml (model configs, variables, hooks) and profiles.yml (warehouse credentials + connection params). Validates adapter version compatibility.
2

File Parsing & Manifest Generation

Every .sql and .yml file is parsed. Jinja is partially rendered to detect ref(), source(), and config() calls. A full manifest.json is written to target/.
3

DAG Construction & Topological Sort

All dependencies form a DAG. dbt uses topological sort (Kahn's algorithm) to determine the correct execution order. Models with no dependencies run first; downstream models follow. Cycles cause an error.
4

Node Selection & Filtering

Applies --select, --exclude, and --selector flags. Supports graph operators: +model (upstream), model+ (downstream), state:modified (changed since last run), and tag/config selectors.
5

Jinja Compilation (Full Render)

Each model's SQL is fully rendered: all macros expanded, ref() replaced with actual schema-qualified table names, config blocks resolved. Compiled SQL is saved to target/compiled/ for debugging.
6

Parallel Execution via Thread Pool

Models whose dependencies have all succeeded are eligible to run in parallel. dbt uses a configurable thread count (default 4). Each thread opens a warehouse connection and submits the DDL/DML. Threads respect DAG ordering automatically.
7

Pre-hooks, Run, Post-hooks

For each model: run pre-hook SQL → execute model DDL/DML → run post-hook SQL. Hooks enable grant permissions, update audit tables, or run custom SQL around each model execution.
8

Result Collection & Artifact Write

dbt collects row counts, timing, and success/error status for every node. Writes run_results.json. Failed models are reported; their downstream dependents are automatically skipped to prevent data corruption.

🧵 Parallelism Model

dbt's parallelism is bounded by the DAG structure. Independent branches run concurrently; dependent models wait. With --threads 8, up to 8 models execute simultaneously. For very wide pipelines, this dramatically cuts total run time. On dbt Cloud, threads can reach 32+.

06 — Production Setup

Setting Up dbt in Production

A production dbt deployment requires orchestration, CI/CD, environment separation, observability, and security. Here's the complete setup.

🏗️ Production Stack Architecture

📁

Git Repo

GitHub / GitLab

→

🔄

CI/CD

GitHub Actions / GitLab CI

→

🎯

Orchestrator

Airflow / Prefect / dbt Cloud

→

⚡

Warehouse

BigQuery / Snowflake / Databricks

→

📊

Observability

Elementary / re_data / Monte Carlo

🔐 profiles.yml (Production)

my_project:
  outputs:

    dev:
      type: bigquery
      method: oauth
      project: my-company-dev
      dataset: dbt_{{ env_var('DBT_USER') }}
      threads: 4
      timeout_seconds: 300

    prod:
      type: bigquery
      method: service-account
      project: my-company-prod
      dataset: dbt_production
      keyfile: {{ env_var('BQ_KEYFILE_PATH') }}
      threads: 16
      timeout_seconds: 3600
      location: US
      maximum_bytes_billed: 10000000000000

  target: dev

⚙️ Airflow DAG for dbt

from airflow import DAG
from airflow.operators.bash import BashOperator
from cosmos import DbtDag, ProjectConfig, ProfileConfig

# Option A: Using Astronomer Cosmos (recommended)
dbt_dag = DbtDag(
    dag_id="dbt_production_daily",
    project_config=ProjectConfig(
        dbt_project_path="/usr/local/airflow/dbt",
    ),
    profile_config=ProfileConfig(
        profile_name="my_project",
        target_name="prod",
    ),
    schedule_interval="0 4 * * *", # 4 AM UTC
    operator_args={
        "vars": {"run_date": "{{ ds }}"},
        "dbt_cmd_flags": ["--threads", "16"],
    },
)

# Option B: BashOperator (simpler)
run_dbt = BashOperator(
    task_id="dbt_run",
    bash_command="""
      dbt run --target prod \
               --select tag:daily \
               --threads 16
    """
)

🔄 GitHub Actions CI/CD Pipeline

# .github/workflows/dbt_ci.yml
name: dbt CI/CD Pipeline

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

jobs:
  dbt_ci:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Install dbt
        run: pip install dbt-bigquery==1.8.0

      - name: dbt deps
        run: dbt deps

      - name: dbt compile (syntax check)
        run: dbt compile --target prod

      - name: Slim CI - run only changed models
                # Uses dbt state to only run modified models + downstream
        run: |
          dbt run --target prod \
                  --select state:modified+ \
                  --defer \
                  --state ./prod-artifacts/

      - name: dbt test (only on changed models)
        run: |
          dbt test --target prod \
                   --select state:modified+ \
                   --defer --state ./prod-artifacts/

  dbt_prod_deploy:
    needs: dbt_ci
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - name: Full production run
        run: dbt run --target prod --threads 16
      - name: Run all tests in production
        run: dbt test --target prod --threads 8

07 — Petabyte Scale

dbt at Petabyte (PB) Data Volume

Running dbt against petabyte-scale datasets demands careful materialization strategies, partitioning, incremental patterns, and warehouse-specific tuning.

⚠️ The Core Challenge at PB Scale

At petabyte volumes, a full table materialization means scanning and rewriting terabytes on every run — costing hours and thousands of dollars. The solution: aggressive incremental processing, partition pruning, and warehouse-native optimizations.

📅

Strategy 1: Partition-First Design

Partition all large fact tables by date. Every query and incremental merge scans only the relevant partitions, reducing compute by 100x–1000x. Combine with clustering on high-cardinality filter columns (user_id, event_type).

⚡

Strategy 2: Micro-batch Incremental

Instead of daily full refreshes, run incrementals every 15–60 minutes processing only new records. Use insert_overwrite strategy to replace entire date partitions atomically — faster than row-level MERGE on huge tables.

🔀

Strategy 3: Model Decomposition

Break one massive model into partitioned sub-models. Process each partition in parallel using dbt's thread pool or Airflow parallelism. Recombine downstream. This converts one 6-hour job into many 15-minute jobs running in parallel.

💾

Strategy 4: Intermediate Caching

Materialize expensive intermediate aggregations as table or incremental models rather than ephemeral or view. Avoids recomputing the same expensive joins/aggregations multiple times across dependent models.

🔧 PB-Scale BigQuery Configuration

-- models/fact/fact_events_pb.sql  (handles petabyte-scale event logs)
{{
  config(
    materialized          = 'incremental',
    incremental_strategy  = 'insert_overwrite',   -- replaces whole partitions atomically
    unique_key            = 'event_id',
    partition_by = {
      'field'       : 'event_date',
      'data_type'   : 'date',
      'granularity' : 'day'
    },
    cluster_by            = ['user_id', 'event_type', 'platform'],
    require_partition_filter = true,              -- NEVER allow full table scans!
    on_schema_change      = 'append_new_columns',
    hours_to_expiration   = None,                 -- production data never expires
    labels                = {'pipeline': 'dbt', 'team': 'data_engineering'},
  )
}}

WITH raw_events AS (
  SELECT
    event_id,
    user_id,
    LOWER(event_type)                     AS event_type,
    platform,
    DATE(event_timestamp, 'UTC')           AS event_date,
    event_timestamp,
    JSON_VALUE(properties, '$.session_id') AS session_id,
    properties
  FROM {{ source('raw_events', 'events') }}
  WHERE event_id IS NOT NULL
    AND event_timestamp IS NOT NULL

  -- Incremental filter: only process the last N days on each run
  {% if is_incremental() %}
    AND DATE(event_timestamp) >=
        DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)  -- 3-day lookback window
  {% else %}
    -- First full build: last 90 days only (bootstrap gradually)
    AND DATE(event_timestamp) >=
        DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
  {% endif %}
)

SELECT * FROM raw_events

⚙️ dbt_project.yml — Production Tuning

name: my_project
version: '1.0.0'
config-version: 2

vars:
  lookback_days: 3                     # global incremental window
  max_bytes_billed: 10000000000000      # 10TB query cost guard

models:
  my_project:
    staging:
      +materialized: view              # always views in staging
      +schema: staging

    intermediate:
      +materialized: ephemeral         # no storage for intermediate

    marts:
      +materialized: table             # final marts are full tables
      +schema: marts
      +grants:
        select: ["group:bi-analysts@company.com"]

    fact:
      +materialized: incremental       # all facts are incremental
      +schema: fact
      +tags: ["fact", "incremental"]

on-run-start:
  - "CALL my_project.pre_run_checks()"  # custom pre-run validation

on-run-end:
  - "{{ log_run_metadata() }}"          # write run stats to audit table
  - {{ elementary.upload_dbt_artifacts() }}

📋 PB Scale Best Practices Checklist

✅
Always partition large tables — date-based partitioning is mandatory for fact tables over 10GB. Enable require_partition_filter to prevent accidental full scans.
✅
Use insert_overwrite strategy — for BigQuery and Spark, replacing full date partitions is faster than row-level MERGE/UPSERT at PB scale. MERGE causes full partition scans.
✅
Lookback window incremental logic — process last N days (not just latest timestamp) to handle late-arriving events. Tune N based on your data SLA.
✅
Cluster by high-selectivity columns — Snowflake micro-partitions and BigQuery clustering reduce bytes scanned by 10x–100x for filtered queries.
✅
Avoid SELECT * from huge tables — always project only needed columns. Use staging models as a projection + filtering layer before joins.
✅
Set query cost guards — BigQuery's maximum_bytes_billed and Snowflake's statement_timeout_in_seconds prevent runaway queries from costing thousands.
✅
Increase thread count for production — default 4 threads is too low. Use 16–32 threads in production to parallelize independent model branches.
✅
Use dbt Slim CI in PRs — state:modified+ selection means PRs only run changed models, not the full pipeline. Saves 95%+ of CI runtime and cost.
✅
Bootstrap large tables in batches — on first build, use a custom macro to process historical data in monthly or weekly chunks instead of one massive query.
✅
Monitor with Elementary or Monte Carlo — attach data observability to detect schema changes, volume anomalies, and freshness violations before they impact dashboards.
✅
Materialize expensive intermediates — if an intermediate CTE is referenced by 3+ downstream models, materialize it as a table to avoid recomputing it multiple times.
✅
Separate dev / CI / prod environments — dev uses personal schemas (dbt_username), CI uses isolated schemas, prod writes to blessed schemas. Never let dev run overwrite production data.

📊 Environment Strategy Summary

Environment	Target	Schema	Threads	Trigger	Data Scope
Development	`dev`	`dbt_username`	4	Manual / dbt CLI	Last 7–30 days
CI / PR	`ci`	`dbt_ci_prXXX`	8	Pull Request	Modified models + 3 days data
Staging	`staging`	`dbt_staging`	16	Merge to main	Last 90 days
Production	`prod`	`dbt_production`	32	Scheduled (Airflow/dbt Cloud)	Full historical data

      🚀 Key Numbers at PB Scale (Reference)
      BigQuery: Process 10TB/min. With partitioning, reduce scans to ~10–50GB per incremental run.
Snowflake: X-Large warehouse (16 nodes) handles most dbt workloads. Use multi-cluster warehouses for concurrent runs.
Databricks: Use dbt-spark or dbt-databricks. Enable Delta Lake for MERGE support and time travel. Optimize for Photon engine.
Incremental runs at PB scale: Should complete in <30 min with proper partitioning. Full rebuilds reserved for schema migrations only.