Big Data Platform Design Guide

Everything about
dbt — Data Build Tool

A comprehensive guide covering what dbt is, how it works, why it matters in modern data stacks, its internal architecture, and how to deploy it at petabyte scale in production.

🔧 dbt Core & Cloud 📦 Models & Materializations 🧪 Testing & Docs 🏗️ Architecture 🚀 Production at PB Scale

What is dbt?

dbt (data build tool) is an open-source command-line tool that enables data analysts and engineers to transform raw data in their warehouses using SQL.

🎯 Core Definition

dbt is the T in ELT (Extract, Load, Transform). It operates directly inside your data warehouse, letting you write modular SQL SELECT statements that dbt compiles and executes, creating tables and views for your analytics layer. Think of it as software engineering applied to data transformation.

30K+
Companies using dbt
ELT
Paradigm (not ETL)
SQL
Primary language
Jinja2
Templating engine
📦

dbt Core (Open Source)

The free, open-source CLI version. Run dbt locally or on your infrastructure. Full control over scheduling, orchestration, and compute. Works with BigQuery, Snowflake, Redshift, Databricks, Spark, DuckDB, and more.

☁️

dbt Cloud (Managed SaaS)

A managed platform with a web IDE, job scheduling, CI/CD, documentation hosting, observability, and team collaboration features — built on top of dbt Core. Includes a Semantic Layer for metric definitions.

🧠

The ELT Revolution

Traditional ETL transformed data before loading (expensive pipelines). dbt enables ELT: load raw data first into cloud warehouses, then transform in place using SQL. The warehouse is the engine — not a middleware layer.

🔄

Analytics Engineering

dbt created the Analytics Engineer role — bridging data engineering and analytics. Analysts own and maintain transformation pipelines using software engineering best practices like version control and testing.


How dbt Works

dbt orchestrates SQL transformations across a directed acyclic graph (DAG) of models. Here's the full execution lifecycle.

📝
SQL Models
Write SELECT statements
🔧
Jinja Compile
Macros + refs resolved
🔗
DAG Resolution
Dependency ordering
⚙️
Warehouse Execution
DDL/DML runs in-warehouse
Tests + Docs
Auto-validation & catalog

📄 A dbt Model (SQL file)

-- models/marts/orders_daily.sql
-- Reference other models with ref()

WITH orders AS (
  SELECT * FROM {{ ref('stg_orders') }}
),

customers AS (
  SELECT * FROM {{ ref('stg_customers') }}
),

daily_agg AS (
  SELECT
    DATE_TRUNC('day', o.created_at) AS order_date,
    c.country,
    COUNT(o.order_id)           AS total_orders,
    SUM(o.revenue)             AS total_revenue,
    AVG(o.revenue)             AS avg_order_value
  FROM orders o
  JOIN customers c USING (customer_id)
  GROUP BY 1, 2
)

SELECT * FROM daily_agg

⚙️ Model Config (YAML)

# models/marts/schema.yml
version: 2

models:
  - name: orders_daily
    description: "Daily order aggregation by country"
    config:
      materialized: incremental
      unique_key: [order_date, country]
      partition_by:
        field: order_date
        data_type: date
        granularity: day

    columns:
      - name: order_date
        tests:
          - not_null
          - unique  # per country partition
      - name: total_revenue
        tests:
          - not_null
          - dbt_expectations.expect_column_values_to_be_between:
              min_value: 0

🔧 Key dbt CLI Commands

# Initialize a new project
dbt init my_project

# Compile SQL without executing (shows rendered SQL)
dbt compile

# Run all models (creates tables/views in warehouse)
dbt run

# Run only changed models + downstream (smart CI)
dbt run --select state:modified+

# Run specific model and all upstream dependencies
dbt run --select +orders_daily

# Execute all data quality tests
dbt test

# Generate and serve documentation site
dbt docs generate && dbt docs serve

# Run incremental models only
dbt run --select config.materialized:incremental

# Snapshot slow-changing dimensions (SCD Type 2)
dbt snapshot

📋 Materializations Explained

Type What dbt Creates When to Use Cost
view Virtual SQL view (no data stored) Staging models, lightweight transforms Query-time compute only
table Full physical table (drop + rebuild) Final mart tables, aggregations Full rebuild on every run
incremental Appends/merges only new rows Large event tables, fact tables Low — processes only delta
ephemeral CTE injected into referencing models Intermediate logic, no object created Zero storage cost
snapshot SCD Type 2 history table Track row changes over time Grows with history
dynamic table Warehouse-managed refresh (Snowflake) Near-real-time refresh without Airflow Auto-refreshed by warehouse

Why dbt? What Makes it Great?

dbt brings software engineering discipline — version control, testing, documentation, modularity — to the data transformation layer.

🔗

Dependency Management via DAG

Using ref() and source() functions, dbt auto-resolves all model dependencies and builds a Directed Acyclic Graph — no manual pipeline wiring needed.

🧪

Built-in Data Testing

Define schema tests (not_null, unique, accepted_values, relationships) and custom SQL tests. Tests run against your warehouse — no separate testing framework needed.

📚

Auto-generated Documentation

Every model, column, test, and source is auto-documented into a browsable data catalog with lineage graphs. No more stale wikis — docs live alongside code.

♻️

DRY Code with Macros

Jinja2 templating + macros let you write reusable SQL logic. The dbt-utils and dbt-expectations packages add hundreds of pre-built macros and tests to your project.

🔀

Git-based Version Control

All models are plain .sql files — they live in Git. Teams can PR, review, branch, rollback, and audit transformations exactly like application code.

Warehouse-native Compute

dbt doesn't run compute itself — it pushes all execution to your data warehouse (BigQuery, Snowflake, Spark, etc.), leveraging its MPP engine and auto-scaling at zero extra cost.

🏗️

Modular Layered Architecture

dbt enforces staging → intermediate → mart layers. Each layer has a clear responsibility, making pipelines understandable, testable, and maintainable at scale.

📊

Semantic Layer (MetricFlow)

Define business metrics once in YAML (revenue, DAU, churn rate). All BI tools and queries consume a single consistent definition — no more metric disagreements.

🔌

Huge Ecosystem (Packages)

dbt Hub has 200+ open-source packages: dbt-utils, dbt-expectations, dbt-audit-helper, elementary, re_data, and adapters for every major warehouse.

🏆 What dbt is Particularly Good At

  • Complex multi-hop transformations — staging → intermediate → aggregation → mart layers with full lineage
  • Incremental processing — handling petabyte-scale fact tables by only processing new/changed data
  • Slow-changing dimension tracking — SCD Type 2 via dbt Snapshots with zero custom code
  • Cross-team data contracts — schema tests enforce SLAs between producers and consumers
  • CI/CD for data pipelines — slim CI runs only affected models on PRs, saving hours of runtime

How dbt is Built Internally

dbt Core is a Python application composed of several key subsystems that work together to compile, plan, and execute your transformation pipeline.

dbt Core Internal Subsystems

🖥️
CLI / Programmatic Interface
Entry point. Parses user commands (dbt run, dbt test), loads profiles and project config from dbt_project.yml and profiles.yml
📂
Manifest Builder (Parser)
Reads all .sql, .yml, .py files. Parses Jinja, resolves ref() and source() calls. Outputs a manifest.json — the complete project graph
🔗
DAG Engine (Graph Module)
Builds and validates the Directed Acyclic Graph using NetworkX. Performs topological sort for execution order. Detects cycles. Applies node selectors (--select, --exclude)
Jinja2 Compilation Engine
Renders Jinja templates with context: model configs, macros, variables. Produces raw SQL. Supports custom macro libraries. Injects adapter-specific SQL via the dispatch system
🔀
Adapter Layer (Plugin Interface)
Abstract interface to each warehouse (BigQuery, Snowflake, Redshift, Spark, DuckDB…). Each adapter implements: connection pool, DDL generation, column type mapping, incremental merge strategies
🏃
Runner (Execution Engine)
Thread-pool based parallel executor. Respects DAG order. Sends compiled SQL to adapter. Captures results, timing, and row counts into run_results.json. Manages retries and partial failures
🗄️
Artifacts Store
Writes manifest.json (project graph), run_results.json (execution results), catalog.json (schema introspection), sources.json. Used for state-based selection in CI/CD

📁 Project File Structure

my_dbt_project/
├── dbt_project.yml       # Project config
├── profiles.yml          # DB connections
├── packages.yml          # Package deps
├── models/
│   ├── staging/          # Raw → clean
│   │   ├── stg_orders.sql
│   │   └── schema.yml
│   ├── intermediate/     # Business logic
│   │   └── int_orders_with_customers.sql
│   └── marts/            # Analytics-ready
│       ├── orders_daily.sql
│       └── schema.yml
├── macros/               # Jinja macros
├── tests/                # Custom SQL tests
├── snapshots/            # SCD Type 2
├── seeds/                # Static CSV data
├── analyses/             # Ad-hoc SQL
└── target/               # Compiled output
    ├── manifest.json
    ├── run_results.json
    └── catalog.json

🔄 Incremental Strategy Logic

-- models/fact/fact_events.sql
{{
  config(
    materialized = 'incremental',
    unique_key   = 'event_id',
    incremental_strategy = 'merge',
    partition_by = {
      'field': 'event_date',
      'data_type': 'date',
      'granularity': 'day'
    },
    cluster_by = ['user_id', 'event_type']
  )
}}

SELECT
  event_id,
  user_id,
  event_type,
  DATE(event_timestamp) AS event_date,
  event_timestamp,
  properties
FROM {{ source('raw', 'events') }}

-- Only new records on incremental runs
{% if is_incremental() %}
  WHERE event_timestamp >
    (SELECT MAX(event_timestamp)
     FROM {{ this }})
{% endif %}

How dbt Builds Your Pipeline

When you run dbt run, a precise sequence of steps transforms your SQL files into warehouse tables and views.

  • 1

    Project & Profile Loading

    Reads dbt_project.yml (model configs, variables, hooks) and profiles.yml (warehouse credentials + connection params). Validates adapter version compatibility.

  • 2

    File Parsing & Manifest Generation

    Every .sql and .yml file is parsed. Jinja is partially rendered to detect ref(), source(), and config() calls. A full manifest.json is written to target/.

  • 3

    DAG Construction & Topological Sort

    All dependencies form a DAG. dbt uses topological sort (Kahn's algorithm) to determine the correct execution order. Models with no dependencies run first; downstream models follow. Cycles cause an error.

  • 4

    Node Selection & Filtering

    Applies --select, --exclude, and --selector flags. Supports graph operators: +model (upstream), model+ (downstream), state:modified (changed since last run), and tag/config selectors.

  • 5

    Jinja Compilation (Full Render)

    Each model's SQL is fully rendered: all macros expanded, ref() replaced with actual schema-qualified table names, config blocks resolved. Compiled SQL is saved to target/compiled/ for debugging.

  • 6

    Parallel Execution via Thread Pool

    Models whose dependencies have all succeeded are eligible to run in parallel. dbt uses a configurable thread count (default 4). Each thread opens a warehouse connection and submits the DDL/DML. Threads respect DAG ordering automatically.

  • 7

    Pre-hooks, Run, Post-hooks

    For each model: run pre-hook SQL → execute model DDL/DML → run post-hook SQL. Hooks enable grant permissions, update audit tables, or run custom SQL around each model execution.

  • 8

    Result Collection & Artifact Write

    dbt collects row counts, timing, and success/error status for every node. Writes run_results.json. Failed models are reported; their downstream dependents are automatically skipped to prevent data corruption.

🧵 Parallelism Model

dbt's parallelism is bounded by the DAG structure. Independent branches run concurrently; dependent models wait. With --threads 8, up to 8 models execute simultaneously. For very wide pipelines, this dramatically cuts total run time. On dbt Cloud, threads can reach 32+.


Setting Up dbt in Production

A production dbt deployment requires orchestration, CI/CD, environment separation, observability, and security. Here's the complete setup.

🏗️ Production Stack Architecture

📁
Git Repo
GitHub / GitLab
🔄
CI/CD
GitHub Actions / GitLab CI
🎯
Orchestrator
Airflow / Prefect / dbt Cloud
Warehouse
BigQuery / Snowflake / Databricks
📊
Observability
Elementary / re_data / Monte Carlo

🔐 profiles.yml (Production)

my_project:
  outputs:

    dev:
      type: bigquery
      method: oauth
      project: my-company-dev
      dataset: dbt_{{ env_var('DBT_USER') }}
      threads: 4
      timeout_seconds: 300

    prod:
      type: bigquery
      method: service-account
      project: my-company-prod
      dataset: dbt_production
      keyfile: {{ env_var('BQ_KEYFILE_PATH') }}
      threads: 16
      timeout_seconds: 3600
      location: US
      maximum_bytes_billed: 10000000000000

  target: dev

⚙️ Airflow DAG for dbt

from airflow import DAG
from airflow.operators.bash import BashOperator
from cosmos import DbtDag, ProjectConfig, ProfileConfig

# Option A: Using Astronomer Cosmos (recommended)
dbt_dag = DbtDag(
    dag_id="dbt_production_daily",
    project_config=ProjectConfig(
        dbt_project_path="/usr/local/airflow/dbt",
    ),
    profile_config=ProfileConfig(
        profile_name="my_project",
        target_name="prod",
    ),
    schedule_interval="0 4 * * *", # 4 AM UTC
    operator_args={
        "vars": {"run_date": "{{ ds }}"},
        "dbt_cmd_flags": ["--threads", "16"],
    },
)

# Option B: BashOperator (simpler)
run_dbt = BashOperator(
    task_id="dbt_run",
    bash_command="""
      dbt run --target prod \
               --select tag:daily \
               --threads 16
    """
)

🔄 GitHub Actions CI/CD Pipeline

# .github/workflows/dbt_ci.yml
name: dbt CI/CD Pipeline

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

jobs:
  dbt_ci:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Install dbt
        run: pip install dbt-bigquery==1.8.0

      - name: dbt deps
        run: dbt deps

      - name: dbt compile (syntax check)
        run: dbt compile --target prod

      - name: Slim CI - run only changed models
                # Uses dbt state to only run modified models + downstream
        run: |
          dbt run --target prod \
                  --select state:modified+ \
                  --defer \
                  --state ./prod-artifacts/

      - name: dbt test (only on changed models)
        run: |
          dbt test --target prod \
                   --select state:modified+ \
                   --defer --state ./prod-artifacts/

  dbt_prod_deploy:
    needs: dbt_ci
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - name: Full production run
        run: dbt run --target prod --threads 16
      - name: Run all tests in production
        run: dbt test --target prod --threads 8

dbt at Petabyte (PB) Data Volume

Running dbt against petabyte-scale datasets demands careful materialization strategies, partitioning, incremental patterns, and warehouse-specific tuning.

⚠️ The Core Challenge at PB Scale

At petabyte volumes, a full table materialization means scanning and rewriting terabytes on every run — costing hours and thousands of dollars. The solution: aggressive incremental processing, partition pruning, and warehouse-native optimizations.

📅

Strategy 1: Partition-First Design

Partition all large fact tables by date. Every query and incremental merge scans only the relevant partitions, reducing compute by 100x–1000x. Combine with clustering on high-cardinality filter columns (user_id, event_type).

Strategy 2: Micro-batch Incremental

Instead of daily full refreshes, run incrementals every 15–60 minutes processing only new records. Use insert_overwrite strategy to replace entire date partitions atomically — faster than row-level MERGE on huge tables.

🔀

Strategy 3: Model Decomposition

Break one massive model into partitioned sub-models. Process each partition in parallel using dbt's thread pool or Airflow parallelism. Recombine downstream. This converts one 6-hour job into many 15-minute jobs running in parallel.

💾

Strategy 4: Intermediate Caching

Materialize expensive intermediate aggregations as table or incremental models rather than ephemeral or view. Avoids recomputing the same expensive joins/aggregations multiple times across dependent models.

🔧 PB-Scale BigQuery Configuration

-- models/fact/fact_events_pb.sql  (handles petabyte-scale event logs)
{{
  config(
    materialized          = 'incremental',
    incremental_strategy  = 'insert_overwrite',   -- replaces whole partitions atomically
    unique_key            = 'event_id',
    partition_by = {
      'field'       : 'event_date',
      'data_type'   : 'date',
      'granularity' : 'day'
    },
    cluster_by            = ['user_id', 'event_type', 'platform'],
    require_partition_filter = true,              -- NEVER allow full table scans!
    on_schema_change      = 'append_new_columns',
    hours_to_expiration   = None,                 -- production data never expires
    labels                = {'pipeline': 'dbt', 'team': 'data_engineering'},
  )
}}

WITH raw_events AS (
  SELECT
    event_id,
    user_id,
    LOWER(event_type)                     AS event_type,
    platform,
    DATE(event_timestamp, 'UTC')           AS event_date,
    event_timestamp,
    JSON_VALUE(properties, '$.session_id') AS session_id,
    properties
  FROM {{ source('raw_events', 'events') }}
  WHERE event_id IS NOT NULL
    AND event_timestamp IS NOT NULL

  -- Incremental filter: only process the last N days on each run
  {% if is_incremental() %}
    AND DATE(event_timestamp) >=
        DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)  -- 3-day lookback window
  {% else %}
    -- First full build: last 90 days only (bootstrap gradually)
    AND DATE(event_timestamp) >=
        DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
  {% endif %}
)

SELECT * FROM raw_events

⚙️ dbt_project.yml — Production Tuning

name: my_project
version: '1.0.0'
config-version: 2

vars:
  lookback_days: 3                     # global incremental window
  max_bytes_billed: 10000000000000      # 10TB query cost guard

models:
  my_project:
    staging:
      +materialized: view              # always views in staging
      +schema: staging

    intermediate:
      +materialized: ephemeral         # no storage for intermediate

    marts:
      +materialized: table             # final marts are full tables
      +schema: marts
      +grants:
        select: ["group:bi-analysts@company.com"]

    fact:
      +materialized: incremental       # all facts are incremental
      +schema: fact
      +tags: ["fact", "incremental"]

on-run-start:
  - "CALL my_project.pre_run_checks()"  # custom pre-run validation

on-run-end:
  - "{{ log_run_metadata() }}"          # write run stats to audit table
  - {{ elementary.upload_dbt_artifacts() }}

📋 PB Scale Best Practices Checklist

  • Always partition large tables — date-based partitioning is mandatory for fact tables over 10GB. Enable require_partition_filter to prevent accidental full scans.
  • Use insert_overwrite strategy — for BigQuery and Spark, replacing full date partitions is faster than row-level MERGE/UPSERT at PB scale. MERGE causes full partition scans.
  • Lookback window incremental logic — process last N days (not just latest timestamp) to handle late-arriving events. Tune N based on your data SLA.
  • Cluster by high-selectivity columns — Snowflake micro-partitions and BigQuery clustering reduce bytes scanned by 10x–100x for filtered queries.
  • Avoid SELECT * from huge tables — always project only needed columns. Use staging models as a projection + filtering layer before joins.
  • Set query cost guards — BigQuery's maximum_bytes_billed and Snowflake's statement_timeout_in_seconds prevent runaway queries from costing thousands.
  • Increase thread count for production — default 4 threads is too low. Use 16–32 threads in production to parallelize independent model branches.
  • Use dbt Slim CI in PRsstate:modified+ selection means PRs only run changed models, not the full pipeline. Saves 95%+ of CI runtime and cost.
  • Bootstrap large tables in batches — on first build, use a custom macro to process historical data in monthly or weekly chunks instead of one massive query.
  • Monitor with Elementary or Monte Carlo — attach data observability to detect schema changes, volume anomalies, and freshness violations before they impact dashboards.
  • Materialize expensive intermediates — if an intermediate CTE is referenced by 3+ downstream models, materialize it as a table to avoid recomputing it multiple times.
  • Separate dev / CI / prod environments — dev uses personal schemas (dbt_username), CI uses isolated schemas, prod writes to blessed schemas. Never let dev run overwrite production data.

📊 Environment Strategy Summary

EnvironmentTargetSchemaThreadsTriggerData Scope
Development dev dbt_username 4 Manual / dbt CLI Last 7–30 days
CI / PR ci dbt_ci_prXXX 8 Pull Request Modified models + 3 days data
Staging staging dbt_staging 16 Merge to main Last 90 days
Production prod dbt_production 32 Scheduled (Airflow/dbt Cloud) Full historical data

🚀 Key Numbers at PB Scale (Reference)

  • BigQuery: Process 10TB/min. With partitioning, reduce scans to ~10–50GB per incremental run.
  • Snowflake: X-Large warehouse (16 nodes) handles most dbt workloads. Use multi-cluster warehouses for concurrent runs.
  • Databricks: Use dbt-spark or dbt-databricks. Enable Delta Lake for MERGE support and time travel. Optimize for Photon engine.
  • Incremental runs at PB scale: Should complete in <30 min with proper partitioning. Full rebuilds reserved for schema migrations only.