🔬 Principal DevOps Architect & Infrastructure Historian

The Complete Terraform
Deep Dive

From the pre-2014 infrastructure chaos to advanced GitOps pipelines — a comprehensive technical reference for engineers who want to truly understand Terraform.

5 Parts · Full Coverage Core Architecture · Internals Real-World · Patterns & Code References · Cited Sources
Part 1 · Origins & The "Why"

Before Terraform: The Pre-2014 Infrastructure Crisis

To understand why Terraform was built, we must first understand the pain it was designed to eliminate — a chaotic era of manual clicking, brittle scripts, and zero reproducibility.

🔥 The Problem: Manual Provisioning & Imperative Scripting

🖱️ Manual Provisioning (ClickOps)

Engineers would log into the AWS Console, Azure Portal, or physical data-center tools and click through wizards to create servers, networks, and databases. This approach had catastrophic failures at scale:

  • No reproducibility: "Works on my region" — no one could recreate the same environment twice. Staging drifted from prod within weeks.
  • Zero audit trail: Who opened port 22 to 0.0.0.0/0 in production? The console didn't tell you.
  • Human error at scale: One misclick in a production firewall could trigger a three-hour outage. In 2013, a misconfigured security group at a major US exchange caused a 45-minute trading halt.
  • No version control: Infrastructure had no Git history, no rollback, no peer review. You couldn't git revert a broken datacenter.
  • Knowledge silos: Only "the guy who built it" knew how it worked. When he left, the company lost institutional memory entirely.

📜 Imperative Scripting (Bash / Python / PowerShell)

The natural evolution was automation scripts. Teams wrote Bash wrappers around AWS CLI commands or Python using boto2. These solved reproducibility but introduced a new class of failures:

  • Idempotency failures: aws ec2 create-vpc would succeed the first time and crash the second with "VPC already exists." Scripts had to manually check current state before every action — a logic nightmare.
  • State blindness: Scripts had no memory. They didn't know what existed in the cloud. Every run was a guess.
  • Order dependency: Creating a subnet before the VPC exists causes an error. Engineers had to manually encode the correct order of 50+ API calls.
  • Error handling hell: A half-run script that crashed after creating 3 of 10 resources left the environment in an unknown partial state — the worst possible scenario.
  • No parallelism: Scripts ran sequentially. Deploying 20 independent S3 buckets took 20× as long as deploying 1.
  • Provider lock-in: An AWS Bash script was 100% AWS-specific. A multi-cloud strategy meant maintaining separate toolchains.
💡
The Historical Turning Point (2011–2014) Netflix's "Chaos Monkey" era revealed that the industry needed infrastructure as code with real state awareness. CloudFormation (2011) was AWS's first attempt, but it was JSON-only, AWS-specific, and had no concept of cross-provider orchestration. Google Deployment Manager and Azure ARM Templates followed — all cloud-siloed. The market needed a provider-agnostic, open-source solution. HashiCorp delivered Terraform in July 2014.

🧠 The Philosophy: Declarative vs. Imperative

⚙️ Imperative Model ("How")

You describe the steps to achieve a goal. The execution engine follows your instructions literally.

imperative.shbash
# You must check IF it exists first
if ! aws ec2 describe-vpcs \
    --filters "Name=tag:Name,Values=my-vpc" \
    | grep -q VpcId; then
  aws ec2 create-vpc \
    --cidr-block 10.0.0.0/16
fi
# Then create the subnet — manually ordered
aws ec2 create-subnet \
  --vpc-id $VPC_ID \
  --cidr-block 10.0.1.0/24

Problems: manual state checks, order management, error handling, no parallelism.

📋 Declarative Model ("What")

You describe the desired end state. The engine computes how to get there — and what to change if reality differs.

declarative.tfHCL
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
}

resource "aws_subnet" "public" {
  vpc_id     = aws_vpc.main.id
  cidr_block = "10.0.1.0/24"
}

Terraform infers order, parallelizes independent resources, and checks current state automatically.

🗄️ Why "State" is Terraform's Most Critical Concept

The terraform.tfstate file is Terraform's source of truth about the real world. Without it, Terraform would be forced to re-read all cloud APIs on every run (slow, rate-limited, and sometimes impossible) to know what it previously created. State enables:

🔗 Resource Mapping

State maps your HCL resource blocks to real cloud resource IDs. aws_vpc.mainvpc-0abc123def456. Without this mapping, Terraform cannot update or delete the resource.

📐 Diff Computation

During terraform plan, Terraform compares desired config → state → real world. This three-way diff determines exactly what changes are necessary and generates a precise execution plan.

⚡ Performance

State is a local cache of remote API responses. Terraform reads attributes from state instead of querying cloud APIs for every attribute of every resource, dramatically speeding up plan operations.

⚠️
State is not a backup. It's a live operational record. If state drifts from reality (someone manually deleted a resource in the console), Terraform will not know until terraform plan compares against the real API. This is called State Drift — one of the most dangerous Terraform failure modes.

🌐 Ecosystem: Terraform vs. Configuration Management

A critical misconception is that Terraform replaces Ansible or Chef. They solve different layers of the stack and are complementary, not competing.

Dimension Terraform Ansible Chef / Puppet
Primary Domain Cloud infrastructure orchestration OS configuration, application deployment OS configuration, compliance enforcement
Model Declarative Imperative (can be idempotent) Declarative (DSL)
State Management Full state file + locking Stateless (re-reads reality) Chef Server / Puppet DB
Cloud Resources Excellent (1000+ providers) Modules exist but limited Not primary use case
Installs packages on VMs No (not its job) Yes (primary use case) Yes (primary use case)
Multi-cloud Yes (AWS + GCP + Azure in one config) Partial Partial
Agentless Yes Yes (SSH) No (Chef Client / Puppet Agent)

🔗 Why Big Data Stacks Need Both

In a production Big Data platform (e.g., an EMR/Spark cluster with Kafka and Cassandra), the workflow is layered:

1

Terraform: Provision the Cloud Layer

Create the VPC, subnets, security groups, IAM roles, EMR cluster nodes, MSK Kafka brokers, and S3 buckets. Terraform knows nothing about what OS packages should be installed — it just creates the machines.

2

Ansible: Configure the Application Layer

After EC2 instances boot, Ansible connects via SSH to install JVM, configure Kafka topic settings, set up Cassandra ring topology, install monitoring agents (Prometheus Node Exporter), and tune kernel parameters (vm.swappiness, net.core.somaxconn).

3

Terraform Output → Ansible Inventory

Terraform outputs the private IP addresses of brokers and workers into a dynamic Ansible inventory file. This closes the loop: Terraform builds the infrastructure, Ansible configures it. Together, they achieve a fully automated, reproducible Big Data environment.

Part 2 · Core Architecture & Internals

The Engine: How Terraform Actually Works

Terraform is not just a CLI wrapper around cloud APIs. It's a sophisticated graph-theoretic execution engine with a plugin-based provider system and a carefully designed lifecycle.

⚙️ Terraform Core vs. Providers

Terraform's architecture is cleanly split into two planes that communicate over a Go-based RPC channel using the gRPC protocol and the Terraform Plugin Framework:

TERRAFORM CORE HCL Parser / Config Loader Reads .tf files → Resource Graph DAG Engine (Graph Builder) Topological Sort · Parallel Walk State Manager Read / Write / Lock State File Plan & Apply Engine Diff Computation · CRUD Dispatch via Plugin RPC gRPC/RPC PROVIDERS (PLUGINS) AWS Provider (Go binary) Implements: CRUD for all AWS resources Azure Provider (Go binary) Calls ARM / Azure Resource Manager APIs GCP Provider (Go binary) Calls Google Cloud REST APIs Custom / Community Providers Kubernetes, Vault, Datadog, GitHub, MongoDB Atlas, 3000+ on registry

Terraform Core Responsibilities

  • Parsing HCL configuration files into an abstract resource graph
  • Building and walking the Directed Acyclic Graph (DAG)
  • Reading, writing, and locking the state file
  • Computing the diff between desired state and current state
  • Calling provider plugins via gRPC for CRUD operations
  • Handling output, variables, locals, and data sources

Provider Plugin Responsibilities

  • Implementing the ResourceServer gRPC interface for each resource
  • Translating Terraform's resource schema into cloud API calls
  • Handling authentication and retries with the cloud provider
  • Mapping cloud API response fields back to Terraform attributes
  • Downloaded and cached in .terraform/providers/ during init
  • Versioned independently from Terraform Core (semver)

📊 Graph Theory: The DAG Engine

Terraform uses a Directed Acyclic Graph (DAG) to model all resource relationships. Each node is a resource; each directed edge represents a dependency (A must exist before B can be created). The DAG is the core data structure that enables both correctness and parallelism.

How the DAG is Built

1. Parse all .tf files

Every resource, data source, variable, and output becomes a node.

2. Infer explicit references

When resource B references aws_vpc.main.id, Terraform creates a directed edge: VPC → Subnet.

3. Add implicit edges

depends_on meta-argument adds explicit edges. Modules create sub-graphs.

4. Topological sort

Terraform performs a DFS-based topological sort to find a valid execution order.

5. Parallel walk

Nodes with no unsatisfied dependencies are dispatched to a goroutine pool for concurrent execution.

Parallel vs Sequential Execution

Parallel: Resources with no dependency relationship between them (e.g., 5 independent S3 buckets) are created simultaneously using Go goroutines. Terraform uses a configurable parallelism setting (default: 10 concurrent operations via -parallelism=N).
🔗
Sequential: Resources that reference each other must be built in order. A Subnet referencing a VPC ID must wait until the VPC is created and its ID is available in state.
dependency_example.tfHCL
# VPC must exist first (node 1)
resource "aws_vpc" "main" {}

# These 3 are INDEPENDENT of each other
# → Terraform creates them IN PARALLEL
resource "aws_subnet" "az1" {
  vpc_id = aws_vpc.main.id
}
resource "aws_subnet" "az2" {
  vpc_id = aws_vpc.main.id
}
resource "aws_internet_gateway" "igw" {
  vpc_id = aws_vpc.main.id
}

🔄 The Terraform Lifecycle: Deep Mechanics

🚀 terraform init

What happens:

  • Reads the required_providers block and queries the Terraform Registry (registry.terraform.io) or a private registry for provider binaries
  • Downloads provider binaries to .terraform/providers/ (platform-specific: darwin_arm64, linux_amd64, etc.)
  • Writes .terraform.lock.hcl with SHA-256 checksums to pin provider versions
  • Initializes the backend (configures S3, Consul, etc. for remote state)
  • Downloads and initializes modules listed in module blocks
State impact: Does not modify the state file. Only configures the local Terraform workspace.

📋 terraform plan

What happens:

  • Performs a refresh: reads the current state file and queries cloud APIs to detect drift
  • Builds the full DAG from HCL configuration
  • Computes a three-way diff: (desired config) vs (state) vs (real cloud resources)
  • Produces a human-readable execution plan: which resources to CREATE, UPDATE, REPLACE, or DESTROY
  • Can output a binary plan file: terraform plan -out=plan.tfplan for reproducible applies
State impact: May update state with refreshed values (drift detection), but makes no changes to real infrastructure. Read-only against the cloud.

terraform apply

What happens:

  • If no plan file provided, re-runs plan and prompts for confirmation
  • Walks the DAG and dispatches CRUD operations to providers via gRPC
  • After each successful resource operation, immediately writes the result to state — partial applies are safe because state is updated incrementally
  • On failure mid-apply, state reflects all successfully completed resources, allowing for re-runs
  • Outputs are computed and written to state at the end
⚠️
State impact: Heavily modifies state. State is locked at the start (with remote backends) and unlocked after completion. Never interrupt an apply.

💥 terraform destroy

What happens:

  • Reverses the dependency graph: resources that were created last are destroyed first
  • Creates a "destroy plan" — equivalent to setting all resources to empty
  • Calls Delete on each resource via the provider gRPC interface
  • Removes destroyed resources from the state file incrementally
  • Honors prevent_destroy = true lifecycle rules — will refuse to destroy protected resources
🔴
State impact: Resources are removed from state as they are destroyed. A successful destroy results in an empty state. This is irreversible for stateful resources like databases.
Part 3 · Advanced Engineering & Patterns

Advanced Patterns: State, Structure & DRY Code

Production Terraform engineering requires far more than writing resource blocks. This section covers the patterns that separate a working prototype from a maintainable, team-safe infrastructure platform.

🔒 Remote Backends & State Locking

Local state (terraform.tfstate on disk) is dangerous in team environments. Remote backends solve the fundamental problem of shared state with concurrent access control.

The Race Condition Problem

💥
Scenario: Alice runs terraform apply on her laptop. Bob runs terraform apply on his laptop simultaneously. Both read the same state, both compute independent plans, both write back a different state file. One overwrites the other. Resources exist in the cloud that are orphaned from state. Chaos ensues.

The Solution: Locking

Remote backends implement a distributed lock using the underlying storage's atomic operations. In S3 + DynamoDB, a lock record is written to DynamoDB before any state mutation and deleted after. Any concurrent Terraform run reads the lock and waits (or fails fast).

S3 Remote Backend (Industry Standard)

backend.tfHCL
terraform {
  backend "s3" {
    bucket         = "my-org-tfstate"
    key            = "prod/vpc/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    # DynamoDB for distributed locking
    dynamodb_table = "terraform-state-locks"
    # Use KMS for encryption at rest
    kms_key_id     = "alias/terraform-state"
  }
}
S3 + DynamoDB Terraform Cloud GCS Azure Blob Consul

🌍 Multi-Environment: Workspaces vs. Terragrunt

📁 Terraform Workspaces

Workspaces allow multiple state files within the same backend configuration. Each workspace maps to a separate terraform.tfstate object (e.g., env:/dev/terraform.tfstate).

workspacesbash
terraform workspace new dev
terraform workspace new staging
terraform workspace select prod

# Reference in config:
main.tfHCL
locals {
  env_config = {
    dev     = { instance_type = "t3.small"  }
    prod    = { instance_type = "m5.4xlarge" }
  }
  cfg = local.env_config[terraform.workspace]
}

Limitations

  • All environments share the same code — no isolation of backend config
  • Cannot have different providers per workspace easily
  • Easy to accidentally apply to wrong workspace (dangerous)
  • Best for simple, ephemeral environments (feature branches, testing)

🌲 Terragrunt (Recommended for Teams)

Terragrunt is a thin wrapper around Terraform that adds DRY configuration, remote state management, and cross-module dependencies. Created by Gruntwork.

Directory Structuretree
infrastructure/
├── _base/
│   └── vpc/          # Reusable Terraform module
├── dev/
│   └── vpc/
│       └── terragrunt.hcl
├── staging/
│   └── vpc/
│       └── terragrunt.hcl
└── prod/
    └── vpc/
        └── terragrunt.hcl
prod/vpc/terragrunt.hclHCL
terraform {
  source = "../../_base/vpc"
}

remote_state {
  backend = "s3"
  config  = {
    bucket = "my-prod-tfstate"
    key    = "vpc/terraform.tfstate"
  }
}

inputs = {
  env          = "prod"
  cidr_block   = "10.0.0.0/16"
  az_count     = 3
}

Terragrunt advantages: separate backend per environment, run-all apply across stacks, dependency blocks between modules, DRY backend configuration via root.hcl.

🔁 DRY Infrastructure: for_each, Dynamic Blocks & Modules

Repeating resource blocks for each environment or configuration variant violates the DRY principle. Terraform provides three powerful constructs to eliminate repetition:

🔄 for_each

Creates multiple resource instances from a map or set. Each instance has a unique key.

for_each.tfHCL
variable "buckets" {
  default = {
    logs   = "us-east-1"
    assets = "eu-west-1"
    backup = "ap-south-1"
  }
}

resource "aws_s3_bucket" "b" {
  for_each = var.buckets
  bucket   = "myorg-${each.key}"
  
  provider = aws.${each.value}
}

🧩 Dynamic Blocks

Generates nested configuration blocks programmatically from a list or map.

dynamic.tfHCL
resource "aws_security_group" "sg" {
  dynamic "ingress" {
    for_each = var.ingress_rules
    content {
      from_port = ingress.value.port
      to_port   = ingress.value.port
      protocol  = "tcp"
      cidr_blocks =
        ingress.value.cidrs
    }
  }
}

📦 Modules

Reusable, versioned packages of Terraform resources. The primary unit of abstraction.

main.tfHCL
module "vpc" {
  source  =
    "terraform-aws-modules/vpc/aws"
  version = "~> 5.0"

  name = "production"
  cidr = "10.0.0.0/16"
  azs  = ["us-east-1a",
          "us-east-1b"]

  enable_nat_gateway = true
}
Part 4 · Real-World Implementation & Safety

Production Implementation: Security, CI/CD & Code

Knowing the internals is necessary but not sufficient. Production Terraform requires hardened secrets management, automated pipelines, and thoughtfully structured code.

🔐 Secrets Management

🔑 HashiCorp Vault

Best for: Dynamic secrets, database credentials, PKI, SSH signing. Vault generates short-lived credentials on demand — no static secrets in config.

vault_provider.tfHCL
data "vault_generic_secret" "db" {
  path = "secret/prod/db"
}

resource "aws_db_instance" "db" {
  password = data.
    vault_generic_secret.
    db.data["password"]
}

🏭 AWS KMS + SSM / Secrets Manager

Best for: AWS-native secrets. Store encrypted values in Secrets Manager, reference via Terraform data source. KMS encrypts state at rest.

ssm_secret.tfHCL
data "aws_secretsmanager_secret_version"
    "db_pass" {
  secret_id = "prod/db/password"
}

locals {
  db_pass = jsondecode(
    data.aws_secretsmanager
    _secret_version.db_pass
    .secret_string)["password"]
}

💻 Environment Variables

Best for: CI/CD pipeline credentials injected at runtime. Never store secrets in .tf files or commit them to git. Use TF_VAR_ prefix.

env_vars.shbash
# CI/CD pipeline injects these
export TF_VAR_db_password="$SECRET"
export AWS_ACCESS_KEY_ID="$KEY"
export AWS_SECRET_ACCESS_KEY="$SID"

# NEVER do this in .tf files:
# password = "hardcoded123" ← DANGER

# Always mark sensitive variables
🔴
Mark variables sensitive = true to prevent Terraform from printing values in plan output.

🔄 CI/CD: GitOps Pipeline with GitHub Actions

A standard GitOps Terraform pipeline enforces that all infrastructure changes go through code review before being applied, and that applies only happen from a controlled, auditable environment — not engineer laptops.

1

Pull Request Opened → terraform plan

GitHub Actions triggers on PR open/push. Runs terraform fmt -check, terraform validate, and terraform plan. The plan output is posted as a PR comment for human review. Tools like Atlantis or Terraform Cloud handle this natively.

2

Code Review & Security Scan

Reviewers inspect the plan diff. Automated tools like Checkov, tfsec, or Semgrep scan for security misconfigurations (open security groups, unencrypted buckets). Policy-as-code tools like Sentinel or OPA/Conftest enforce organizational policies.

3

PR Merge → terraform apply

On merge to main, the pipeline applies the saved plan file (not a fresh plan — this guarantees what was reviewed is what gets applied). Uses OIDC-based short-lived credentials instead of long-lived AWS access keys.

4

Drift Detection (Scheduled)

A nightly scheduled job runs terraform plan with the -detailed-exitcode flag. If exit code 2 (changes detected), it alerts the team via Slack/PagerDuty. This catches manual console changes before they cause incidents.

.github/workflows/terraform.ymlYAML
name: Terraform CI/CD
on:
  pull_request:  # Plan on PR
  push:
    branches: [main]  # Apply on merge

permissions:
  id-token: write   # OIDC for AWS
  contents: read
  pull-requests: write  # Post plan comment

jobs:
  terraform:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS via OIDC
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123:role/GitHubActions
          aws-region: us-east-1

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "1.9.0"

      - run: terraform init
      - run: terraform fmt -check -recursive
      - run: terraform validate

      - name: Terraform Plan
        run: terraform plan -out=plan.tfplan -no-color
        if: github.event_name == 'pull_request'

      - name: Terraform Apply
        run: terraform apply -auto-approve plan.tfplan
        if: github.ref == 'refs/heads/main'

💻 Production Code: HA VPC + Managed Database (AWS)

📁
Module Structure: This example deploys a High-Availability 3-AZ VPC with public/private subnets, NAT Gateways, and an RDS Aurora PostgreSQL cluster across 3 availability zones using a modular structure following the Single Responsibility Principle.
modules/vpc/main.tfHCL
# ─── HIGH-AVAILABILITY VPC MODULE ────────────────────────────────────

data "aws_availability_zones" "available" {
  state = "available"
}

locals {
  azs             = slice(data.aws_availability_zones.available.names, 0, var.az_count)
  public_cidrs    = [for i, az in local.azs : cidrsubnet(var.cidr_block, 8, i)]
  private_cidrs   = [for i, az in local.azs : cidrsubnet(var.cidr_block, 8, i + 10)]
}

# ─── VPC ─────────────────────────────────────────────────────────────
resource "aws_vpc" "this" {
  cidr_block           = var.cidr_block
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = merge(var.tags, { Name = "${var.name}-vpc" })
}

# ─── PUBLIC SUBNETS (one per AZ) ─────────────────────────────────────
resource "aws_subnet" "public" {
  for_each                = toset(local.azs)
  vpc_id                  = aws_vpc.this.id
  cidr_block              = local.public_cidrs[index(local.azs, each.key)]
  availability_zone       = each.key
  map_public_ip_on_launch = true

  tags = merge(var.tags, {
    Name = "${var.name}-public-${each.key}"
    Tier = "public"
  })
}

# ─── PRIVATE SUBNETS (one per AZ) ────────────────────────────────────
resource "aws_subnet" "private" {
  for_each          = toset(local.azs)
  vpc_id            = aws_vpc.this.id
  cidr_block        = local.private_cidrs[index(local.azs, each.key)]
  availability_zone = each.key

  tags = merge(var.tags, {
    Name = "${var.name}-private-${each.key}"
    Tier = "private"
  })
}

# ─── INTERNET GATEWAY ────────────────────────────────────────────────
resource "aws_internet_gateway" "this" {
  vpc_id = aws_vpc.this.id
  tags   = merge(var.tags, { Name = "${var.name}-igw" })
}

# ─── ELASTIC IPs + NAT GATEWAYS (one per public subnet for HA) ───────
resource "aws_eip" "nat" {
  for_each = toset(local.azs)
  domain   = "vpc"
  tags     = merge(var.tags, { Name = "${var.name}-eip-${each.key}" })
}

resource "aws_nat_gateway" "this" {
  for_each      = toset(local.azs)
  allocation_id = aws_eip.nat[each.key].id
  subnet_id     = aws_subnet.public[each.key].id
  depends_on    = [aws_internet_gateway.this]
  tags          = merge(var.tags, { Name = "${var.name}-nat-${each.key}" })
}

# ─── ROUTE TABLES ────────────────────────────────────────────────────
resource "aws_route_table" "public" {
  vpc_id = aws_vpc.this.id
  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.this.id
  }
  tags = merge(var.tags, { Name = "${var.name}-rt-public" })
}

resource "aws_route_table" "private" {
  for_each = toset(local.azs)
  vpc_id   = aws_vpc.this.id
  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.this[each.key].id
  }
  tags = merge(var.tags, { Name = "${var.name}-rt-private-${each.key}" })
}

# ─── ROUTE TABLE ASSOCIATIONS ────────────────────────────────────────
resource "aws_route_table_association" "public" {
  for_each       = aws_subnet.public
  subnet_id      = each.value.id
  route_table_id = aws_route_table.public.id
}

resource "aws_route_table_association" "private" {
  for_each       = aws_subnet.private
  subnet_id      = each.value.id
  route_table_id = aws_route_table.private[each.key].id
}
modules/vpc/variables.tfHCL
variable "name" {
  description = "Prefix for all resource names"
  type        = string
}

variable "cidr_block" {
  description = "VPC CIDR block"
  type        = string
  default     = "10.0.0.0/16"
  validation {
    condition     = can(cidrhost(var.cidr_block, 0))
    error_message = "Must be a valid CIDR block."
  }
}

variable "az_count" {
  description = "Number of availability zones (2 or 3)"
  type        = number
  default     = 3
  validation {
    condition     = var.az_count >= 2 && var.az_count <= 3
    error_message = "az_count must be 2 or 3."
  }
}

variable "tags" {
  description = "Common tags applied to all resources"
  type        = map(string)
  default     = {}
}
modules/vpc/outputs.tfHCL
output "vpc_id" {
  description = "VPC ID"
  value       = aws_vpc.this.id
}

output "private_subnet_ids" {
  description = "IDs of all private subnets"
  value       = [for s in aws_subnet.private : s.id]
}

output "public_subnet_ids" {
  description = "IDs of all public subnets"
  value       = [for s in aws_subnet.public : s.id]
}

output "vpc_cidr" {
  value = aws_vpc.this.cidr_block
}
modules/database/main.tfHCL
# ─── AURORA POSTGRESQL CLUSTER (MULTI-AZ HA) ─────────────────────────

resource "aws_db_subnet_group" "this" {
  name       = "${var.name}-db-subnets"
  subnet_ids = var.private_subnet_ids
  tags       = var.tags
}

resource "aws_security_group" "db" {
  name   = "${var.name}-db-sg"
  vpc_id = var.vpc_id

  dynamic "ingress" {
    for_each = var.allowed_security_groups
    content {
      from_port       = 5432
      to_port         = 5432
      protocol        = "tcp"
      security_groups = [ingress.value]
    }
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
  tags = var.tags
}

resource "aws_rds_cluster" "this" {
  cluster_identifier      = "${var.name}-aurora"
  engine                  = "aurora-postgresql"
  engine_version          = "15.4"
  database_name           = var.db_name
  master_username         = var.master_username
  master_password         = var.master_password # Injected from Vault/SSM
  db_subnet_group_name    = aws_db_subnet_group.this.name
  vpc_security_group_ids  = [aws_security_group.db.id]
  storage_encrypted       = true
  kms_key_id              = var.kms_key_arn
  deletion_protection     = var.deletion_protection
  skip_final_snapshot     = false
  final_snapshot_identifier = "${var.name}-final-${formatdate("YYYYMMDD", timestamp())}"

  lifecycle {
    prevent_destroy = true  # Safety net for production DB
    ignore_changes  = [master_password] # Managed externally
  }
  tags = var.tags
}

# ─── CLUSTER INSTANCES (reader + writer) ─────────────────────────────
resource "aws_rds_cluster_instance" "instances" {
  count              = var.instance_count
  identifier         = "${var.name}-instance-${count.index}"
  cluster_identifier = aws_rds_cluster.this.id
  instance_class     = var.instance_class
  engine             = aws_rds_cluster.this.engine
  engine_version     = aws_rds_cluster.this.engine_version

  performance_insights_enabled = true
  monitoring_interval          = 60
  auto_minor_version_upgrade   = true
  tags                         = var.tags
}
environments/prod/main.tf (Root module)HCL
terraform {
  required_version = ">= 1.9"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
  backend "s3" {
    bucket         = "myorg-tfstate-prod"
    key            = "prod/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

provider "aws" {
  region = var.aws_region
  default_tags {
    tags = {
      Environment = "prod"
      ManagedBy   = "terraform"
      Team        = "platform"
    }
  }
}

# ─── COMPOSE MODULES ─────────────────────────────────────────────────
module "vpc" {
  source     = "../../modules/vpc"
  name       = "prod"
  cidr_block = "10.0.0.0/16"
  az_count   = 3
  tags       = var.common_tags
}

module "database" {
  source                   = "../../modules/database"
  name                     = "prod"
  vpc_id                   = module.vpc.vpc_id
  private_subnet_ids       = module.vpc.private_subnet_ids
  db_name                  = "appdb"
  master_username          = "dbadmin"
  master_password          = var.db_password  # from TF_VAR_db_password
  instance_class           = "db.r6g.2xlarge"
  instance_count           = 3
  kms_key_arn              = var.kms_key_arn
  deletion_protection      = true
  allowed_security_groups  = []
  tags                     = var.common_tags
}

output "db_endpoint" {
  value     = module.database.cluster_endpoint
  sensitive = true
}
Part 5 · Troubleshooting

Terraform Traps: Common Failures & How to Debug

Every Terraform practitioner eventually hits these failure modes. Understanding them deeply — and knowing the exact commands to diagnose them — separates production engineers from weekend experimenters.

🔄 Trap 1: Cycle Errors

Symptom: Error: Cycle: aws_security_group.a, aws_security_group.b

Occurs when resource A references resource B which references resource A — creating a circular dependency in the DAG. Terraform cannot determine which to create first.

Common Causes

  • Two security groups that reference each other's ID in their ingress rules
  • A module that outputs its own input
  • Misuse of depends_on creating an indirect cycle

The Fix

fix_cycle.tfHCL
# WRONG: creates a cycle
resource "aws_security_group" "a" {
  ingress {
    security_groups = [aws_security_group.b.id]
  }
}

# FIX: use aws_security_group_rule
resource "aws_security_group" "a" {}
resource "aws_security_group" "b" {}
resource "aws_security_group_rule" "a_to_b" {
  type                     = "ingress"
  source_security_group_id = aws_security_group.a.id
  security_group_id        = aws_security_group.b.id
}

📉 Trap 2: State Drift

Symptom: Resources exist in the cloud but not in state, or vice versa. Plan shows unexpected changes.

Occurs when humans manually modify cloud resources (console changes, emergency hotfixes) without updating Terraform configuration. This is the most dangerous operational failure mode.

Diagnosis & Remediation

drift_commands.shbash
# 1. Detect drift (refresh from real APIs)
terraform plan -refresh-only

# 2. If resource exists in cloud but not state:
terraform import aws_s3_bucket.my_bucket my-bucket-name

# 3. If resource in state but deleted in cloud:
terraform state rm aws_s3_bucket.my_bucket

# 4. If resource config drifted:
terraform apply -refresh-only # accept real state

# 5. View raw state for investigation:
terraform state list
terraform state show aws_s3_bucket.my_bucket

🔒 Trap 3: State Lock Stuck

Symptom: Error acquiring the state lock: ConditionalCheckFailedException

Occurs when a previous Terraform process was killed mid-apply (Ctrl+C, CI timeout, network loss) and failed to release the DynamoDB lock. The state is safe but no one can apply.

unlock.shbash
# Get lock ID from error message, then:
terraform force-unlock <LOCK_ID>

# Or directly in DynamoDB (emergency):
aws dynamodb delete-item \
  --table-name terraform-state-locks \
  --key '{"LockID": {"S": "path/to/state"}}'

# ⚠️ Verify no other process is running first!

🔄 Trap 4: Resource Replace on Change

Symptom: Plan shows -/+ destroy and then create replacement for an in-place change you expected.

Some resource attributes are "ForceNew" — changing them requires destroying the old resource and creating a new one (e.g., changing an EC2 instance's AMI ID or an RDS engine version requires replacement). This can cause production downtime.

lifecycle_trick.tfHCL
resource "aws_instance" "web" {
  ami           = var.ami_id
  instance_type = "t3.medium"

  # Create new instance before destroying old
  lifecycle {
    create_before_destroy = true
  }
}

Use create_before_destroy = true for zero-downtime replacements. Combine with a load balancer to drain traffic from the old instance before destruction.

🔍 Debugging with TF_LOG

Terraform's TF_LOG environment variable enables detailed logging at multiple verbosity levels. This is your most powerful debugging tool for provider errors, API rate limits, and network issues.

Log Levels (ascending verbosity)

LevelContentsUse Case
ERRORFatal errors onlyProduction alerting
WARNWarnings + errorsDeprecation notices
INFOHigh-level flowGeneral debugging
DEBUGAll operationsProvider API calls
TRACEEverything (gRPC)Provider internals

Debug Commands

debug.shbash
# Full debug to file (TRACE = everything)
export TF_LOG=TRACE
export TF_LOG_PATH=./terraform.log
terraform apply

# Debug only the provider (not core)
export TF_LOG_PROVIDER=DEBUG

# See exact AWS API calls made:
export TF_LOG=TRACE | grep "RequestURL"

# Validate config without cloud calls:
terraform validate

# Plan with detailed resource reason:
terraform plan -out=p.tfplan
terraform show -json p.tfplan | jq \
  '.resource_changes[]
   | select(.change.actions[]
     == "delete")'

📋 Terraform Traps Quick Reference

TrapRoot CauseImmediate CommandPrevention
Cycle Error Circular resource references in DAG terraform graph | dot -Tsvg to visualize Use _rule resources to break cycles
State Drift Manual cloud console changes terraform plan -refresh-only Nightly drift detection CI job
Stuck Lock Killed mid-apply process terraform force-unlock <ID> Use CI with proper timeout handling
Force Replace Immutable attribute changed terraform plan to preview lifecycle { create_before_destroy }
Provider Timeout Cloud API slow / rate limited TF_LOG=DEBUG terraform apply Set timeouts block in resource
State Corruption Concurrent applies or manual edits Restore from S3 versioned backup Always use remote state + locking
Variable Leakage Sensitive values printed in logs Check sensitive = true flags Mark all secrets as sensitive
References

Sources & Further Reading

All information in this guide is grounded in official documentation, academic research, and industry publications. The following references are cited throughout.

1
HashiCorp Terraform Official Documentation — Comprehensive reference for all Terraform commands, configuration, backends, and providers. Primary source for lifecycle mechanics, state management, and plugin architecture.
2
Terraform Internals: Resource Graph — Official deep-dive into the DAG engine, topological sort implementation, and parallel walk algorithm.
3
Terraform State: Purpose of Terraform State — HashiCorp's official explanation of why state is required, what it contains, and how it maps resources to real infrastructure.
4
Terraform GitHub — Historical Changelog since v0.1 (July 2014) — Primary source for the historical evolution of Terraform from its initial release through the plugin protocol changes.
5
Terragrunt by Gruntwork — Official Terragrunt documentation covering DRY backends, run-all operations, and cross-module dependencies.
6
Atlantis: Terraform Pull Request Automation — Documentation for the GitOps workflow tool that automates plan/apply on pull requests.
7
Terraform Plugin Framework (gRPC Protocol) — Technical specification for how providers implement the gRPC server interface and communicate with Terraform Core.
8
AWS Provider Documentation (registry.terraform.io) — Complete reference for all AWS resources, data sources, and provider configuration used in the code examples.
9
Checkov — Infrastructure as Code Security Scanning — Tool for static analysis of Terraform code to detect security misconfigurations before apply.
10
HashiCorp Vault: Secrets Engines — Official documentation for integrating Vault dynamic secrets with Terraform for database credential injection.
11
OpenTofu Documentation (Terraform Fork) — Community-maintained fork of Terraform under the Linux Foundation, providing insight into the open-source Terraform codebase and community direction post-BSL license change (2023).
12
GitHub Actions OIDC for AWS Authentication — Recommended approach for authenticating GitHub Actions to AWS without long-lived access keys using OIDC federation.