🗄️ Big Data Platform Design

Amazon S3 — Complete Reference
for Data Engineers

From fundamentals to petabyte-scale production architecture. Everything you need to design, build, and operate S3 as the backbone of your Big Data Platform.

335T+
Objects stored globally
11 9s
Durability guarantee
5TB
Max single object size
Effective storage scale

🧠 What is Amazon S3?

Simple Storage Service — the world's most widely used object storage, and the de facto data lake foundation.

🗃️

Object Storage (Not File, Not Block)

S3 stores data as objects in flat namespaces called buckets. Unlike a filesystem, there are no directories — only keys (paths) and values (bytes). Every object gets a unique URL and optional metadata.


s3://my-datalake/year=2025/month=03/events.parquet
📐

Core Concepts

  • Bucket — global namespace container, tied to a region
  • Object — file + metadata, up to 5 TB
  • Key — full path of the object within a bucket
  • Prefix — logical folder simulation using "/" in key names
  • Versioning — keep history of all object versions
  • ETag — MD5 hash for integrity verification
🔌

API-First Design

Everything is HTTP REST. GET / PUT / DELETE / HEAD / LIST — Six verbs control everything. This makes it universally integrable with any language, framework, or system on earth.


REST API AWS SDK S3-compatible API CLI / boto3
🌐

S3 as Data Lake Foundation

In modern Big Data architecture, S3 acts as the central data lake — a single source of truth where raw, processed, and curated data all live. Every processing engine (Spark, Flink, Athena, Redshift) reads from and writes to S3.


Apache Spark AWS Glue Athena Redshift Spectrum

🔑 S3 in the Data Architecture Landscape

Kafka / Streams Databases / RDS APIs / SaaS DATA SOURCES 🗄️ Amazon S3 Data Lake Foundation 11 nines durability · Infinite scale STORAGE LAYER Apache Spark / Flink / EMR AWS Athena Serverless SQL Redshift Spectrum Analytics DW COMPUTE / QUERY

⚙️ How S3 Works

Under the hood: request routing, consistency model, data paths, and operations.

📡 Request Lifecycle — PUT an Object

1

DNS Resolution + TLS Handshake

Client resolves bucket.s3.region.amazonaws.com. The bucket name routes to the correct AWS region's S3 fleet. TLS 1.3 secures all traffic in transit.

2

Authentication & Authorization (SigV4)

AWS Signature Version 4 signs every request with HMAC-SHA256. IAM evaluates the identity — IAM Role, User, or Service Account — against the bucket policy and resource policy before any data movement starts.

3

Data Ingestion — Frontend Service

The S3 frontend service (load-balancer tier) accepts the TCP stream. For objects >100 MB, Multipart Upload splits data into concurrent parts (min 5 MB each) that are uploaded in parallel and assembled atomically on completion.

4

Durability: Multi-AZ Replication

S3 Standard replicates every object across ≥3 Availability Zones. Each AZ independently stores the data on multiple physical drives with Reed-Solomon erasure coding. The PUT only returns 200 OK after AZ-redundant replication is confirmed.

5

Metadata Indexing

A distributed metadata service records the object key, size, ETag, storage class, ACL, and version ID. This index is what powers the LIST API and enables S3 Inventory, Lifecycle, and Replication rules to function at scale.

6

Strong Read-After-Write Consistency

Since December 2020, S3 offers strong consistency for all operations — GET, PUT, DELETE, LIST. A successful PUT is immediately visible to all subsequent GETs with no eventual consistency lag. This is critical for Big Data pipelines.

🔢 Key Operations & APIs

  • PutObject — upload single object (≤5GB)
  • GetObject — download full object or byte-range
  • DeleteObject — remove object (soft with versioning)
  • ListObjectsV2 — paginate up to 1000 keys/page
  • CreateMultipartUpload — start chunked upload
  • CopyObject — server-side copy without re-transfer
  • SelectObjectContent — query CSV/JSON/Parquet in-place
  • HeadObject — get metadata without downloading data

📊 Throughput & Limits

  • Per-prefix: 5,500 GET/HEAD, 3,500 PUT/DELETE per sec
  • Scaling trick: Use randomized prefixes to shard across S3 partitions
  • Max object size: 5 TB (requires multipart for >5 GB)
  • Max bucket count: 100 (soft) / 1,000 (hard) per account
  • No bucket size limit: store exabytes in one bucket
  • S3 Transfer Acceleration: uses CloudFront edge for global fast PUT

🔒 Security Model

  • Encryption at rest: SSE-S3 (free), SSE-KMS (auditable), SSE-C (customer key)
  • Encryption in transit: TLS enforced via bucket policy
  • Block Public Access: account-level guardrail
  • Bucket Policy: resource-based JSON IAM policy
  • VPC Endpoint: private access, traffic never leaves AWS network
  • Macie: ML-based PII/sensitive data detection
  • Object Lock: WORM for compliance (GOVERNANCE / COMPLIANCE)

🔔 Events & Integrations

  • S3 Event Notifications → SQS / SNS / Lambda
  • S3 Event Bridge → 100+ AWS services, fine-grained routing
  • Object Lambda: transform data on GET without copying
  • S3 Batch Operations: invoke Lambda on billions of objects
  • Replication (CRR/SRR): async object-level replication
  • S3 Access Logs: detailed request logging to S3

🏗️ S3 Internal Architecture

How AWS built a system that stores trillions of objects with 11 nines of durability.

🌐 Multi-Layer Architecture

CLIENT TIER SDK / CLI HTTP REST API Presigned URL VPC Endpoint Transfer Accel. FRONTEND SERVICE LAYER (Load Balancer + Auth) SigV4 Auth · Rate Limiting · TLS Termination · Request Routing METADATA SERVICE Key → Location mapping Distributed NoSQL · Strongly consistent STORAGE NODES Erasure coded chunks (Reed-Solomon) Distributed across AZs + physical hosts AZ-a 3+ replicas Independent power/network us-east-1a AZ-b 3+ replicas Independent power/network us-east-1b AZ-c 3+ replicas Independent power/network us-east-1c

🔢 Erasure Coding

S3 uses Reed-Solomon erasure coding rather than simple 3× replication. Data is split into k data chunks and m parity chunks. The system can reconstruct the full object even if m chunks are lost — providing durability comparable to 6× replication at ~1.5× storage cost.


Example: k=10 data + m=4 parity → survives 4 simultaneous drive failures

🗂️ Metadata Sharding

S3's metadata service is a distributed key-value store sharded by bucket+prefix hash. This is why randomizing prefixes matters — sequential prefixes (e.g., timestamps) hot-spot a single shard, while random prefixes distribute load across hundreds of shard servers.

📦 Multipart Upload Internals

For large objects, S3 Multipart Upload lets you upload parts (5 MB–5 GB each) in parallel. Parts are stored as separate objects internally, then assembled via CompleteMultipartUpload which is an atomic operation — either all parts succeed or the whole upload fails.

🔄 Consistency Architecture

S3 achieves strong consistency through a distributed consensus protocol at the metadata layer. All reads are routed through the metadata service, which ensures the latest committed version is returned. This was a fundamental architectural redesign deployed in 2020.

🚀 Why S3 for Big Data?

S3 wins in every dimension that matters for large-scale data platforms.

📊 S3 vs Alternatives for Data Lake

Dimension Amazon S3 HDFS Azure ADLS GCS
Durability 11 nines (99.999999999%) ~4 nines (3× replication) 11+ nines 11+ nines
Scalability ∞ (no practical limit) Cluster-bound, hard limit
Cost/GB/mo ~$0.023 (Standard) $0.05–0.10 (ops incl.) ~$0.018 ~$0.020
Operational Overhead Zero (fully managed) High (cluster admin) Low Low
Compute Separation Native Coupled
Ecosystem Universal (Spark, Flink, dbt…) Hadoop-centric Azure-first GCP-first
Strong Consistency Since 2020
⚖️

Decoupled Compute & Storage

With HDFS, scaling storage requires scaling compute (and vice versa). S3 breaks this coupling — you can run a 1-node Spark cluster or a 10,000-node EMR fleet against the same S3 data. Elasticity becomes cost-efficient at any scale.

💸

Dramatic Cost Tiering

S3's storage classes reduce cost by up to 95% for cold data. Active warehouse tables stay in Standard; 90-day-old audit logs move to Glacier Instant; 3-year compliance data moves to Glacier Deep Archive. Lifecycle rules automate all of this.

🔗

Universal Connector

Every major data tool supports S3 natively: Apache Spark, Flink, Hive, Presto, Trino, dbt, Airbyte, Fivetran, Databricks, Snowflake, and 200+ more. S3 is the common language of Big Data.

🛡️

Enterprise Security & Compliance

SOC 2, HIPAA, PCI-DSS, GDPR, FedRAMP. Object Lock for WORM compliance. KMS-managed encryption keys. Fine-grained IAM policies. VPC-only access. Macie for data classification. S3 is battle-tested for regulated industries.

💡
Key Insight: The "Data Lakehouse" pattern (Delta Lake, Apache Iceberg, Apache Hudi on S3) gives you both the cheap scale of a data lake AND the ACID transactions of a warehouse — all sitting on S3 as the storage layer.

📦 S3 Storage Classes

Choose the right tier for each data access pattern to minimize cost without sacrificing availability.

Storage Class Decision Matrix

S3 Standard
DEFAULT
Frequently accessed data. Hot data lake tables, recent partitions, active pipeline staging. No retrieval fee.
ms latency
$0.023/GB
S3 Intelligent-Tiering
AUTO
Automatically moves objects between tiers based on access. Best for unpredictable patterns. Small monitoring fee per object.
ms latency
$0.023–0.004/GB
S3 Standard-IA
INFREQUENT
Accessed less than once/month. Disaster recovery, older year partitions. 30-day minimum storage charge.
ms latency
$0.0125/GB
S3 One Zone-IA
SINGLE AZ
Infrequently accessed, reproducible data. One AZ only — lower cost but no AZ failure protection. Good for thumbnails/re-derivable data.
ms latency
$0.010/GB
Glacier Instant Retrieval
ARCHIVE
Long-lived archive accessed ~once/quarter. Medical images, media assets. Same instant access as Standard-IA but cheaper storage.
ms latency
$0.004/GB
Glacier Flexible Retrieval
COLD
Rarely accessed. Bulk data backups, compliance archives. Retrieval: 1–5 min (Expedited), 3–5 hr (Standard), 5–12 hr (Bulk).
1 min–12 hr
$0.0036/GB
Glacier Deep Archive
DEEPFREEZE
Cheapest S3 storage. 7–10 year compliance retention, regulatory records. Minimum 180 days. Retrieval 12–48 hours.
12–48 hours
$0.00099/GB

🔄 Lifecycle Policy Example

Automatically transition objects across tiers based on age:

JSON
# S3 Lifecycle Policy — BigData Datalake
{
  "Rules": [
    {
      "ID": "DataLakeLifecycle",
      "Status": "Enabled",
      "Filter": { "Prefix": "data/processed/" },
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"    // After 30 days
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER_IR"      // After 90 days
        },
        {
          "Days": 365,
          "StorageClass": "DEEP_ARCHIVE"    // After 1 year
        }
      ],
      "NoncurrentVersionTransitions": [
        { "NoncurrentDays": 7, "StorageClass": "STANDARD_IA" }
      ],
      "NoncurrentVersionExpiration": { "NoncurrentDays": 90 }
    }
  ]
}

🔥 Production at Petabyte Scale

Designing and operating S3 when your data lake holds 1 PB+ across hundreds of billions of objects.

🗂️ Bucket Strategy at PB Scale

  • Separation of concerns: Raw / Staged / Curated / Archive buckets
  • One bucket per environment: dev / staging / prod (never share)
  • Region co-location: S3 + Compute in same region → zero egress cost
  • S3 Inventory: daily CSV manifest of all objects for auditing
  • Requester Pays: charge consumers for cross-account access

my-company-raw-prod
my-company-staged-prod
my-company-curated-prod
my-company-archive-prod

📁 Partition Design Strategy

Good partitioning is the single most impactful performance decision for query engines (Athena, Spark).

# ❌ BAD — Sequential keys hot-spot S3 shards
s3://bucket/2025-03-18-00001.parquet

# ✅ GOOD — Hive-style partition pruning
s3://bucket/events/
  year=2025/month=03/day=18/
    region=us-east/
      part-00001.parquet

# ✅ BEST at PB scale — prefix randomization
s3://bucket/a3f2/events/year=2025/...
s3://bucket/7c1e/events/year=2025/...

⚡ Performance Tuning at Scale

Key levers to maximize S3 throughput for large-scale jobs

🔀 Prefix Sharding

Each prefix partition handles 3,500 PUT/5,500 GET per sec. Shard across 100+ prefixes to reach 500K+ RPS. Add random 4-hex prefix to all keys.

📦 Multipart Upload

Always use Multipart for files >100 MB. Use 64–128 MB part sizes. Saturate network with 20–50 concurrent part uploads per object.

🔌 S3 Select / Byte Range

Use S3 Select to push filter predicates down to S3 for CSV/JSON/Parquet. Use byte-range GET to read only relevant row groups in Parquet files.

💾 File Sizing

Target 128–512 MB Parquet files. Too small = LIST API overhead. Too large = wasted reads. Use compaction jobs (Spark/Glue) to maintain file sizes.

🏛️ Data Lakehouse Architecture at PB Scale

INGESTION Kafka Firehose DMS / CDC API Gateway S3 raw/ JSON · Avro · CSV PROCESSING Spark on EMR AWS Glue ETL Apache Flink dbt (transforms) S3 curated/ Parquet · Delta Iceberg tables QUERY / SERVE Amazon Athena Redshift Spectrum Trino / Presto QuickSight / BI AWS Glue Data Catalog — Hive Metastore (table metadata, schema, partition info)

🔒 Production Security Checklist

  • 🔐
    Block All Public Access — Enable at account level. No exceptions for data lakes. All access via IAM roles only.
  • 🔑
    SSE-KMS Encryption — Use customer-managed KMS keys (CMK) for each environment. Enforce via bucket policy (aws:SecureTransport + s3:x-amz-server-side-encryption).
  • 🌐
    VPC Gateway Endpoints — All S3 access from within VPC goes through private endpoint. No internet traversal. Zero data egress cost for VPC traffic.
  • 📋
    CloudTrail + S3 Access Logs — Log all data-plane operations. Feed to Security Lake or SIEM. Retain 1 year minimum for compliance.
  • 🛡️
    Bucket Versioning + Object Lock — Enable versioning on curated buckets. Use COMPLIANCE mode Object Lock for regulatory data. Protects against ransomware and accidental deletion.
  • 🔁
    Cross-Region Replication (CRR) — Replicate curated data to DR region. Use RTC (Replication Time Control) for 99.99% of objects replicated within 15 minutes SLA.

📋 Config & Code Reference

Production-ready configurations for setting up S3 in a Big Data platform.

🏗️ Terraform — Production S3 Data Lake Bucket

TERRAFORM
# ═══ S3 Data Lake — Production Setup ═══

resource "aws_s3_bucket" "datalake_curated" {
  bucket = "my-company-curated-prod"

  tags = {
    Environment = "production"
    DataClass   = "curated"
    Team        = "data-platform"
  }
}

# ── Block all public access ──
resource "aws_s3_bucket_public_access_block" "curated" {
  bucket                  = aws_s3_bucket.datalake_curated.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

# ── SSE-KMS Encryption ──
resource "aws_s3_bucket_server_side_encryption_configuration" "curated" {
  bucket = aws_s3_bucket.datalake_curated.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = "aws:kms"
      kms_master_key_id = aws_kms_key.datalake.arn
    }
    bucket_key_enabled = true  # Reduces KMS API costs by 99%
  }
}

# ── Versioning ──
resource "aws_s3_bucket_versioning" "curated" {
  bucket = aws_s3_bucket.datalake_curated.id
  versioning_configuration { status = "Enabled" }
}

# ── Lifecycle Policy ──
resource "aws_s3_bucket_lifecycle_configuration" "curated" {
  bucket = aws_s3_bucket.datalake_curated.id

  rule {
    id     = "tiering"
    status = "Enabled"

    transition {
      days          = 90
      storage_class = "STANDARD_IA"
    }
    transition {
      days          = 365
      storage_class = "GLACIER_IR"
    }

    noncurrent_version_expiration {
      noncurrent_days = 90
    }
  }
}

# ── CRR to DR Region ──
resource "aws_s3_bucket_replication_configuration" "curated" {
  bucket = aws_s3_bucket.datalake_curated.id
  role   = aws_iam_role.replication.arn

  rule {
    id     = "disaster-recovery"
    status = "Enabled"

    destination {
      bucket        = aws_s3_bucket.datalake_curated_dr.arn
      storage_class = "STANDARD_IA"

      replication_time {
        status = "Enabled"   # RTC: 15-min SLA
        time { minutes = 15 }
      }
    }
  }
}

🐍 Python (boto3) — PB-Scale Multipart Upload

PYTHON
import boto3
from boto3.s3.transfer import TransferConfig
import concurrent.futures

# ── Optimized config for PB-scale uploads ──
transfer_config = TransferConfig(
    multipart_threshold = 100 * 1024 * 1024,   # 100 MB
    multipart_chunksize = 128 * 1024 * 1024,   # 128 MB parts
    max_concurrency     = 20,                   # 20 threads/object
    use_threads         = True
)

s3 = boto3.client('s3', region_name='us-east-1')

def upload_to_datalake(local_path: str, s3_key: str):
    """Upload with optimised multipart + SSE-KMS + metadata."""
    s3.upload_file(
        Filename   = local_path,
        Bucket     = 'my-company-raw-prod',
        Key        = s3_key,
        Config     = transfer_config,
        ExtraArgs  = {
            'ServerSideEncryption': 'aws:kms',
            'SSEKMSKeyId'        : 'arn:aws:kms:...',
            'StorageClass'       : 'INTELLIGENT_TIERING',
            'ContentType'        : 'application/octet-stream',
            'Metadata'           : {
                'pipeline-version': 'v2.1',
                'source-system'   : 'kafka-prod'
            }
        }
    )

# ── Paginated LIST across billions of objects ──
def list_objects_paginated(bucket: str, prefix: str):
    paginator = s3.get_paginator('list_objects_v2')
    pages = paginator.paginate(Bucket=bucket, Prefix=prefix)
    for page in pages:
        for obj in page.get('Contents', []):
            yield obj['Key'], obj['Size']

# ── Byte-range GET (read only a Parquet row group) ──
def read_parquet_footer(bucket: str, key: str) -> bytes:
    response = s3.get_object(
        Bucket    = bucket,
        Key       = key,
        Range     = 'bytes=-8192'  # Last 8KB (Parquet footer)
    )
    return response['Body'].read()

⚡ Apache Spark — Read from S3 at Scale

PYTHON / PYSPARK
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PB-Scale S3 Reader") \
    .config("spark.hadoop.fs.s3a.impl",
            "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .config("spark.hadoop.fs.s3a.aws.credentials.provider",
            "com.amazonaws.auth.WebIdentityTokenCredentialsProvider") \
    .config("spark.hadoop.fs.s3a.connection.maximum", "200") \
    .config("spark.hadoop.fs.s3a.fast.upload",       "true") \
    .config("spark.hadoop.fs.s3a.block.size",        "134217728") \
    .config("spark.sql.parquet.filterPushdown",      "true") \
    .config("spark.sql.parquet.mergeSchema",         "false") \
    .getOrCreate()

# ── Read with partition pruning ──
df = spark.read.parquet(
    "s3a://my-company-curated-prod/events/"
).filter(
    "year = 2025 AND month = 3"   # Only reads matching partitions
)

# ── Write with optimal settings ──
df.repartition(200) \
  .write \
  .mode("overwrite") \
  .partitionBy("year", "month", "day") \
  .option("parquet.block.size", "134217728") \
  .parquet("s3a://my-company-curated-prod/events_v2/")

💰 Cost Estimation for 1 PB Data Lake

Layer Volume Storage Class Cost/Month
Hot (curated, <30 days) 50 TB S3 Standard ~$1,150
Warm (30–365 days) 200 TB Standard-IA ~$2,500
Cold (1–3 years) 500 TB Glacier Instant ~$2,000
Archive (>3 years) 250 TB Deep Archive ~$248
Total: 1 PB 1,000 TB Mixed tiered ~$5,900/mo
⚠️
Note: Costs above are storage only. Add GET/PUT request costs (~$0.004/10K req), data transfer ($0.09/GB out to internet), and KMS calls. Real total typically 1.3–1.6× storage cost. Use AWS Cost Explorer and S3 Lens to monitor.