Big Data Platform Design

The MinIO
Complete Guide

A production-grade, S3-compatible, cloud-native object store built for AI/ML, analytics, and petabyte-scale Big Data workloads.

325+ GB/sGET throughput
PB-scalestorage
S3-APIcompatible
Apache 2.0open source
MinIO in the Data Stack
Spark
Flink
Trino
Hive
↓ S3 API ↓
MinIO Gateway Layer
↓ Erasure Code ↓
Drive 1
Drive 2
Drive 3
Drive N
↑ Replication ↑
Site A
Site B
Site C
🪣
What is MinIO?
Object storage reimagined for the modern data stack

Definition

MinIO is a high-performance, S3-compatible object storage system written in Go. It is designed to run on commodity hardware and cloud infrastructure, storing any amount of unstructured data — from a few gigabytes to exabytes — as flat objects (files) in buckets.

Unlike traditional file systems (POSIX) or block storage, MinIO treats every piece of data — logs, images, ML models, Parquet files, videos — as an object with a key, a value (the data), and metadata. There is no directory hierarchy at the OS level; the flat namespace is infinitely scalable.

Core Identity

S3-Native API

100% compatible with the AWS S3 API. Any application, SDK, or tool built for S3 (Boto3, AWS CLI, Spark, etc.) works with MinIO without code changes.

Design Goal

Speed First

Written in Go for low-latency, high-concurrency IO. Single-binary deployment with zero external dependencies. Saturates 100 GbE NICs on commodity NVMe hardware.

Deployment

Cloud Native

First-class Kubernetes operator, Helm chart, and Operator Console. Runs on bare metal, VMs, or any K8s distribution — on-premises or in the cloud.

License

Open Source (AGPL-3)

Community edition is AGPL-3. Commercial license (SSPL) available for closed-source products. Full source code on GitHub with >47 k stars.

💡

Why "object" storage?

Object storage decouples metadata from data, enabling virtually unlimited scale and rich querying — unlike file systems limited by inode counts or block stores limited by block size. Each object gets a globally unique URL (its key), making it addressable across distributed systems without a central namespace server.

⚙️
How MinIO Works
Core internals — erasure coding, healing, replication

1 · Erasure Coding (Reed-Solomon)

MinIO does not use simple replication. Instead, it applies Reed-Solomon erasure coding to split every object into data shards and parity shards across the drives/nodes in an erasure set.

  • Default EC:4 — 12 data shards + 4 parity shards across 16 drives.
  • Tolerates loss of up to N/2 drives without any data loss.
  • Storage overhead is just 33% (vs 200% for 3× replication).
  • On read, MinIO reconstructs data from any surviving shard combination.
  • Background healing automatically recomputes missing shards when drives return.
bash — Erasure Set Math
# Erasure set = drives used for one coding group
# Standard: 16 drives per set (12 data + 4 parity)
minio server \
  http://minio{1...4}/data{1...4}   # 4 nodes × 4 drives = 16 drives / set

# Or specify explicitly:
MINIO_ERASURE_SET_DRIVE_COUNT=16

# Verify protection level at runtime:
mc admin info myminio | grep "EC:"

2 · Inline Bitrot Detection

Every shard is checksummed using HighwayHash-256 at write time. On every read, checksums are verified. Silent data corruption (bitrot) is detected instantly and the corrupted shard is healed from parity — without operator intervention.

3 · Distributed Mode & Server Pools

In production, MinIO runs as a distributed cluster of nodes. All nodes are equal peers — there is no master. The cluster is composed of one or more Server Pools, each a homogeneous group of nodes+drives forming their own erasure sets.

  • Horizontal scaling: Add a new Server Pool to expand capacity non-disruptively.
  • Object placement: Objects are placed on pools based on available free space (weighted).
  • Quorum writes: A PUT requires N/2 + 1 drives to acknowledge before confirming success.
  • Read quorum: Only data shards needed — no parity required for reads under normal conditions.

4 · Active-Active Site Replication

For disaster recovery and geo-distribution, MinIO supports Active-Active replication across multiple independent MinIO deployments (sites). Every write to any site propagates to all peers in near real-time via internal queuing.

  • All sites remain fully writable — no primary/secondary model.
  • Conflict resolution uses last-writer-wins semantics.
  • Policies, users, groups, and IAM settings also replicate automatically.
  • Typical RPO: <1 second on a 10 GbE WAN link.
🔄

Bucket Replication vs. Site Replication

Bucket Replication (S3-compatible) copies objects from one bucket to another, even across vendors (e.g., MinIO → AWS S3). Site Replication replicates the entire namespace including IAM, policies, and all buckets — recommended for DR at PB scale.

🏗️
Architecture Deep Dive
Layers, components, and data flow
MinIO Production Stack — Full Layers
Client Layer
Apache Spark
Apache Flink
Trino / Presto
Jupyter / MLflow
AWS CLI / SDK
Custom Apps
↕ HTTPS / S3 API (REST) ↕
Gateway / LB
NGINX / HAProxy
MinIO Console (UI)
MinIO Operator (K8s)
IAM / STS
↕ Internal gRPC / HTTP ↕
MinIO Nodes
Node 1 (minio)
Node 2 (minio)
Node 3 (minio)
Node N (minio)
↕ Erasure Set I/O ↕
Storage Drives
NVMe /data1
NVMe /data2
NVMe /data3
NVMe /dataN

Metadata Management

MinIO stores object metadata alongside data as XL meta files within the same erasure set. This eliminates a separate metadata database and keeps metadata access local, reducing latency. For bucket-level metadata and IAM, MinIO uses an internal etcd-free distributed KV store backed by the same drives.

Lifecycle & Tiering

MinIO supports ILM (Information Lifecycle Management) — objects automatically transition between storage tiers (hot NVMe → warm HDD → cold cloud) based on age or access patterns. The remote tier can be another MinIO, AWS S3, GCS, or Azure Blob.

Networking Requirements for PB-Scale

  • Minimum: 10 GbE between all nodes in a server pool. Erasure coding requires all drives to be written in parallel — network is often the bottleneck, not disk.
  • Recommended: 25 GbE or 100 GbE for high-throughput workloads (ML training data, large-scale ETL).
  • Topology: All nodes in a pool should be on the same L2 segment (single rack or spine-leaf) to minimize latency variance.
  • Load balancer: Deploy NGINX, HAProxy, or F5 in front of MinIO for health-checking and TLS termination. Use round-robin or least-connection strategies.
Why MinIO is Great
The competitive advantages that matter for Big Data
Performance

Fastest Object Store

Benchmarked at 325 GB/s GET and 165 GB/s PUT on a 32-node NVMe cluster. Outperforms Ceph and all major cloud vendors in raw throughput at equal hardware cost.

Simplicity

Single Binary

The entire MinIO server is one statically compiled binary (~120 MB). No JVM, no package manager, no runtime dependencies. Runs anywhere Go runs — including ARM and s390x.

Compatibility

True S3 Parity

Supports all S3 features: multipart upload, pre-signed URLs, bucket versioning, object locking (WORM), object tagging, lifecycle policies, server-side encryption, and event notifications.

Cost

70–90% Cheaper

Running on your own hardware (or spot VMs) vs. AWS S3 for PB-scale workloads typically yields 70–90% cost savings. No egress fees for on-prem deployments.

Security

Enterprise-Grade Security

TLS everywhere, SSE-S3 / SSE-KMS / SSE-C encryption, LDAP/AD integration, OpenID Connect, attribute-based access control (ABAC), and audit logging built-in.

Ecosystem

Works with Everything

Native integrations: Apache Spark (Hadoop S3A), Flink, Trino, Hive, Presto, DeltaLake, Apache Iceberg, Hudi, MLflow, Kubeflow, Airflow, dbt, and more.

Ops

Kubernetes Native

Official MinIO Operator auto-manages tenant lifecycle, auto-healing, certificate rotation, upgrades, and scaling on any K8s. Operator Console provides a unified management UI.

Observability

Prometheus + Grafana

Exposes 100+ Prometheus metrics out of the box. Pre-built Grafana dashboards for throughput, capacity, errors, healing status, and replication lag.

🏆

MinIO is best for…

AI/ML training data lakes, data lake-house architectures (Iceberg/Delta), log aggregation, time-series data stores, media/CDN backends, container registry storage (Harbor), backup targets, and any workload needing S3-compatible storage at massive scale without cloud vendor lock-in.

Performance Benchmarks
Real-world throughput numbers for planning

Reference Hardware Config (32 nodes × 32 NVMe)

Benchmark published by MinIO on 32-node cluster, each with dual AMD EPYC, 512 GB RAM, 32× 2 TB NVMe, 2× 100 GbE NICs.

Throughput vs. Object Size (32-node NVMe cluster)

GET Throughput

1 MB objects95 GB/s
64 MB objects220 GB/s
256 MB objects325 GB/s
AWS S3 (equiv HW)~45 GB/s

PUT Throughput

1 MB objects55 GB/s
64 MB objects120 GB/s
256 MB objects165 GB/s
AWS S3 (equiv HW)~30 GB/s
🧠

ML/AI Workload Tip

For large model checkpoints and training datasets (10–500 GB objects), use MinIO's multipart upload (128 MB part size) to saturate network bandwidth. Enable MINIO_STORAGE_CLASS_STANDARD=EC:2 for hot training data to reduce parity overhead and maximize IOPS.

📊

Benchmark Tool

Use warp (MinIO's official benchmark tool) to validate your hardware before going to production. It tests GET, PUT, DELETE, and mixed workloads with configurable concurrency and object sizes across your actual cluster topology.

🚀
Production Setup — PB Scale
Step-by-step guide to a durable, high-throughput MinIO cluster

Phase 0 — Hardware Planning

  • Nodes: Minimum 4 nodes; recommended 8–32 for PB workloads. Always use multiples of 4.
  • Drives per node: 4, 8, or 16 drives. Prefer NVMe for hot; SATA SSD or HDD for warm/cold.
  • RAM: 32–128 GB per node. MinIO caches drive metadata in RAM.
  • Network: 25 GbE minimum for production; 100 GbE for high-throughput ML/analytics.
  • OS: RHEL 8/9, Ubuntu 22.04 LTS, Rocky Linux 9. XFS filesystem on all data drives.

Phase 1 — OS & Disk Preparation

bash — Format and mount drives (all nodes)
# Format each drive as XFS (faster than ext4 for object workloads)
for disk in /dev/nvme{0..3}n1; do
  mkfs.xfs -L "minio-$(basename $disk)" -f $disk
done

# Mount with noatime,nodiratime for performance
cat >> /etc/fstab <<EOF
LABEL=minio-nvme0n1  /data1  xfs  defaults,noatime,nodiratime  0 2
LABEL=minio-nvme1n1  /data2  xfs  defaults,noatime,nodiratime  0 2
LABEL=minio-nvme2n1  /data3  xfs  defaults,noatime,nodiratime  0 2
LABEL=minio-nvme3n1  /data4  xfs  defaults,noatime,nodiratime  0 2
EOF
mount -a

# Verify
df -h | grep /data
bash — Kernel tuning for high-throughput IO
# Network stack tuning
cat >> /etc/sysctl.d/99-minio.conf <<EOF
net.core.rmem_max = 67108864
net.core.wmem_max = 67108864
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864
net.core.netdev_max_backlog = 250000
net.ipv4.tcp_congestion_control = bbr
vm.swappiness = 1
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5
EOF
sysctl -p /etc/sysctl.d/99-minio.conf

# Set I/O scheduler to none (pass-through) for NVMe
for dev in /sys/block/nvme*/queue/scheduler; do
  echo none > $dev
done

Phase 2 — Install MinIO Binary

bash — Install on all nodes
# Download latest MinIO (amd64)
wget https://dl.min.io/server/minio/release/linux-amd64/minio
chmod +x minio
mv minio /usr/local/bin/

# Create minio user (never run as root in production)
useradd -r -s /sbin/nologin minio-user
chown -R minio-user:minio-user /data{1..4}

# Install mc (MinIO client)
wget https://dl.min.io/client/mc/release/linux-amd64/mc
chmod +x mc && mv mc /usr/local/bin/

Phase 3 — Environment Configuration

bash — /etc/default/minio (production config)
# Credentials (use Vault or K8s secrets in real deployments)
MINIO_ROOT_USER="admin"
MINIO_ROOT_PASSWORD="SuperSecretPassw0rd123!"

# Cluster topology: 4 nodes × 4 drives = 16 drives (EC:4 parity)
MINIO_VOLUMES="https://minio{1...4}.internal:9000/data{1...4}"

# API port and console port
MINIO_OPTS="--address :9000 --console-address :9001"

# TLS (place certs in /etc/minio/certs/)
MINIO_CERT_FILE="/etc/minio/certs/public.crt"
MINIO_KEY_FILE="/etc/minio/certs/private.key"

# Storage class: EC:4 standard, EC:2 reduced redundancy
MINIO_STORAGE_CLASS_STANDARD="EC:4"
MINIO_STORAGE_CLASS_RRS="EC:2"

# Enable compression (transparent, for cold data)
MINIO_COMPRESS_ALLOW_ENCRYPTION="on"
MINIO_COMPRESS_EXTENSIONS=".log,.txt,.csv,.json"
MINIO_COMPRESS_MIME_TYPES="text/plain,application/json"

# Prometheus metrics (scrape at :9000/minio/health/metrics)
MINIO_PROMETHEUS_AUTH_TYPE="public"

# Audit logging (send to Kafka/Elasticsearch)
MINIO_AUDIT_WEBHOOK_ENABLE_kafka="on"
MINIO_AUDIT_WEBHOOK_ENDPOINT_kafka="http://kafka:9092/minio-audit"
systemd — /etc/systemd/system/minio.service
[Unit]
Description=MinIO Object Storage
After=network-online.target
Wants=network-online.target

[Service]
WorkingDirectory=/usr/local
EnvironmentFile=/etc/default/minio
ExecStart=/usr/local/bin/minio server $MINIO_OPTS $MINIO_VOLUMES
User=minio-user
Group=minio-user
Restart=always
RestartSec=5s
LimitNOFILE=1048576
TasksMax=infinity
TimeoutStopSec=120
SendSIGKILL=no

[Install]
WantedBy=multi-user.target
bash — Start and verify cluster
# Enable and start on ALL nodes
systemctl daemon-reload
systemctl enable --now minio

# Verify cluster health
mc alias set myminio https://minio1.internal:9000 admin SuperSecretPassw0rd123!
mc admin info myminio

# Expected output excerpt:
#  Servers: 4  Drives: 16  Online: 16  Offline: 0
#  Status:  16 online, 0 offline drives
#  Used: 0 B / 64 TB total

Phase 4 — Multi-Site Replication (DR)

bash — Active-Active site replication setup
# Register aliases for both sites
mc alias set site-a https://minio-site-a.internal:9000 admin pass1
mc alias set site-b https://minio-site-b.internal:9000 admin pass2

# Enable site replication (run once from either site)
mc admin replicate add site-a site-b

# Verify replication status
mc admin replicate info site-a

# Add a third DR site later:
mc admin replicate add site-a site-b site-c

Phase 5 — Kubernetes Deployment (Operator)

bash — MinIO Operator via Helm
# Install MinIO Operator
helm repo add minio-operator https://operator.min.io
helm install --namespace minio-operator --create-namespace \
  operator minio-operator/operator

# Deploy a MinIO tenant (4 servers × 4 drives)
helm install --namespace minio-tenant --create-namespace \
  tenant minio-operator/tenant \
  --set "tenant.pools[0].servers=4" \
  --set "tenant.pools[0].volumesPerServer=4" \
  --set "tenant.pools[0].size=2Ti" \
  --set "tenant.pools[0].storageClassName=local-nvme"
1

Hardware → XFS format drives, tune OS kernel (sysctl, scheduler)

Foundation phase: every data drive formatted as XFS with noatime, network stack tuned for large transfers, I/O scheduler set to none for NVMe.

2

Install MinIO binary + configure /etc/default/minio

Single binary, no package dependencies. Configure MINIO_VOLUMES with your node expansion syntax to define the erasure set topology.

3

TLS everywhere + load balancer

Use Let's Encrypt or internal CA. Place NGINX/HAProxy in front for client-facing TLS termination and health-check routing. MinIO nodes communicate over TLS internally.

4

Observability: Prometheus + Grafana + alerting

Scrape /minio/v2/metrics/cluster. Import MinIO's official Grafana dashboards (IDs: 13502, 15305). Set alerts on drive offline, healing rate, and replication lag.

5

Multi-site replication for DR

Enable mc admin replicate add across geographically separated sites. Test failover quarterly by simulating a site outage and verifying zero data loss.

6

ILM policies + storage tiering

Set up lifecycle rules to move cold data to HDD or cloud (AWS S3, GCS) after N days. This keeps hot NVMe headroom ≥ 20% for write performance.

📊
MinIO vs. Alternatives
How MinIO compares in the Big Data storage landscape
🐙 Ceph (RGW)
  • S3-compatible (RGW)
  • Block + file + object
  • Complex to operate
  • Lower raw throughput
  • Large footprint (many daemons)
  • Strong community
  • Slow metadata (RADOS)
  • Hard K8s integration
☁️ AWS S3
  • The S3 API standard
  • Global availability
  • Expensive at PB scale
  • High egress fees
  • No on-prem option
  • Vendor lock-in
  • Managed (zero ops)
  • Massive ecosystem
Feature MinIO Ceph RGW AWS S3 HDFS
S3 API ✓ Full parity ✓ Most features ✓ Native ✗ Not S3
Max throughput 325 GB/s ~100 GB/s ~50 GB/s* ~150 GB/s
Operational complexity Low (1 binary) Very High None (managed) High (NN)
On-premises
Kubernetes Native ✓ Operator Partial (Rook) ✓ EKS
Iceberg / Delta ✓ Native ✓ Via S3 ✓ Native ✓ Partial
Egress cost @ 1 PB/mo $0 (on-prem) $0 (on-prem) ~$90,000 $0

* AWS S3 throughput is per-prefix limited; aggregate across prefixes is higher.

🛡️
Production Best Practices
Hard-won lessons for PB-scale deployments
Storage

Keep free space ≥ 20%

MinIO write performance degrades sharply above 80% capacity. Use ILM tiering rules to automatically push cold data to cheaper storage before hitting this threshold.

Network

Dedicated storage VLAN

Isolate MinIO inter-node traffic from client traffic using separate NICs or VLANs. This prevents noisy-neighbor bandwidth contention on shared 10 GbE switches.

Security

IAM per service account

Never share root credentials. Create a dedicated MinIO service account per application with least-privilege bucket policies. Rotate credentials via Vault or K8s Secrets.

Reliability

Test healing regularly

Run mc admin heal --recursive monthly. Simulate drive failures in staging to measure MTTR and validate that parity reconstruction keeps pace with load.

Scale

Use server pools, not drive expansion

When adding capacity, add full Server Pools (new nodes + drives) rather than adding drives to existing nodes. Pools are the safe, non-disruptive scale-out unit.

Performance

S3 multipart for large objects

For objects > 128 MB, always use multipart upload (128–256 MB parts). This enables parallel upload across drives, dramatically increasing throughput for ML datasets and backups.

🔑

Encryption Strategy for PB Deployments

Use SSE-KMS (Server-Side Encryption with KMS) backed by HashiCorp Vault or AWS KMS. Every object gets a unique data encryption key (DEK) derived from a master key — so compromising one object never exposes others. Enable it by setting MINIO_KMS_KES_ENDPOINT and MINIO_KMS_KES_KEY_NAME environment variables.