Distributed Systems — Storage, Files & Failures

💾

Distributed Storage Systems (DSS)

How data is spread, replicated and managed across multiple nodes as a unified system

🌐 What is a Distributed Storage System?

A Distributed Storage System (DSS) stores data across multiple physical or virtual devices connected through a network, while presenting applications with a unified storage interface. Unlike centralized storage, data is broken into smaller components, replicated for redundancy, and spread across nodes that communicate to serve read and write requests.

Each storage node has its own processing power and memory, enabling parallel operations. A central control plane manages data distribution, replication, and fault tolerance. The result: unprecedented scalability, performance, and resilience.

📊

Why it matters: Research indicates 85% of organizations experienced data loss incidents in 2024, and 93% of businesses suffering prolonged data loss (10+ days) face bankruptcy within a year — underscoring the critical importance of robust distributed storage.

Core Operation

How Data Flows

1. Data Partitioning

Incoming data is divided into smaller manageable chunks or blocks for distributed placement.

2. Replication

Each chunk is typically replicated 3× across different nodes or availability zones to ensure fault tolerance.

3. Metadata Tracking

A central metadata service maps each block to its physical location(s) across the cluster.

4. Access & Retrieval

Clients query metadata to find block locations, then retrieve directly from data nodes — avoiding central bottlenecks.

Architecture

Key Roles & Components

Storage Nodes: Store actual data blocks; handle read/write ops; report health & capacity
Metadata Server / NameNode: Manages file-system metadata, namespace, and block-location maps — does not store file content
Coordination Nodes: Oversee data replication, cluster state, and leader election
Control Plane: Manages data distribution, replication factor enforcement, and fault tolerance policies
Client Interface: Unified API that hides underlying complexity from applications

💡

When a data node fails, the system automatically redistributes its data to maintain the configured replication factor without manual intervention.

🏗️ Distributed Storage — Master–Slave Architecture (HDFS-style)

Storage Types

Block Storage

Divides data into fixed-size blocks, each with a unique address. Ideal for databases and random I/O with low latency (e.g., NVMe-oF volumes, AWS EBS).

Storage Types

Object Storage

Stores data as objects with metadata and unique identifiers in a flat namespace. Highly scalable for unstructured data (e.g., Amazon S3, Ceph RADOS).

Storage Types

File Storage (DFS)

Presents files in hierarchical namespaces across distributed nodes, appearing as a single logical filesystem (e.g., HDFS, GFS, GlusterFS).

🔁 Replication Models

Synchronous

Data is copied to all replicas before write is acknowledged to the client. Guarantees strong consistency but increases write latency. Used in financial systems.

pseudocode

// Synchronous write path
function write(data):
  primary.write(data)
  replica1.write(data)  // await
  replica2.write(data)  // await
  return ACK  // only after ALL replicas confirm

Asynchronous

Write is acknowledged immediately after the primary stores data. Replicas are updated asynchronously. Lower latency, eventual consistency risk. Used in content delivery.

pseudocode

// Asynchronous write path
function write(data):
  primary.write(data)
  return ACK  // immediate acknowledgment
  // replicas updated in background
  replicate_async(data, [r1, r2])

📁

Distributed File Systems (DFS)

Hierarchical file abstractions spanning multiple nodes — HDFS, GFS, Ceph, and beyond

📂 What is a Distributed File System?

A Distributed File System (DFS) allows data to be stored across multiple storage nodes and locations while appearing to users and applications as a single, unified system. Multiple clients access data stored on different servers, with each server holding primary copies and replicas to ensure fault tolerance and availability.

Core Characteristics

Transparency: Users see one logical storage system
Replication: Multiple copies of each data block
Scalability: Horizontal expansion by adding nodes
Fault Tolerance: Automatic re-replication on node failure
Consistency Models: Configurable from strong to eventual

Engineering Challenges

Keeping replicas synchronized under high write loads
Network overhead for every data transfer
Intelligent load balancing to avoid hotspots
Continuous health monitoring and automatic re-replication
Security across multiple nodes and regions

Google

GFS / Colossus

GFS pioneered the pattern: single master + chunk servers. Colossus evolved this with a distributed metadata model, eliminating the single-master bottleneck and enabling parallelized metadata operations across multiple servers.

Use case: Large sequential reads, batch analytics, search indexing

Apache

HDFS

Master-slave: single NameNode manages FsImage (namespace tree) + EditLog (change journal). DataNodes store actual blocks and send heartbeats. Designed for enormous sequential file reads (MapReduce).

Use case: Hadoop ecosystem, big data batch processing

Haystack / f4

Haystack solved massive photo storage by reducing metadata I/O: each photo stored in large volumes with an in-memory index. f4 added erasure coding for warm storage, reducing storage footprint by ~50%.

Use case: Billions of small files (photos, media blobs)

📝 HDFS Write Path — How a File Gets Written

⚖️

CAP Theorem

The fundamental trade-off governing all distributed storage design decisions

⚖️ You Can Only Pick Two

The CAP theorem (proved 2002) states that a distributed system cannot simultaneously guarantee all three of: Consistency, Availability, and Partition Tolerance. This theorem guides every distributed storage architecture decision.

🔵

Consistency

All replicas return the same data at any given time

🟢

Availability

System always responds to requests, even under failure

🟣

Partition Tolerance

System operates despite network partitions between nodes

🔥

Distributed System Failure Modes

The taxonomy of how and why distributed systems fail — and what each failure pattern means for recovery

⚠️

Failure is the default, not the exception. In large distributed systems with hundreds of nodes, hardware failures occur daily. The system must be architected to detect, isolate, and recover from each failure category automatically — without operator intervention.

🛑

Fail-Stop

Fail-Stop Failure

The node simply stops functioning and the rest of the system quickly detects it is gone. The clearest, easiest failure to handle.

Behavior

Node halts all operations immediately
Failure is detectable via missed heartbeats
No corrupted state propagated

Recovery

Leader election to reassign responsibilities
Re-replication of blocks from the failed node
Apache Kafka & Kubernetes use health checks + timeouts

💥

Crash Failure

Similar to fail-stop, but the component fails silently — its failure may not be immediately obvious to the rest of the system.

Behavior

Node crashes without alerting peers
Detection requires waiting for timeouts
Transient failures (GC pause, network hiccup) can look identical

Recovery

Timeout-based failure detection with exponential backoff
Heartbeat monitoring with careful tuning to avoid false positives
Leader election after confirmed timeout

🕸️

Network Partition

Communication between components is interrupted, but each component continues to operate independently — leading to divergent state.

Causes

Network outages, switch failures, firewall misconfiguration
Each partition believes the other has failed
Both sides may promote themselves to leader → split-brain

Recovery

Quorum-based consensus (only majority partition acts)
Eventual consistency + conflict resolution when healed
CRDTs (Conflict-free Replicated Data Types)

☠️

Byzantine Failure

Nodes exhibit arbitrary or malicious behavior — sending conflicting, incorrect, or inconsistent information to different peers.

Behavior

Compromised or buggy node sends different data to different peers
Extremely difficult to detect and diagnose
E.g., a voting node sending different tallies to different replicas

Recovery

Byzantine Fault Tolerance (BFT) algorithms — PBFT
Majority consensus to filter malicious components
Blockchain systems use BFT-inspired techniques

⚡

Transient Failure

Failures that occur temporarily and may resolve on their own — caused by transient environmental conditions or network glitches.

Behavior

Temporary unavailability (GC pause, CPU spike)
Challenging to reproduce and diagnose
Risk: system may incorrectly treat as permanent failure

Recovery

Retry mechanisms with exponential backoff
Careful timeout tuning to distinguish from real failures
Monitoring and logging for pattern identification

🐌

Performance Failure

Performance Degradation

Nodes degrade in performance, leading to slower response times or reduced throughput — a "gray failure" that's hard to detect.

Behavior

Resource contention, bottlenecks, hardware degradation
Node is still "alive" but responds slowly
Can cascade to affect dependent services

Recovery

Circuit breakers to shed load from slow nodes
Load balancing to redistribute traffic
Prometheus/Grafana for continuous performance monitoring

🧠

Split-Brain Scenarios

When a cluster divides into isolated groups that each believe they are the sole authority

🧠 What is Split-Brain?

Split-brain occurs when a distributed system has two or more active leaders simultaneously. It is caused by a network partition or communication failure that divides cluster nodes into isolated groups, each unaware of the other — and each believing it is the only active set of nodes.

The result: servers may record the same data inconsistently, compete for resources, or make conflicting decisions. This can shut down the cluster or, worse, cause silent data corruption.

🚨

The Zombie Leader Problem: A leader node experiencing an intermittent failure (e.g., a stop-the-world GC pause) gets declared dead, and a new leader is elected. When the original leader recovers, it doesn't know it was replaced. The system now has two active leaders issuing conflicting commands.

🕸️ Split-Brain Scenario — 5-Node Cluster During Network Partition

🕰️ Generation Clocks — Detecting Zombie Leaders

Every time a new leader is elected, the generation number (also called "epoch" or "term" in Raft) is incremented. This generation number is included in every request the leader sends to followers. When an old leader reconnects, its stale generation number immediately identifies it as a zombie leader — allowing all nodes to ignore its requests.

Raft-style generation clock logic

// When a node receives a request from a leader:
function handle_leader_request(request, node_state):
  if request.generation < node_state.current_generation:
    // Stale leader — zombie detected!
    reject(request, reason="STALE_GENERATION")
    send_current_term(request.sender, node_state.current_generation)
    return

  if request.generation > node_state.current_generation:
    // We are behind — step down from any leadership role
    node_state.current_generation = request.generation
    node_state.role = FOLLOWER

  // Normal processing: valid leader with current generation
  process(request)

🔄

Failure Handling & Recovery Strategies

Consensus algorithms, quorum systems, fencing, and self-healing mechanisms

N/2+1

Quorum formula for N nodes

3×

Default replication factor (HDFS)

2f+1

Nodes needed to tolerate f failures

Odd

Node count prevents tie votes

⚙️ Raft Consensus Algorithm — Leader Election

Raft is the most widely-used consensus protocol for leader election. To elect a leader, a candidate must receive votes from a majority (quorum) of nodes. In a 5-node system, at least 3 nodes must agree. This guarantees only one leader can exist at any time.

Normal Operation → Leader Fails → New Election

Phase 1: Normal

👑 N1

Leader

📦 N2

Follower

📦 N3

Follower

→

Phase 2: N1 Fails

💀 N1

DEAD

🗳️ N2

Candidate

📦 N3

Voted

→

Phase 3: N2 Elected

⚫ N1

Gone

👑 N2

NEW Leader

📦 N3

Follower

🛡️ Split-Brain Prevention Strategies

Require a majority (N/2 + 1) of nodes to agree before any write or leadership decision. Only the partition with quorum continues operating. The minority partition rejects writes and waits for reconnection.

etcd + Raft: 5-node cluster requires 3 votes. If Zone A has 3 nodes and Zone B has 2, only Zone A proceeds.
ZooKeeper: Ensemble of servers uses leader election with quorum to prevent dual-leader scenarios.
DynamoDB Global Tables: Uses quorum-based techniques across regions to determine the authoritative source for updates.
Deploy clusters with odd numbers of nodes (3, 5, 7) to always guarantee a quorum majority.

💡

A node that cannot reach quorum (N/2 + 1) knows it is in the minority and stops making decisions — even if it's otherwise healthy.

Fencing forcefully isolates or shuts down the other partition to guarantee only one active node can modify data at any time. STONITH = "Shoot The Other Node In The Head."

Power switches or management cards physically power off the presumed-dead node
Cuts network access to shared storage from the fenced node
Guarantees no zombie leader can continue issuing commands
Combined with quorum in enterprise cluster managers (Pacemaker/Corosync)
Used in high-availability database clusters to prevent dual-primary scenarios

Nodes send periodic heartbeat signals to prove they are alive. Missing heartbeats trigger failure detection — but heartbeat intervals must be carefully tuned to avoid false positives from transient network lag.

Set heartbeat intervals based on measured network latency characteristics
Use adaptive timeouts that account for known transient conditions (GC pauses, high CPU)
Combine with short timeouts + robust failure detection mechanisms
etcd: configure heartbeat-interval and election-timeout for your environment
Avoid declaring nodes dead due to a single missed heartbeat — use a threshold (e.g., 3 consecutive misses)

For AP systems that prioritize availability, allow divergence during partition and reconcile state when the partition heals using conflict resolution mechanisms.

CRDTs (Conflict-free Replicated Data Types) are data structures designed for automatic conflict-free merging
Vector clocks track causality to determine which updates happened before others
Last-Write-Wins (LWW): Simplest strategy — most recent timestamp wins on conflict
Application-level resolution: Expose conflicts to the application for domain-specific merging
Amazon DynamoDB, Apache Cassandra's Lightweight Transactions (LWT) use this approach

Consensus algorithms like Paxos and Raft mathematically guarantee that only one leader can be elected in any given "term" or "epoch," preventing split-brain at the protocol level.

Raft: Simpler and more understandable than Paxos. Used in etcd, CockroachDB, TiKV. Leader holds authority for a "term"; election requires majority.
Paxos: Academic gold standard. More complex; variations include Multi-Paxos (used in Chubby, Google Spanner).
Term/Epoch number: Monotonically increasing — a node with a higher term automatically supersedes one with a lower term.
Candidates campaign for votes; a vote can only be cast once per term per node.
Mathematical proof: it is impossible for two leaders to win a majority in the same term.

Raft leader election — simplified

// Candidate node starts election:
candidate.term += 1          // increment term
candidate.vote_for_self()
votes = broadcast_vote_request(candidate.term, candidate.log_index)

if votes >= quorum(cluster.size):
  become_leader()            // won majority → safe to lead
  broadcast_heartbeat()      // suppress other elections
else:
  revert_to_follower()       // lost → wait for next timeout

🔧 Node Failure — Complete Recovery Playbook

① Failure Detection

Heartbeat monitor detects missed responses. After configured threshold (e.g., 3 consecutive misses), the node is marked as suspected failed. Monitoring tools (Prometheus, Grafana) alert operators.

② Confirmation & Fencing

Cluster confirms failure via quorum check. Fencing isolates the failed node from shared storage or powers it down (STONITH). Prevents zombie leader from causing damage.

③ Leader Election (if failed node was leader)

Remaining nodes initiate election using Raft/Paxos. Candidate with most up-to-date log and fresh term number campaigns. Majority vote → new leader promoted. Generation/term number incremented.

④ Re-Replication

Metadata server (NameNode) detects under-replicated blocks. Schedules re-replication to bring blocks back to desired replication factor (e.g., 3×). Re-replication prioritized by urgency (blocks with only 1 copy first).

⑤ Load Rebalancing

Data from the failed node is redistributed across healthy nodes. Load balancer updates routing tables to exclude the failed node. Traffic seamlessly redirected — clients experience no downtime.

⑥ Node Recovery (optional)

When the failed node comes back online, it rejoins with a clean state. It syncs missed updates from the current leader using its generation number to verify it's rejoining safely (not as a zombie).

📊

Systems Comparison

Key distributed storage and file systems compared across architecture, consistency model, and use case

System	Type	Architecture	Consistency	CAP	Best For
HDFS	DFS	Single NameNode + DataNodes (master-slave)	Strong	CP	Large sequential batch reads (MapReduce)
GFS / Colossus	DFS	Distributed metadata servers + chunk servers	Strong	CP	Google-scale search, analytics
Amazon S3	Object	Distributed key-value (NoSQL-style)	Eventual → Strong	AP	Unstructured data, media, backups
Apache Cassandra	DB / KV	Leaderless P2P ring (consistent hashing)	Tunable / Eventual	AP	High-write, globally distributed apps
etcd	KV Store	Raft consensus cluster (odd nodes)	Strong (linearizable)	CP	Config, service discovery, coordination
MongoDB	Document	Replica sets + sharding (primary-secondary)	Strong primary / Eventual secondaries	CP	Flexible schema applications
GlusterFS	DFS	Distributed hash (brick-based, no master)	Eventual (with sync risk)	AP	NAS-style shared storage, VMs
Ceph	Unified	RADOS (object) + RBD (block) + CephFS (file)	Strong	CP	Cloud infrastructure, OpenStack, Kubernetes

🎯 Key Design Takeaways

Always deploy cluster nodes in odd numbers (3, 5, 7) to guarantee quorum majority
Choose CP (Consistency + Partition Tolerance) for financial transactions, config stores, and coordination services
Choose AP (Availability + Partition Tolerance) for content delivery, social feeds, and systems that tolerate stale reads
Design failure detection to distinguish transient from permanent failures using tuned timeouts

Use generation clocks / term numbers to prevent zombie leaders from causing split-brain
Implement fencing (STONITH) in high-availability systems where data correctness is critical
Maintain at least 3× replication across independent failure domains (racks, zones, data centers)
Monitor with Prometheus/Grafana; automate re-replication and leader election — humans are too slow

Distributed Storage &File Systems

Distributed Storage Systems (DSS)

🌐 What is a Distributed Storage System?

How Data Flows

1. Data Partitioning

2. Replication

3. Metadata Tracking

4. Access & Retrieval

Key Roles & Components

🏗️ Distributed Storage — Master–Slave Architecture (HDFS-style)

Block Storage

Object Storage

File Storage (DFS)

🔁 Replication Models

Distributed File Systems (DFS)

📂 What is a Distributed File System?

Core Characteristics

Engineering Challenges

GFS / Colossus

HDFS

Haystack / f4

📝 HDFS Write Path — How a File Gets Written

CAP Theorem

⚖️ You Can Only Pick Two

Distributed System Failure Modes

Fail-Stop Failure

Behavior

Recovery

Crash Failure

Behavior

Recovery

Network Partition

Causes

Recovery

Byzantine Failure

Behavior

Recovery

Transient Failure

Behavior

Recovery

Performance Degradation

Behavior

Recovery

Split-Brain Scenarios

🧠 What is Split-Brain?

🕸️ Split-Brain Scenario — 5-Node Cluster During Network Partition

🕰️ Generation Clocks — Detecting Zombie Leaders

Failure Handling & Recovery Strategies

⚙️ Raft Consensus Algorithm — Leader Election

Normal Operation → Leader Fails → New Election

🛡️ Split-Brain Prevention Strategies

🔧 Node Failure — Complete Recovery Playbook

① Failure Detection

② Confirmation & Fencing

③ Leader Election (if failed node was leader)

④ Re-Replication

⑤ Load Rebalancing

⑥ Node Recovery (optional)

Systems Comparison

🎯 Key Design Takeaways

Distributed Storage &
File Systems