Architecture Deep Dive: Resilience, Fault-Tolerance, HA & Scalability

🛡️

Resilience Patterns

Systems that recover gracefully from failures — absorbing shocks and returning to normal operation without data loss or prolonged outages.

Core Concepts

RTO

Recovery Time Objective — max acceptable downtime

RPO

Recovery Point Objective — max acceptable data loss

MTTR

Mean Time to Recovery — avg time to restore service

MTBF

Mean Time Between Failures — avg uptime between incidents

Circuit Breaker Pattern

The circuit breaker wraps calls to external dependencies. After a threshold of failures, it "opens" and fast-fails requests instead of waiting for timeouts — protecting the caller from cascade failures.

// Circuit Breaker State Machine

CLOSED
Normal ops

OPEN
Fast fail

HALF-OPEN
Probe

CLOSED
Recovered

● Threshold exceeded → OPEN ● After timeout → HALF-OPEN (try one request) ● Success → back to CLOSED

Pattern

Circuit Breaker

Prevent cascading failures by short-circuiting calls to failing services after a configurable error threshold. Three states: Closed (normal), Open (fail fast), Half-Open (probe).

CircuitBreaker cb = CircuitBreaker.builder()
  .failureRateThreshold(50)   // % failures
  .waitDurationInOpenState(
    Duration.ofSeconds(30))
  .slidingWindowSize(10)
  .permittedCallsInHalfOpenState(3)
  .build();

Count-based or time-based sliding window

Slow call detection alongside errors

Event listeners for state transitions

Pattern

Retry with Exponential Backoff

Retry transient failures with increasing wait times between attempts, plus jitter to prevent thundering herd. Combine with circuit breaker for compound protection.

const retry = async (fn, opts) => {
  for (let i = 0; i < opts.maxAttempts; i++) {
    try { return await fn(); }
    catch(e) {
      if (i === opts.maxAttempts - 1) throw e;
      const delay = opts.base * 2**i
        + Math.random() * 1000; // jitter
      await sleep(delay);
    }
  }
};

Jitter prevents thundering herd: multiple clients don't retry simultaneously after an outage, spreading load over time.

Pattern

Bulkhead Isolation

Isolate components into separate thread pools, connection pools, or processes. A failure or saturation in one bulkhead cannot exhaust resources in another.

Thread Pool Bulkheads

Payment
Pool: 20

Search
Pool: 50

Report
Pool: 5

Separate connection pools per downstream service
Semaphore-based bulkhead for lightweight isolation
Resource quota enforcement per tenant

Pattern

Timeout & Deadline Propagation

Every outbound call must have a bounded deadline. In distributed systems, propagate context deadlines (gRPC context, HTTP headers) so all downstream services honor the same cut-off.

// gRPC context propagation
ctx, cancel := context.WithTimeout(
  parentCtx, 2*time.Second)
defer cancel()

resp, err := client.GetUser(ctx, req)
if err == context.DeadlineExceeded {
    recordTimeout("GetUser")
}

Budget approach: Total deadline = T. Allocate T×0.4 to first hop, T×0.3 to next, leaving slack for serialization.

Pattern

Fallback & Graceful Degradation

When a service fails, return a cached value, a default, or a reduced-feature response — rather than returning an error to the end user.

Stale cache fallback: Serve last known good response
Static default: Return empty list, zero count
Feature flag: Disable non-critical features
Local computation: Run simplified offline logic

Pattern

Idempotency & At-Least-Once Safety

Design operations to be safe to execute multiple times. Use idempotency keys so that retried requests don't cause duplicate side effects.

// Idempotency key in HTTP header
POST /payments
Idempotency-Key: "uuid-abc-123"

// Server-side deduplication
if seen(key) {
  return cachedResponse[key]
}
result := processPayment(req)
store(key, result, ttl=24h)
return result

Event Sourcing & CQRS for Resilience

📋

Event Sourcing

Store state as an ordered log of immutable events, not mutable rows

Every state change is recorded as an event. State is derived by replaying the event log. This enables:

Complete audit trail for debugging incidents
Point-in-time state reconstruction
Easy recovery by replaying from last snapshot
Temporal decoupling between write and read models

// Event store append
events.append({
  "id": "evt-001",
  "type": "OrderPlaced",
  "aggregate_id": "order-42",
  "timestamp": "2024-01-15T10:30Z",
  "data": { "items": [...], "total": 99.99 }
})

// Rebuild state from events
order = events
  .filter(e => e.aggregate_id == id)
  .reduce(applyEvent, initialState)

⚡

Fault Tolerance

Designing systems that continue operating correctly even when individual components fail — tolerating failures without impacting correctness or availability.

Failure Taxonomy

Failure Type	Characteristics	Detection Strategy	Tolerance Approach
Crash Fault	Node stops responding entirely	Heartbeat timeout, health check failure	Replication, failover, leader election
Omission Fault	Messages lost in transit (receive or send)	Acknowledgment timeout, seq number gaps	Reliable messaging, ACK + retransmit
Byzantine Fault	Node sends incorrect/malicious data	Cross-validation across replicas	BFT consensus (needs 3f+1 nodes)
Network Partition	Split-brain — nodes can't communicate	Quorum loss detection	CAP trade-offs, fencing tokens
Timing Fault	Clock skew, delayed message delivery	Logical clocks, hybrid logical clocks	Vector clocks, Lamport timestamps

Replication Strategies

Replication

Single-Leader (Primary-Replica)

All writes go to one leader; replicas serve reads. Simple consistency model. Leader failure requires election (Raft, Paxos).

👑 Leader

→WAL sync→

Replica 1

→WAL sync→

Replica 2

Sync replication: strong consistency, higher latency
Async replication: lower latency, risk of data loss on failover
Semi-sync: at least one replica confirms before ack

Replication

Multi-Leader (Multi-Master)

Multiple nodes accept writes. Write conflicts must be resolved via Last-Write-Wins (LWW), CRDTs, or application-level merge.

Conflict resolution strategies: LWW (Cassandra), CRDT merge functions (Riak), custom merge (Git), operational transformation (Google Docs).

Geo-distributed writes across regions
Offline-first clients (CouchDB)
Requires conflict detection (vector clocks)

Replication

Leaderless (Quorum)

No designated leader. Client reads from W nodes, writes to R nodes. Consistency tunable via quorum: W + R > N.

Formula: N = replicas, W = write quorum, R = read quorum
Strong consistency: W+R > N. Availability: smaller W+R.

Read repair: fix stale nodes on read
Anti-entropy: background Merkle tree sync
Hinted handoff: store writes for unavailable nodes

Consensus Algorithms

Consensus

Raft Algorithm

Understandable consensus with explicit leader election. Used in etcd, CockroachDB, TiKV, and most modern distributed databases.

// Raft leader election phases
Follower times out → becomes Candidate
Candidate increments term → sends RequestVote RPC
Majority vote received → becomes Leader
Leader sends heartbeats (AppendEntries)
Log entries committed when majority ack

Tolerates (N-1)/2 node failures
Log matching property ensures consistency
Leader completeness: elected leader has all committed entries

Consensus

Paxos (Multi-Paxos)

The foundational consensus protocol. Complex but proven. Multi-Paxos elects a stable proposer to skip Phase 1 for subsequent instances.

// Multi-Paxos phases
Phase 1a: Prepare(n)     // proposer
Phase 1b: Promise(n,v)   // acceptor
Phase 2a: Accept(n,v)    // proposer
Phase 2b: Accepted(n,v)  // acceptor + learner notified

// Leader leasing avoids repeated phase 1

Saga Pattern for Distributed Transactions

🔄

Saga — Distributed Transaction Coordinator

Long-running transactions as a sequence of local transactions with compensating actions

Choreography Saga: Each service publishes events and listens to others. Decentralized, no coordinator, but harder to track overall state.

Orchestration Saga: A central saga orchestrator sends commands and handles failures with explicit compensation logic.

Compensating transactions must be idempotent and semantically correct — e.g., "reverse-charge" instead of "delete payment".

// Orchestration saga steps
saga OrderSaga {
  steps: [
    { action: reserveInventory,
      compensate: releaseInventory },
    { action: chargePayment,
      compensate: refundPayment },
    { action: scheduleShipment,
      compensate: cancelShipment }
  ]
}
// On step 3 fail → run
// refundPayment + releaseInventory

🔁

High Availability

Designing systems to achieve maximum uptime through redundancy, automated failover, health checking, and elimination of single points of failure.

Availability Math

99.9%

Three Nines — 8.7 hrs/year downtime

99.99%

Four Nines — 52 min/year downtime

99.999%

Five Nines — 5.26 min/year downtime

    Series availability: A(total) = A₁ × A₂ × A₃ ... (each dependency multiplies risk)

    Parallel availability: A(total) = 1 − (1−A₁)(1−A₂) ... (redundancy adds availability)

Deployment Topologies

Topology

Active-Passive Failover

One active node handles all traffic; a passive standby is ready to take over. Automatic failover via health check + DNS cutover or floating IP.

Load Balancer / VIP

🟢 Active Node

Serving Traffic

🟡 Passive Node

Standby / Syncing

Hot standby: continuously syncing, fast failover (<30s)
Warm standby: periodic sync, slower failover (1-5 min)
Cold standby: restore from backup, minutes to hours

Topology

Active-Active Clustering

All nodes serve traffic simultaneously. Load balanced across the cluster. No wasted standby capacity. Requires shared state or distributed state management.

Node A
33%

Node B
33%

Node C
33%

↕ Distributed state sync (Gossip / Raft)

Full resource utilization — no idle standby
Session affinity or distributed session store required
Requires CAP trade-off decisions for shared state

Topology

Multi-Region Active-Active

Deploy across multiple geographic regions. Route users to nearest region. Regions replicate data asynchronously, with conflict resolution for writes.

Global Traffic Manager / Anycast routing
Region-local writes with async cross-region replication
Conflict-free replicated data types (CRDTs)
Survives entire region failure transparently

AWS pattern: Route 53 latency routing + DynamoDB Global Tables + Aurora Global Database.

Health Checking & Load Balancing

Health Check

Layered Health Probe Design

Health checks should verify the actual ability to serve traffic, not just "the process is running."

// Three-tier health check
GET /health/live    → Is the process alive?
  "status": "up"

GET /health/ready   → Can it serve traffic?
  checks: db.ping(), cache.ping(),
  downstream.check()

GET /health/startup → Finished initializing?
  checks: migrations_complete,
  config_loaded, warmup_done

K8s probes: livenessProbe restarts pod; readinessProbe removes from service endpoints; startupProbe delays liveness checks.

Load Balancing

Load Balancing Algorithms

Algorithm	Best For
Round Robin	Homogeneous, stateless
Least Connections	Long-lived connections
Least Response Time	Mixed workload latencies
IP Hash / Consistent	Session stickiness
Resource Based	CPU/memory-aware routing
Power of Two Choices	Large distributed systems

Zero-Downtime Deployment Patterns

Deploy

Blue-Green Deployment

Maintain two identical production environments (blue = current, green = new). Switch traffic atomically at the load balancer. Instant rollback by switching back.

🔵 Blue v1.0

← Live traffic

🟢 Green v2.0

← Idle / testing

Deploy

Canary Release

Gradually shift traffic to the new version. Start with 1%, monitor error rates and latency, then incrementally increase: 1% → 5% → 20% → 100%.

Metrics to gate: error rate, p99 latency, business KPIs (conversion, revenue). Auto-rollback if threshold exceeded.

Deploy

Rolling Update

Replace instances one by one (or in batches). Always have N-k healthy instances serving traffic. Works with immutable infrastructure.

# Kubernetes rolling update
strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 1
    maxUnavailable: 0

📈

High Scalability

Designing systems that gracefully handle increasing load — from thousands to billions of requests — through horizontal scaling, partitioning, and async architectures.

Scaling Dimensions

Axis X

Horizontal Scaling

Add more identical instances behind a load balancer. Stateless services scale linearly. The simplest and most common scaling technique for web tiers.

Requirement: Session state must be externalized (Redis, JWT) and all instances are functionally equivalent.

Axis Y

Functional Decomposition

Split a monolith into microservices. Each service scales independently based on its own bottleneck — not every service needs 100 instances.

Conway's Law: System structure reflects organizational communication structure. Align service boundaries with team boundaries.

Axis Z

Data Partitioning

Shard data across nodes so each node owns a subset. Lookups route to the appropriate shard. Scale-out data layer for read and write throughput.

Shard key choice is critical: avoid hot spots, ensure even distribution, minimize cross-shard queries.

Caching Architecture

⚡

Multi-Level Caching Hierarchy

Latency: L1 CPU 1ns → L2 0.1μs → L3 10μs → DRAM 100μs → SSD 1ms → Network 10ms → HDD 10ms

Client-Side

Browser Cache
HTTP Cache-Control

Service Worker
Offline cache

↓ Cache miss flows downstream ↓

Edge / CDN

CDN Edge Node
Varnish / CloudFront

API Gateway Cache
Redis @ edge

↓

Application-Side

In-Process L1
Caffeine / LRU Map

Distributed L2
Redis Cluster

DB Query Cache
Materialized views

↓ Final cache miss ↓

Storage

Primary DB

Object Store

Search Index

Database Sharding

Sharding

Consistent Hashing

Map both data keys and nodes onto a virtual ring. Each key belongs to the nearest node clockwise. Adding/removing a node only migrates keys from adjacent neighbors — not a full reshuffle.

// Consistent hash ring
ring = SortedMap()
for node in nodes:
  for v in range(VIRTUAL_NODES):
    h = hash(f"{node}:{v}")
    ring[h] = node

def get_node(key):
  h = hash(key)
  return ring.ceiling(h) or ring.min()

Virtual nodes (vnodes) balance load when node capacities differ.

Sharding

Range-Based Sharding

Assign contiguous key ranges to shards. Simple to implement and supports range queries within a shard. Risk: hot spots if keys are temporally skewed (e.g., all writes go to the "latest" shard).

Mitigation: Add a random prefix or suffix to time-based keys to distribute writes. Alternatively, use reverse-timestamp ordering.

Sharding

Directory-Based Sharding

A lookup service (directory) maps each key to its shard. Maximum flexibility — can rebalance or migrate shards without rehashing. The directory itself is a bottleneck; cache it aggressively.

Supports arbitrary partition reassignment
Directory must be highly available (replicated)
Used by MongoDB (config servers), Vitess

Async & Queue-Based Decoupling

📬

Message Queue Patterns

Decouple producers from consumers — absorb traffic spikes, enable independent scaling

Competing Consumers

Multiple workers pull from the same queue. Each message processed once. Scale workers by queue depth. Used in Celery, SQS, RabbitMQ work queues.

Pub/Sub Fan-Out

Each subscriber receives every message. Topics → subscriptions. Used for event broadcasting, cache invalidation, audit logging.

Event Streaming

Ordered, durable log of events (Kafka, Kinesis). Consumers maintain offsets. Replay, backfill, and multiple independent consumers from same log.

// Kafka consumer group scaling
partitions = 12   // max parallel consumers
consumers  = 4    // each handles 3 partitions
// Scale consumers up to partition count
// Beyond that → pre-increase partition count

Auto-Scaling Strategies

Scaling

Reactive Scaling

Scale based on observed metrics: CPU utilization, memory, request queue depth, custom metrics.

# K8s HPA — scale on CPU
spec:
  scaleTargetRef:
    name: api-deployment
  minReplicas: 2
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target: 60% utilization
  - type: External       # queue depth
    external:
      metric: sqs_queue_length
      target: 100

Scaling

Predictive Scaling

Pre-scale based on historical patterns (time-of-day, day-of-week), ML forecasts, or scheduled events (sale launches).

AWS Predictive Scaling uses ML on CloudWatch history
Pre-warm instance pools before expected traffic spikes
Combine with reactive for best of both worlds
Cold start mitigation: keep min instances warm

🏗️

Big Data Architecture

Architectures for processing, storing, and serving data at petabyte scale — from ingestion pipelines to analytical query engines.

Lambda vs Kappa Architecture

Architecture

Lambda Architecture

Combines a batch layer (accurate, high-latency), speed layer (approximate, low-latency), and a serving layer that merges both views.

Raw Data Store (immutable)

Batch Layer
Spark / Hadoop

Hours/days latency

Speed Layer
Flink / Storm

Seconds latency

Serving Layer (merge views)
Druid / Cassandra

Con: Two code paths (batch + streaming) are complex to maintain and keep in sync.

Architecture

Kappa Architecture

Simplifies Lambda by using only a streaming layer (Kafka + Flink). Reprocessing is done by replaying the Kafka log with a new consumer group.

Immutable Event Log (Kafka)

↓ stream processing (Flink / Spark Structured Streaming)

Real-time View

Historical View
(reprocessed)

Pro: One code path, simpler ops. Con: Requires Kafka retention to be long enough for full reprocessing.

Modern Data Lakehouse Architecture

🗄️

Data Lakehouse

Combines the cheap storage of a data lake with the ACID guarantees and performance of a data warehouse

Ingestion

Kafka Streams

CDC (Debezium)

Batch ETL

API Connectors

Storage (S3 / GCS / ADLS)

Bronze
Raw / landing

→

Silver
Cleaned / dedupe

→

Gold
Aggregated / enriched

Table Format (ACID layer)

Delta Lake

Apache Iceberg

Apache Hudi

Query Engines

Spark SQL

Trino / Presto

Databricks SQL

BigQuery Omni

Delta Lake / Iceberg ACID Guarantees:

Serializable ACID transactions on object storage
Schema evolution without rewriting all data
Time travel: query data as of any past timestamp
UPSERT / MERGE / DELETE via snapshot isolation

-- Iceberg time travel
SELECT * FROM orders
  FOR VERSION AS OF 12345;

SELECT * FROM orders
  FOR TIMESTAMP AS OF
  TIMESTAMP '2024-01-15 10:00:00';

-- Incremental read (CDC)
SELECT * FROM iceberg.orders
  CHANGES BETWEEN SNAPSHOT
  12340 AND 12345;

Stream Processing Patterns

Streaming

Windowing Strategies

Group stream events into finite windows for aggregation. Choice of window type directly affects accuracy and latency.

Window Type	Description	Use Case
Tumbling	Fixed, non-overlapping intervals	Hourly metrics
Sliding	Fixed size, rolling every T	Moving averages
Session	Gap-based, user activity scoped	User sessions
Global	All events since beginning	Running totals

Streaming

Watermarks & Late Data

In event-time processing, watermarks tell the engine "we won't see events older than T anymore." They allow emitting window results while handling some late-arriving data.

// Flink watermark strategy
WatermarkStrategy
  .forBoundedOutOfOrderness(
    Duration.ofSeconds(10))  // 10s late
  .withTimestampAssigner(
    (event, ts) -> event.eventTime)

// Handle very late events
.sideOutputLateData(lateTag)
.allowedLateness(Duration.ofMinutes(1))

Streaming

Exactly-Once Semantics

Guaranteeing that each event is processed exactly once — not lost (at-least-once) and not duplicated (at-most-once).

Flink: Distributed snapshots (Chandy-Lamport) + transactional sinks
Kafka: Idempotent producers + transactions (atomic write across partitions + offsets)
Spark: Structured Streaming write-ahead log + idempotent sinks

⬡

Distributed Systems Design

Fundamental theory and practical patterns for building correct, performant distributed systems — from CAP theorem trade-offs to service mesh observability.

CAP Theorem & PACELC

CAP

Consistency

Every read receives the most recent write or an error. All nodes see the same data at the same time. Requires coordination on writes.

HBase

Zookeeper

etcd

CockroachDB

CAP

Availability

Every request gets a response (not necessarily the latest data). System remains operational even during node failures. No single point of failure.

Cassandra

CouchDB

DynamoDB*

Riak

CAP

Partition Tolerance

System continues operating despite network partitions. In distributed systems, network partitions are inevitable — so you must choose between C and A when a partition occurs.

PACELC: extends CAP — even without partitions, there is a trade-off between latency and consistency.

Consistency Models (Spectrum)

Model	Guarantee	Latency	Examples
Linearizable	Real-time ordering — strongest. Reads always see latest write globally.	High (synchronous coordination)	Zookeeper, etcd, Spanner
Sequential	Operations appear in some sequential order consistent with program order per process.	Moderate	Single-leader MySQL
Causal	Causally related operations are seen in order. Concurrent ops can differ across nodes.	Low-Moderate	MongoDB causal sessions, COPS
Read-Your-Writes	A process always sees its own writes. Other processes may still be stale.	Low	MySQL GTID sticky routing
Eventual	If no new updates, all replicas eventually converge. No timing guarantee.	Very Low	Cassandra, DynamoDB (default)

Distributed Coordination Primitives

Coordination

Distributed Locking

Mutual exclusion across nodes. Use fencing tokens to prevent the "zombie write" problem where a lock holder that paused for GC believes it still holds the lock.

-- Redis Redlock algorithm
1. Get current timestamp T₀
2. Try SET lock:key random_val
   NX PX {ttl} on N/2+1 nodes
3. If acquired majority AND
   (T_now - T₀) < TTL → lock held
4. On use: include fencing token
   (monotonic counter from lock svc)
5. Storage checks token is latest

Coordination

Leader Election

One node is designated leader to coordinate work. Raft, ZAB (ZooKeeper), and bully algorithm are common approaches.

Epoch numbers prevent split-brain (old leader must respect new epoch)
Lease-based leadership: leader holds time-bounded lease from majority
Watch mechanism (ZK) for followers to detect leader failure instantly

Coordination

Distributed Barrier & Semaphore

Coordinate a group of processes to a synchronization point. All workers wait at a barrier until all have reached it (e.g., all map tasks complete before reduce).

// ZooKeeper double barrier
enter(barrierNode, processId):
  create(barrierNode/processId)
  while childCount(barrierNode) < N:
    wait(watch=barrierNode)

leave(barrierNode, processId):
  delete(barrierNode/processId)
  while childCount(barrierNode) > 0:
    wait(watch=barrierNode)

Observability in Distributed Systems

🔭

Three Pillars of Observability

Metrics, Logs, and Distributed Traces — each answers different questions about system behaviour

📊 Metrics

What: Numeric time-series (counters, gauges, histograms). Tools: Prometheus + Grafana. Patterns: RED (Rate, Errors, Duration) for services; USE (Utilization, Saturation, Errors) for resources.

# Histogram for p99
http_request_duration_seconds_bucket{
  le="0.1"} = 9500
le="1.0"  = 9950
le="+Inf" = 10000

📜 Logs

What: Structured event records with context. Tools: ELK stack, Loki + Grafana. Patterns: Structured JSON logging, correlation IDs, log levels with sampling.

{
  "level": "error",
  "trace_id": "abc-123",
  "service": "payment",
  "user_id": 42,
  "msg": "charge failed"
}

🔍 Traces

What: End-to-end request flows across services via spans with parent-child relationships. Tools: Jaeger, Zipkin, Tempo. Standard: OpenTelemetry.

span = tracer.start(
  "db.query",
  parent=httpSpan)
span.setTag("db.table", "orders")
// propagated via headers:
// traceparent: 00-abc123-xyz789-01

Service Mesh & Network Resilience

Infra

Service Mesh (Istio / Linkerd)

Sidecar proxies intercept all inter-service traffic — providing resilience, observability, and security without code changes.

mTLS: Automatic mutual TLS between services
Traffic management: Retries, timeouts, circuit breaking at proxy layer
Canary: Traffic splitting by weight or header
Telemetry: Automatic metrics and trace injection

Infra

Rate Limiting & Backpressure

Protect services from overload. Rate limiting at the edge; backpressure signals propagated through the system.

// Token bucket algorithm
class TokenBucket {
  capacity = 1000;   // burst
  rate     = 100/s;   // refill rate
  tokens   = capacity;

  allow(cost=1):
    refill()
    if tokens >= cost:
      tokens -= cost
      return true
    return false  // 429 Too Many Requests
}

CAP / Consistency Cheatsheet for Common Systems

System	Type	Default Consistency	Replication	HA Strategy
PostgreSQL	RDBMS	Serializable (single node)	Streaming replication (sync/async)	Patroni, pgBouncer, Citus
Cassandra	Wide-column	Eventual (tunable)	Leaderless, N=3 typical	Multi-DC, RF≥3 per DC
Kafka	Log / Queue	Sequential per partition	ISR (In-Sync Replicas)	min.insync.replicas=2, acks=all
Redis	In-memory	Strong (single node)	Async to replicas	Sentinel / Cluster mode
Elasticsearch	Search	Near real-time (1s refresh)	Primary + replica shards	Cross-cluster replication (CCR)
Apache Flink	Streaming	Exactly-once (checkpoints)	State backend replication	Standby TaskManagers, JobManager HA via ZK

Resilience, Fault-Tolerance,High Availability & Scalability

Resilience Patterns

Circuit Breaker

Retry with Exponential Backoff

Bulkhead Isolation

Timeout & Deadline Propagation

Fallback & Graceful Degradation

Idempotency & At-Least-Once Safety

Event Sourcing

Fault Tolerance

Single-Leader (Primary-Replica)

Multi-Leader (Multi-Master)

Leaderless (Quorum)

Raft Algorithm

Paxos (Multi-Paxos)

Saga — Distributed Transaction Coordinator

High Availability

Active-Passive Failover

Active-Active Clustering

Multi-Region Active-Active

Layered Health Probe Design

Load Balancing Algorithms

Blue-Green Deployment

Canary Release

Rolling Update

High Scalability

Horizontal Scaling

Functional Decomposition

Data Partitioning

Multi-Level Caching Hierarchy

Consistent Hashing

Range-Based Sharding

Directory-Based Sharding

Message Queue Patterns

Reactive Scaling

Predictive Scaling

Big Data Architecture

Lambda Architecture

Kappa Architecture

Data Lakehouse

Windowing Strategies

Watermarks & Late Data

Exactly-Once Semantics

Distributed Systems Design

Consistency

Availability

Partition Tolerance

Distributed Locking

Leader Election

Distributed Barrier & Semaphore

Three Pillars of Observability

Service Mesh (Istio / Linkerd)

Rate Limiting & Backpressure

Resilience, Fault-Tolerance,
High Availability & Scalability