Resilience Patterns
Systems that recover gracefully from failures — absorbing shocks and returning to normal operation without data loss or prolonged outages.
The circuit breaker wraps calls to external dependencies. After a threshold of failures, it "opens" and fast-fails requests instead of waiting for timeouts — protecting the caller from cascade failures.
Normal ops
Fast fail
Probe
Recovered
Circuit Breaker
Prevent cascading failures by short-circuiting calls to failing services after a configurable error threshold. Three states: Closed (normal), Open (fail fast), Half-Open (probe).
Retry with Exponential Backoff
Retry transient failures with increasing wait times between attempts, plus jitter to prevent thundering herd. Combine with circuit breaker for compound protection.
Bulkhead Isolation
Isolate components into separate thread pools, connection pools, or processes. A failure or saturation in one bulkhead cannot exhaust resources in another.
Pool: 20
Pool: 50
Pool: 5
- Separate connection pools per downstream service
- Semaphore-based bulkhead for lightweight isolation
- Resource quota enforcement per tenant
Timeout & Deadline Propagation
Every outbound call must have a bounded deadline. In distributed systems, propagate context deadlines (gRPC context, HTTP headers) so all downstream services honor the same cut-off.
Fallback & Graceful Degradation
When a service fails, return a cached value, a default, or a reduced-feature response — rather than returning an error to the end user.
- Stale cache fallback: Serve last known good response
- Static default: Return empty list, zero count
- Feature flag: Disable non-critical features
- Local computation: Run simplified offline logic
Idempotency & At-Least-Once Safety
Design operations to be safe to execute multiple times. Use idempotency keys so that retried requests don't cause duplicate side effects.
Event Sourcing
Every state change is recorded as an event. State is derived by replaying the event log. This enables:
- Complete audit trail for debugging incidents
- Point-in-time state reconstruction
- Easy recovery by replaying from last snapshot
- Temporal decoupling between write and read models
Fault Tolerance
Designing systems that continue operating correctly even when individual components fail — tolerating failures without impacting correctness or availability.
| Failure Type | Characteristics | Detection Strategy | Tolerance Approach |
|---|---|---|---|
| Crash Fault | Node stops responding entirely | Heartbeat timeout, health check failure | Replication, failover, leader election |
| Omission Fault | Messages lost in transit (receive or send) | Acknowledgment timeout, seq number gaps | Reliable messaging, ACK + retransmit |
| Byzantine Fault | Node sends incorrect/malicious data | Cross-validation across replicas | BFT consensus (needs 3f+1 nodes) |
| Network Partition | Split-brain — nodes can't communicate | Quorum loss detection | CAP trade-offs, fencing tokens |
| Timing Fault | Clock skew, delayed message delivery | Logical clocks, hybrid logical clocks | Vector clocks, Lamport timestamps |
Single-Leader (Primary-Replica)
All writes go to one leader; replicas serve reads. Simple consistency model. Leader failure requires election (Raft, Paxos).
- Sync replication: strong consistency, higher latency
- Async replication: lower latency, risk of data loss on failover
- Semi-sync: at least one replica confirms before ack
Multi-Leader (Multi-Master)
Multiple nodes accept writes. Write conflicts must be resolved via Last-Write-Wins (LWW), CRDTs, or application-level merge.
- Geo-distributed writes across regions
- Offline-first clients (CouchDB)
- Requires conflict detection (vector clocks)
Leaderless (Quorum)
No designated leader. Client reads from W nodes, writes to R nodes. Consistency tunable via quorum: W + R > N.
N = replicas, W = write quorum, R = read quorumStrong consistency: W+R > N. Availability: smaller W+R.
- Read repair: fix stale nodes on read
- Anti-entropy: background Merkle tree sync
- Hinted handoff: store writes for unavailable nodes
Raft Algorithm
Understandable consensus with explicit leader election. Used in etcd, CockroachDB, TiKV, and most modern distributed databases.
- Tolerates (N-1)/2 node failures
- Log matching property ensures consistency
- Leader completeness: elected leader has all committed entries
Paxos (Multi-Paxos)
The foundational consensus protocol. Complex but proven. Multi-Paxos elects a stable proposer to skip Phase 1 for subsequent instances.
Saga — Distributed Transaction Coordinator
Choreography Saga: Each service publishes events and listens to others. Decentralized, no coordinator, but harder to track overall state.
Orchestration Saga: A central saga orchestrator sends commands and handles failures with explicit compensation logic.
High Availability
Designing systems to achieve maximum uptime through redundancy, automated failover, health checking, and elimination of single points of failure.
Parallel availability: A(total) = 1 − (1−A₁)(1−A₂) ... (redundancy adds availability)
Active-Passive Failover
One active node handles all traffic; a passive standby is ready to take over. Automatic failover via health check + DNS cutover or floating IP.
- Hot standby: continuously syncing, fast failover (<30s)
- Warm standby: periodic sync, slower failover (1-5 min)
- Cold standby: restore from backup, minutes to hours
Active-Active Clustering
All nodes serve traffic simultaneously. Load balanced across the cluster. No wasted standby capacity. Requires shared state or distributed state management.
33%
33%
33%
- Full resource utilization — no idle standby
- Session affinity or distributed session store required
- Requires CAP trade-off decisions for shared state
Multi-Region Active-Active
Deploy across multiple geographic regions. Route users to nearest region. Regions replicate data asynchronously, with conflict resolution for writes.
- Global Traffic Manager / Anycast routing
- Region-local writes with async cross-region replication
- Conflict-free replicated data types (CRDTs)
- Survives entire region failure transparently
Layered Health Probe Design
Health checks should verify the actual ability to serve traffic, not just "the process is running."
Load Balancing Algorithms
| Algorithm | Best For |
|---|---|
| Round Robin | Homogeneous, stateless |
| Least Connections | Long-lived connections |
| Least Response Time | Mixed workload latencies |
| IP Hash / Consistent | Session stickiness |
| Resource Based | CPU/memory-aware routing |
| Power of Two Choices | Large distributed systems |
Blue-Green Deployment
Maintain two identical production environments (blue = current, green = new). Switch traffic atomically at the load balancer. Instant rollback by switching back.
Canary Release
Gradually shift traffic to the new version. Start with 1%, monitor error rates and latency, then incrementally increase: 1% → 5% → 20% → 100%.
Rolling Update
Replace instances one by one (or in batches). Always have N-k healthy instances serving traffic. Works with immutable infrastructure.
High Scalability
Designing systems that gracefully handle increasing load — from thousands to billions of requests — through horizontal scaling, partitioning, and async architectures.
Horizontal Scaling
Add more identical instances behind a load balancer. Stateless services scale linearly. The simplest and most common scaling technique for web tiers.
Functional Decomposition
Split a monolith into microservices. Each service scales independently based on its own bottleneck — not every service needs 100 instances.
Data Partitioning
Shard data across nodes so each node owns a subset. Lookups route to the appropriate shard. Scale-out data layer for read and write throughput.
Multi-Level Caching Hierarchy
HTTP Cache-Control
Offline cache
Varnish / CloudFront
Redis @ edge
Caffeine / LRU Map
Redis Cluster
Materialized views
Consistent Hashing
Map both data keys and nodes onto a virtual ring. Each key belongs to the nearest node clockwise. Adding/removing a node only migrates keys from adjacent neighbors — not a full reshuffle.
Virtual nodes (vnodes) balance load when node capacities differ.
Range-Based Sharding
Assign contiguous key ranges to shards. Simple to implement and supports range queries within a shard. Risk: hot spots if keys are temporally skewed (e.g., all writes go to the "latest" shard).
Directory-Based Sharding
A lookup service (directory) maps each key to its shard. Maximum flexibility — can rebalance or migrate shards without rehashing. The directory itself is a bottleneck; cache it aggressively.
- Supports arbitrary partition reassignment
- Directory must be highly available (replicated)
- Used by MongoDB (config servers), Vitess
Message Queue Patterns
Multiple workers pull from the same queue. Each message processed once. Scale workers by queue depth. Used in Celery, SQS, RabbitMQ work queues.
Each subscriber receives every message. Topics → subscriptions. Used for event broadcasting, cache invalidation, audit logging.
Ordered, durable log of events (Kafka, Kinesis). Consumers maintain offsets. Replay, backfill, and multiple independent consumers from same log.
Reactive Scaling
Scale based on observed metrics: CPU utilization, memory, request queue depth, custom metrics.
Predictive Scaling
Pre-scale based on historical patterns (time-of-day, day-of-week), ML forecasts, or scheduled events (sale launches).
- AWS Predictive Scaling uses ML on CloudWatch history
- Pre-warm instance pools before expected traffic spikes
- Combine with reactive for best of both worlds
- Cold start mitigation: keep min instances warm
Big Data Architecture
Architectures for processing, storing, and serving data at petabyte scale — from ingestion pipelines to analytical query engines.
Lambda Architecture
Combines a batch layer (accurate, high-latency), speed layer (approximate, low-latency), and a serving layer that merges both views.
Spark / Hadoop
Flink / Storm
Druid / Cassandra
Kappa Architecture
Simplifies Lambda by using only a streaming layer (Kafka + Flink). Reprocessing is done by replaying the Kafka log with a new consumer group.
(reprocessed)
Data Lakehouse
Raw / landing
Cleaned / dedupe
Aggregated / enriched
Delta Lake / Iceberg ACID Guarantees:
- Serializable ACID transactions on object storage
- Schema evolution without rewriting all data
- Time travel: query data as of any past timestamp
- UPSERT / MERGE / DELETE via snapshot isolation
Windowing Strategies
Group stream events into finite windows for aggregation. Choice of window type directly affects accuracy and latency.
| Window Type | Description | Use Case |
|---|---|---|
| Tumbling | Fixed, non-overlapping intervals | Hourly metrics |
| Sliding | Fixed size, rolling every T | Moving averages |
| Session | Gap-based, user activity scoped | User sessions |
| Global | All events since beginning | Running totals |
Watermarks & Late Data
In event-time processing, watermarks tell the engine "we won't see events older than T anymore." They allow emitting window results while handling some late-arriving data.
Exactly-Once Semantics
Guaranteeing that each event is processed exactly once — not lost (at-least-once) and not duplicated (at-most-once).
- Flink: Distributed snapshots (Chandy-Lamport) + transactional sinks
- Kafka: Idempotent producers + transactions (atomic write across partitions + offsets)
- Spark: Structured Streaming write-ahead log + idempotent sinks
Distributed Systems Design
Fundamental theory and practical patterns for building correct, performant distributed systems — from CAP theorem trade-offs to service mesh observability.
Consistency
Every read receives the most recent write or an error. All nodes see the same data at the same time. Requires coordination on writes.
Availability
Every request gets a response (not necessarily the latest data). System remains operational even during node failures. No single point of failure.
Partition Tolerance
System continues operating despite network partitions. In distributed systems, network partitions are inevitable — so you must choose between C and A when a partition occurs.
| Model | Guarantee | Latency | Examples |
|---|---|---|---|
| Linearizable | Real-time ordering — strongest. Reads always see latest write globally. | High (synchronous coordination) | Zookeeper, etcd, Spanner |
| Sequential | Operations appear in some sequential order consistent with program order per process. | Moderate | Single-leader MySQL |
| Causal | Causally related operations are seen in order. Concurrent ops can differ across nodes. | Low-Moderate | MongoDB causal sessions, COPS |
| Read-Your-Writes | A process always sees its own writes. Other processes may still be stale. | Low | MySQL GTID sticky routing |
| Eventual | If no new updates, all replicas eventually converge. No timing guarantee. | Very Low | Cassandra, DynamoDB (default) |
Distributed Locking
Mutual exclusion across nodes. Use fencing tokens to prevent the "zombie write" problem where a lock holder that paused for GC believes it still holds the lock.
Leader Election
One node is designated leader to coordinate work. Raft, ZAB (ZooKeeper), and bully algorithm are common approaches.
- Epoch numbers prevent split-brain (old leader must respect new epoch)
- Lease-based leadership: leader holds time-bounded lease from majority
- Watch mechanism (ZK) for followers to detect leader failure instantly
Distributed Barrier & Semaphore
Coordinate a group of processes to a synchronization point. All workers wait at a barrier until all have reached it (e.g., all map tasks complete before reduce).
Three Pillars of Observability
What: Numeric time-series (counters, gauges, histograms). Tools: Prometheus + Grafana. Patterns: RED (Rate, Errors, Duration) for services; USE (Utilization, Saturation, Errors) for resources.
What: Structured event records with context. Tools: ELK stack, Loki + Grafana. Patterns: Structured JSON logging, correlation IDs, log levels with sampling.
What: End-to-end request flows across services via spans with parent-child relationships. Tools: Jaeger, Zipkin, Tempo. Standard: OpenTelemetry.
Service Mesh (Istio / Linkerd)
Sidecar proxies intercept all inter-service traffic — providing resilience, observability, and security without code changes.
- mTLS: Automatic mutual TLS between services
- Traffic management: Retries, timeouts, circuit breaking at proxy layer
- Canary: Traffic splitting by weight or header
- Telemetry: Automatic metrics and trace injection
Rate Limiting & Backpressure
Protect services from overload. Rate limiting at the edge; backpressure signals propagated through the system.
| System | Type | Default Consistency | Replication | HA Strategy |
|---|---|---|---|---|
| PostgreSQL | RDBMS | Serializable (single node) | Streaming replication (sync/async) | Patroni, pgBouncer, Citus |
| Cassandra | Wide-column | Eventual (tunable) | Leaderless, N=3 typical | Multi-DC, RF≥3 per DC |
| Kafka | Log / Queue | Sequential per partition | ISR (In-Sync Replicas) | min.insync.replicas=2, acks=all |
| Redis | In-memory | Strong (single node) | Async to replicas | Sentinel / Cluster mode |
| Elasticsearch | Search | Near real-time (1s refresh) | Primary + replica shards | Cross-cluster replication (CCR) |
| Apache Flink | Streaming | Exactly-once (checkpoints) | State backend replication | Standby TaskManagers, JobManager HA via ZK |