High Availability Patterns: Build Reliable Distributed Systems
Learn essential high availability patterns including redundancy, failover, load balancing, and SLA calculations. Practical strategies for building systems that stay online.
High Availability Patterns: Building Reliable Distributed Systems
Availability measures how often a system is operational. High availability (HA) means the system stays up even when components fail. For critical systems, downtime costs money, reputation, and sometimes lives.
The CAP theorem tells us that during partitions, we choose between consistency and availability. High availability patterns help minimize the time spent in that trade-off by preventing failures and handling them gracefully when they occur.
This post covers practical patterns for building highly available systems.
Introduction
High availability means systems that remain operational even when components fail. We measure it as a percentage of uptime over time:
graph LR
A[99%] --> B[87.6 hours downtime/year]
A --> C[3.65 days downtime/year]
D[99.9%] --> E[8.76 hours downtime/year]
D --> F[525.6 minutes downtime/year]
G[99.99%] --> H[52.6 minutes downtime/year]
G --> I[8.6 seconds downtime/day]
The “nines” matter. Each additional nine represents a tenfold reduction in downtime. Whether that matters depends on your business context. A video streaming service can probably survive 4 hours of downtime per year. A hospital’s monitoring system cannot.
Redundancy
The first line of defense against failure is having backup components. Redundancy means duplicating critical components so that if one fails, another takes over.
Types of Redundancy
Active-active: Multiple replicas serve traffic simultaneously. If one fails, others continue without interruption.
// Active-active: all servers handle requests
const servers = ["server1", "server2", "server3"];
async function handleRequest(request) {
// Try servers in order until one responds
for (const server of servers) {
try {
return await sendToServer(server, request);
} catch (error) {
continue; // Try next server
}
}
throw new Error("All servers unavailable");
}
Active-passive: One primary server handles traffic. Standby servers are ready but not processing requests until failover.
// Active-passive: standby takes over on primary failure
const primary = "primaryServer";
const standby = "standbyServer";
async function handleRequest(request) {
try {
return await sendToServer(primary, request);
} catch (error) {
// Failover to standby
console.log("Primary failed, activating standby");
return await sendToServer(standby, request);
}
}
Active-active requires more complex conflict resolution but provides better resource utilization. Active-passive is simpler but wastes resources on idle standby capacity.
These trade-offs shape your redundancy strategy. Pick active-active when downtime is unacceptable and the budget exists to run multiple hot sites. Pick active-passive when some downtime is tolerable and simplicity matters more.
Failover Patterns
Failover is the process of switching from a failed component to a backup. Several patterns exist:
Automatic vs Manual Failover
Automatic failover detects failures and switches without human intervention. Manual failover requires an operator to trigger the switch.
Automatic is faster but riskier. If detection is imperfect, you might failover unnecessarily, causing a cascade of problems. Manual failover gives you control but introduces human delay.
For most production systems, a hybrid approach works: automatic detection with automatic failover for minor issues, manual intervention for major events.
Health Checks
Failover needs accurate failure detection:
// Health check endpoint
app.get("/health", async (req, res) => {
const health = {
status: "ok",
timestamp: Date.now(),
checks: {
database: await checkDatabase(),
cache: await checkCache(),
disk: await checkDiskSpace(),
},
};
// Return unhealthy if any critical check fails
const isHealthy =
health.checks.database === "ok" && health.checks.cache === "ok";
res.status(isHealthy ? 200 : 503).json(health);
});
async function checkDatabase() {
try {
await db.query("SELECT 1");
return "ok";
} catch (error) {
return "failed";
}
}
Health checks should verify actual functionality, not just process liveness. A process can be running but unable to serve requests.
Failover Time
Failover introduces latency. Components of failover time:
graph TD
A[Failover Time] --> B[Failure Detection]
A --> C[Decision to Failover]
A --> D[ DNS/Route Update]
A --> E[New Instance Startup]
A --> F[Health Check Propagation]
B --> B1[Usually 5-30 seconds]
D --> D1[Can be 30-60 seconds for DNS]
DNS-based failover is slow because DNS records are cached and TTLs can be 60 seconds or more. Floating IP or anycast approaches are faster.
Load Balancing
Load balancers distribute traffic across multiple servers. They also help with availability by routing around failed instances.
Load Balancing Algorithms
Round robin: Send each request to the next server in sequence. Simple but does not account for varying request complexity.
// Round robin implementation
let currentIndex = 0;
const servers = ["server1", "server2", "server3"];
function getNextServer() {
const server = servers[currentIndex];
currentIndex = (currentIndex + 1) % servers.length;
return server;
}
Least connections: Send new requests to the server with the fewest active connections. Better for varying request durations.
IP hash: Route requests from the same client IP to the same server. Useful when session state is stored locally.
Weighted: Assign weights to servers based on capacity. More powerful servers receive more traffic.
Load Balancer Health Checks
// Load balancer health monitoring
const servers = [
{ host: "server1", healthy: true, connections: 10 },
{ host: "server2", healthy: true, connections: 5 },
{ host: "server3", healthy: false, connections: 0 },
];
function routeRequest(request) {
const healthy = servers.filter((s) => s.healthy);
if (healthy.length === 0) {
throw new Error("No healthy servers");
}
// Choose server with least connections
const target = healthy.reduce((a, b) =>
a.connections < b.connections ? a : b,
);
target.connections++;
return sendRequest(target.host, request).finally(() => target.connections--);
}
SLA Calculations
Service Level Agreements define expected availability. Calculating SLA helps you understand what your system needs to deliver.
SLA Composition
End-to-end SLA depends on the weakest component:
// Calculate combined SLA
function calculateSLA(slaValues) {
// For independent components, combined availability
// is the product of individual availabilities
const combined = slaValues.reduce((acc, sla) => {
// Convert percentage to decimal
const availability = sla / 100;
return acc * availability;
}, 1);
return (combined * 100).toFixed(2) + "%";
}
// Example: Load balancer (99.99%) + 3 app servers (99.9% each)
// = 99.99% * 99.9% * 99.9% * 99.9%
const sla = calculateSLA([99.99, 99.9, 99.9, 99.9]);
console.log(sla); // Output: 99.60%
Adding more components reduces overall SLA. This is why HA architectures prefer minimal chains of dependencies.
Planning for SLA
| Target SLA | Max Downtime/Year | Max Downtime/Month |
|---|---|---|
| 99% | 3.65 days | 7.30 hours |
| 99.9% | 8.76 hours | 43.8 minutes |
| 99.99% | 52.6 minutes | 4.38 minutes |
| 99.999% | 5.26 minutes | 26.3 seconds |
99.999% (“five nines”) is extremely difficult. It allows only 5 minutes of downtime per year. Most systems target 99.9% or 99.99%.
Circuit Breakers
Circuit breakers prevent cascading failures. When a downstream service is failing, the circuit breaker trips and fast-fails requests instead of overwhelming the dying service.
class CircuitBreaker {
constructor(failureThreshold = 5, timeout = 60000) {
this.failureThreshold = failureThreshold;
this.timeout = timeout;
this.failures = 0;
this.state = "CLOSED"; // CLOSED, OPEN, HALF_OPEN
this.nextAttempt = 0;
}
async call(fn) {
if (this.state === "OPEN") {
if (Date.now() > this.nextAttempt) {
this.state = "HALF_OPEN";
} else {
throw new Error("Circuit breaker is OPEN");
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
onSuccess() {
this.failures = 0;
this.state = "CLOSED";
}
onFailure() {
this.failures++;
if (this.failures >= this.failureThreshold) {
this.state = "OPEN";
this.nextAttempt = Date.now() + this.timeout;
}
}
}
The circuit breaker gives failing services time to recover instead of being buried by retry storms.
Graceful Degradation
When full functionality is impossible, provide partial functionality. This is graceful degradation.
// Graceful degradation example
async function getProductDetails(productId) {
const fullDetails = {
reviews: null,
relatedProducts: null,
priceHistory: null,
};
try {
fullDetails.reviews = await getReviews(productId);
} catch (error) {
console.log("Reviews unavailable");
}
try {
fullDetails.relatedProducts = await getRelated(productId);
} catch (error) {
console.log("Related products unavailable");
}
try {
fullDetails.priceHistory = await getPriceHistory(productId);
} catch (error) {
console.log("Price history unavailable");
}
return fullDetails;
}
Users see available information immediately instead of facing a blank screen or error message.
Leader Election
Leader Election Overview
Leader Election Deep Dive
When active-passive redundancy is used, the standby needs to know when to take over. Leader election handles this coordination — deciding which component is primary at any given moment.
Election Algorithms
Single-node election: The primary writes a heartbeat to durable storage. If the heartbeat stops, the standby promotes itself. Simple but vulnerable to split-brain if the storage is shared.
// Heartbeat-based leader election
class LeaderElection {
constructor(leaseDuration = 15000) {
this.leaseDuration = leaseDuration;
this.isLeader = false;
this.renewTimer = null;
}
async tryAcquireLeadership(lockKey) {
try {
const result = await this.storage.acquireLock(
lockKey,
this.leaseDuration,
);
this.isLeader = result;
if (this.isLeader) {
this.startHeartbeat(lockKey);
}
return this.isLeader;
} catch (error) {
return false;
}
}
startHeartbeat(lockKey) {
this.renewTimer = setInterval(async () => {
try {
await this.storage.renewLock(lockKey, this.leaseDuration);
} catch (error) {
console.log("Leadership lost:", error.message);
this.isLeader = false;
clearInterval(this.renewTimer);
}
}, this.leaseDuration / 3);
}
}
Raft-based election: Distributed consensus algorithm where nodes vote on leadership. A candidate needs majority votes to become leader. Used by etcd, CockroachDB.
The algorithm cycles through states: when a leader stops sending heartbeats, followers become candidates and request votes from peers. If a candidate gets majority votes, it becomes the new leader. If not, the cycle repeats.
graph LR
A[Start] --> B{Current leader alive?}
B -->|Yes| C[Continue serving]
B -->|No| D[Node becomes Candidate]
D --> E[Request votes from peers]
E --> F{Received majority?}
F -->|Yes| G[Become Leader]
F -->|No| H{Another won election?}
H -->|Yes| C
H -->|No| D
ZooKeeper (ZAB protocol): ZooKeeper handles leader election using a zab protocol. Nodes register as ephemeral sequentially ordered nodes. The node with the lowest sequence number becomes leader.
The insight is that ZooKeeper provides total ordering across all nodes, so every node agrees on who should be leader without needing to talk to each other directly.
// ZooKeeper-based leader election (pseudocode)
async function runForLeadership(zkClient, nodePath) {
// Create ephemeral sequential node
const myNode = await zkClient.create(nodePath + "/leader-", {
ephemeral: true,
sequential: true,
});
const children = await zkClient.getChildren(nodePath);
const sorted = children.sort();
if (sorted[0] === myNode.split("/").pop()) {
// I'm the leader
await zkClient.setData(nodePath, JSON.stringify({ leader: myNode }));
} else {
// Watch the previous node
const previousNode = sorted[sorted.indexOf(myNode.split("/").pop()) - 1];
await zkClient.exists(nodePath + "/" + previousNode, watchCallback);
}
}
Split-Brain Prevention
Split-brain occurs when two nodes both believe they’re primary. Prevention strategies:
| Strategy | How It Works | Trade-off |
|---|---|---|
| Minority shutdown | If network partition occurs, minority side loses leadership | Some requests fail during partition |
| Fencing tokens | Leader must present incrementing token to storage | Storage must support fencing |
| Majority quorum | Leadership requires majority vote | Cannot tolerate partition > N/2 |
| Red zone | Partitioned leaders refuse writes | Some capacity wasted during partition |
Election Checklist
- Pick an election algorithm that matches your consistency requirements
- Always use fencing tokens when writing to shared storage
- Test network partition scenarios before production breaks them for you
- Set up alerts for election events so you know when leadership changes
Database High Availability
Database High Availability Overview
Database HA Patterns
Databases have their own availability challenges because they own state. The replication topology you choose affects both read throughput and how failures play out.
Replication Topologies
Primary-Replica (async): Writes go to the primary, replicas copy asynchronously. Simple to set up but replicas can lag behind during heavy write periods.
// Replica lag monitoring
class ReplicaMonitor {
async checkReplicationHealth(replicas) {
const results = await Promise.all(
replicas.map(async (replica) => {
const primaryPosition = await this.getPrimaryWALPosition();
const replicaPosition = await replica.getWALPosition();
return {
replica: replica.name,
lagBytes: primaryPosition - replicaPosition,
lagSeconds:
(primaryPosition - replicaPosition) / this.walBytesPerSecond,
};
}),
);
const lagging = results.filter((r) => r.lagSeconds > 30);
if (lagging.length > 0) {
console.warn("Replicas lagging:", lagging);
}
return results;
}
}
Synchronous replication (semi-sync): Write is acknowledged only after at least one replica confirms. No data loss on primary failure but adds latency to every write.
Multi-primary (active-active): All nodes accept writes. Conflict resolution becomes your problem. Makes sense when writes need to happen in any region.
Database Failover Patterns
| Pattern | Description | RTO | RPO |
|---|---|---|---|
| Automatic failover | Primary fails, replica promoted automatically | 30-60s | Potential data loss if async |
| Planned failover | Manual switch for maintenance | 60-120s | Zero (sync rep) |
| Multi-region failover | Cross-datacenter failover | 2-5min | Varies by replication |
sequenceDiagram
participant Primary as Primary DB
participant Replica as Read Replica
participant Arbiter as Arbiter/Quorum
participant App as Application
Primary->>Replica: Async WAL replication
App->>Primary: Write
Primary->>App: Acknowledged
Primary->>Primary: FAILURE
Replica->>Arbiter: "Primary down"
Arbiter->>Replica: "Promote yourself"
Replica->>App: "New primary ready"
Note over App: Configure new connection string
App->>Replica: Write (new primary)
Database HA Checklist
- Pick replication mode based on how much data loss you can tolerate (RPO)
- Monitor replica lag continuously with alerts — do not assume replicas are caught up
- Use connection pooling that reconnects automatically on primary failover
- Test failover with a real database, not just in theory
Rate Limiting for High Availability
Rate Limiting Overview
Rate Limiting for HA
Rate limiting protects systems from overload. When traffic spikes or a DoS attack occurs, rate limiting ensures some users get service rather than all users getting no service.
Rate Limiting Algorithms
Token bucket: Requests consume tokens that refill at a fixed rate. Good for smooth, predictable limiting with burst allowance.
class TokenBucketRateLimiter {
constructor(rate, capacity) {
this.rate = rate; // tokens per second
this.capacity = capacity;
this.tokens = capacity;
this.lastRefill = Date.now();
}
allowRequest() {
this.refill();
if (this.tokens >= 1) {
this.tokens -= 1;
return true;
}
return false;
}
refill() {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000;
this.tokens = Math.min(this.capacity, this.tokens + elapsed * this.rate);
this.lastRefill = now;
}
}
Sliding window: Requests counted in a rolling time window. More accurate than fixed windows but uses more memory to track timestamps.
class SlidingWindowRateLimiter {
constructor(maxRequests, windowMs) {
this.maxRequests = maxRequests;
this.windowMs = windowMs;
this.requests = [];
}
allowRequest() {
const now = Date.now();
const windowStart = now - this.windowMs;
// Remove old requests
this.requests = this.requests.filter((t) => t > windowStart);
if (this.requests.length < this.maxRequests) {
this.requests.push(now);
return true;
}
return false;
}
}
Hierarchical Rate Limiting
For distributed systems, use multiple levels:
graph TD
A[Request] --> B{Global Rate Limit}
B -->|Allowed| C{Service Rate Limit}
B -->|Rejected| G[429 Too Many Requests]
C -->|Allowed| D{Instance Rate Limit}
C -->|Rejected| H[429 - Service]
D -->|Allowed| E[Process Request]
D -->|Rejected| I[429 - Instance]
| Level | Scope | Purpose |
|---|---|---|
| Global | All services | Prevent DDoS |
| Service | Per microservice | Protect downstream |
| Instance | Per server/container | Prevent resource exhaustion |
| User | Per user/account | Fairness |
Rate Limiting Checklist
- Apply rate limiting at multiple levels: global, service, and instance
- Return 429 with Retry-After header so clients know when to retry
- Use token bucket for smooth rate limiting; sliding window when accuracy matters
- Watch rate limit rejections — they often signal an attack or bug before anything else fails
Health Check Design
Health Check Overview
Health Check Design Patterns
Health checks determine whether an instance is fit to serve traffic. Bad health checks cause unnecessary failovers or worse — they mask failures until users experience them.
Deep Health Checks vs Liveness
// Liveness check - is the process alive?
app.get("/health/liveness", (req, res) => {
res.status(200).json({ status: "ok" });
});
// Readiness check - can the service handle requests?
app.get("/health/readiness", async (req, res) => {
const checks = {
dependencies: await checkDependencies(),
capacity: await checkCapacity(),
configured: await checkConfiguration(),
};
const healthy = Object.values(checks).every((c) => c.ok);
res.status(healthy ? 200 : 503).json({
status: healthy ? "ready" : "not_ready",
checks,
});
});
async function checkDependencies() {
try {
await db.query("SELECT 1");
await redis.ping();
return { ok: true };
} catch (error) {
return { ok: false, error: error.message };
}
}
Health Check Best Practices
| Practice | Why It Matters |
|---|---|
| Check dependencies, not just process | Process can be running but unable to serve |
| Use timeout on all checks | Prevent cascade from slow dependency |
| Avoid caches for health checks | Cache staleness masks actual dependency health |
| Return specific failure reasons | Faster debugging when health check fails |
| Segment readiness vs liveness | Liveness = restart needed; readiness = traffic pause |
Health Check Checklist
- Have separate liveness and readiness endpoints
- Readiness should check dependencies, not just whether the process is alive
- Timeouts on all checks prevent cascade failures from slow dependencies
- Include enough detail in health check responses so debugging is faster
When to Use / When Not to Use
| Scenario | Recommendation |
|---|---|
| Critical systems (healthcare, finance) | High availability mandatory |
| SLA of 99.99%+ | Active-active with automatic failover |
| Cost-sensitive applications | Active-passive with manual failover |
| Stateless services | Load balancer with health checks |
| Stateful services with persistence | Database replication with failover |
When TO Use Active-Active
- Multiple geographic regions: Users in different regions hit their nearest datacenter
- High traffic volumes: Single active cannot handle load even with vertical scaling
- Zero-downtime requirements: Failover happens instantly without any interruption
- Read-heavy workloads: All replicas serve reads, dramatically increasing throughput
When NOT to Use Active-Active
- Complex conflict resolution: When data has mutable state that is hard to partition
- Strong consistency requirements: Synchronizing writes across active nodes is expensive
- Limited budget: Running multiple active datacenters costs significantly more
- Simple applications: The complexity cost outweighs availability benefits
Active-Active vs Active-Passive Decision Matrix
| Criteria | Active-Active | Active-Passive |
|---|---|---|
| Cost | 2-3x (all sites active) | 1.5-2x (standby idle) |
| Complexity | High (conflict resolution, sync) | Medium (failover logic) |
| Failover Speed | Instant (traffic split) | 30s-5min (promotion) |
| Write Throughput | Higher (all nodes write) | Limited to primary |
| Data Consistency | Complex (multi-master sync) | Simple (async replication) |
| Geographic Diversity | Excellent (multi-region) | Good (standby in other DC) |
| Resource Utilization | High (all nodes busy) | Low (standby idle) |
| Rollback Complexity | Complex (already processing) | Simple (demote standby) |
Multi-Region Deployment Patterns
When deploying across multiple geographic regions, consider these patterns:
| Pattern | Description | Use Case |
|---|---|---|
| Primary-Secondary | One region accepts writes, others replicate asynchronously | Read-heavy, geo-distributed users |
| Primary-Primary | All regions accept writes, conflict resolution required | Write-heavy, globally distributed users |
| CQRS + Global Traffic Routing | Separate read/write paths, route users to nearest region | Complex domains, maximum performance |
| Stateless Microservices + Regional Databases | Compute is stateless and globally distributed | Cloud-native, Kubernetes-based |
Cross-region replication considerations:
// Multi-region write latency expectations
const REPLICATION_LATENCY = {
"us-east-1 to eu-west-1": "~100-150ms RTT",
"us-east-1 to ap-southeast-1": "~200-250ms RTT",
"eu-west-1 to ap-southeast-1": "~150-200ms RTT",
};
// Consistency vs latency tradeoff in multi-region
async function writeMultiRegion(key, value, options = {}) {
const { consistencyLevel = "QUORUM", regions = ["us-east-1", "eu-west-1"] } =
options;
if (consistencyLevel === "ALL_REGIONS") {
// Strongest consistency, highest latency
const results = await Promise.all(
regions.map((region) => writeToRegion(region, key, value)),
);
return results;
} else if (consistencyLevel === "QUORUM") {
// Majority of regions must acknowledge
const acks = await Promise.race([
Promise.all(regions.map((r) => writeToRegion(r, key, value))),
timeout(5000), // 5 second timeout
]);
return acks;
} else {
// Local region only, async replicate
await writeToRegion("local", key, value);
backgroundSyncToOtherRegions(regions);
return { success: true, replicated: false };
}
}
Stateful vs Stateless Failover Differences
| Aspect | Stateless Services | Stateful Services |
|---|---|---|
| Failover Complexity | Low (just restart somewhere) | High (must preserve state) |
| State Recovery | None needed | Must recover from replica or WAL |
| Failover Time | 5-30 seconds (container restart) | 30s-5min (state recovery) |
| Data Loss Risk | None | Possible (unreplicated writes) |
| Scaling | Horizontal (easy) | Requires consistent hashing/sharding |
| Session Affinity | Not needed | Often required (or externalize state) |
Stateful failover sequence:
sequenceDiagram
participant Primary as Primary DB
participant Standby as Standby DB
participant LB as Load Balancer
participant App as Application
Primary->>Standby: Replicate WAL continuously
Note over Primary,Standby: Async replication with lag
Primary->>LB: Health check OK
App->>LB: Write request
LB->>Primary: Route to primary
Primary->>Primary: CRASH - stops responding
Standby->>Primary: Health check fails
Standby->>LB: "I am available"
LB->>Standby: Promote to primary
Note over App: Brief write failure during failover
App->>LB: Retry write
LB->>Standby: Route to new primary
Standby->>App: Write acknowledged
Note over Primary,Standby: Re-sync when old primary recovers
Trade-off Analysis
HA Architecture Decisions
| Decision | Pros | Cons | Best For |
|---|---|---|---|
| Active-Active vs Active-Passive | Active-Active: instant failover, better utilisation | Active-Active: complex, expensive | Active-Active: critical systems, high traffic |
| Synchronous vs Async Replication | Sync: zero RPO | Sync: higher latency | Sync: financial systems; Async: web apps |
| Automatic vs Manual Failover | Auto: faster | Auto: risk of false positives | Auto: non-critical; Manual: critical |
| Multi-Region vs Single-Region | Multi: full DC failure survived | Multi: higher latency, more complex | Multi: 99.99%+ SLA required |
| Shared Storage vs Shared Nothing | Shared: simple | Shared: single point of failure | Shared nothing: maximum availability |
Kubernetes HA Patterns
For Kubernetes-based deployments:
| Pattern | Description | Trade-off |
|---|---|---|
| PodDisruptionBudget (PDB) | Ensures minimum pods available during disruptions | May delay node drains |
| PodAntiAffinity | Spreads pods across nodes/AZs | Requires enough nodes |
| multi-AZ vs multi-region | AZ failure = local; Region failure = full DR | AZ simpler, Region safer |
| StatefulSets | Ordered deployment, persistent storage | More complex than Deployments |
# Example: HA Kubernetes deployment with anti-affinity
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
replicas: 3
template:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- api-server
topologyKey: topology.kubernetes.io/zone
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: api-server
# Example: PodDisruptionBudget for minimum availability
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-server-pdb
spec:
minAvailable: 2 # Always keep at least 2 pods running
selector:
matchLabels:
app: api-server
Production Failure Scenarios
| Failure Scenario | Impact | Mitigation |
|---|---|---|
| Load balancer instance failure | All traffic fails to backend | Run multiple LBs; health checks remove failed instances |
| Primary database failure | Write operations fail | Automatic failover to standby with health monitoring |
| DNS cache poisoning | Traffic routed to wrong/invalid IPs | Short TTLs; DNSSEC; use floating IP instead |
| Cascade failure | One component failure triggers others | Circuit breakers; bulkhead pattern; graceful degradation |
| Datacenter power failure | Entire site goes down | Multi-datacenter active-active; UPS and generators |
| Network partition between DCs | Split-brain risk | Quorum-based decisions; automatic failover lock |
| Disk full on primary | Database writes fail | Monitoring; auto-scaling storage; archive old data |
Real-world Failure Scenarios
Netflix 2012 Outage Incident
What happened: Netflix, a company famously built on availability-first principles, experienced a major outage in 2012. Despite designing for high availability, the Elastic Load Balancer configuration during a routine deployment caused approximately 5% of streaming sessions to fail for 18 hours.
Root cause: A configuration change deployed during peak traffic disabled health checks on three availability zones simultaneously. Netflix’s availability patterns routed traffic away from unhealthy zones, but the cumulative load on the remaining zones exceeded their capacity threshold.
Impact: An estimated 3 million Netflix users were unable to stream content during the outage. Customer support tickets increased by 400%.
Lesson learned: Availability patterns reduce the blast radius of component failures, but they don’t eliminate single points of failure in the control plane. Load balancer and deployment configuration deserve as much redundancy testing as data layer replication.
Heroku 2012 Multi-AZ Outage
What happened: In 2012, Heroku experienced a significant outage when the underlying AWS infrastructure in the US-east-1 region suffered a multi-availability-zone failure. Heroku’s platform assumed that at least one AZ would always be healthy, a reasonable assumption that proved false during this incident.
Root cause: Heroku’s routing layer had a hard dependency on AWS Route 53 for DNS failover. When AWS’s DNS infrastructure in US-east-1 experienced elevated latency, the routing layer’s health checking mechanism failed to detect healthy alternatives in other regions quickly enough.
Impact: Over 50,000 Heroku applications experienced complete unavailability for approximately 4 hours. Many applications had no means of manual failover since the platform managed the infrastructure entirely.
Lesson learned: Availability patterns are only as reliable as their dependencies. The failure of an upstream dependency can cascade through availability mechanisms designed to prevent cascading failures.
Cloudflare 2019 WAF Outage
What happened: On July 2, 2019, Cloudflare experienced a global outage affecting millions of websites. The incident took down Cloudflare’s core proxying, CDN, and WAF functionality, causing 502 errors for customers worldwide.
Root cause: A newly deployed WAF managed-rule containing a poorly written regex targeting XSS attacks caused catastrophic backtracking. The regex contained patterns like .*(?:.*=.*) which, when processing certain input strings, created exponential computational overhead. For example, matching x=xxxxxxxxxxxxxxxxxxxx (20 x’s) required 555 steps — a super-linear growth pattern that overwhelmed CPU on all HTTP/HTTPS serving cores globally.
The rule was deployed at 13:42 UTC via Cloudflare’s Quicksilver system, which propagates configuration changes globally in seconds. Within minutes, the backtracking exhausted CPU across their entire HTTP/HTTPS fleet.
Impact: Approximately 20% of global traffic was affected. Major websites including financial institutions, news organisations, and communication platforms were unreachable for up to 30 minutes. The fix (disabling the rule) was not deployed until 14:07 UTC — 25 minutes later — partly because the operations team had difficulty accessing internal control systems through their own compromised edge.
Lesson learned: Even software-layer rule deployments can create catastrophic failure modes at global scale. The speed of configuration propagation (Quicksilver pushed to all PoPs in seconds) meant the damage was instant and worldwide. This underscores that availability at the application layer is bounded by the safety of your configuration deployment pipeline — and that rules affecting regex evaluation deserve the same rigorous testing and gradual rollout as any other code change.
Common Pitfalls / Anti-Patterns
These patterns represent the most common mistakes teams make when designing for high availability. Each is avoidable with proper planning.
Pitfall 1: Designing for Five Nines Without Budget
Problem: Targeting 99.999% availability requires significant investment in redundancy, monitoring, and processes. Teams targeting it without the budget to support it often miss the target.
Solution: Start with 99.9% (four nines), measure what actually causes downtime, and improve incrementally. Each nine costs roughly 10x more than the previous.
Pitfall 2: Ignoring Dependency SLAs
Problem: Your service has 99.99% uptime but depends on a service with 99% uptime. Your actual availability is 99.99% × 99% = 98.99%.
Solution: Map all dependencies and calculate composite SLA. If a dependency is weak, add redundancy for that specific dependency or accept the lower composite SLA.
Pitfall 3: Automatic Failover Without Testing
Problem: Automatic failover sounds great until it triggers unnecessarily due to a false positive, causing a cascade of problems.
Solution: Test failover regularly (at least quarterly). Use manual failover for initial deployments until you have confidence in detection accuracy.
Pitfall 4: Forgetting About Session State
Problem: When failover happens, in-memory session state is lost. Users are logged out, shopping carts are emptied.
Solution: Externalize session state to Redis or similar. Use stateless request processing wherever possible. For sessions that must be local, use session affinity (with awareness of failover).
Chaos Engineering
Game Days for HA Patterns
Run these chaos experiments to validate availability patterns:
| Chaos Experiment | What It Validates | Success Criteria |
|---|---|---|
| Kill random app server | Load balancer routes around failure | < 1% error rate |
| Terminate primary DB | Failover completes successfully | < 60s downtime |
| Network partition (single DC) | Quorum maintained | Reads continue, writes queued |
| Fill disk on replica | Monitoring detects issue | Alert < 5 min, graceful degradation |
| Restart all app servers simultaneously | Traffic spikes handled | Rate limiting prevents cascade |
Pre-Game Day Checklist
- [ ] Define steady-state hypothesis
- [ ] Establish baseline metrics
- [ ] Notify stakeholders of experiment window
- [ ] Verify recent backup is available
- [ ] Confirm rollback plan
- [ ] Have on-call ready to intervene
- [ ] Document expected outcome
- [ ] Establish abort criteria
Security Checklist
- Load balancer accessible only via TLS (HTTPS)
- Health check endpoints authenticated
- Failover mechanisms protected from unauthorized trigger
- DNS records protected with short TTLs and DNSSEC
- Cross-datacenter traffic encrypted
- Secrets for failover mechanisms rotated regularly
- Audit logging of all failover events
Interview Questions
What to cover:
- Active-active means all nodes run hot, handling traffic simultaneously — failover is immediate because traffic is already being served elsewhere
- Active-passive keeps a standby idle until needed — failover takes time to promote the standby
- Active-active gives you instant failover and better resource use, but adds complexity around conflict resolution and costs more since all sites run full capacity
- Active-passive is simpler operationally but wastes resources on idle standby
- Pick active-active when downtime is unacceptable and you have the budget to run multiple sites. Pick active-passive when some downtime is tolerable and cost matters more.
What to cover:
- The composite SLA is the product of all individual SLAs — convert percentages to decimals, multiply, convert back
- Example: 99.99% × 99.9% × 99.9% = 99.60% — that's only 2 nines, even though each component had 3-4 nines
- The insight is that every additional dependency shrinks your overall availability
- In HA design, this means keeping dependency chains as short as possible
- If one dependency is weaker than the rest, add redundancy specifically for that dependency or accept the lower composite number
What to cover:
- Circuit breaker monitors failures to a downstream service — once the threshold is hit, it trips and fast-fails all requests instead of letting them pile up
- Three states: CLOSED (normal operation) → OPEN (failing fast) → HALF_OPEN (testing if the downstream has recovered)
- Use it when your service has multiple downstream dependencies and a cascade failure is likely if one starts struggling
- The point is to stop hammering a dying service and give it breathing room to recover
- Key parameters to configure: failure threshold and how long to wait before trying again (half-open)
- Real example: your database is slow; the circuit breaker trips so your app servers stop accumulating connections
What to cover:
- Failure detection: usually 5-30 seconds, depends on how often you run health checks
- Decision to failover: automatic is faster but riskier (might trigger unnecessarily), manual gives you control but introduces human delay
- DNS or route update: 30-60+ seconds — DNS caching with long TTLs is usually the culprit
- New instance startup: 30-120 seconds — depends on whether you pre-warm instances
- Health check propagation: 5-15 seconds for clients to realize the new instance is healthy
- To minimize: run health checks frequently, keep DNS TTLs short, pre-warm instances, use floating IPs instead of DNS for faster rerouting
What to cover:
- Liveness asks "is the process alive?" If it fails, the container or pod gets restarted
- Readiness asks "can this service handle traffic?" If it fails, the instance gets removed from the load balancer pool
- Liveness checks should be cheap and fast — just check if the process responds
- Readiness should verify actual functionality: can it reach the database, the cache, does it have its config?
- If you use readiness logic for a liveness check, you risk restarting containers when the issue is just a temporary dependency outage
What to cover:
- Graceful degradation means providing partial functionality when full functionality isn't available
- Instead of showing users a blank screen or an error, they see whatever the system can still deliver
- Implementation: wrap each dependency call in a try-catch and fill in a sensible default when it fails
- Example: an e-commerce product page shows the product, price, and availability even if the reviews and recommendations services are down
- The principle: fail fast on core features (you need those to function), fail gracefully on peripheral ones
What to cover:
- Split-brain happens when a network partition leaves two nodes both thinking they're primary — both accept writes, which leads to data divergence or corruption
- Prevention strategies: minority shutdown (the side with fewer nodes loses leadership), fencing tokens (storage rejects writes from the old primary), majority quorum (you need majority votes to be leader), red zone (partitioned leaders refuse to accept writes)
- Fencing is a common approach: the primary must present an incrementing token when writing to storage, and the storage rejects any write with a stale token
- Always test this — deliberately create network partitions to see if your split-brain protection actually works
What to cover:
- Synchronous replication waits for at least one replica to confirm the write before acknowledging it — no data loss on primary failure but every write now has to wait for a round-trip to the replica
- Asynchronous replication acknowledges writes immediately — lower latency but the replica might lag behind if the network hiccups or the replica is under load
- Semi-synchronous is the middle ground: wait for one replica, not all
- The choice comes down to your RPO tolerance — financial systems usually need synchronous replication, most web apps can live with asynchronous
- Watch out for async replication lag: the primary can be perfectly healthy while the replica silently falls behind due to network issues or heavy write load
What to cover:
- Without rate limiting: a traffic spike hits your service, it starts timing out, clients retry, the retry load makes things worse, and you're in a cascade failure
- Rate limiting at multiple levels: global (stops DDoS), service (protects your downstream dependencies), instance (prevents your own resources from being exhausted)
- Token bucket gives you smooth limiting with burst allowance; sliding window gives you more accuracy but uses more memory
- When a limit is hit, return 429 with a Retry-After header so clients know when to come back
- Rate limit rejections are an early warning signal — a spike in 429s often means an attack is underway or a client has gone rogue
What to cover:
- Chaos engineering means deliberately injecting failures to test whether your system actually handles them the way you think it does
- Game days are planned chaos experiments run during maintenance windows — you notify stakeholders, define success criteria, and have someone ready to intervene if things go sideways
- Common experiments: kill a random app server, terminate the primary database, simulate a network partition, fill up a disk
- Success criteria: error rate stays below a threshold, failover completes within your RTO, graceful degradation actually works
- Before you start: define what "normal" looks like (steady-state hypothesis), capture baseline metrics, have a rollback plan
- Netflix created Chaos Monkey to randomly kill servers in production — that spawned the whole discipline of chaos engineering
What to cover:
- RTO (Recovery Time Objective) is how long you can afford to be down — the maximum acceptable time to restore service after a failure
- RPO (Recovery Point Objective) is how much data you can afford to lose — the maximum acceptable time window for data loss
- RTO drives your failover speed requirements: a 5-minute RTO means failover must complete in under 5 minutes, which constrains your architecture choices
- RPO drives your replication strategy: a 1-hour RPO means you can lose up to 1 hour of data, so async replication might be fine; a 0 RPO requires synchronous replication
- Financial systems often have 0 RPO (no data loss acceptable) and short RTOs (minutes), which pushes toward synchronous replication and fast failover
- Most web apps can tolerate some data loss (RPO of minutes to hours) and moderate downtime (RTO of 30-60 minutes), which opens up async replication and simpler failover
What to cover:
- Stateless services: just restart somewhere else, no state to recover. Failover time is dominated by instance startup and health check propagation (5-30 seconds typically)
- Stateful services: must preserve or recover state before serving traffic. Failover involves replica promotion, WAL replay, connection string updates (30s-5min typically)
- The hard part is the state itself — if you are mid-write when the primary dies, async replication means that write might be lost
- Stateful failover sequence: detect failure, promote standby, replay any uncommitted WAL, update connection strings, verify replication caught up
- Session state is a common trap: in-memory sessions are lost on failover. Always externalize session state to Redis or similar
- Consistent hashing helps with stateful scaling — when a node fails, only its key space remaps rather than everything
What to cover:
- Multi-AZ: instances span availability zones within a single region. Failure of one AZ is handled by the others. Latency between AZs is low (1-5ms). Cost is moderate
- Multi-region: instances span different geographic regions. Failure of an entire region is survivable. Latency between regions is higher (100-250ms depending on distance). Cost is significantly higher
- Multi-AZ protects against AZ failures (power, networking, hardware) but not region-level disasters (natural disasters, regulatory events)
- Multi-region is required for 99.99%+ availability because a single region failure would break your SLA
- The catch with multi-region: you now have cross-region replication latency, which affects both performance and RPO. Strong consistency across regions is expensive
- Most applications start with multi-AZ and only add multi-region when they have specific HA requirements and the budget to support it
What to cover:
- CAP theorem: during a network partition (P), you must choose between consistency (C) and availability (A). You cannot have both when the network is split
- In practice, "partition" means your system detects network connectivity issues. The choice is how to respond: sacrifice availability (reject requests to maintain consistency) or sacrifice consistency (serve stale data to stay available)
- Most modern systems choose availability by default — they stay up during partitions and accept that reads might be stale or writes might be lost
- PACELC extends CAP: when there is no partition (E = else), you still have a latency vs consistency trade-off. Strong consistency requires coordination, which adds latency
- PACELC clarifies that even in the absence of partitions, you are always making a choice: do you want consistency (slower, more coordination) or speed (faster, potential staleness)?
- In HA design, this means accepting that highly consistent systems will be slower and less available during recovery, while highly available systems will have weaker consistency guarantees
What to cover:
- The bulkhead pattern isolates components so that failure in one area does not spread to others — named after ship bulkheads that contain flooding to a single compartment
- In microservices: each service has its own connection pool to the database. If one service misbehaves and exhausts its pool connections, other services can still function because they have their own pools
- Without bulkheads: one service's connection leak exhausts the shared connection pool, and now every service is blocked — cascade failure
- Implementation: set per-service connection limits that are smaller than the total available. Monitor and alert on connection pool utilization per service
- Bulkheads work alongside circuit breakers: bulkheads contain the blast radius, circuit breakers stop the flood of retries
- Kubernetes namespaces can act as bulkheads — resource quotas prevent a noisy namespace from consuming all cluster resources
What to cover:
- Each downstream dependency needs its own check, aggregated into an overall health status. The health endpoint should reflect the actual serving capability of the service
- Structure: return status (ok/degraded/down) plus per-dependency detail so operators can diagnose quickly. Avoid just returning "unhealthy" with no information
- Set timeouts on every dependency check — a slow dependency should not block your health check from completing. If a check times out, treat it as failed
- Segment checks: a liveness probe should be cheap and fast (is the process alive?), a readiness probe should verify actual functionality (can this instance serve traffic?)
- Do not use cached values in health checks — staleness masks actual dependency health. Check the dependency directly
- Consider dependency injection for health checks: makes testing easier and allows you to mock failing dependencies in chaos experiments
What to cover:
- Quotas prevent any single tenant, service, or user from consuming all available resources. Without quotas, one misbehaving client can exhaust connection pools, memory, or CPU, affecting everyone
- Rate limiting handles request volume; quotas handle resource consumption per entity. Both are needed for complete protection
- Implement hierarchical quotas: global quotas (total system capacity), per-service quotas (prevent one service from dominating), per-tenant quotas (fairness between customers)
- When a quota is exceeded: return 429 with a Retry-After header so clients can back off and retry later rather than hammering immediately
- Monitor quota utilization — approaching quota limits is an early warning of an oncoming failure. A sudden spike in 429s often precedes a cascade
- Set conservative default quotas and allow customers to request increases through a defined process, so you can validate before granting
What to cover:
- Start with your target SLA and work backward: 99.99% allows ~52 minutes of downtime per year, which constrains your failover time, RTO, and how quickly you must detect failures
- Map all dependencies and their individual SLAs. Composite SLA = product of all SLAs. If any dependency does not meet your target, add redundancy for that dependency specifically
- Capacity planning for HA is different from normal capacity planning: you must plan for degraded operation (N-1 redundancy). If one component fails, the remaining components must handle the full load
- Stress test during failover: when your primary fails, how much load can your standby handle? If it cannot handle peak load, you need larger standby or active-active
- Watch for resource exhaustion during failover: connection pools, thread pools, and CPU all spike during the transition period. Pre-warm standby instances to reduce startup time
- Review capacity quarterly. Traffic patterns change, and what was sufficient 6 months ago might be under-provisioned now. Growth + failover = capacity crisis
What to cover:
- PodDisruptionBudget (PDB): ensures a minimum number of pods remain available during voluntary disruptions (node drains, upgrades). Prevents too many replicas going down simultaneously
- PodAntiAffinity: scheduling constraint that spreads pods across failure domains (AZs, nodes). Ensures a single AZ or node failure does not take out all replicas
- StatefulSets: for stateful workloads that need stable network identity and persistent storage. Ordered deployment and scaling ensure pods come up in the right sequence
- PDB handles voluntary disruptions; anti-affinity handles AZ failures. Both are needed for a complete HA posture
- StatefulSets are more complex than Deployments — only use them when you genuinely need stable identity and ordered operations. Most workloads are fine with Deployments
- Combine PDB + anti-affinity + multi-AZ node pools: PDB prevents disruption cascades, anti-affinity spreads across AZs, multi-AZ pools ensure physical separation
What to cover:
- Active-active means all regions serve traffic and accept writes. Traffic is routed to the nearest region (geographic DNS, anycast). Failover is instant because traffic is already being served elsewhere
- Key challenge 1: conflict resolution. If the same record is modified in two regions simultaneously, how do you merge? Last-write-wins is simple but loses data; conflict resolution is complex and expensive
- Key challenge 2: cross-region replication latency. Writes in region A must be replicated to region B before being considered durable. If you need strong consistency, every write waits for cross-region acknowledgment (slow)
- Key challenge 3: topology. You need a way to route users to the nearest region (latency-based DNS, GeoDNS, anycast IP). You need health checks that span regions to detect when a region should be removed from the pool
- Most active-active systems accept eventual consistency across regions — writes are acknowledged locally and replicated asynchronously. Conflicts are resolved using timestamps, vector clocks, or application logic
- For databases: multi-primary replication (e.g., CockroachDB, Cassandra) handles conflict resolution. For application state: externalize to a distributed store that handles replication (Redis Cluster, etcd)
- Cost is 2-3x single region because all regions are hot. Only adopt this for genuine 99.99%+ requirements with the operational complexity budget to match
Further Reading
- AWS Architecture Blog: Building Fault-Tolerant Systems
- Google SRE Book - Chapter 6 on handling overload
- Netflix Chaos Engineering - Principles of chaos engineering
- Raft Consensus Algorithm Paper - Original Raft paper for leader election
- Microsoft Azure Well-Architected Framework - Reliability checklist
Quick Recap Checklist
- Availability measures the proportion of time a system remains operational and accessible
- High availability (99.9%+) typically requires redundancy at every layer of the stack
- The nines of availability: 99% = 3.65 days/year downtime, 99.9% = 8.76 hours, 99.99% = 52.6 minutes
- Active-passive redundancy keeps a warm standby ready for failover — simpler but slower to switch
- Active-active redundancy runs multiple replicas simultaneously — complex but handles load during failures
- Failover detection requires health checks at multiple levels (application, service, infrastructure)
- Geographic distribution reduces the blast radius of regional outages and natural disasters
- Global server load balancing (GSLB) routes traffic based on latency, health, and proximity
- Health checks must be granular — a partially degraded service should report degraded, not healthy
- Graceful degradation allows a system to remain partially available when full functionality is impossible
Conclusion
High availability is not a feature you add at the end — it is an architectural discipline that shapes every decision from the start. The patterns covered in this post provide a toolkit for building systems that remain operational when components fail.
Redundancy, failover, load balancing, and circuit breakers work together to prevent single points of failure from cascading into system-wide outages. SLA planning ensures you understand the availability commitments you are making and can design accordingly.
The key principles to remember:
- Design for failure — assume components will fail and build accordingly
- Keep dependencies minimal — every additional dependency lowers your composite SLA
- Test failover regularly — an untested failover plan is not a failover plan
- Monitor the right signals — latency, traffic, errors, and saturation
Building highly available systems requires balancing complexity against reliability. Not every system needs five nines of availability. Choose the availability target that matches your business context and budget, then design intentionally to meet that target.
Category
Related Posts
The Eight Fallacies of Distributed Computing
Explore the classic assumptions developers make about networked systems that lead to failures. Learn how to avoid these pitfalls in distributed architecture.
Distributed Systems Primer: Key Concepts for Modern Architecture
A practical introduction to distributed systems fundamentals. Learn about failure modes, replication strategies, consensus algorithms, and the core challenges of building distributed software.
Graceful Degradation: Systems That Bend Instead Break
Design systems that maintain core functionality when components fail through fallback strategies, degradation modes, and progressive service levels.