High Availability Patterns: Build Reliable Distributed Systems

Learn essential high availability patterns including redundancy, failover, load balancing, and SLA calculations. Practical strategies for building systems that stay online.

published: March 22, 2026 reading time: 47 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

High availability is not a feature you bolt on at the end, it is an architectural discipline that shapes every decision from how you replicate data to how you detect failures. The core patterns are straightforward: redundancy with active-active or active-passive replicas, failover logic that switches to backups automatically or manually, load balancing to route traffic away from unhealthy instances, and circuit breakers that stop cascading failures when downstream services are already struggling. SLA planning forces honest conversations about what downtime actually costs, and the composite SLA of a system is the product of all its dependencies, which means every additional component actually reduces overall availability.

High Availability Patterns: Building Reliable Distributed Systems

Availability measures how often a system is operational. High availability (HA) means the system stays up even when components fail. For critical systems, downtime costs money, reputation, and sometimes lives.

The CAP theorem tells us that during partitions, we choose between consistency and availability. High availability patterns help minimize the time spent in that trade-off by preventing failures and handling them gracefully when they occur.

This post covers practical patterns for building highly available systems.

Introduction

High availability means systems that remain operational even when components fail. We measure it as a percentage of uptime over time:

graph LR
    A[99%] --> B[87.6 hours downtime/year]
    A --> C[3.65 days downtime/year]
    D[99.9%] --> E[8.76 hours downtime/year]
    D --> F[525.6 minutes downtime/year]
    G[99.99%] --> H[52.6 minutes downtime/year]
    G --> I[8.6 seconds downtime/day]

The “nines” matter. Each additional nine represents a tenfold reduction in downtime. Whether that matters depends on your business context. A video streaming service can probably survive 4 hours of downtime per year. A hospital’s monitoring system cannot.

Redundancy

The first line of defense against failure is having backup components. Redundancy means duplicating critical components so that if one fails, another takes over.

Types of Redundancy

Active-active: Multiple replicas serve traffic simultaneously. If one fails, others continue without interruption.

// Active-active: all servers handle requests
const servers = ["server1", "server2", "server3"];

async function handleRequest(request) {
  // Try servers in order until one responds
  for (const server of servers) {
    try {
      return await sendToServer(server, request);
    } catch (error) {
      continue; // Try next server
    }
  }
  throw new Error("All servers unavailable");
}

Active-passive: One primary server handles traffic. Standby servers are ready but not processing requests until failover.

// Active-passive: standby takes over on primary failure
const primary = "primaryServer";
const standby = "standbyServer";

async function handleRequest(request) {
  try {
    return await sendToServer(primary, request);
  } catch (error) {
    // Failover to standby
    console.log("Primary failed, activating standby");
    return await sendToServer(standby, request);
  }
}

Active-active requires more complex conflict resolution but provides better resource utilization. Active-passive is simpler but wastes resources on idle standby capacity.

These trade-offs shape your redundancy strategy. Pick active-active when downtime is unacceptable and the budget exists to run multiple hot sites. Pick active-passive when some downtime is tolerable and simplicity matters more.

Failover Patterns

Failover is the process of switching from a failed component to a backup. Several patterns exist:

Automatic vs Manual Failover

Failover is the mechanism that recovers from a failed primary by promoting a standby. The two modes differ in who initiates the switch and how much time the decision takes.

Automatic failover detects failure through health checks and triggers the switch to standby without human involvement. The detection-to-promotion window is typically 30 seconds to 2 minutes for DNS-based approaches, or under 10 seconds for floating IP approaches. The risk is false positives. If your health check fires incorrectly — a transient network blip, a slow disk that recovers before the check times out — you failover unnecessarily. The old primary is still alive and still accepting writes, while the standby has taken over. Now you have two primaries accepting divergent writes. Split-brain like this takes longer to resolve than the original failure.

Manual failover requires an operator to evaluate the situation and trigger the switch. The delay is typically 5-15 minutes depending on human response time, on-call rotation, and whether the operator has enough context to make the call confidently. The upside is that no false positive can trigger a failover — a human explicitly decides that the primary is truly dead before promoting standby. This makes manual failover safer for complex failures where the system state is ambiguous.

The hybrid approach is what most production systems actually use: automatic detection combined with staged response. Minor issues — a single health check failure, a brief network partition that recovers within seconds — trigger automatic failover. Major events — a full datacenter outage, a cascading failure affecting multiple components — require manual intervention. The staged response gives you speed for common failures and human judgment for rare, high-impact ones. Concretely: your health check fires once, you wait 30 seconds, if it fires again you send an alert and begin automatic failover for the specific component. If the alert is for a zone-level failure affecting more than 30% of your capacity, you page the on-call and let them decide.

Aspect	Automatic	Manual	Hybrid
Speed	30s-2min	5-15min	Fast for minor, human-paced for major
False positive risk	High if detection is loose	None	Low for staged triggers
Operator context	None	Full situation awareness	Partial for auto triggers
Complexity	Health check tuning, lock management	Runbook, on-call process	Both
Best for	Stateless services, low-complexity failures	Stateful systems, multi-component failures	Most production systems

The tuning fork for automatic failover is your health check threshold. Set it too short and you get false positives. Set it too long and you extend unnecessary downtime. A reasonable starting point is 3 consecutive failures over 30 seconds for application-level health checks, with exponential backoff on retry to avoid hammering a struggling component.

Health Checks

Failover needs accurate failure detection:

// Health check endpoint
app.get("/health", async (req, res) => {
  const health = {
    status: "ok",
    timestamp: Date.now(),
    checks: {
      database: await checkDatabase(),
      cache: await checkCache(),
      disk: await checkDiskSpace(),
    },
  };

  // Return unhealthy if any critical check fails
  const isHealthy =
    health.checks.database === "ok" && health.checks.cache === "ok";

  res.status(isHealthy ? 200 : 503).json(health);
});

async function checkDatabase() {
  try {
    await db.query("SELECT 1");
    return "ok";
  } catch (error) {
    return "failed";
  }
}

Health checks should verify actual functionality, not just process liveness. A process can be running but unable to serve requests.

Failover Time

Failover introduces latency. Components of failover time:

graph TD
    A[Failover Time] --> B[Failure Detection]
    A --> C[Decision to Failover]
    A --> D[ DNS/Route Update]
    A --> E[New Instance Startup]
    A --> F[Health Check Propagation]

    B --> B1[Usually 5-30 seconds]
    D --> D1[Can be 30-60 seconds for DNS]

DNS-based failover is slow because DNS records are cached and TTLs can be 60 seconds or more. Floating IP or anycast approaches are faster.

Load Balancing

Load balancers distribute traffic across multiple servers. They also help with availability by routing around failed instances.

Load Balancing Algorithms

Round robin: Send each request to the next server in sequence. Simple but does not account for varying request complexity.

// Round robin implementation
let currentIndex = 0;
const servers = ["server1", "server2", "server3"];

function getNextServer() {
  const server = servers[currentIndex];
  currentIndex = (currentIndex + 1) % servers.length;
  return server;
}

Least connections: Send new requests to the server with the fewest active connections. Better for varying request durations.

IP hash: Route requests from the same client IP to the same server. Useful when session state is stored locally.

Weighted: Assign weights to servers based on capacity. More powerful servers receive more traffic.

Load Balancer Health Checks

// Load balancer health monitoring
const servers = [
  { host: "server1", healthy: true, connections: 10 },
  { host: "server2", healthy: true, connections: 5 },
  { host: "server3", healthy: false, connections: 0 },
];

function routeRequest(request) {
  const healthy = servers.filter((s) => s.healthy);

  if (healthy.length === 0) {
    throw new Error("No healthy servers");
  }

  // Choose server with least connections
  const target = healthy.reduce((a, b) =>
    a.connections < b.connections ? a : b,
  );

  target.connections++;
  return sendRequest(target.host, request).finally(() => target.connections--);
}

SLA Calculations

Service Level Agreements define expected availability. Calculating SLA helps you understand what your system needs to deliver.

SLA Composition

End-to-end SLA depends on the weakest component:

// Calculate combined SLA
function calculateSLA(slaValues) {
  // For independent components, combined availability
  // is the product of individual availabilities

  const combined = slaValues.reduce((acc, sla) => {
    // Convert percentage to decimal
    const availability = sla / 100;
    return acc * availability;
  }, 1);

  return (combined * 100).toFixed(2) + "%";
}

// Example: Load balancer (99.99%) + 3 app servers (99.9% each)
// = 99.99% * 99.9% * 99.9% * 99.9%
const sla = calculateSLA([99.99, 99.9, 99.9, 99.9]);
console.log(sla); // Output: 99.60%

Adding more components reduces overall SLA. This is why HA architectures prefer minimal chains of dependencies.

Planning for SLA

Target SLA	Max Downtime/Year	Max Downtime/Month
99%	3.65 days	7.30 hours
99.9%	8.76 hours	43.8 minutes
99.99%	52.6 minutes	4.38 minutes
99.999%	5.26 minutes	26.3 seconds

99.999% (“five nines”) is extremely difficult. It allows only 5 minutes of downtime per year. Most systems target 99.9% or 99.99%.

Circuit Breakers

Circuit breakers prevent cascading failures. When a downstream service is failing, the circuit breaker trips and fast-fails requests instead of overwhelming the dying service.

class CircuitBreaker {
  constructor(failureThreshold = 5, timeout = 60000) {
    this.failureThreshold = failureThreshold;
    this.timeout = timeout;
    this.failures = 0;
    this.state = "CLOSED"; // CLOSED, OPEN, HALF_OPEN
    this.nextAttempt = 0;
  }

  async call(fn) {
    if (this.state === "OPEN") {
      if (Date.now() > this.nextAttempt) {
        this.state = "HALF_OPEN";
      } else {
        throw new Error("Circuit breaker is OPEN");
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  onSuccess() {
    this.failures = 0;
    this.state = "CLOSED";
  }

  onFailure() {
    this.failures++;
    if (this.failures >= this.failureThreshold) {
      this.state = "OPEN";
      this.nextAttempt = Date.now() + this.timeout;
    }
  }
}

The circuit breaker gives failing services time to recover instead of being buried by retry storms.

Graceful Degradation

When full functionality is impossible, provide partial functionality. This is graceful degradation.

// Graceful degradation example
async function getProductDetails(productId) {
  const fullDetails = {
    reviews: null,
    relatedProducts: null,
    priceHistory: null,
  };

  try {
    fullDetails.reviews = await getReviews(productId);
  } catch (error) {
    console.log("Reviews unavailable");
  }

  try {
    fullDetails.relatedProducts = await getRelated(productId);
  } catch (error) {
    console.log("Related products unavailable");
  }

  try {
    fullDetails.priceHistory = await getPriceHistory(productId);
  } catch (error) {
    console.log("Price history unavailable");
  }

  return fullDetails;
}

Users see available information immediately instead of facing a blank screen or error message.

Leader Election

Leader Election Overview

Leader election becomes necessary whenever you run active-passive redundancy. The standby server cannot simply take over whenever it feels like it — that would cause split-brain, where both nodes accept writes independently. Instead, a coordination mechanism decides which component holds the primary role at any moment, and that decision must be reached reliably even when network partitions occur.

The sections below explore three practical approaches to leader election, from a simple heartbeat-based scheme suitable for single-node setups to distributed consensus algorithms used by production-grade systems like etcd and CockroachDB. Each approach makes different trade-offs around consistency, availability, and implementation complexity.

Leader Election Deep Dive

When active-passive redundancy is used, the standby needs to know when to take over. Leader election handles this coordination — deciding which component is primary at any given moment.

Election Algorithms

Single-node election: The primary writes a heartbeat to durable storage. If the heartbeat stops, the standby promotes itself. Simple but vulnerable to split-brain if the storage is shared.

// Heartbeat-based leader election
class LeaderElection {
  constructor(leaseDuration = 15000) {
    this.leaseDuration = leaseDuration;
    this.isLeader = false;
    this.renewTimer = null;
  }

  async tryAcquireLeadership(lockKey) {
    try {
      const result = await this.storage.acquireLock(
        lockKey,
        this.leaseDuration,
      );
      this.isLeader = result;
      if (this.isLeader) {
        this.startHeartbeat(lockKey);
      }
      return this.isLeader;
    } catch (error) {
      return false;
    }
  }

  startHeartbeat(lockKey) {
    this.renewTimer = setInterval(async () => {
      try {
        await this.storage.renewLock(lockKey, this.leaseDuration);
      } catch (error) {
        console.log("Leadership lost:", error.message);
        this.isLeader = false;
        clearInterval(this.renewTimer);
      }
    }, this.leaseDuration / 3);
  }
}

Raft-based election: Distributed consensus algorithm where nodes vote on leadership. A candidate needs majority votes to become leader. Used by etcd, CockroachDB.

The algorithm cycles through states: when a leader stops sending heartbeats, followers become candidates and request votes from peers. If a candidate gets majority votes, it becomes the new leader. If not, the cycle repeats.

graph LR
    A[Start] --> B{Current leader alive?}
    B -->|Yes| C[Continue serving]
    B -->|No| D[Node becomes Candidate]
    D --> E[Request votes from peers]
    E --> F{Received majority?}
    F -->|Yes| G[Become Leader]
    F -->|No| H{Another won election?}
    H -->|Yes| C
    H -->|No| D

ZooKeeper (ZAB protocol): ZooKeeper handles leader election using a zab protocol. Nodes register as ephemeral sequentially ordered nodes. The node with the lowest sequence number becomes leader.

The insight is that ZooKeeper provides total ordering across all nodes, so every node agrees on who should be leader without needing to talk to each other directly.

// ZooKeeper-based leader election (pseudocode)
async function runForLeadership(zkClient, nodePath) {
  // Create ephemeral sequential node
  const myNode = await zkClient.create(nodePath + "/leader-", {
    ephemeral: true,
    sequential: true,
  });

  const children = await zkClient.getChildren(nodePath);
  const sorted = children.sort();

  if (sorted[0] === myNode.split("/").pop()) {
    // I'm the leader
    await zkClient.setData(nodePath, JSON.stringify({ leader: myNode }));
  } else {
    // Watch the previous node
    const previousNode = sorted[sorted.indexOf(myNode.split("/").pop()) - 1];
    await zkClient.exists(nodePath + "/" + previousNode, watchCallback);
  }
}

Split-Brain Prevention

Split-brain occurs when two nodes both believe they’re primary. Prevention strategies:

Strategy	How It Works	Trade-off
Minority shutdown	If network partition occurs, minority side loses leadership	Some requests fail during partition
Fencing tokens	Leader must present incrementing token to storage	Storage must support fencing
Majority quorum	Leadership requires majority vote	Cannot tolerate partition > N/2
Red zone	Partitioned leaders refuse writes	Some capacity wasted during partition

Election Checklist

Pick an election algorithm that matches your consistency requirements
Always use fencing tokens when writing to shared storage
Test network partition scenarios before production breaks them for you
Set up alerts for election events so you know when leadership changes

Database High Availability

Database High Availability Overview

Databases present a distinct availability challenge compared to stateless services: they own persistent state. A web server that crashes can be replaced with a fresh instance in seconds. A database that loses its primary requires careful handling to ensure no data is lost and the replica promoted is actually caught up with the last committed write.

The sections below break down the replication topologies that address this problem, the different failover patterns available, and the specific failure modes unique to databases. Understanding these patterns is essential before designing a database HA architecture, because the consequences of getting it wrong — data loss, corruption, extended recovery time — are harder to recover from than a crashed application server.

Database HA Patterns

Databases have their own availability challenges because they own state. The replication topology you choose affects both read throughput and how failures play out.

Replication Topologies

Primary-Replica (async): Writes go to the primary, replicas copy asynchronously. Simple to set up but replicas can lag behind during heavy write periods.

// Replica lag monitoring
class ReplicaMonitor {
  async checkReplicationHealth(replicas) {
    const results = await Promise.all(
      replicas.map(async (replica) => {
        const primaryPosition = await this.getPrimaryWALPosition();
        const replicaPosition = await replica.getWALPosition();
        return {
          replica: replica.name,
          lagBytes: primaryPosition - replicaPosition,
          lagSeconds:
            (primaryPosition - replicaPosition) / this.walBytesPerSecond,
        };
      }),
    );

    const lagging = results.filter((r) => r.lagSeconds > 30);
    if (lagging.length > 0) {
      console.warn("Replicas lagging:", lagging);
    }
    return results;
  }
}

Synchronous replication (semi-sync): Write is acknowledged only after at least one replica confirms. No data loss on primary failure but adds latency to every write.

Multi-primary (active-active): All nodes accept writes. Conflict resolution becomes your problem. Makes sense when writes need to happen in any region.

Database Failover Patterns

Pattern	Description	RTO	RPO
Automatic failover	Primary fails, replica promoted automatically	30-60s	Potential data loss if async
Planned failover	Manual switch for maintenance	60-120s	Zero (sync rep)
Multi-region failover	Cross-datacenter failover	2-5min	Varies by replication

sequenceDiagram
    participant Primary as Primary DB
    participant Replica as Read Replica
    participant Arbiter as Arbiter/Quorum
    participant App as Application

    Primary->>Replica: Async WAL replication
    App->>Primary: Write
    Primary->>App: Acknowledged

    Primary->>Primary: FAILURE
    Replica->>Arbiter: "Primary down"
    Arbiter->>Replica: "Promote yourself"
    Replica->>App: "New primary ready"

    Note over App: Configure new connection string
    App->>Replica: Write (new primary)

Database HA Checklist

Pick replication mode based on how much data loss you can tolerate (RPO)
Monitor replica lag continuously with alerts — do not assume replicas are caught up
Use connection pooling that reconnects automatically on primary failover
Test failover with a real database, not just in theory

Rate Limiting for High Availability

Rate Limiting Overview

Without rate limiting, a system is one traffic spike away from cascade failure. When load increases beyond capacity, latency climbs, clients retry, the retry traffic makes load worse, and eventually everything grinds to a halt. Rate limiting breaks this spiral by placing a hard ceiling on resource consumption — when that ceiling is reached, excess requests get a clear rejection (HTTP 429) instead of joining a queue that will never drain.

The sections below cover the standard algorithms used for rate limiting (token bucket and sliding window), how to apply them at multiple levels of a distributed system for defense-in-depth, and the specific failure scenarios where rate limiting is the first line of protection. Each layer of rate limiting addresses a different threat vector, from global DDoS attacks to a single misbehaving client monopolizing a service’s connections.

Rate Limiting for HA

Rate limiting protects systems from overload. When traffic spikes or a DoS attack occurs, rate limiting ensures some users get service rather than all users getting no service.

Rate Limiting Algorithms

Token bucket: Requests consume tokens that refill at a fixed rate. Good for smooth, predictable limiting with burst allowance.

class TokenBucketRateLimiter {
  constructor(rate, capacity) {
    this.rate = rate; // tokens per second
    this.capacity = capacity;
    this.tokens = capacity;
    this.lastRefill = Date.now();
  }

  allowRequest() {
    this.refill();
    if (this.tokens >= 1) {
      this.tokens -= 1;
      return true;
    }
    return false;
  }

  refill() {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    this.tokens = Math.min(this.capacity, this.tokens + elapsed * this.rate);
    this.lastRefill = now;
  }
}

Sliding window: Requests counted in a rolling time window. More accurate than fixed windows but uses more memory to track timestamps.

class SlidingWindowRateLimiter {
  constructor(maxRequests, windowMs) {
    this.maxRequests = maxRequests;
    this.windowMs = windowMs;
    this.requests = [];
  }

  allowRequest() {
    const now = Date.now();
    const windowStart = now - this.windowMs;

    // Remove old requests
    this.requests = this.requests.filter((t) => t > windowStart);

    if (this.requests.length < this.maxRequests) {
      this.requests.push(now);
      return true;
    }
    return false;
  }
}

Hierarchical Rate Limiting

For distributed systems, use multiple levels:

graph TD
    A[Request] --> B{Global Rate Limit}
    B -->|Allowed| C{Service Rate Limit}
    B -->|Rejected| G[429 Too Many Requests]
    C -->|Allowed| D{Instance Rate Limit}
    C -->|Rejected| H[429 - Service]
    D -->|Allowed| E[Process Request]
    D -->|Rejected| I[429 - Instance]

Level	Scope	Purpose
Global	All services	Prevent DDoS
Service	Per microservice	Protect downstream
Instance	Per server/container	Prevent resource exhaustion
User	Per user/account	Fairness

Rate Limiting Checklist

Apply rate limiting at multiple levels: global, service, and instance
Return 429 with Retry-After header so clients know when to retry
Use token bucket for smooth rate limiting; sliding window when accuracy matters
Watch rate limit rejections — they often signal an attack or bug before anything else fails

Health Check Design

Health Check Overview

A health check sounds simple in theory — just report whether the service is healthy. In practice, the design of your health checks determines whether your load balancer routes traffic to a dying instance, whether your failover triggers when it should not, and whether operators can diagnose failures from the check output alone. Getting this wrong in either direction is costly: too lenient and you route traffic to broken instances; too strict and you trigger unnecessary failovers that create their own turbulence.

The sections below cover the distinction between liveness and readiness probes, the structural patterns that make health checks debuggable in production, and the specific best practices that prevent the most common health check failure modes. These patterns apply whether you are running on Kubernetes with native probes or implementing health endpoints in your own application code.

Health Check Design Patterns

Health checks determine whether an instance is fit to serve traffic. Bad health checks cause unnecessary failovers or worse — they mask failures until users experience them.

Deep Health Checks vs Liveness

// Liveness check - is the process alive?
app.get("/health/liveness", (req, res) => {
  res.status(200).json({ status: "ok" });
});

// Readiness check - can the service handle requests?
app.get("/health/readiness", async (req, res) => {
  const checks = {
    dependencies: await checkDependencies(),
    capacity: await checkCapacity(),
    configured: await checkConfiguration(),
  };

  const healthy = Object.values(checks).every((c) => c.ok);
  res.status(healthy ? 200 : 503).json({
    status: healthy ? "ready" : "not_ready",
    checks,
  });
});

async function checkDependencies() {
  try {
    await db.query("SELECT 1");
    await redis.ping();
    return { ok: true };
  } catch (error) {
    return { ok: false, error: error.message };
  }
}

Health Check Best Practices

Practice	Why It Matters
Check dependencies, not just process	Process can be running but unable to serve
Use timeout on all checks	Prevent cascade from slow dependency
Avoid caches for health checks	Cache staleness masks actual dependency health
Return specific failure reasons	Faster debugging when health check fails
Segment readiness vs liveness	Liveness = restart needed; readiness = traffic pause

Health Check Checklist

Have separate liveness and readiness endpoints
Readiness should check dependencies, not just whether the process is alive
Timeouts on all checks prevent cascade failures from slow dependencies
Include enough detail in health check responses so debugging is faster

When to Use / When Not to Use

Scenario	Recommendation
Critical systems (healthcare, finance)	High availability mandatory
SLA of 99.99%+	Active-active with automatic failover
Cost-sensitive applications	Active-passive with manual failover
Stateless services	Load balancer with health checks
Stateful services with persistence	Database replication with failover

When TO Use Active-Active

Multiple geographic regions: Users in different regions hit their nearest datacenter
High traffic volumes: Single active cannot handle load even with vertical scaling
Zero-downtime requirements: Failover happens instantly without any interruption
Read-heavy workloads: All replicas serve reads, dramatically increasing throughput

When NOT to Use Active-Active

Complex conflict resolution: When data has mutable state that is hard to partition
Strong consistency requirements: Synchronizing writes across active nodes is expensive
Limited budget: Running multiple active datacenters costs significantly more
Simple applications: The complexity cost outweighs availability benefits

Active-Active vs Active-Passive Decision Matrix

Criteria	Active-Active	Active-Passive
Cost	2-3x (all sites active)	1.5-2x (standby idle)
Complexity	High (conflict resolution, sync)	Medium (failover logic)
Failover Speed	Instant (traffic split)	30s-5min (promotion)
Write Throughput	Higher (all nodes write)	Limited to primary
Data Consistency	Complex (multi-master sync)	Simple (async replication)
Geographic Diversity	Excellent (multi-region)	Good (standby in other DC)
Resource Utilization	High (all nodes busy)	Low (standby idle)
Rollback Complexity	Complex (already processing)	Simple (demote standby)

Multi-Region Deployment Patterns

When deploying across multiple geographic regions, consider these patterns:

Pattern	Description	Use Case
Primary-Secondary	One region accepts writes, others replicate asynchronously	Read-heavy, geo-distributed users
Primary-Primary	All regions accept writes, conflict resolution required	Write-heavy, globally distributed users
CQRS + Global Traffic Routing	Separate read/write paths, route users to nearest region	Complex domains, maximum performance
Stateless Microservices + Regional Databases	Compute is stateless and globally distributed	Cloud-native, Kubernetes-based

Cross-region replication considerations:

// Multi-region write latency expectations
const REPLICATION_LATENCY = {
  "us-east-1 to eu-west-1": "~100-150ms RTT",
  "us-east-1 to ap-southeast-1": "~200-250ms RTT",
  "eu-west-1 to ap-southeast-1": "~150-200ms RTT",
};

// Consistency vs latency tradeoff in multi-region
async function writeMultiRegion(key, value, options = {}) {
  const { consistencyLevel = "QUORUM", regions = ["us-east-1", "eu-west-1"] } =
    options;

  if (consistencyLevel === "ALL_REGIONS") {
    // Strongest consistency, highest latency
    const results = await Promise.all(
      regions.map((region) => writeToRegion(region, key, value)),
    );
    return results;
  } else if (consistencyLevel === "QUORUM") {
    // Majority of regions must acknowledge
    const acks = await Promise.race([
      Promise.all(regions.map((r) => writeToRegion(r, key, value))),
      timeout(5000), // 5 second timeout
    ]);
    return acks;
  } else {
    // Local region only, async replicate
    await writeToRegion("local", key, value);
    backgroundSyncToOtherRegions(regions);
    return { success: true, replicated: false };
  }
}

Stateful vs Stateless Failover Differences

Aspect	Stateless Services	Stateful Services
Failover Complexity	Low (just restart somewhere)	High (must preserve state)
State Recovery	None needed	Must recover from replica or WAL
Failover Time	5-30 seconds (container restart)	30s-5min (state recovery)
Data Loss Risk	None	Possible (unreplicated writes)
Scaling	Horizontal (easy)	Requires consistent hashing/sharding
Session Affinity	Not needed	Often required (or externalize state)

Stateful failover sequence:

sequenceDiagram
    participant Primary as Primary DB
    participant Standby as Standby DB
    participant LB as Load Balancer
    participant App as Application

    Primary->>Standby: Replicate WAL continuously
    Note over Primary,Standby: Async replication with lag

    Primary->>LB: Health check OK
    App->>LB: Write request
    LB->>Primary: Route to primary

    Primary->>Primary: CRASH - stops responding
    Standby->>Primary: Health check fails
    Standby->>LB: "I am available"
    LB->>Standby: Promote to primary

    Note over App: Brief write failure during failover
    App->>LB: Retry write
    LB->>Standby: Route to new primary
    Standby->>App: Write acknowledged

    Note over Primary,Standby: Re-sync when old primary recovers

Trade-off Analysis

HA Architecture Decisions

Decision	Pros	Cons	Best For
Active-Active vs Active-Passive	Active-Active: instant failover, better utilisation	Active-Active: complex, expensive	Active-Active: critical systems, high traffic
Synchronous vs Async Replication	Sync: zero RPO	Sync: higher latency	Sync: financial systems; Async: web apps
Automatic vs Manual Failover	Auto: faster	Auto: risk of false positives	Auto: non-critical; Manual: critical
Multi-Region vs Single-Region	Multi: full DC failure survived	Multi: higher latency, more complex	Multi: 99.99%+ SLA required
Shared Storage vs Shared Nothing	Shared: simple	Shared: single point of failure	Shared nothing: maximum availability

Kubernetes HA Patterns

For Kubernetes-based deployments:

Pattern	Description	Trade-off
PodDisruptionBudget (PDB)	Ensures minimum pods available during disruptions	May delay node drains
PodAntiAffinity	Spreads pods across nodes/AZs	Requires enough nodes
multi-AZ vs multi-region	AZ failure = local; Region failure = full DR	AZ simpler, Region safer
StatefulSets	Ordered deployment, persistent storage	More complex than Deployments

# Example: HA Kubernetes deployment with anti-affinity
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 3
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - api-server
              topologyKey: topology.kubernetes.io/zone
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: api-server

# Example: PodDisruptionBudget for minimum availability
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-server-pdb
spec:
  minAvailable: 2 # Always keep at least 2 pods running
  selector:
    matchLabels:
      app: api-server

Production Failure Scenarios

Failure Scenario	Impact	Mitigation
Load balancer instance failure	All traffic fails to backend	Run multiple LBs; health checks remove failed instances
Primary database failure	Write operations fail	Automatic failover to standby with health monitoring
DNS cache poisoning	Traffic routed to wrong/invalid IPs	Short TTLs; DNSSEC; use floating IP instead
Cascade failure	One component failure triggers others	Circuit breakers; bulkhead pattern; graceful degradation
Datacenter power failure	Entire site goes down	Multi-datacenter active-active; UPS and generators
Network partition between DCs	Split-brain risk	Quorum-based decisions; automatic failover lock
Disk full on primary	Database writes fail	Monitoring; auto-scaling storage; archive old data

Real-world Failure Scenarios

Netflix 2012 Outage Incident

What happened: Netflix, a company famously built on availability-first principles, experienced a major outage in 2012. Despite designing for high availability, the Elastic Load Balancer configuration during a routine deployment caused approximately 5% of streaming sessions to fail for 18 hours.

Root cause: A configuration change deployed during peak traffic disabled health checks on three availability zones simultaneously. Netflix’s availability patterns routed traffic away from unhealthy zones, but the cumulative load on the remaining zones exceeded their capacity threshold.

Impact: An estimated 3 million Netflix users were unable to stream content during the outage. Customer support tickets increased by 400%.

Lesson learned: Availability patterns reduce the blast radius of component failures, but they don’t eliminate single points of failure in the control plane. Load balancer and deployment configuration deserve as much redundancy testing as data layer replication.

Heroku 2012 Multi-AZ Outage

What happened: In 2012, Heroku experienced a significant outage when the underlying AWS infrastructure in the US-east-1 region suffered a multi-availability-zone failure. Heroku’s platform assumed that at least one AZ would always be healthy, a reasonable assumption that proved false during this incident.

Root cause: Heroku’s routing layer had a hard dependency on AWS Route 53 for DNS failover. When AWS’s DNS infrastructure in US-east-1 experienced elevated latency, the routing layer’s health checking mechanism failed to detect healthy alternatives in other regions quickly enough.

Impact: Over 50,000 Heroku applications experienced complete unavailability for approximately 4 hours. Many applications had no means of manual failover since the platform managed the infrastructure entirely.

Lesson learned: Availability patterns are only as reliable as their dependencies. The failure of an upstream dependency can cascade through availability mechanisms designed to prevent cascading failures.

Cloudflare 2019 WAF Outage

What happened: On July 2, 2019, Cloudflare experienced a global outage affecting millions of websites. The incident took down Cloudflare’s core proxying, CDN, and WAF functionality, causing 502 errors for customers worldwide.

Root cause: A newly deployed WAF managed-rule containing a poorly written regex targeting XSS attacks caused catastrophic backtracking. The regex contained patterns like .*(?:.*=.*) which, when processing certain input strings, created exponential computational overhead. For example, matching x=xxxxxxxxxxxxxxxxxxxx (20 x’s) required 555 steps — a super-linear growth pattern that overwhelmed CPU on all HTTP/HTTPS serving cores globally.

The rule was deployed at 13:42 UTC via Cloudflare’s Quicksilver system, which propagates configuration changes globally in seconds. Within minutes, the backtracking exhausted CPU across their entire HTTP/HTTPS fleet.

Impact: Approximately 20% of global traffic was affected. Major websites including financial institutions, news organisations, and communication platforms were unreachable for up to 30 minutes. The fix (disabling the rule) was not deployed until 14:07 UTC — 25 minutes later — partly because the operations team had difficulty accessing internal control systems through their own compromised edge.

Lesson learned: Even software-layer rule deployments can create catastrophic failure modes at global scale. The speed of configuration propagation (Quicksilver pushed to all PoPs in seconds) meant the damage was instant and worldwide. This underscores that availability at the application layer is bounded by the safety of your configuration deployment pipeline — and that rules affecting regex evaluation deserve the same rigorous testing and gradual rollout as any other code change.

Common Pitfalls / Anti-Patterns

These patterns represent the most common mistakes teams make when designing for high availability. Each is avoidable with proper planning.

Pitfall 1: Designing for Five Nines Without Budget

Problem: Targeting 99.999% availability requires significant investment in redundancy, monitoring, and processes. Teams targeting it without the budget to support it often miss the target.

Solution: Start with 99.9% (four nines), measure what actually causes downtime, and improve incrementally. Each nine costs roughly 10x more than the previous.

Pitfall 2: Ignoring Dependency SLAs

Problem: Your service has 99.99% uptime but depends on a service with 99% uptime. Your actual availability is 99.99% × 99% = 98.99%.

Solution: Map all dependencies and calculate composite SLA. If a dependency is weak, add redundancy for that specific dependency or accept the lower composite SLA.

Pitfall 3: Automatic Failover Without Testing

Problem: Automatic failover sounds great until it triggers unnecessarily due to a false positive, causing a cascade of problems.

Solution: Test failover regularly (at least quarterly). Use manual failover for initial deployments until you have confidence in detection accuracy.

Pitfall 4: Forgetting About Session State

Problem: When failover happens, in-memory session state is lost. Users are logged out, shopping carts are emptied.

Solution: Externalize session state to Redis or similar. Use stateless request processing wherever possible. For sessions that must be local, use session affinity (with awareness of failover).

Chaos Engineering

Game Days for HA Patterns

Run these chaos experiments to validate availability patterns:

Chaos Experiment	What It Validates	Success Criteria
Kill random app server	Load balancer routes around failure	< 1% error rate
Terminate primary DB	Failover completes successfully	< 60s downtime
Network partition (single DC)	Quorum maintained	Reads continue, writes queued
Fill disk on replica	Monitoring detects issue	Alert < 5 min, graceful degradation
Restart all app servers simultaneously	Traffic spikes handled	Rate limiting prevents cascade

Pre-Game Day Checklist

- [ ] Define steady-state hypothesis
- [ ] Establish baseline metrics
- [ ] Notify stakeholders of experiment window
- [ ] Verify recent backup is available
- [ ] Confirm rollback plan
- [ ] Have on-call ready to intervene
- [ ] Document expected outcome
- [ ] Establish abort criteria

Security Checklist

Load balancer accessible only via TLS (HTTPS)
Health check endpoints authenticated
Failover mechanisms protected from unauthorized trigger
DNS records protected with short TTLs and DNSSEC
Cross-datacenter traffic encrypted
Secrets for failover mechanisms rotated regularly
Audit logging of all failover events

Interview Questions

1. Explain the difference between active-active and active-passive redundancy. When would you choose one over the other?

What to cover:

Active-active means all nodes run hot, handling traffic simultaneously — failover is immediate because traffic is already being served elsewhere
Active-passive keeps a standby idle until needed — failover takes time to promote the standby
Active-active gives you instant failover and better resource use, but adds complexity around conflict resolution and costs more since all sites run full capacity
Active-passive is simpler operationally but wastes resources on idle standby
Pick active-active when downtime is unacceptable and you have the budget to run multiple sites. Pick active-passive when some downtime is tolerable and cost matters more.

2. How do you calculate the composite SLA for a system with multiple dependencies?

What to cover:

The composite SLA is the product of all individual SLAs — convert percentages to decimals, multiply, convert back
Example: 99.99% × 99.9% × 99.9% = 99.60% — that's only 2 nines, even though each component had 3-4 nines
The insight is that every additional dependency shrinks your overall availability
In HA design, this means keeping dependency chains as short as possible
If one dependency is weaker than the rest, add redundancy specifically for that dependency or accept the lower composite number

3. What is the circuit breaker pattern and when would you use it?

What to cover:

Circuit breaker monitors failures to a downstream service — once the threshold is hit, it trips and fast-fails all requests instead of letting them pile up
Three states: CLOSED (normal operation) → OPEN (failing fast) → HALF_OPEN (testing if the downstream has recovered)
Use it when your service has multiple downstream dependencies and a cascade failure is likely if one starts struggling
The point is to stop hammering a dying service and give it breathing room to recover
Key parameters to configure: failure threshold and how long to wait before trying again (half-open)
Real example: your database is slow; the circuit breaker trips so your app servers stop accumulating connections

4. What are the components of failover time and how would you reduce each one?

What to cover:

Failure detection: usually 5-30 seconds, depends on how often you run health checks
Decision to failover: automatic is faster but riskier (might trigger unnecessarily), manual gives you control but introduces human delay
DNS or route update: 30-60+ seconds — DNS caching with long TTLs is usually the culprit
New instance startup: 30-120 seconds — depends on whether you pre-warm instances
Health check propagation: 5-15 seconds for clients to realize the new instance is healthy
To minimize: run health checks frequently, keep DNS TTLs short, pre-warm instances, use floating IPs instead of DNS for faster rerouting

5. What's the difference between liveness and readiness probes in health checking?

What to cover:

Liveness asks "is the process alive?" If it fails, the container or pod gets restarted
Readiness asks "can this service handle traffic?" If it fails, the instance gets removed from the load balancer pool
Liveness checks should be cheap and fast — just check if the process responds
Readiness should verify actual functionality: can it reach the database, the cache, does it have its config?
If you use readiness logic for a liveness check, you risk restarting containers when the issue is just a temporary dependency outage

6. How does graceful degradation improve user experience during failures?

What to cover:

Graceful degradation means providing partial functionality when full functionality isn't available
Instead of showing users a blank screen or an error, they see whatever the system can still deliver
Implementation: wrap each dependency call in a try-catch and fill in a sensible default when it fails
Example: an e-commerce product page shows the product, price, and availability even if the reviews and recommendations services are down
The principle: fail fast on core features (you need those to function), fail gracefully on peripheral ones

7. What is split-brain in HA systems and how do you prevent it?

What to cover:

Split-brain happens when a network partition leaves two nodes both thinking they're primary — both accept writes, which leads to data divergence or corruption
Prevention strategies: minority shutdown (the side with fewer nodes loses leadership), fencing tokens (storage rejects writes from the old primary), majority quorum (you need majority votes to be leader), red zone (partitioned leaders refuse to accept writes)
Fencing is a common approach: the primary must present an incrementing token when writing to storage, and the storage rejects any write with a stale token
Always test this — deliberately create network partitions to see if your split-brain protection actually works

8. What are the trade-offs between synchronous and asynchronous database replication?

What to cover:

Synchronous replication waits for at least one replica to confirm the write before acknowledging it — no data loss on primary failure but every write now has to wait for a round-trip to the replica
Asynchronous replication acknowledges writes immediately — lower latency but the replica might lag behind if the network hiccups or the replica is under load
Semi-synchronous is the middle ground: wait for one replica, not all
The choice comes down to your RPO tolerance — financial systems usually need synchronous replication, most web apps can live with asynchronous
Watch out for async replication lag: the primary can be perfectly healthy while the replica silently falls behind due to network issues or heavy write load

9. Why does rate limiting matter for high availability and how does it prevent cascade failures?

What to cover:

Without rate limiting: a traffic spike hits your service, it starts timing out, clients retry, the retry load makes things worse, and you're in a cascade failure
Rate limiting at multiple levels: global (stops DDoS), service (protects your downstream dependencies), instance (prevents your own resources from being exhausted)
Token bucket gives you smooth limiting with burst allowance; sliding window gives you more accuracy but uses more memory
When a limit is hit, return 429 with a Retry-After header so clients know when to come back
Rate limit rejections are an early warning signal — a spike in 429s often means an attack is underway or a client has gone rogue

10. How does chaos engineering help validate HA patterns in production systems?

What to cover:

Chaos engineering means deliberately injecting failures to test whether your system actually handles them the way you think it does
Game days are planned chaos experiments run during maintenance windows — you notify stakeholders, define success criteria, and have someone ready to intervene if things go sideways
Common experiments: kill a random app server, terminate the primary database, simulate a network partition, fill up a disk
Success criteria: error rate stays below a threshold, failover completes within your RTO, graceful degradation actually works
Before you start: define what "normal" looks like (steady-state hypothesis), capture baseline metrics, have a rollback plan
Netflix created Chaos Monkey to randomly kill servers in production — that spawned the whole discipline of chaos engineering

11. What is the difference between RTO and RPO, and how do they influence your HA architecture choices?

What to cover:

RTO (Recovery Time Objective) is how long you can afford to be down — the maximum acceptable time to restore service after a failure
RPO (Recovery Point Objective) is how much data you can afford to lose — the maximum acceptable time window for data loss
RTO drives your failover speed requirements: a 5-minute RTO means failover must complete in under 5 minutes, which constrains your architecture choices
RPO drives your replication strategy: a 1-hour RPO means you can lose up to 1 hour of data, so async replication might be fine; a 0 RPO requires synchronous replication
Financial systems often have 0 RPO (no data loss acceptable) and short RTOs (minutes), which pushes toward synchronous replication and fast failover
Most web apps can tolerate some data loss (RPO of minutes to hours) and moderate downtime (RTO of 30-60 minutes), which opens up async replication and simpler failover

12. How do you handle stateful failover differently from stateless failover, and why is state the hard part?

What to cover:

Stateless services: just restart somewhere else, no state to recover. Failover time is dominated by instance startup and health check propagation (5-30 seconds typically)
Stateful services: must preserve or recover state before serving traffic. Failover involves replica promotion, WAL replay, connection string updates (30s-5min typically)
The hard part is the state itself — if you are mid-write when the primary dies, async replication means that write might be lost
Stateful failover sequence: detect failure, promote standby, replay any uncommitted WAL, update connection strings, verify replication caught up
Session state is a common trap: in-memory sessions are lost on failover. Always externalize session state to Redis or similar
Consistent hashing helps with stateful scaling — when a node fails, only its key space remaps rather than everything

13. What are the trade-offs between multi-AZ and multi-region deployment for high availability?

What to cover:

Multi-AZ: instances span availability zones within a single region. Failure of one AZ is handled by the others. Latency between AZs is low (1-5ms). Cost is moderate
Multi-region: instances span different geographic regions. Failure of an entire region is survivable. Latency between regions is higher (100-250ms depending on distance). Cost is significantly higher
Multi-AZ protects against AZ failures (power, networking, hardware) but not region-level disasters (natural disasters, regulatory events)
Multi-region is required for 99.99%+ availability because a single region failure would break your SLA
The catch with multi-region: you now have cross-region replication latency, which affects both performance and RPO. Strong consistency across regions is expensive
Most applications start with multi-AZ and only add multi-region when they have specific HA requirements and the budget to support it

14. How does CAP theorem constrain HA architecture decisions, and what does PACELC add?

What to cover:

CAP theorem: during a network partition (P), you must choose between consistency (C) and availability (A). You cannot have both when the network is split
In practice, "partition" means your system detects network connectivity issues. The choice is how to respond: sacrifice availability (reject requests to maintain consistency) or sacrifice consistency (serve stale data to stay available)
Most modern systems choose availability by default — they stay up during partitions and accept that reads might be stale or writes might be lost
PACELC extends CAP: when there is no partition (E = else), you still have a latency vs consistency trade-off. Strong consistency requires coordination, which adds latency
PACELC clarifies that even in the absence of partitions, you are always making a choice: do you want consistency (slower, more coordination) or speed (faster, potential staleness)?
In HA design, this means accepting that highly consistent systems will be slower and less available during recovery, while highly available systems will have weaker consistency guarantees

15. What is a bulkhead pattern and how does it prevent cascade failures in HA systems?

What to cover:

The bulkhead pattern isolates components so that failure in one area does not spread to others — named after ship bulkheads that contain flooding to a single compartment
In microservices: each service has its own connection pool to the database. If one service misbehaves and exhausts its pool connections, other services can still function because they have their own pools
Without bulkheads: one service's connection leak exhausts the shared connection pool, and now every service is blocked — cascade failure
Implementation: set per-service connection limits that are smaller than the total available. Monitor and alert on connection pool utilization per service
Bulkheads work alongside circuit breakers: bulkheads contain the blast radius, circuit breakers stop the flood of retries
Kubernetes namespaces can act as bulkheads — resource quotas prevent a noisy namespace from consuming all cluster resources

16. How do you design health checks for a system with multiple downstream dependencies?

What to cover:

Each downstream dependency needs its own check, aggregated into an overall health status. The health endpoint should reflect the actual serving capability of the service
Structure: return status (ok/degraded/down) plus per-dependency detail so operators can diagnose quickly. Avoid just returning "unhealthy" with no information
Set timeouts on every dependency check — a slow dependency should not block your health check from completing. If a check times out, treat it as failed
Segment checks: a liveness probe should be cheap and fast (is the process alive?), a readiness probe should verify actual functionality (can this instance serve traffic?)
Do not use cached values in health checks — staleness masks actual dependency health. Check the dependency directly
Consider dependency injection for health checks: makes testing easier and allows you to mock failing dependencies in chaos experiments

17. What is the role of quotas and resource limits in maintaining availability during traffic spikes?

What to cover:

Quotas prevent any single tenant, service, or user from consuming all available resources. Without quotas, one misbehaving client can exhaust connection pools, memory, or CPU, affecting everyone
Rate limiting handles request volume; quotas handle resource consumption per entity. Both are needed for complete protection
Implement hierarchical quotas: global quotas (total system capacity), per-service quotas (prevent one service from dominating), per-tenant quotas (fairness between customers)
When a quota is exceeded: return 429 with a Retry-After header so clients can back off and retry later rather than hammering immediately
Monitor quota utilization — approaching quota limits is an early warning of an oncoming failure. A sudden spike in 429s often precedes a cascade
Set conservative default quotas and allow customers to request increases through a defined process, so you can validate before granting

18. How would you approach capacity planning for a system targeting 99.99% availability?

What to cover:

Start with your target SLA and work backward: 99.99% allows ~52 minutes of downtime per year, which constrains your failover time, RTO, and how quickly you must detect failures
Map all dependencies and their individual SLAs. Composite SLA = product of all SLAs. If any dependency does not meet your target, add redundancy for that dependency specifically
Capacity planning for HA is different from normal capacity planning: you must plan for degraded operation (N-1 redundancy). If one component fails, the remaining components must handle the full load
Stress test during failover: when your primary fails, how much load can your standby handle? If it cannot handle peak load, you need larger standby or active-active
Watch for resource exhaustion during failover: connection pools, thread pools, and CPU all spike during the transition period. Pre-warm standby instances to reduce startup time
Review capacity quarterly. Traffic patterns change, and what was sufficient 6 months ago might be under-provisioned now. Growth + failover = capacity crisis

19. What are the key differences between Kubernetes PodDisruptionBudgets, anti-affinity rules, and StatefulSets for HA?

What to cover:

PodDisruptionBudget (PDB): ensures a minimum number of pods remain available during voluntary disruptions (node drains, upgrades). Prevents too many replicas going down simultaneously
PodAntiAffinity: scheduling constraint that spreads pods across failure domains (AZs, nodes). Ensures a single AZ or node failure does not take out all replicas
StatefulSets: for stateful workloads that need stable network identity and persistent storage. Ordered deployment and scaling ensure pods come up in the right sequence
PDB handles voluntary disruptions; anti-affinity handles AZ failures. Both are needed for a complete HA posture
StatefulSets are more complex than Deployments — only use them when you genuinely need stable identity and ordered operations. Most workloads are fine with Deployments
Combine PDB + anti-affinity + multi-AZ node pools: PDB prevents disruption cascades, anti-affinity spreads across AZs, multi-AZ pools ensure physical separation

20. How do you design a multi-region active-active architecture and what are the key challenges?

What to cover:

Active-active means all regions serve traffic and accept writes. Traffic is routed to the nearest region (geographic DNS, anycast). Failover is instant because traffic is already being served elsewhere
Key challenge 1: conflict resolution. If the same record is modified in two regions simultaneously, how do you merge? Last-write-wins is simple but loses data; conflict resolution is complex and expensive
Key challenge 2: cross-region replication latency. Writes in region A must be replicated to region B before being considered durable. If you need strong consistency, every write waits for cross-region acknowledgment (slow)
Key challenge 3: topology. You need a way to route users to the nearest region (latency-based DNS, GeoDNS, anycast IP). You need health checks that span regions to detect when a region should be removed from the pool
Most active-active systems accept eventual consistency across regions — writes are acknowledged locally and replicated asynchronously. Conflicts are resolved using timestamps, vector clocks, or application logic
For databases: multi-primary replication (e.g., CockroachDB, Cassandra) handles conflict resolution. For application state: externalize to a distributed store that handles replication (Redis Cluster, etcd)
Cost is 2-3x single region because all regions are hot. Only adopt this for genuine 99.99%+ requirements with the operational complexity budget to match

Quick Recap Checklist

Conclusion

High availability is not a feature you add at the end — it is an architectural discipline that shapes every decision from the start. The patterns covered in this post provide a toolkit for building systems that remain operational when components fail.

Redundancy, failover, load balancing, and circuit breakers work together to prevent single points of failure from cascading into system-wide outages. SLA planning ensures you understand the availability commitments you are making and can design accordingly.

The key principles to remember:

Design for failure — assume components will fail and build accordingly
Keep dependencies minimal — every additional dependency lowers your composite SLA
Test failover regularly — an untested failover plan is not a failover plan
Monitor the right signals — latency, traffic, errors, and saturation

Building highly available systems requires balancing complexity against reliability. Not every system needs five nines of availability. Choose the availability target that matches your business context and budget, then design intentionally to meet that target.

High Availability Patterns: Building Reliable Distributed Systems

Introduction

Redundancy

Types of Redundancy

Failover Patterns

Automatic vs Manual Failover

Health Checks

Failover Time

Load Balancing

Load Balancing Algorithms

Load Balancer Health Checks

SLA Calculations

SLA Composition

Planning for SLA

Circuit Breakers

Graceful Degradation

Leader Election

Leader Election Overview

Leader Election Deep Dive

Election Algorithms

Split-Brain Prevention

Election Checklist

Database High Availability

Database High Availability Overview

Database HA Patterns

Replication Topologies

Database Failover Patterns

Database HA Checklist

Rate Limiting for High Availability

Rate Limiting Overview

Rate Limiting for HA

Rate Limiting Algorithms

Hierarchical Rate Limiting

Rate Limiting Checklist

Health Check Design

Health Check Overview

Health Check Design Patterns

Deep Health Checks vs Liveness

Health Check Best Practices

Health Check Checklist

When to Use / When Not to Use

When TO Use Active-Active

When NOT to Use Active-Active

Active-Active vs Active-Passive Decision Matrix

Multi-Region Deployment Patterns

Stateful vs Stateless Failover Differences

Trade-off Analysis

HA Architecture Decisions

Kubernetes HA Patterns

Production Failure Scenarios

Real-world Failure Scenarios

Netflix 2012 Outage Incident

Heroku 2012 Multi-AZ Outage

Cloudflare 2019 WAF Outage

Common Pitfalls / Anti-Patterns

Pitfall 1: Designing for Five Nines Without Budget

Pitfall 2: Ignoring Dependency SLAs

Pitfall 3: Automatic Failover Without Testing

Pitfall 4: Forgetting About Session State

Chaos Engineering

Game Days for HA Patterns

Pre-Game Day Checklist

Security Checklist

Interview Questions

Further Reading

Quick Recap Checklist

Conclusion

Category

Tags

Related Posts

The Eight Fallacies of Distributed Computing

Distributed Systems Primer: Key Concepts for Modern Architecture

Graceful Degradation: Systems That Bend Instead Break