Consistency Models in Distributed Systems: Complete Guide
Learn about strong, weak, eventual, and causal consistency models. Understand read-your-writes, monotonic reads, and picking the right model for your system.
Consistency Models in Distributed Systems
Strong consistency sounds simple: reads always return the most recent write. In distributed systems, though, strong consistency is expensive. The CAP theorem forces us to think about consistency trade-offs, and the PACELC theorem reminds us that even without partitions, consistency has a latency cost.
Knowing where different consistency models apply helps you make better architectural decisions. Most systems do not need strong consistency everywhere. Finding where you can relax consistency guarantees can dramatically improve performance.
This post builds on the CAP theorem and PACELC theorem discussions.
Introduction
Think of consistency as a dial you can turn, not a binary switch:
graph LR
A[Strong<br/>Consistency] --> B[Sequential<br/>Consistency]
B --> C[Causal<br/>Consistency]
C --> D[Eventual<br/>Consistency]
D --> E[Weak<br/>Consistency]
Strong consistency at one end, weak at the other. Most practical systems sit somewhere in between.
Core Concepts
Strong consistency guarantees that any read sees all preceding writes. The system appears to have a single copy of the data. This is what most developers assume when they think about database transactions.
Linearizability vs Sequential Consistency
There is a formal distinction that trips up even experienced engineers: linearizability and sequential consistency are not the same thing.
Sequential consistency (as described above) guarantees that all nodes see operations in the same total order, but that order does not have to correspond to real time. Operations can appear to happen before or after their actual wall-clock time.
Linearizability adds a real-time requirement on top of sequential consistency. Every operation appears to complete atomically at some point between its invocation and its response, and those points are ordered in real time. If write A completes before write B starts (in wall-clock time), then B must be visible after A on all nodes.
The practical difference: linearizability gives you the illusion of a single copy of data across all time. Sequential consistency only guarantees order, not real-time ordering. Linearizability is strictly stronger.
Most databases that claim “strong consistency” mean linearizability. Paxos and Raft provide linearizability. Single-leader replication with reads from the leader is linearizable (if the leader does not change during the read).
Concrete example: sequentially consistent but not linearizable.
Consider two processes writing to the same variable, with a write from region A (x=2) and a write from region B (x=3) that complete in overlapping real time. Under sequential consistency, all replicas must agree on the same total order (e.g., B’s write first, then A’s), but they might not reflect real time — A’s write “wins” even if it started after B’s completed. Under linearizability, the real-time ordering matters: if B’s write completes before A’s write starts, then A’s write must appear after B’s everywhere. A system can be sequentially consistent (all replicas agree on order) but not linearizable (the agreed-upon order does not match real time).
// Linearizability test: does the read reflect a write that completed before it started?
// Wall clock: write starts at t=0, completes at t=5
// read starts at t=6, completes at t=10
// Linearizability requires: read sees the write
// Sequential consistency: read might see the write OR the initial state (order is preserved, but real time is not)
// Real-world example of the difference:
// Sequential consistency: process A writes x=1, then x=2. Process B reads x.
// B might see x=2 even if B's read starts before A's write completes in wall-clock time,
// as long as B sees A's writes in order (x=1 then x=2, never x=2 then x=1).
// Linearizability: B's read at t=6 must see x=2 because A's write completed at t=5.
Consistency Model Comparison Table
Here is how real-world systems map to consistency models:
| System | Default Model | Strongest Model Available | Notes |
|---|---|---|---|
| DynamoDB | Eventual (per-item) | Linearizable (per-item) | Per-item linearizability; cross-item transactions are limited |
| Cassandra | Eventual (ONE) | QUORUM / ALL | Tunable consistency; QUORUM gives linearizability |
| Google Spanner | Linearizable | Linearizable (TrueTime) | GPS + atomic clocks bound clock uncertainty to ±7ms |
| CockroachDB | Snapshot + serializable | Serializable | Serializable isolation level via distributed transactions |
| etcd | Linearizable | Linearizable | Uses Raft consensus |
| MongoDB (wired tiger) | Snapshot / linearizable | Linearizable (majority read) | Single-shard linearizability; multi-shard needs transactions |
| Amazon Aurora | Session (read-after-write) | Linearizable (MCI) | Multi-master with cache-coherent invalidation |
| Azure Cosmos DB | Session (default) | Strong (bounded staleness) | Five consistency models tunable per request |
| Riak | Eventual | Strong (ALL) / causal | CRDT-native; strong requires ALL replicas |
| Zookeeper (ZK) | Linearizable | Linearizable | Single-object operations; transactions across objects |
| Kafka | Per-partition ordering | Exactly-once (transactional) | Within a partition, all consumers see messages in order |
Why it matters: Teams often pick a database based on its advertised consistency model without realizing they can tune it. Cassandra with QUORUM is strongly consistent. Cosmos DB with bounded staleness is cheaper than strong but still bounded. The default model is not the only model.
Bank account balances are the standard example:
// Strong consistency: read after write always returns the new value
async function transfer(from, to, amount) {
const account = await db.account.findOne({ id: from });
if (account.balance >= amount) {
await db.account.update(
{ id: from },
{ balance: account.balance - amount },
);
await db.account.update({ id: to }, { balance: account.balance + amount });
return { success: true };
}
return { success: false, reason: "Insufficient funds" };
}
// After this completes, any read of 'from' balance shows the new value
const balance = await db.account.findOne({ id: from });
// balance is definitely reduced, even if you read from a different replica
Strong consistency uses consensus protocols like Paxos or Raft to coordinate reads and writes across nodes. This coordination adds latency. In a geo-distributed system, a strongly consistent write might take 100-300ms. An eventually consistent write might take 5ms. The extra latency is usually worth it for data that matters.
Sequential Consistency
Sequential consistency sits between strong and eventual. It guarantees that all nodes see operations in the same order, but that order does not have to match real time.
Imagine two processes writing to the same variable:
// Sequential consistency: both replicas see writes in the same order
// but that order might not match real-world time
// Process A writes x = 1
// Process B writes x = 2
// Process C reads x
// Valid under sequential consistency: C might read 1 or 2
// Invalid: C reading x = 0 (the initial value) after both writes completed
With sequential consistency, you cannot have time travel paradoxes where a later write appears to happen before an earlier write. But you also cannot assume that a read reflects all writes that finished before it in real time.
Causal Consistency
Causal consistency is weaker. It only guarantees that causally related operations appear in order across all nodes. Operations with no causal relationship can be observed in different orders by different nodes.
Causally related means: if operation A causes operation B to happen, then A must be visible before B everywhere. Example:
// Causal consistency example
// Post 1: "I got promoted" (write by user)
// Post 2: "Celebrating tonight" (comment on post 1)
// Every node sees post 2 only after seeing post 1
// Because post 2 is causally dependent on post 1
// But "I got promoted" and "Weather is nice today" from the same user
// Are not causally related - they can appear in any order
Causal consistency is attractive because it is the strongest model that can be implemented without coordination in certain systems. The Dynamo paper popularized eventual consistency with causal metadata.
Eventual Consistency Models
Eventual consistency makes the weakest guarantee: if no new updates are made to an object, eventually all reads will return the last written value. This section covers the core eventual consistency variants and their practical implications.
Eventual Consistency
The word “eventually” is doing a lot of work here. It could mean milliseconds or hours. Different systems give different guarantees.
Bounded vs Unbounded Eventual Consistency
An important distinction often overlooked: bounded vs unbounded eventual consistency.
| Type | Definition | Real-World Examples |
|---|---|---|
| Bounded | Convergence happens within a known, finite time | DynamoDB (typically seconds), Cassandra (usually < 1 second normally) |
| Unbounded | Convergence will happen, but no time bound is guaranteed | DNS propagation (minutes to hours), CDN cache expiry (hours to days) |
Why this matters:
// Bounded eventual consistency - you can reason about staleness
// DynamoDB withDynamoDB Streams: staleness typically < 5 seconds
async function readWithStalenessBound(key) {
const item = await dynamodb.getItem({ Key: key }).promise();
// You know the item is at most 5 seconds stale (bounded)
return item;
}
// Unbounded eventual consistency - no promises about when
// Traditional DNS: TTL might be 24 hours, but convergence is not guaranteed
async function resolveWithDNS(hostname) {
const addresses = await dns.lookup(hostname);
// Could return old IP for hours; no bounded guarantee
return addresses;
}
Convergence time factors:
| Factor | Impact on Convergence Time |
|---|---|
| Network distance | Geo-distributed replicas add 50-200ms per hop |
| Load average | High load increases replication lag |
| Failure recovery | Node recovery triggers anti-entropy, can take minutes |
| Quorum availability | If quorum is impaired, writes may stall |
For bounded eventual consistency, typical convergence times:
- Same-region replicas: 10-100ms
- Cross-region replicas: 100ms - 2s
- Under failure conditions: can extend to 30s or more
For unbounded eventual consistency:
- DNS: minutes to 48+ hours (depending on TTL and propagation)
- CDN: hours to days
- Some gossip protocols: theoretically unbounded, practically O(log N)
Session Consistency Models
Session consistency guarantees apply within a single client session, providing practical guarantees like read-your-writes and monotonicity that users expect in real applications. This section covers session guarantees and their implementations.
Session Consistency
Session consistency guarantees that within a single client session, certain properties hold. This is not a single model but a family of guarantees:
| Session Guarantee | Definition | Implementation |
|---|---|---|
| Read-your-writes | Session sees its own writes | Route reads to primary or last-write replica |
| Monotonic reads | Session never sees older values | Sticky sessions, read your writes routing |
| Monotonic writes | Session’s writes are ordered | Per-client write sequencing |
| Write-follows-reads | Write is ordered after reads that preceded it | Causal ordering tracking |
DynamoDB session consistency example:
// DynamoDB: Session-level consistency guarantees
// Using a client's session token to ensure read-your-writes
const dynamodb = new AWS.DynamoDB({ region: "us-east-1" });
const docClient = new AWS.DynamoDB.DocumentClient({
dynamodb,
sessionToken: "user-session-id", // Enables session guarantees
});
// With ConsistentRead=true, guarantees read-your-writes within session
const item = await docClient
.get({
TableName: "Posts",
Key: { userId, postId },
ConsistentRead: true, // Strong within this session
})
.promise();
PRAM (Pipelined Random Access Memory) Consistency
PRAM consistency is a weak model that guarantees:
Writes from a single process are observed by all processes in the order they were issued. Writes from different processes may be observed in different orders.
In simpler terms: your own writes appear in order to everyone; other people’s writes may appear in different orders to different observers.
// PRAM consistency example
// Process A writes x=1, then x=2
// Process B writes y=1, then y=2
// All processes see:
// - A's writes in order: x=1 then x=2
// - B's writes in order: y=1 then y=2
// But some might see B's writes interleaved with A's differently
// Valid PRAM ordering: x=1, x=2, y=1, y=2
// Also valid: y=1, y=2, x=1, x=2
// Also valid: x=1, y=1, x=2, y=2
// NOT valid: x=2, x=1 (A's writes out of order)
PRAM vs Causal consistency:
- PRAM: Only your own write ordering is guaranteed
- Causal: Write ordering that has causal dependency is guaranteed
PRAM is relatively easy to implement efficiently and provides useful guarantees for some workloads.
// Eventual consistency: write returns immediately
async function writeEventual(key, value) {
// Write to local replica, return immediately
await localCache.set(key, value);
return { success: true, version: ++localVersion };
}
// Later, background replication propagates the write
// During the propagation window, reads might return stale values
async function read(key) {
return await localCache.get(key); // Could be stale during propagation
}
Amazon DynamoDB uses eventual consistency by default. Google Spanner uses strong consistency. Most databases let you choose per-query.
Read Your Own Writes
Read-your-writes consistency (also called read-after-write consistency) guarantees that after a client writes a value, subsequent reads by that same client see that write.
This is intuitive for single-node systems but tricky in distributed systems:
// Without read-your-writes: user might not see their own post
await api.createPost({ title: "Hello World" });
const posts = await api.getMyPosts();
// posts might not include "Hello World" if the read hits a replica
// that has not yet received the write
// With read-your-writes: the read routes to the right place
await api.createPost({ title: "Hello World" });
const posts = await api.getMyPosts({ consistentRead: true });
// posts definitely includes "Hello World"
Most social media applications need read-your-writes for user-generated content. Users get confused when they post something and then do not see it. This is one of the most common bugs I see in production systems.
Monotonic Reads
Monotonic reads guarantee that if a client reads a value V at time T, it will never read an older value at a later time. Reads never go backward in time.
This prevents a jarring experience where data appears to roll back:
// Without monotonic reads: timeline can go backward
await api.createComment({ text: "Great post!" });
const comments1 = await api.getComments(postId); // Includes "Great post!"
// ... some time passes, replication lag increases ...
const comments2 = await api.getComments(postId); // "Great post!" disappeared!
// With monotonic reads: once you see a value, you never unsee it
const comments = await api.getComments(postId, { monotonic: true });
// Once "Great post!" appears, it stays visible for this client session
Monotonic reads are important for any application where users see a chronological list of items. Without this guarantee, pagination can show duplicate items or skip items that were added while you were reading.
Monotonic Writes
Monotonic writes guarantee that writes from a client appear in the order they were issued. This sounds obvious, but in distributed systems, a client might issue two writes that arrive at different replicas in different orders.
// Without monotonic writes: writes can be applied out of order
await api.updateProfile({ name: "Alice" });
await api.updateProfile({ status: "Busy" });
// The replica receiving updates might apply status before name
// resulting in name being set but status showing previous value
// With monotonic writes: writes are serialized per client
await api.updateProfile({ name: "Alice" }, { serial: true });
await api.updateProfile({ status: "Busy" }, { serial: true });
// Updates are applied in order, regardless of network timing
This is less commonly discussed but matters for operations that build on each other.
ACID vs BASE
Most developers learn ACID (Atomicity, Consistency, Isolation, Durability) as the default model for databases. In distributed systems, the alternative is BASE (Basically Available, Soft state, Eventually consistent).
These are not competing worldviews — they represent different trade-off philosophies.
| Property | ACID | BASE |
|---|---|---|
| Atomicity | All or nothing | All or nothing (per node) |
| Consistency | Invariants preserved at transaction end | Invariants eventually restored |
| Isolation | Serializable or weaker | Best effort |
| Durability | Synchronous write to disk | Asynchronous, may lose recent writes |
| Availability | May sacrifice for consistency | Prioritizes availability |
| Typical use | Financial, inventory, order management | Web scale, social networks, caching layers |
ACID systems (PostgreSQL, MySQL/InnoDB) use two-phase locking for isolation. They will stall or reject writes if locks cannot be acquired. BASE systems (most NoSQL) trade strict isolation for availability and partition tolerance. Neither is wrong — they serve different purposes.
The key insight: you do not have to choose one paradigm for your entire application. Many production systems use both — ACID databases as the system of record for transactions, and BASE stores (or caches) for read-heavy derived views. The challenge is knowing which data belongs in which model.
In practice: financial ledgers, inventory counts, and payment state belong in ACID. Social feeds, cached user preferences, and analytics dashboards belong in BASE.
Trade-off Analysis
This table provides a decision-matrix comparison for consistency model selection across practical dimensions:
| Dimension | Linearizable | Sequential | Causal | Session | Eventual |
|---|---|---|---|---|---|
| Write latency | 100-300ms (geo-dist) | 50-200ms | 20-100ms | 10-50ms | 5-20ms |
| Read latency | 100-300ms (geo-dist) | 50-200ms | 20-100ms | 10-50ms | 1-10ms |
| Coordination required | All nodes (quorum) | All nodes (total order) | Causally related ops | Per-session | None |
| Staleness window | Zero | Zero | Minimal | Session-scoped | Unbounded or bounded |
| Implementation complexity | High (consensus) | Medium | Medium | Low | Low |
| Failure behavior | Unavailable if quorum lost | Unavailable if quorum lost | Degraded (causal gaps) | Degraded (session breaks) | Available (may be stale) |
| CAP classification | CP | CP | CA hybrid | CA hybrid | AP |
| Cross-region writes | Expensive (all regions) | Expensive | Moderate (causal only) | Cheap (session sticky) | Cheapest (async) |
Quorum Configuration Quick Reference
| Replicas (N) | Strong (R+W>N) | Fast Reads (R=1) | Fast Writes (W=1) |
|---|---|---|---|
| 3 | R=2, W=2 | R=1, W=3 | R=3, W=1 |
| 5 | R=3, W=3 | R=1, W=5 | R=5, W=1 |
| 7 | R=4, W=4 | R=1, W=7 | R=7, W=1 |
Trade-off Summary
Consistency, availability, latency, and complexity form a fundamental trade-off space in distributed systems. Here is a consolidated view of the key decisions:
| Trade-off Dimension | Strong Consistency | Weak/Eventual Consistency |
|---|---|---|
| Latency | High (quorum round-trips, consensus) | Low (local writes, async replication) |
| Throughput | Lower (coordination overhead) | Higher (parallel local writes) |
| Availability | Lower (fails if quorum unreachable) | Higher (continues during partitions) |
| Complexity | Higher (consensus protocols, 2PC) | Lower (eventual merge, CRDTs) |
| Debugging | Easier (total order, linearizable) | Harder (staleness, race conditions) |
| UX Risk | Lower (reads always fresh) | Higher (stale reads, rollback) |
| Consistency Model | Best For | Avoid When |
|---|---|---|
| Linearizable | Financial transactions, locks, inventory | Latency matters, geo-distributed |
| Sequential | Leader-based reads, single-region | Multi-region with causality needs |
| Causal | Social apps, collaborative editing | Total order required |
| Session | User-generated content, dashboards | Financial or inventory updates |
| Eventual | Caches, counters, analytics | Writes must be immediately visible |
| Weak | High-throughput logging, sensor data | Any correctness dependency |
Production Failure Scenarios
| Failure Scenario | Impact | Mitigation |
|---|---|---|
| Replica returns stale data after write | User sees old value after confirming write | Implement read-your-writes by routing reads to quorum or primary |
| Monotonic read violation | Data appears to roll back, breaking user experience | Use sticky sessions or read-your-writes guarantees |
| Causal ordering violation | Comment appears before post it references | Implement vector clocks for causal tracking |
| Conflict resolution produces wrong result | LWW picks wrong version in concurrent edits | Use CRDTs for automatically mergeable data |
| Replication lag spikes | Extended staleness window during high load | Monitor lag, alert on thresholds, scale replicas |
Conflict Resolution Strategies
When multiple replicas accept concurrent writes, conflicts must be resolved. This section covers conflict resolution strategies, how to choose the right model, and common mistakes to avoid.
Choosing the Right Consistency Model
Picking a consistency model is not a one-time architectural decision — it is a per-operation decision that affects both performance and correctness.
Decision Framework
Answer these questions in order:
1. What happens if reads return stale data?
- Financial transactions, inventory, payment state: Strong consistency (linearizable)
- Social media feeds, likes, view counts: Eventual consistency is fine
- User-generated content (posts, comments): Session consistency (read-your-writes)
2. What happens if writes are lost?
- If a write being lost causes financial harm or data corruption: Strong consistency with quorum
- If lost writes are acceptable and convergence is eventual: Eventual consistency
3. What is your tolerance for latency?
- User-facing operations needing fast response: Eventual or session consistency
- Background jobs, analytics: Eventual consistency is fine
- Synchronous operations where users wait for confirmation: Strong consistency
4. Are operations causally dependent?
- If operation B depends on operation A (e.g., comment on a post): Causal consistency at minimum
- If operations are independent: Eventual consistency is usually sufficient
Consistency Model Selection Table
| Use Case | Recommended Model | Implementation |
|---|---|---|
| Financial transfers, payment processing | Linearizable (strong) | QUORUM, Raft/Paxos, 2PC |
| User posts and comments | Session (read-your-writes) | Sticky sessions, primary routing |
| Product catalog, pricing (display) | Bounded eventual | DynamoDB, Cassandra with LOCAL_QUORUM |
| Social media likes, follower counts | Eventual | DynamoDB, Cassandra ONE |
| Collaborative editing (text, documents) | Causal / CRDT | RGA, OR-Set, operational transformation |
| Distributed locks, leader election | Linearizable | etcd, Zookeeper (Zab/Raft) |
| Configuration, feature flags | Strong / linearizable | etcd, Zookeeper |
| Recommendation engine, analytics | Eventual | Cassandra, DynamoDB streams |
| Shopping cart operations | Session / monotonic | DynamoDB with session token |
Common Mistakes
- Using strong consistency everywhere. This is the default for most developers because it is the mental model from single-node databases. Most systems do not need it everywhere, and paying the coordination cost globally tanks performance.
- Choosing eventual consistency without measuring lag. “Eventual” is not “fast.” Under load, eventual consistency lag can spike to seconds or minutes. If your users notice stale data, you need bounded eventual at minimum.
- Forgetting read-your-writes for user-facing apps. This is the single most common consistency bug in production. Users post content and do not see it. It is jarring and entirely preventable.
| Strategy | How It Works | Trade-offs |
|---|---|---|
| Last-Write-Wins (LWW) | Highest timestamp wins | Simple but can lose writes; depends on clock sync |
| First-Write-Wins | First write wins | Conservative but wasteful |
| CRDTs | Data structures that merge automatically | No conflicts by design; limited to certain types |
| Application Merge | App code decides how to combine | Flexible but requires custom logic |
| Manual Resolution | User decides | Best for critical conflicts; bad UX otherwise |
CRDT Patterns and Examples
CRDTs (Conflict-free Replicated Data Types) provide mathematically proven conflict resolution by design. This section covers common CRDT implementations and when to use them.
CRDT Example: G-Counter
// Grow-only counter - merges by taking max of each replica's value
class GCounter {
constructor(replicaId) {
this.replicaId = replicaId;
this.counts = {}; // { replicaId: value }
}
increment() {
this.counts[this.replicaId] = (this.counts[this.replicaId] || 0) + 1;
}
merge(other) {
for (const [id, value] of Object.entries(other.counts)) {
this.counts[id] = Math.max(this.counts[id] || 0, value);
}
}
value() {
return Object.values(this.counts).reduce((a, b) => a + b, 0);
}
}
CRDT Example: LWW-Register (Last-Write-Wins)
LWW-Register stores a value and a timestamp. On merge, the write with the higher timestamp wins:
// Last-Write-Wins Register - simplest conflict-free register
class LWWRegister {
constructor(replicaId) {
this.replicaId = replicaId;
this.value = null;
this.timestamp = 0;
}
set(value) {
// Use hybrid logical clock or wall clock with care
this.timestamp = Date.now();
this.value = value;
}
merge(other) {
if (other.timestamp > this.timestamp) {
this.value = other.value;
this.timestamp = other.timestamp;
}
// If our timestamp is newer or equal, keep our value
}
get() {
return this.value;
}
}
// Usage: inventory items, user preferences, configuration
const inventory = new LWWRegister("replica-1");
inventory.set({ item: "widget", count: 10 });
// Concurrent write from another replica with timestamp 1001
inventory.merge({ value: { item: "widget", count: 15 }, timestamp: 1001 });
console.log(inventory.get()); // { item: 'widget', count: 15 }
CRDT Example: OR-Set (Observed-Remove Set)
OR-Set allows adding and removing elements where add-wins-over-remove for concurrent operations:
// Observed-Remove Set - add wins over remove for concurrent ops
class ORSet {
constructor(replicaId) {
this.replicaId = replicaId;
this.items = new Map(); // element -> { addedBy: replicaId, addTag: number }
this.tombstones = new Map(); // element -> Set of removed tags
}
add(element) {
const tag = Date.now() + Math.random();
this.items.set(element, {
addedBy: this.replicaId,
addTag: tag,
removed: false,
});
}
remove(element) {
// Mark as removed but don't delete - we need to track for merging
const entry = this.items.get(element);
if (entry) {
entry.removed = true;
}
}
merge(other) {
for (const [element, entry] of other.items) {
const ourEntry = this.items.get(element);
if (!ourEntry) {
// Element doesn't exist locally, add it
this.items.set(element, { ...entry });
} else if (entry.addTag > ourEntry.addTag) {
// Other replica added this element more recently
this.items.set(element, { ...entry });
} else if (entry.removed && !ourEntry.removed) {
// Other replica removed it, but we haven't - add wins
// Keep the element
ourEntry.removed = false;
}
// If both removed, stays removed (no-op)
}
}
contains(element) {
const entry = this.items.get(element);
return entry && !entry.removed;
}
}
// Usage: shopping cart items, collaborative document elements
const cart = new ORSet("user-1");
cart.add("item-A");
cart.add("item-B");
cart.remove("item-B");
// Concurrent removal of item-B and addition of item-C from another device
cart.merge({
items: new Map([
["item-B", { addedBy: "user-1", addTag: 100, removed: true }],
["item-C", { addedBy: "user-1", addTag: 101, removed: false }],
]),
});
console.log(cart.contains("item-A")); // true
console.log(cart.contains("item-B")); // true (add wins)
console.log(cart.contains("item-C")); // true
CRDT Comparison Table
| CRDT Type | Operations | Merge Strategy | Use Cases |
|---|---|---|---|
| G-Counter | Increment only | Take max per replica | Vote counts, page views |
| PN-Counter | Increment + Decrement | Add positives, subtract negatives | Account balances |
| LWW-Register | Set value | Higher timestamp wins | Configuration, preferences |
| OR-Set | Add + Remove | Add wins over concurrent remove | Shopping carts, collaborative editing |
| RGA | Add + Remove (ordered) | Conflation of concurrent removes | Chat messages, collaborative text |
Transaction Coordination Protocols
Strong consistency in distributed transactions requires coordination protocols. This section covers two-phase commit, three-phase commit, and alternatives like the Saga pattern for long-running workflows.
Two-Phase Commit (2PC)
2PC is an atomic commitment protocol with two phases:
Phase 1 — Prepare:
The coordinator sends a PREPARE message to all participants. Each participant votes YES if it can commit (has locked resources, written redo logs) or NO if it must abort.
Phase 2 — Commit/Abort:
If all participants vote YES, the coordinator sends COMMIT and all participants apply the change. If any vote NO, the coordinator sends ABORT and all participants roll back.
// Simplified 2PC coordinator implementation
class TwoPhaseCommitCoordinator {
async execute(transaction, participants) {
// Phase 1: Prepare
const votes = await Promise.all(
participants.map((p) => p.prepare(transaction)),
);
const canCommit = votes.every((v) => v === "YES");
if (canCommit) {
// Phase 2a: Commit
await Promise.all(participants.map((p) => p.commit(transaction)));
return { success: true };
} else {
// Phase 2b: Abort
await Promise.all(participants.map((p) => p.abort(transaction)));
return { success: false, reason: "Participant voted NO" };
}
}
}
2PC Failure Modes
| Failure Point | Problem | Outcome |
|---|---|---|
| Coordinator crashes before prepare | Participants lock resources indefinitely | Timeout-based rollback required |
| Coordinator crashes after prepare but before commit | Participants do not know decision | Block waiting for coordinator |
| Participant crashes after prepare | Has locks but voted YES, does not know decision | Block or heuristic commit possible |
| Network partition during commit | Some participants commit, others do not | Inconsistent state |
The blocking problem is 2PC’s main weakness: if the coordinator fails after participants have voted YES, they block indefinitely waiting for the commit or abort decision.
Three-Phase Commit (3PC)
3PC adds a third phase to eliminate the blocking problem under synchronous networks with the finite upper bound on processor arrest:
Phase 1 — CanCommit: Coordinator asks if participants can commit (similar to 2PC prepare, but participants do not lock resources yet).
Phase 2 — PreCommit: If all say YES, coordinator sends PRE-COMMIT. Participants acknowledge and lock resources.
Phase 3 — DoCommit: Coordinator sends DO-COMMIT. Participants apply the change and release locks.
// 3PC eliminates blocking by adding the PreCommit phase
// If a participant times out in PreCommit, it can safely commit
// because everyone already voted YES and received PreCommit
// Timeout handling in 3PC participant
class ThreePhaseParticipant {
async handlePreCommit(transaction) {
this.state = "PRE_COMMITTED";
await this.writeRedoLog(transaction);
await this.acquireLocks(transaction);
// Send acknowledgment
await this.coordinator.ackPreCommit(transaction.id);
}
// If coordinator crashes here and we time out waiting for DoCommit:
// We can safely commit because:
// 1. We received PreCommit (meaning all voted YES)
// 2. We know enough participants are in PreCommitted or Committed state
async onDoCommitTimeout() {
if (this.state === "PRE_COMMITTED") {
await this.doCommit();
}
}
}
2PC vs 3PC Comparison
| Aspect | 2PC | 3PC |
|---|---|---|
| Blocking | Yes (coordinator failure) | No (under sync network assumptions) |
| Network rounds | 2 | 3 |
| Latency | Lower | Higher |
| Failure handling | Participant block on coordinator | Safe commit on timeout in PreCommit |
| Real-world usage | Most databases (with variations) | Rarely used directly |
| Assumptions | Crash-stop, eventually synchronous | Eventually synchronous, bounded delay |
Why 3PC is not widely used:
- The assumption of eventually synchronous networks does not hold in geo-distributed systems
- The protocol is sensitive to timing — if messages are delayed beyond bounds, the safety guarantees break
- In practice, Paxos/Raft consensus protocols are preferred over 3PC for strong consistency
Sagas
For workflows that span multiple services and cannot hold locks for minutes or hours, the Saga pattern replaces atomic commit with a sequence of local transactions, each with a compensating transaction to undo its work:
// Saga pattern: compensating transactions instead of 2PC
class OrderSaga {
steps = [
{ name: "reserveInventory", compensate: "releaseInventory" },
{ name: "processPayment", compensate: "refundPayment" },
{ name: "createShipment", compensate: "cancelShipment" },
];
async execute() {
const completed = [];
for (const step of this.steps) {
try {
const result = await this.executeStep(step);
completed.push({ step, result });
} catch (error) {
// Compensate in reverse order
for (const { step } of completed.reverse()) {
await this.executeCompensation(step);
}
throw error;
}
}
}
}
Sagas trade atomicity for availability and low latency — if a compensation fails, you need alerting and manual intervention, but the system remains available.
Chain Replication
Chain replication is an alternative to Paxos/Raft for achieving strong consistency. In wide-area networks, it can offer lower write latency than quorum-based approaches. This section covers how chain replication works, its guarantees, and comparison with Raft.
How Chain Replication Works
Data is replicated along a chain of nodes: Head → Middle → Tail. Writes arrive at the Head and propagate sequentially down the chain. Reads are served by the Tail (which has seen all committed writes).
graph LR
W[Write Request] --> H[Head Node]
H --> M1[Node 2]
M1 --> M2[Node 3]
M2 --> T[Tail Node]
R[Read Request] --> T
T --> RV[Return Value]
Write path:
- Head receives write, appends to its log
- Head forwards to next node; each node appends to its log and forwards
- When Tail receives and acknowledges, the write is committed
- Tail sends commit acknowledgment back up the chain to Head
- Head responds to client
Read path:
- Client sends read to Tail
- Tail has the most up-to-date committed data (all writes pass through it)
- Tail returns the value directly
Chain Replication Guarantees
Because writes propagate in order and the Tail always holds the latest committed state, reads from the Tail are strongly consistent without needing quorum.
// Chain replication write latency: O(chain_length)
// In a 3-node chain: Head → Node2 → Tail
// Write goes through all 3 nodes before client gets acknowledgment
// Compare to Raft quorum: write must reach (N/2 + 1) nodes
// In a 3-node Raft cluster, quorum is 2 — same as chain replication
// Difference becomes apparent with more nodes:
// Raft quorum with 5 nodes: 3 nodes
// Chain replication with 5 nodes: all 5 nodes in chain
// Chain replication drawback: failure of Head or Tail requires reconfiguration
// Raft can elect new leader from any majority node
Chain Replication vs Raft Comparison
| Aspect | Chain Replication | Raft Consensus |
|---|---|---|
| Write latency | O(N) sequential propagation | O(N) quorum (but parallel) |
| Read latency | O(1) — direct to Tail | O(N) if leader, O(N) if followers |
| Strong consistency | Yes (reads from Tail) | Yes (leader or quorum reads) |
| Leader bottleneck | No (writes flow through chain) | Yes (all writes go through leader) |
| Failure handling | Head or Tail failure needs rechain | Any server can become leader |
| Geo-distribution | Chain topology is fixed | Flexible majority quorum placement |
| Implementation | Simpler (deterministic chain) | More complex (leader election) |
| Used by | Azure Cosmos DB (some configs), CORFU | etcd, CockroachDB, Consul |
CORFU uses chain replication with flash storage for high-performance consensus in data centers.
Quorum-Based Systems
Quorum-based replication underpins many distributed databases. If you understand the math behind quorum selection, you can pick the right consistency level for your workload. This section covers the R/W/N model, quorum math, and practical trade-offs.
The R/W/N Model
In a system with N replicas, a quorum is a subset of replicas that must respond for an operation to succeed.
For read quorum R and write quorum W:
- A write must be acknowledged by W replicas
- A read must be acknowledged by R replicas
- To guarantee strong consistency: R + W > N
This inequality ensures that every read quorum overlaps with every write quorum — at least one replica in any read quorum has seen the latest write.
Quorum Math Table
| N | R | W | R + W > N | Consistency Guarantee | Use Case |
|---|---|---|---|---|---|
| 3 | 1 | 1 | No | Weak (eventual) | Maximum availability, any replica |
| 3 | 2 | 1 | No | Weak (eventual) | Fast writes, weaker reads |
| 3 | 1 | 2 | No | Weak (eventual) | Fast reads, weaker writes |
| 3 | 2 | 2 | Yes | Strong (quorum overlap guaranteed) | Cassandra QUORUM, DynamoDB |
| 3 | 3 | 3 | Yes | Strongest (all nodes) | Maximum durability, highest latency |
| 5 | 3 | 3 | Yes | Strong | Common production configuration |
| 5 | 2 | 3 | Yes | Strong | Read-heavy workloads |
| 5 | 3 | 2 | Yes | Strong | Write-heavy workloads |
Quorum Selection Trade-offs
// DynamoDB: N=3 replicas (automatic)
// Strong consistency: R=3, W=3 (all replicas must acknowledge)
// Eventually consistent: R=1, W=1 (any replica)
// Session consistency: R=1, W=1 with session sticky
// Cassandra: N=3, configurable
// ONE: R=1, W=1 — fastest, weakest consistency
// QUORUM: R=2, W=2 — balanced, strong consistency
// ALL: R=3, W=3 — slowest, strongest durability
// The key insight: R+W > N guarantees that reads see writes
// This is why QUORUM in a 3-node cluster (R=2, W=2) is strongly consistent
// Even if the leader crashes mid-write, quorum ensures overlap
Quorum with Node Failures: Read Repair Implications
When some replicas are down, quorum operations continue but staleness can increase:
// Scenario: N=5, R=3, W=3
// If 2 replicas are down:
// - Writes still succeed (W=3 remaining replicas)
// - Reads still succeed (R=3 remaining replicas)
// - BUT: the 2 dead replicas miss the write
// When dead replicas come back online:
// - They must sync via anti-entropy (read repair only catches divergent reads)
// - During the recovery window, reads from the dead replicas would be stale
// Best practice: monitor replica health
// If more than (N-W) replicas are down, writes should fail
// Because you cannot guarantee W replicas will persist the write
const MIN_REPLICAS_FOR_WRITE = N - (N - W);
if (availableReplicas < MIN_REPLICAS_FOR_WRITE) {
throw new Error("Insufficient replicas for safe write");
}
Sloppy Quorums and Hinted Handoff
Some systems (notably Dynamo) use sloppy quorums: when the primary quorum is unavailable, writes go to healthy nodes outside the preferred location. The data is hinted to be handed back when the original node recovers.
// Sloppy quorum: write succeeds even if preferred nodes are down
// Example: N=3, preferred nodes [A, B, C], all down
// Coordinator writes to [D, E, F] instead (sloppy quorum)
// When A comes back, D/E/F send hinted writes to A
// Trade-off: sloppy quorums sacrifice strict consistency for availability
// A read during the handoff window might not see a recent write
// But the system remains available during partial failures
// Hinted handoff sequence:
// 1. D receives write for A (A is down)
// 2. D stores the write with a hint: "deliver to A when available"
// 3. D periodically checks if A is back
// 4. When A recovers, D sends the hinted data to A
// 5. A applies the write and acknowledges to D
Data Repair Mechanisms
Distributed databases use two complementary mechanisms to maintain consistency: read repair and anti-entropy. This section explains both approaches and when each matters in production.
Read Repair (Reactive)
Read repair happens during read operations. When a replica returns stale data during a read, the system proactively pushes the fresh data to that replica.
// Conceptual read repair flow
async function readWithRepair(key) {
// Read from multiple replicas
const [replica1, replica2, replica3] = await Promise.all([
readReplica("replica1", key),
readReplica("replica2", key),
readReplica("replica3", key),
]);
// Check for divergence
const versions = [replica1, replica2, replica3];
const latest = findMostRecent(versions);
// If any replica is stale, repair it
for (const replica of versions) {
if (replica.version < latest.version) {
// Asynchronously push the correct value
repairReplica(replica, latest);
}
}
return latest.value;
}
Characteristics:
- Triggered by: Read operations
- Scope: Only repairs replicas that were contacted during the read
- Latency impact: Minimal (async repair)
- Coverage: Limited - replicas not read may remain stale
Anti-Entropy (Proactive)
Anti-entropy is a background process that continuously compares replicas and repairs any divergence, regardless of whether reads have occurred.
// Conceptual anti-entropy using Merkle trees
class AntiEntropyService {
constructor(replicas) {
this.replicas = replicas;
this.merkleTrees = new Map();
}
// Build Merkle tree for a replica's data range
buildMerkleTree(replica, range) {
const tree = new MerkleTree();
for (const [key, value] of this.getRange(replica, range)) {
tree.insert(hash(key + value.version));
}
return tree;
}
// Compare trees to find divergent keys
async compareReplicas(replicaA, replicaB) {
const treeA = this.buildMerkleTree(replicaA, "all");
const treeB = this.buildMerkleTree(replicaB, "all");
// Find nodes that differ
const diff = treeA.diff(treeB);
// For each divergent node, traverse down to find actual keys
const divergentKeys = [];
for (const node of diff) {
if (node.isLeaf()) {
divergentKeys.push(node.key);
} else {
// Fetch children for deeper comparison
const children = await this.fetchChildren(replicaA, node);
divergentKeys.push(...children);
}
}
// Sync divergent keys
await this.syncKeys(replicaA, replicaB, divergentKeys);
}
}
Characteristics:
- Triggered by: Background scheduled process (e.g., Cassandra nodetool repair)
- Scope: All replicas, regardless of recent reads
- Latency impact: Can be significant during repair operations
- Coverage: Comprehensive - finds all divergence
Comparison Table
| Aspect | Read Repair | Anti-Entropy |
|---|---|---|
| Trigger | On-read | Background/scheduled |
| Scope | Only read replicas | All replicas |
| Staleness window | Until next read of stale replica | Zero (when repair completes) |
| Resource cost | Low (per-read) | High (full tree comparison) |
| Failure detection | Detects on read | Detects all divergence |
| Example | Cassandra, DynamoDB | Cassandra nodetool repair, Riak |
When Each Mechanism Matters
| Scenario | Read Repair Sufficient? | Need Anti-Entropy? |
|---|---|---|
| Low read frequency | No - replicas may go stale for long periods | Yes |
| High read frequency | Yes - repairs happen frequently | No |
| Write-heavy workloads | No - stale replicas accumulate | Yes |
| Read-heavy workloads | Yes - continuous reads repair frequently | Optional |
| Compliance/audit requirements | No - need guaranteed convergence | Yes |
Practical note: Most production systems use both. Cassandra uses read-repair for immediate repairs during reads and nodetool repair (anti-entropy) as a periodic background job to ensure all replicas converge.
CAP Theorem Implications
Eric Brewer’s CAP theorem (2000) states that a distributed system cannot simultaneously provide Consistency, Availability, and Partition tolerance. In practice, partitions happen — networks fail, nodes become unreachable. When they do, you must choose between consistency and availability.
The CAP choice is not permanent. You can choose different trade-offs for different parts of your system, and you can even change the choice dynamically.
| Scenario | Trade-off | Examples |
|---|---|---|
| Partition occurs | CP (consistency over availability) | Zookeeper, etcd, MongoDB majority |
| Partition heals | System recovers with full consistency | All systems return to normal |
| No partition (normal ops) | Both consistency and availability | Depends on latency vs consistency needs |
CP systems (Zookeeper, etcd, MongoDB with majority) will refuse to serve reads or accept writes if a majority cannot be reached. They will never return stale or potentially incorrect data.
AP systems (DynamoDB, Cassandra, Riak) will continue serving reads and accepting writes even during partitions. They may return stale data but they remain available.
The nuance most engineers miss: “availability” in CAP is not about latency, it is about whether the system can respond at all. An AP system that returns 503 because it is overloaded is not “available” in CAP terms. A CP system that returns “I cannot confirm this write” is still available — it is just not accepting writes.
PACELC extends this: even without partitions, latency and consistency are in trade-off. A strongly consistent system has higher write latency because of coordination. This is often the more relevant trade-off in geo-distributed deployments.
Delta CRDTs
Standard CRDTs (G-Counter, OR-Set) track the full state of a data structure. For large data structures like counters or registers, this is acceptable. For growing data structures like lists or text, full-state CRDTs become expensive — the merge payload grows with the size of the data structure.
Delta CRDTs solve this by propagating only the delta (change) since the last sync, rather than the full state.
// Delta-G-Counter: only propagate incremented deltas, not full state
class DeltaGCounter {
constructor(replicaId) {
this.replicaId = replicaId;
this.fullState = {}; // { replicaId: count }
this.pendingDeltas = {}; // { replicaId: deltaSinceLastSync }
}
increment() {
this.fullState[this.replicaId] = (this.fullState[this.replicaId] || 0) + 1;
this.pendingDeltas[this.replicaId] =
(this.pendingDeltas[this.replicaId] || 0) + 1;
}
// Get delta to send to another replica
getDelta() {
const delta = { ...this.pendingDeltas };
this.pendingDeltas = {}; // Clear pending after sending
return delta;
}
// Merge received delta (not full state)
mergeDelta(delta) {
for (const [id, value] of Object.entries(delta)) {
this.fullState[id] = Math.max(this.fullState[id] || 0, value);
}
}
value() {
return Object.values(this.fullState).reduce((a, b) => a + b, 0);
}
}
// Usage: large counters that change frequently but need efficient sync
// Instead of sending full counter state (which could be huge), send only deltas
When Delta CRDTs matter: High-frequency counters (page views, vote counts) where the full state grows unbounded, but you need efficient network utilization. Most production CRDT deployments for large-scale systems use delta propagation.
Byzantine Fault Tolerance in Consistency
The consistency models discussed so far assume crash-stop failures: a node either works correctly or stops responding. Byzantine failures are worse — a node can behave arbitrarily, sending incorrect data, contradicting itself, or selectively responding to some nodes but not others.
Byzantine Fault Tolerance (BFT) is relevant for:
- Systems where nodes may be compromised (adversarial environments)
- Cryptocurrency blockchains (Bitcoin, Byzantine fault tolerance for consensus)
- Aerospace and critical infrastructure
Practical BFT protocols like PBFT (Practical Byzantine Fault Tolerance) require more than two-thirds of nodes to be honest. This is significantly more expensive than crash-fault tolerant protocols like Raft, which only needs a majority.
For most web and business applications, crash-fault tolerance (Raft, Paxos) is sufficient. Byzantine fault tolerance is overkill and adds substantial overhead. The exception is if you are building systems where nodes may behave maliciously — in which case you are probably working on consensus for cryptocurrencies or distributed ledgers, not general web services.
Common Pitfalls / Anti-Patterns
Pitfall 1: Assuming “Eventually Consistent” Means “Quickly Consistent”
Problem: Teams assume eventual consistency means consistent within seconds when it can take minutes or longer during high load or failures.
Solution: Measure your actual staleness distribution under various conditions. Do not make UX assumptions without data. Consider bounded staleness models if users need timeliness guarantees.
Pitfall 2: Ignoring Monotonic Read Violations
Problem: When reads route to different replicas, users can see data “go backward” - seeing a value, then an older value. This is deeply confusing even if technically “eventually consistent.”
Solution: Use sticky sessions or read-your-writes guarantees for user-facing applications. Route reads to the same replica within a session.
Pitfall 3: Using Vector Clocks Without Understanding Them
Problem: Implementing vector clocks for causal consistency seems simple, but the memory and bandwidth overhead grows with replica count and history length.
Solution: Most databases handle this internally (DynamoDB, Cassandra). Only implement custom vector clocks if you understand the trade-offs and have measured the overhead.
Pitfall 4: Testing Consistency at Wrong Layer
Problem: Testing consistency only at the unit test level misses distributed race conditions that only appear under concurrent load.
Solution: Use chaos testing frameworks (Jepsen, Chaos Monkey) to inject failures and verify consistency guarantees hold. Test your application, not just the database.
Real-World Case Studies
Amazon DynamoDB: Eventual Consistency at Scale
DynamoDB launched with eventual consistency as the default reading model. It was a deliberate engineering trade-off — strong consistency required quorum reads that doubled latency and tripled cost. The team knew most workloads could tolerate brief staleness.
The Dynamo paper (2007) described how they handled the Cassandra counter incident in 2010: a burst of traffic to a popular item caused replication lag, and thousands of reads returned stale data for several minutes. Their mitigation was straightforward — they rate-limited the popular key and let anti-entropy catch up. The incident revealed that bounded vs unbounded eventual consistency matters even at Amazon’s scale.
DynamoDB now offers both eventual and strong consistency options per read. The eventual model still handles the majority of requests — it’s faster and cheaper.
Google Spanner: Strong Consistency Across Geo-Distributed Nodes
Spanner achieves strong consistency across globally distributed nodes using TrueTime — GPS and atomic clock hardware that bounds clock uncertainty to ±7ms. Writes go through Paxos consensus, and reads quorum on the latest Paxos sequence number.
The operational cost is real: Spanner writes take 10-40ms in the same region, 100-200ms cross-region. This is why Spanner is expensive. The trade-off is deliberate — if you need globally consistent data with availability guarantees, you pay in latency and cost.
A known failure: Spanner’s leader lease mechanism means that if a datacenter loses power and its Paxos leader goes down, there is a lease expiration period where no writes can be accepted (typically 10 seconds). During this window the system is unavailable for writes, even though TrueTime could theoretically allow it.
Cassandra: Tunable Consistency Meets the 2018 Outage
Cassandra’s tunable consistency model lets you choose per-query — ONE, QUORUM, ALL. The problem: developers often chose ONE for speed, and did not understand what that meant for durability.
In October 2018, a large Cassandra deployment experienced data loss after a node failure during a configuration change. The root cause was a regression in how Cassandra’s failure detection and repair mechanisms interacted with the ONE consistency level for writes. With replication factor 3 and consistency level ONE, losing one node could silently lose writes that the other replicas had not yet acknowledged.
The fix was not purely technical — it required updating operational runbooks to mandate QUORUM for all writes in production, and adding tooling to enforce this automatically.
Quick Recap Checklist
- Eventual consistency guarantees that if no new updates are made, all replicas will eventually return the same value
- Strong consistency ensures all reads receive the most recent write — requires coordination (e.g., Paxos, Raft)
- Sequential consistency guarantees that all nodes see operations in the same total order as they were initiated
- Causal consistency ensures causally related operations are seen by all nodes in order
- Read-your-writes consistency guarantees a client always sees their own previous writes
- Monotonic reads guarantee that once a client reads a value, subsequent reads never return older values
- Monotonic writes guarantee that all writes by a client are processed in order across replicas
- Session consistency provides a good balance — a client’s reads within a session are consistent
- The CAP theorem constrains what consistency models are achievable during network partitions
- PACELC extends CAP to describe latency-consistency trade-offs even when the system is healthy
Real-world Failure Scenarios
Scenario 1: Google Gmail’s Eventual Consistency Bug (2009)
What happened: In 2009, Gmail experienced a bug where sent emails were occasionally not delivered to recipients for several hours, despite the sender receiving a “sent successfully” confirmation.
Root cause: Gmail used an eventually consistent storage backend for message routing metadata. A routine storage cluster migration caused the routing index to become stale. Messages were correctly stored but the lookup index pointed to the wrong storage partition.
Impact: Approximately 0.5% of emails sent during the affected window were delayed by 2-8 hours. Business-critical emails (contract updates, interview invitations) arrived late, causing real-world consequences.
Lesson learned: Eventual consistency is acceptable for non-critical data, but the “eventual” window can extend unpredictably during infrastructure operations. Critical notification systems need either synchronous confirmation reads or application-level delivery tracking.
Scenario 2: Amazon DynamoDB’s Strict Consistency Confusion
What happened: In 2014, developers building on DynamoDB noticed inconsistent behaviour when using strongly consistent reads after write operations. In some multi-partition scenarios, reads immediately following writes returned stale values despite DynamoDB’s documentation stating strong consistency was the default.
Root cause: DynamoDB’s global tables replicate across regions using an eventually consistent merge. Strongly consistent reads are only guaranteed within a single region. Cross-region strongly consistent reads were silently falling back to eventual consistency without clear documentation.
Impact: Several applications assumed cross-region strong consistency, leading to data divergence during failover scenarios. One company discovered their disaster recovery data was 15 minutes stale when they needed it most.
Lesson learned: Understand the exact consistency guarantees of your database — not just the marketing name. “Strong consistency” can have geographic and operational scope limitations that aren’t immediately obvious.
Scenario 3: Facebook’s TAO Consistency Model in Practice
What happened: Facebook’s TAO distributed data store experienced a bug where cached object relationships (e.g., “user X is friend of user Y”) would occasionally show contradictory results when queried from different data centre locations simultaneously.
Root cause: TAO uses a hierarchy of caches with eventual consistency between tiers. A bug in the invalidation propagation caused stale friendship edges to persist in secondary caches after the primary cache was invalidated.
Impact: Users occasionally saw friend suggestions that contradicted their actual friend list, or saw content permissions applied incorrectly. The inconsistency was subtle enough that most users attributed it to UI bugs rather than data inconsistency.
Lesson learned: Multi-tier caching systems compound consistency problems. Each cache tier adds its own “eventual” window to writes. Cache invalidation is harder than cache population — design invalidation as carefully as you design the caching logic itself.
Interview Questions
This is a read-your-writes consistency violation. The write went to one replica but the subsequent read hit a different replica that had not yet received the write. Fixes include routing reads to the primary or last-written replica for the session, using sticky sessions, or issuing reads with a higher consistency level (QUORUM) after a write.
The cart needs strong consistency or at minimum read-your-writes. If a user adds an item and it disappears, they lose a sale. However, price and inventory checks can use eventual consistency — it is acceptable if the displayed price is slightly stale, as long as the checkout process re-verifies. Session consistency with QUORUM reads works well here.
Read-your-writes guarantees that after writing value V, subsequent reads by the same session see V or a more recent value. Monotonic reads guarantee that if a session sees value V at time T, it will never see an older value at time T'>T. Read-your-writes is about your own writes; monotonic reads is about not going backward in time for any reason.
Strong consistency (linearizable) — using QUORUM or ALL consistency levels, or distributed transactions with two-phase commit. The cost is latency and throughput, but for financial operations where incorrect balances cause real harm, this is non-negotiable. Many systems use a separate strongly-consistent ledger for the actual money movement and use eventual consistency only for display/read purposes.
PACELC states: if there is a partition (P), you choose between availability (A) and consistency (C); else (E), even without partitions, latency (L) and consistency (C) are in trade-off. Strong consistency requires coordination — writes must go through a consensus protocol or quorum, which adds round trips. In a geo-distributed setting, a strongly consistent write might cross continents. An eventually consistent write can commit locally and replicate asynchronously, cutting latency significantly. The trade-off is that you may read stale data while replication is in flight.
For a read-heavy workload with strong consistency, R=3, W=3 (QUORUM) works well. R+W=6 > N=5 guarantees strong consistency. With 5 replicas, quorum is 3 — you need 3 replicas to acknowledge both reads and writes. For read-heavy workloads, you might also consider R=3, W=2 — this still gives R+W=5 > 5 for strong consistency, and reads are faster (3 replicas vs 3 for writes). The trade-off is slightly higher write latency since writes need 3 acknowledgments but reads only need 3 of 5 (same as quorum in this case).
2PC has a blocking problem: if the coordinator crashes after participants have voted YES, they wait forever. 3PC adds a PreCommit phase so participants know the decision before committing. If a participant times out waiting for DoCommit after PreCommit, it can safely commit because all participants already voted YES. However, 3PC assumes eventually synchronous networks — if messages are delayed beyond bounds, the safety guarantees break. In geo-distributed systems, this assumption does not hold. In practice, Paxos/Raft are preferred over 3PC because they handle leader elections more robustly and have cleaner failure semantics.
An RGA (Replicated Growable Array) or similar ordered sequence CRDT is ideal for collaborative text editing. RGA handles concurrent inserts at the same position by using causal ordering — if two users insert "A" and "B" at position 5 simultaneously, the result is deterministic regardless of network ordering. Deletions are typically implemented as tombstones so that concurrent delete and insert operations merge correctly (add-wins or delete-wins depending on semantics). This avoids the complexity of Operational Transformation while providing automatic conflict resolution. The key property: no matter what order operations arrive, the final document state is identical on all replicas.
Read repair is reactive — when a client reads from multiple replicas and detects a stale value, the system asynchronously pushes the fresh data to stale replicas during that read. Anti-entropy is proactive — a background process continuously compares Merkle trees (or similar) across all replica pairs to find and repair divergence, regardless of whether reads have occurred. You would rely only on read repair when: reads are frequent (stale replicas get repaired quickly), write volume is low (staleness accumulates slowly), and you can tolerate replicas being stale between their own reads. You need anti-entropy when: reads are infrequent (replicas stay stale for long periods), write volume is high (stale replicas accumulate missed writes), or compliance requires guaranteed convergence within a bounded time.
Chain replication's O(1) read latency from the Tail node is attractive, but it has a critical vulnerability: the Head and Tail are single points of failure. If the Tail fails, you cannot serve reads until a reconfiguration completes. If the Head fails, writes stop propagating. Raft elections can elect any majority node as leader, providing better fault tolerance. Additionally, chain replication's sequential propagation means write latency is O(chain_length) — in a wide-area chain spanning multiple data centers, this adds significant latency. Raft's parallel quorum-based approach can complete writes with fewer round-trips in geo-distributed settings. Chain replication shines in single-datacenter or high-performance local-area deployments where the topology is stable.
LWW appears simple but has a critical hidden cost: it depends on synchronized wall clocks across distributed nodes, which is notoriously unreliable. Clock skew, NTP drift, or VM pauses can cause the "last" write to actually be an older write that arrived later. This silently drops user data — imagine a user adding an item to their cart while another device removes an item, and due to clock skew the remove wins even though it happened first. Instead, I would recommend an OR-Set (Observed-Remove Set) CRDT for shopping carts — it preserves all add operations and uses causal ordering to handle concurrent adds and removes correctly. The trade-off is slightly more storage overhead, but the semantics are correct for e-commerce.
Common causes of replication lag spikes: (1) Network congestion or increased latency between datacenters — check cross-region latency metrics; (2) Write burst overwhelming replicas — check write throughput vs replica capacity; (3) Disk IOPS saturation on replicas — check disk queue depth and IOPS utilization; (4) Replica recovery after node failure — anti-entropy or read-repair catching up; (5) Garbage collection pauses in JVM-based replicas causing them to fall behind. To diagnose: check replication_lag_seconds per replica, compare lag spike timestamps with write throughput graphs, check disk and network metrics at the same time windows, and look for correlation with failure events or scaling events. Alerts should fire when lag exceeds acceptable bounds for your bounded eventual consistency window.
A sloppy quorum writes to the first N healthy nodes when preferred nodes are unavailable, rather than failing. The data is "hinted" to be delivered to the correct node when it recovers. You might choose a sloppy quorum when: (1) you prioritize write availability over immediate consistency — a node being down should not block all writes; (2) you have geographically distributed replicas and want writes to succeed even if a datacenter is partially unavailable; (3) your use case tolerates brief staleness as long as writes eventually succeed. The trade-off is that during the handoff window, reads might not see the most recent write. Dynamo-style systems use sloppy quorums deliberately — they are designed for highest availability, not strongest consistency.
The classic approach is a dual-path architecture: financial writes go through a strongly consistent path (Paxos/Raft quorum, or 2PC with a reliable coordinator), while analytics read from a separate eventually consistent replica. The key is event-driven replication: when a financial transaction commits, publish an event to a streaming system (Kafka, Kinesis) that asynchronously replicates to the analytics store. The analytics store uses its own eventual consistency model and can be tuned for read throughput rather than consistency. At consistency boundaries (e.g., month-end reporting), you take a snapshot from the strongly consistent store rather than relying on replication to have caught up. This pattern is used by most large-scale financial systems — the "source of truth" is strongly consistent and expensive; derived views are eventually consistent and cheap to query.
With N=3 and consistency level ONE, data loss occurs when: (1) A write is acknowledged after being written to only one replica, and that replica fails before the other two receive it — the write is lost; (2) A read returns data from a replica that later fails before anti-entropy repairs the other replicas — reads from that replica may have returned stale data in the window between the write and the failure; (3) A network partition splits the cluster such that a minority of replicas accept writes and then fail. Prevention: use QUORUM (R=2, W=2) for all production writes — R+W=4 > N=3 guarantees strong consistency and that at least one replica in any read quorum has seen every write. At minimum, use LOCAL_QUORUM for geographically distributed deployments to ensure reads and writes quorum within the local datacenter before acknowledging.
A vector clock is a per-object, per-replica timestamp structure that tracks the logical time at each replica. Unlike a wall-clock timestamp (which is single-valued), a vector clock is a map like {replica_A: 3, replica_B: 1, replica_C: 5}. It can capture causality: if replica A's clock is [3, 1, 5] and it receives an update from replica B with clock [3, 2, 5], A's merged clock is [3, 2, 5]. This lets you determine whether two versions of an object are causally related (one happened before the other) or concurrent (neither caused the other). You use vector clocks when: you need to implement causal consistency (Dynamo-style), you need to distinguish between "this version is older" vs "this version is concurrent but different," and your conflict resolution is more sophisticated than last-write-wins. The cost: vector clocks grow with replica count and history length, which is why most production systems either limit history or switch to LWW after a bounded window.
Isolation levels in ACID databases and consistency models in distributed systems are related but not identical. Isolation levels (READ UNCOMMITTED, READ COMMITTED, REPEATABLE READ, SERIALIZABLE) govern how concurrent transactions interact within a single database node. Consistency models govern how data appears across distributed replicas. SERIALIZABLE isolation maps roughly to linearizability — all operations appear to happen in a global total order. READ COMMITTED maps roughly to session consistency — you see only committed data, but you may see your own writes multiple times or not see them immediately. The key difference: isolation levels are about concurrency within one machine; consistency models are about state across multiple machines. A database can have SERIALIZABLE isolation locally but still be eventually consistent across replicas if replication is asynchronous. This is why PostgreSQL's serializable snapshot isolation (SSI) is not the same as linearizability across a distributed setup — the serializability guarantee holds within one node, not across asynchronous replicas.
MongoDB's "local" read concern reads from the primary replica, which means you always get the latest data the primary has acknowledged. However, it does not guarantee linearizability because: (1) The primary may have acknowledged a write that has not yet been replicated to a majority of replicas. If the primary fails and a new primary is elected from a replica that has not received this write, a subsequent read could go to the new primary and see a different (older) value. (2) "local" reads do not wait for replication to confirm the write is durable on a majority — they read whatever the primary has locally, which may not survive a failover. To get linearizability in MongoDB, you need "majority" read concern, which reads from the primary but waits to confirm the data has been replicated to a majority before returning it. This maps to the R+W > N model: majority reads guarantee you are reading from a replica that has seen all acknowledged writes.
You are likely seeing variable replication lag due to network conditions and consistency level. With a multi-region deployment and eventual consistency: cross-region network latency varies based on load, congestion, and routing — 50-200ms is typical for US-East to EU-West, but spikes happen during network congestion or BGP re-routing. The 200ms reads are likely hitting a replica that has already received the latest replication batch. The 2-5 second reads are hitting replicas still catching up on the replication queue. This is the "eventually" in eventual consistency doing what it says — convergence is not guaranteed within a bounded time unless you use bounded eventual consistency (with explicit latency SLOs) or strong consistency (QUORUM reads across regions, which would be 200-400ms per read). The fix depends on requirements: if users need to see their own writes immediately, implement read-your-writes by routing reads to the primary region for the session. If all reads need to be fast and consistent, you need a read-your-writes session guarantee plus replication lag monitoring.
Adding a cache layer fundamentally changes the consistency model because the cache and database can diverge. The cache provides a shortcut for reads — if data is in cache, you skip the database entirely. If the cache is not invalidated or updated correctly after a write, you get stale reads from cache. This can violate even read-your-writes consistency: a user writes data, it goes to the database, the cache is not invalidated, the user reads back and hits the stale cache. The solution is cache-aside with careful invalidation: after any write, delete or update the cache entry so the next read fetches fresh data from the database. You can also use write-through cache (cache updated synchronously with database, more consistent but adds write latency) or read-through cache (cache populated on miss, no consistency benefit). Cache TTLs introduce another problem: even with correct invalidation, a long TTL means reads can be arbitrarily stale if the cache entry was populated before a write. The rule: caches are best for data that is read frequently and written rarely. For frequently-written data, the cache adds more consistency headaches than performance benefit unless you invalidate aggressively.
Observability Checklist
Metrics to Capture
read_staleness_seconds(histogram) - Time between write and read reflecting itconsistency_level_used(counter) - Breakdown by consistency levelconflict_resolution_duration_ms(histogram) - Time spent resolving conflictsmonotonic_violations_total(counter) - When reads go backward in timereplication_lag_seconds(gauge) - Per replica, per datacenter
Testing Eventual Consistency
// Jepsen-style test for eventual consistency
async function testEventualConsistency(clients, writeKey, writeValue) {
// Write to all clients simultaneously
await Promise.all(clients.map((c) => c.write(writeKey, writeValue)));
// Poll until all clients return the value
const deadline = Date.now() + 30000; // 30 second timeout
while (Date.now() < deadline) {
const results = await Promise.all(clients.map((c) => c.read(writeKey)));
if (results.every((r) => r === writeValue)) {
return { success: true, timeMs: Date.now() - start };
}
await sleep(100);
}
return { success: false, results };
}
Alerts to Configure
| Alert | Threshold | Severity |
|---|---|---|
| Staleness > 1s | 1000ms | Warning |
| Staleness > 10s | 10000ms | Critical |
| Monotonic violations | > 0 in 5 minutes | Critical |
| Conflict rate > 1% | 1% of writes | Warning |
Security Checklist
- Authentication required for all replica communication
- TLS encryption for all inter-node replication traffic
- Audit logging of consistency level changes
- Network policies restricting replica access
- Encryption at rest for all replica data
- Client credentials rotated periodically
Further Reading
Academic Papers
- Dynamo: Amazon’s Highly Available Key-value Store — The paper that popularized eventual consistency at web scale
- Paxos Made Simple — Leslie Lamport’s accessible explanation of the consensus protocol
- Consistency and Consensus in Distributed Systems — Comprehensive survey of consistency models
Standards and Specifications
- CAP Theorem (Brewer’s Presentation) — Eric Brewer’s original PODC keynote presenting the CAP conjecture
- Base vs ACID — Stonebraker’s critique of ACID and defense of BASE
Production Guides
- Azure Cosmos DB consistency levels — Practical guide to tuning consistency in a globally distributed database
- Cassandra documentation: Tuning consistency — Real-world consistency tuning for Cassandra deployments
- Jepsen Consistency Testing — Formal testing of distributed systems consistency guarantees
Related Posts
For latency trade-offs, see the PACELC theorem post. For building highly available systems with consistency guarantees, see Availability Patterns.
Conclusion
- Consistency models exist on a spectrum from strong to weak.
- Read-your-writes matters for user-facing applications - always verify writes are visible.
- Monotonic reads prevent jarring rollback experiences in pagination and feeds.
- Causal consistency covers most real-world scenarios efficiently.
- Conflict resolution (LWW, CRDTs, app-merge) must match your data’s semantics.
- Most databases let you configure consistency per query - use this flexibility.
Copy/Paste Checklist
- [ ] Audit application to identify consistency requirements per operation
- [ ] Use strong consistency (QUORUM, LINEARIZABLE) for financial/inventory ops
- [ ] Implement read-your-writes for user-generated content
- [ ] Use sticky sessions to ensure monotonic reads
- [ ] Choose conflict resolution strategy: LWW, CRDT, or application merge
- [ ] Monitor replication lag and staleness metrics
- [ ] Test consistency guarantees under failure injection
- [ ] Document consistency requirements in ADR (Architecture Decision Records) Category
Related Posts
Google Spanner: Globally Distributed SQL at Scale
Google Spanner architecture combining relational model with horizontal scalability, TrueTime API for global consistency, and F1 database implementation.
Gossip Protocol: Scalable State Propagation
Learn how gossip protocols enable scalable state sharing in distributed systems. Covers epidemic broadcast, anti-entropy, SWIM failure detection, and real-world applications like Cassandra and Consul.
Synchronous Replication: Data Consistency Across Nodes
Learn about synchronous replication patterns in distributed databases, including Quorum-based replication and write-ahead log shipping for zero RPO.