Multi-Paxos

Multi-Paxos extends basic Paxos to achieve consensus on sequences of values, enabling practical replicated state machines for distributed systems and databases.

published: reading time: 25 min read author: GeekWorkBench

Introduction

Basic Paxos agrees on a single value. Run it once, and all non-faulty nodes commit to the same value. Run it again with a different value, and they commit to that. But what if you need to agree on a sequence of values, like a replicated log of commands? That’s what Multi-Paxos solves.

The extension is conceptually simple: treat each slot in the log as a separate instance of basic Paxos. The complexity comes from optimization—running Paxos from scratch for every log entry would be prohibitively expensive.

Core Concepts

Multi-Paxos optimizes away the prepare phase for consecutive entries once a leader is established. The insight is that if a proposer has recently won a majority vote, it can reasonably assume it remains the leader for a period. During this stable period, it can skip Phase 1 and go directly to Phase 2 for subsequent entries.

This optimization transforms Multi-Paxos from a two-phase protocol per entry to essentially a one-phase protocol after the initial leader election.

sequenceDiagram
    participant L as Leader
    participant F1 as Follower 1
    participant F2 as Follower 2

    rect rgb(40, 40, 60)
        Note over L,F2: First entry: Full Paxos (Phase 1 + Phase 2)
        L->>F1: Prepare(n=10)
        L->>F2: Prepare(n=10)
        F1-->>L: Promise(v=nil)
        F2-->>L: Promise(v=nil)
        L->>F1: Accept(n=10, v="cmd1")
        L->>F2: Accept(n=10, v="cmd1")
        F1-->>L: Accepted
        F2-->>L: Accepted
    end

    rect rgb(40, 60, 40)
        Note over L,F2: Subsequent entries: Optimized (Phase 2 only)
        L->>F1: Accept(n=10, v="cmd2")
        L->>F2: Accept(n=10, v="cmd2")
        F1-->>L: Accepted
        F2-->>L: Accepted

        L->>F1: Accept(n=10, v="cmd3")
        L->>F2: Accept(n=10, v="cmd3")
        F1-->>L: Accepted
        F2-->>L: Accepted
    end

Role of the Leader

Multi-Paxos typically designates a stable leader to avoid contention. Without leadership, multiple nodes might propose conflicting values for the same log position, causing retries and latency spikes.

The leader election can use basic Paxos itself (running a round of prepare messages to see if any node can get a majority). Some implementations piggyback on the first client’s request to establish leadership lazily.

Log Gaps and Truncation

Multi-Paxos truncates log entries at the last known stable index, skipping over gaps where commands were already decided. The gaps happen naturally when leaders change: the new leader may not have entries that were committed by the previous leader if they had not been fully replicated.

graph LR
    L1["Node 1<br/>Log: [A] [B] [C] [D] [E]"]
    L2["Node 2<br/>Log: [A] [B] [C]"]
    L3["Node 3<br/>Log: [A] [B] [C] [D]"]

    L1 -->|Committed| C1["Idx 1-5 committed"]
    L2 -->|Truncated| C2["Missing idx 4,5"]
    L3 -->|Truncated| C3["Missing idx 5"]

When a new leader takes over, it may have gaps in its log where other nodes already committed entries. Multi-Paxos handles this by having the leader re-propose missing entries rather than copying them — ensuring consistency without a full synchronization.

Skipped Log Indices

What happens if a client request needs to go into log position 100, but the log only has 50 entries? In practice, the leader detects gaps and either waits for the gap to be filled or replicates a no-op command to fill the gap. Some implementations use a snapshot-based approach where the state machine state is checkpointed and old log entries are discarded.

Multi-Paxos vs Raft

Raft was designed to be understandable by explicitly separating concerns. Multi-Paxos is more abstract and leaves many details unspecified. This distinction has practical implications.

graph LR
    subgraph "Design Philosophy"
        M[Multi-Paxos<br/>Abstract model<br/>Minimal assumptions] --> R[Raft<br/>Concrete specification<br/>Every case defined]
    end

    subgraph "Practical Trade-offs"
        M2[Multi-Paxos<br/>Fewer network rounds<br/>for stable workloads<br/>Harder to implement] --> R2[Raft<br/>Easier to implement<br/>Correctly<br/>More rounds]
    end
AspectMulti-PaxosRaft
Network Rounds (stable leader)1 RTT per entry1 RTT per entry
Leader ElectionVia Paxos (2+ RTTs)Heartbeat-based
Log Compactionunspecified (varies)Defined clearly (snapshot + log truncation)
Membership ChangesComplex, varies by implementationJoint consensus or single-server changes
Implementation DifficultyHigh (many unspecified details)Moderate (clear specification)
Leader StabilityCritical for performanceImportant but more resilient
Disk I/OImplementation-dependentTypically batched snapshots

Raft separates leader election from log replication explicitly. Multi-Paxos treats these as the same mechanism. The Raft paper gives you reference pseudo-code for all components; Multi-Paxos papers focus on correctness proofs rather than implementation guidance.

For new projects, Raft is usually the better choice. The clarity of specification reduces implementation bugs. For existing systems (Chubby, Zookeeper’s Zab), understanding Multi-Paxos helps you reason about their behavior.

Checkpointing and State Machine Recovery

Multi-Paxos logs grow indefinitely. Without checkpointing, recovery time grows proportionally to log length. Checkpointing (or snapshotting) solves this by periodically saving the state machine state and discarding old log entries.

What a Checkpoint Contains

A checkpoint is not just a memory dump. It must capture everything needed to reconstruct the state machine at a specific log index:

class Checkpoint:
    def __init__(self, last_applied_index, state, log_length):
        self.last_applied_index = last_applied_index
        # The highest log index whose command has been applied
        # All earlier entries are now reflected in state

        self.state = state
        # The actual state machine state (key-value pairs,
        # document contents, etc.)

        self.log_length = log_length
        # How many log entries existed when checkpoint was taken
        # Entries before this index can be discarded

The Replay Process

When a node restarts, it must reconstruct its state by reading the checkpoint and replaying log entries after the checkpoint index.

graph TB
    subgraph "State Machine Recovery"
        S1[Read last checkpoint<br/>last_applied=500<br/>state=Snapshot_500] --> S2[Find log entries<br/>501, 502, 503...]
        S2 --> S3[Replay entry 501<br/>Apply to state]
        S3 --> S4[Replay entry 502<br/>Apply to state]
        S4 --> S5[Replay entry 503<br/>Apply to state]
        S5 --> S6[State machine matches<br/>other replicas]
    end

Checkpoint Creation and Log Truncation

Checkpoints must be created consistently across replicas to avoid divergence.

sequenceDiagram
    participant L as Leader
    participant F1 as Follower 1
    participant F2 as Follower 2
    participant SM as State Machine

    Note over L: Checkpoint interval reached

    L->>SM: Create checkpoint<br/>state=apply(log 1-500)
    L->>F1: Send checkpoint + index=500
    L->>F2: Send checkpoint + index=500
    Note over SM: Log entries 1-500 now reflected<br/>in state

    F1->>F1: Apply checkpoint locally
    F2->>F2: Apply checkpoint locally

    L->>F1: Truncate log entries 1-500
    L->>F2: Truncate log entries 1-500

    Note over L,F2: All replicas now at<br/>checkpoint index 500

The leader coordinates checkpoint creation to ensure all replicas can truncate the same log entries. If a follower is slow, the leader waits before truncating.

Consistency Requirements

A checkpoint is only trustworthy if it reflects a consistent view of the state machine. In Multi-Paxos terms, this means the checkpoint index must be a committed log position.

If you checkpoint at index 400 but index 401 was not yet committed (no majority acknowledged), you might checkpoint state that other nodes never saw. After a failure, the replica might diverge.

Proper implementations only checkpoint at committed indices.

Scalability Considerations

Multi-Paxos scales well for write-heavy workloads under stable leadership, but certain limits emerge as cluster size increases.

Leader Bottleneck

The designated leader becomes a bottleneck at high throughput. Every write must go through the leader, which sends Accept messages to all followers and waits for majority acknowledgment. As more nodes join the cluster, the leader’s network and disk I/O burden grows proportionally.

For a 3-node cluster, the leader sends 2 messages per entry. For a 7-node cluster, it sends 6 messages per entry. The network overhead scales with cluster size, and disk I/O remains constant per entry regardless of cluster size.

# Throughput estimation under stable leadership
def estimate_throughput(node_count, fsync_latency_ms, network_latency_ms):
    # Time per entry = max(fsync_latency, network_round_trip) + processing
    round_trip = network_latency_ms * 2  # leader -> follower -> leader
    entry_time_ms = max(fsync_latency_ms, round_trip)
    max_throughput = 1000 / entry_time_ms
    return max_throughput

# 3 nodes, 5ms fsync, 1ms network: ~200 entries/sec
# 7 nodes, 5ms fsync, 1ms network: ~200 entries/sec (fsync bound)
# Same throughput, more network overhead for larger clusters

Read Scalability Workarounds

Multi-Paxos clusters often implement follower reads for horizontal read scalability. Followers serve stale reads by applying committed log entries to their local state machine. This trades consistency for throughput.

Client --> Leader: write x=1
Leader --> Followers: Accept(n=10, v="write x=1")
Followers --> Leader: Accepted
Leader --> Client: committed

Client --> Follower: read x  (returns local state, potentially stale)

For strong consistency reads, clients must route to the leader. Some systems use lease-based follower reads where the leader grants followers permission to serve reads for a time window, accepting staleness in exchange for distributed read throughput.

Membership Change Overhead

Adding nodes to a Multi-Paxos cluster requires careful coordination. The protocol does not specify membership changes, so implementations typically:

  1. Pause client traffic
  2. Propagate new configuration through the log
  3. Wait for the new configuration to commit
  4. Resume traffic

This pause duration grows with the number of nodes being added, making large-scale cluster expansion disruptive.

Durability and Performance Trade-offs

Multi-Paxos provides strong durability guarantees by default: a value is only committed when a majority acknowledges it. However, this creates performance tensions that production deployments must navigate.

The Durability Throughput Tension

Synchronous persistence before acknowledgment creates a hard throughput ceiling. Each entry must survive a crash to be considered committed, which means fsync completes before the leader responds to clients.

ConfigurationDurabilityThroughputCrash Loss
Sync per entryHighestLowest (~100-1000/sec)None
Batch fsync (8 entries)HighModerate (~5000/sec)Last batch
Async append + background syncModerateHigh (~20000/sec)Several seconds of writes

Battery-Backed Write Caches

Enterprise storage systems use battery-backed write caches (BBWC) to accelerate durability acknowledgments. The cache is battery-protected, so data in the cache survives power loss. The storage system acknowledges writes after they hit the BBWC, before data reaches disk.

Leader --> Storage: write entry 501
Storage --> BBWC: append entry
BBWC --> Storage: acknowledged (durable even if power lost)
Storage --> Leader: Accepted
Leader --> Followers: Accept(n=10, v="entry 501")

This enables high throughput while maintaining durability guarantees. If power fails, the BBWC flushes to disk on recovery.

epoch-based Leases

Some Multi-Paxos variants use epoch-based leases to batch acknowledgments across multiple entries. The leader proposes a range of entries and followers acknowledge the entire epoch with a single round trip.

# Epoch-based batching
def on_client_request(value):
    epoch = current_epoch
    entries.append(Entry(epoch, sequence++, value))

    if should_flush_epoch():
        # Send all entries in epoch together
        send_accept_batch(entries)
        wait_for_majority_ack(entries)
        commit_epoch(epoch)
        entries.clear()

This amortizes network round trips across many entries but introduces a vulnerability window. If the leader crashes before the epoch commits, all entries in the epoch are lost.

Relationship to Other Patterns

Multi-Paxos forms the backbone of many replicated log systems. My post on Two-Phase Commit discusses a different approach to coordinating distributed transactions, though it lacks the fault-tolerance guarantees that Multi-Paxos provides.

For building complete replicated services, you’ll need more than just Multi-Paxos. My posts on Database Replication and Distributed Transactions explore how consensus fits into broader system designs.

Practical Applications

Google’s Spanner uses Multi-Paxos variants for synchronizing replicas across data centers. Chubby uses it for lock service. Many distributed databases use the same underlying pattern: a replicated log of commands that is applied to each replica’s state machine.

The reason this works in practice: basic Paxos is expensive, but you only need it once to establish leadership. After that, the common case is fast.

Production Failure Scenarios

Scenario 1: The Phantom Write Problem

What happened: A financial trading system using Multi-Paxos lost orders during a leadership election. The old leader had received and logged an order but the new leader, having a shorter log, did not have the order committed.

Root cause: The client sent an order to the old leader, which wrote it to its local log and returned success to the client. However, the old leader crashed before replicating the entry to a majority of followers. The new leader, with a different log prefix, was elected and continued processing without the lost order.

Impact: Several customer orders were silently dropped. The trading system had to reconcile its actual state against what clients believed had succeeded.

Lesson learned: Clients must wait for commit acknowledgments from a majority before considering a request complete. Idempotency keys help with client-side retry safety.

Scenario 2: The Stale Checkpoint Catastrophe

What happened: A distributed database using Multi-Paxos experienced silent data corruption after a routine maintenance event when a checkpoint was restored from an uncommitted log index.

Root cause: The checkpoint was taken at log index 400, but index 401 had not yet been committed (no majority acknowledgment). When a follower took a local checkpoint including index 401’s partial state, and then crashed and restarted, it restored state that other replicas had never seen.

Impact: After recovery, that replica diverged from the cluster. Detecting the divergence required expensive checksum verification across all data.

Lesson learned: Checkpoints must only be taken at committed log indices. The committed index is the highest index that a majority has acknowledged—nothing less is safe.

Scenario 3: The Fsync Bottleneck Chain

What happened: A high-throughput key-value store using Multi-Paxos experienced severe write latency spikes during peak load, eventually timing out clients.

Root cause: The system ran on spinning disks where fsync latency dominated. At high throughput, the single leader became a bottleneck—the disk could not keep up with the rate of accepted proposals requiring synchronous persistence. Write queues grew, latency spiked, and clients timed out.

Impact: The service became effectively unavailable during peak load despite having adequate CPU and network capacity.

Lesson learned: Profile disk I/O patterns before deploying Multi-Paxos. Consider batching fsync calls, using battery-backed write caches, or switching to SSDs for write-heavy workloads.

Common Pitfalls / Anti-Patterns

Multi-Paxos sounds simple, but production implementations face real challenges. The gap between theory and practice is substantial.

Persistence: If the leader crashes before replicating to a majority, clients may think a command succeeded when it actually failed. Careful clients must track which entries are actually committed.

Checkpointing: The log grows indefinitely. Periodic checkpointing of the state machine lets you truncate old entries, but taking consistent checkpoints across a distributed system is its own problem.

Membership changes: Adding or removing nodes while the system is running requires careful coordination. Most implementations pause the protocol, change membership, and resume.

Leader Stability Challenges

Multi-Paxos needs a relatively stable leader to get its performance benefits. When leadership flips often, the protocol collapses back toward basic Paxos.

Consider a scenario with three nodes (N1, N2, N3) and a client issuing commands:

# Scenario: Leadership changes during command processing
# Time T1: N1 is leader
client_cmd_1 = {"seq": 1, "value": "write x=1"}
# N1 sends Accept to N2, N3

# Time T2: N1 fails, N2 becomes leader before cmd_1 commits
# N2 receives client_cmd_2 = {"seq": 1, "value": "write x=2"}
# N2 proposes seq=1 with value "write x=2"

# Time T3: N1 recovers, N3 thinks N1 is still leader
# N3 receives Accept from N1 for seq=1, N2 for seq=1
# Conflict: same sequence number, different values

If clients retry with the same sequence numbers after leader changes, conflicting proposals can emerge. The system must either assign sequence numbers centrally (defeating distributed benefits) or accept that some commands will be lost and require client-level retry logic.

Log Compaction Interaction

Checkpointing interacts with Multi-Paxos in subtle ways. A checkpoint must capture the complete state, but determining “complete” in a distributed setting is tricky.

graph TB
    subgraph "Checkpoint Creation Process"
        C1[Leader creates checkpoint<br/>of state machine state]
        C2[Send checkpoint to all followers<br/>asynchronously]
        C3[Followers apply checkpoint<br/>locally]
        C4[Leader truncates log<br/>at checkpoint index]
        C5[Clients can now read<br/>consistent snapshot]

        C1 --> C2 --> C3 --> C4 --> C5
    end

Here is the tricky part: a follower might be at log index 100 while the leader checkpointed at index 150. If the follower crashes and restarts, it must reconstruct state from the checkpoint plus log entries 101-150. If any of those entries were lost during the crash, the follower has an inconsistent view that no amount of Paxos rounds will fix.

Disk I/O Bottlenecks

Multi-Paxos requires durability before acknowledging accepted proposals. Each Accept message typically triggers a fsync to persistent storage.

# Typical accept handler with persistence
def on_accept(proposal_id, value):
    # 1. Write to log (sequential I/O, relatively fast)
    log.append(proposal_id, value)

    # 2. Persist to disk (fsync - orders of magnitude slower)
    log.flush_to_disk()  # ~5-10ms on spinning disk
    log.sync()           # fsync call

    # 3. Update in-memory state
    state.update(value)

    # 4. Send response
    return Accepted(proposal_id)

On a system with spinning disks, fsync latency dominates. A single leader processing thousands of commands per second can become bottlenecked on disk I/O. SSDs help, but the fundamental issue stays: every accepted proposal requires synchronous persistence.

Some implementations batch multiple accepts before fsync, trading durability for throughput. This means a crash might lose several recently-accepted commands.

Concurrency Control Limitations

Multi-Paxos decides on a sequence of values but provides no native concurrency control for overlapping writes to the same keys.

# Multi-Paxos does not handle this:
# Client A: write x = 1, read x
# Client B: write x = 2, read x
# If both operations touch the same key "x",
# Multi-Paxos guarantees they are ordered,
# but provides no mechanism to detect
# or prevent conflicting access patterns

# You must layer your own concurrency control:
# - External locking (Zookeeper advisory locks)
# - Optimistic concurrency (version vectors)
# - Pessimistic locking (2PL, MVCC)

For simple key-value stores, this is manageable. For complex transactions touching multiple keys, you need additional coordination.

Persistent State Requirements

Before sending Accept messages, a proper Multi-Paxos implementation must persist certain state to ensure crash safety.

# State that MUST be persisted before sending Accept
class AcceptorState:
    def on_prepare_response(self, proposal_id, accepted_id, accepted_value):
        # Persist promise before sending ack
        # If we crash after sending Promise but before persisting,
        # we might accept a lower-numbered proposal after restart
        self.promised_id = proposal_id
        persist_to_disk({
            'promised_id': self.promised_id,
            'accepted_id': self.accepted_id,  # Also persist accepted
            'accepted_value': self.accepted_value
        })

    def on_accept(self, proposal_id, value):
        # Must persist BEFORE sending Accepted response
        # If we crash after sending response but before persisting,
        # we might lose knowledge of what we accepted
        if proposal_id >= self.promised_id:
            self.accepted_id = proposal_id
            self.accepted_value = value
            persist_to_disk({
                'accepted_id': self.accepted_id,
                'accepted_value': self.accepted_value
            })
            return Accepted(proposal_id)

The invariant is simple: once you have promised not to accept lower proposals, or have accepted a value, that state must survive crashes. Violating this leads to split-brain scenarios where different nodes believe different values were chosen.

Quick Recap

Before diving into implementation, ensure you understand:

  • Basic Paxos agrees on a single value; Multi-Paxos extends this to a sequence of values
  • The optimization: skip Phase 1 (prepare) for consecutive entries once a leader is stable
  • A stable leader enables 1-RTT per entry instead of 2-RTT
  • Log gaps occur during leader changes and are handled by re-proposing missing entries
  • Checkpointing must occur at committed log indices to maintain consistency
  • Disk I/O (fsync) is the primary bottleneck for high-throughput Multi-Paxos implementations
  • Multi-Paxos provides no native concurrency control—layer it on top
  • Membership changes require pausing the protocol in most implementations

Interview Questions

1. What problem does Multi-Paxos solve that basic Paxos cannot?

Key points:

  • Basic Paxos only agrees on a single value; Multi-Paxos agrees on a sequence of values
  • This enables replicated state machines where a distributed log drives state changes
  • Each log position becomes a separate Paxos instance, but optimized for the common case
2. Explain the key optimization Multi-Paxos uses to reduce latency.

Key points:

  • After a leader wins a majority vote, it assumes leadership is stable for a period
  • Subsequent entries skip Phase 1 (prepare phase) and go directly to Phase 2 (accept phase)
  • This reduces 2-RTT per entry to 1-RTT per entry after leader election
3. What happens when a leader crashes during Multi-Paxos execution?

Key points:

  • A new leader must be elected, typically via Paxos itself (prepare messages)
  • The new leader may have gaps in its log where other nodes committed entries
  • Multi-Paxos handles this by re-proposing missing entries, not copying them
  • This ensures consistency without requiring full synchronization
4. Why is leader stability critical for Multi-Paxos performance?

Key points:

  • Frequent leadership changes collapse the protocol back toward basic Paxos
  • Each leadership change requires new Phase 1 (prepare) rounds before Phase 2
  • This eliminates the 1-RTT optimization benefit for subsequent entries
  • Systems must balance leader leases, heartbeats, and failure detection carefully
5. What is the purpose of log checkpointing in Multi-Paxos?

Key points:

  • Without checkpointing, recovery time grows proportionally to log length
  • Checkpointing periodically saves the state machine state and discards old log entries
  • A checkpoint must capture everything needed to reconstruct state at a specific log index
  • Only checkpoint at committed indices—uncommitted state would cause divergence
6. Describe the disk I/O bottleneck in Multi-Paxos and potential solutions.

Key points:

  • Every accepted proposal requires fsync to persistent storage before acknowledgment
  • fsync latency (~5-10ms on spinning disks) dominates throughput
  • Solutions: batch multiple accepts before fsync (trades durability), use SSDs, pipeline writes
  • Batching means a crash might lose several recently-accepted commands
7. How does Multi-Paxos handle membership changes (adding/removing nodes)?

Key points:

  • Membership changes require careful coordination to maintain consistency
  • Most implementations pause the protocol, change membership, then resume
  • The Paxos mechanism itself does not specify how to handle membership changes
  • Joint consensus or single-server changes are common approaches (as in Raft)
8. What concurrency control limitations does Multi-Paxos have?

Key points:

  • Multi-Paxos decides the order of values but provides no conflict detection
  • Overlapping writes to the same keys are ordered but not prevented
  • Additional layering required: advisory locks (Zookeeper), optimistic concurrency (version vectors), or pessimistic locking (2PL, MVCC)
  • For multi-key transactions, additional coordination beyond Multi-Paxos is essential
9. Compare Multi-Paxos and Raft in terms of implementation complexity.

Key points:

  • Multi-Paxos is abstract with many unspecified details—high implementation difficulty
  • Raft provides a concrete specification with pseudo-code for all components
  • Raft separates leader election from log replication explicitly; Multi-Paxos treats them as the same
  • For new projects, Raft is usually the better choice due to clarity of specification
10. What state must be persisted before sending Accept messages in Multi-Paxos?

Key points:

  • Promise state: once promised not to accept lower-numbered proposals, that must survive crashes
  • Accepted state: once accepted a value, that must persist before sending Accepted response
  • The invariant: both promised_id and accepted_id/accepted_value must survive crashes
  • Violating this leads to split-brain scenarios where nodes disagree on chosen values
11. How does Multi-Paxos relate to Google's Spanner and Chubby?

Key points:

  • Spanner uses Multi-Paxos variants for synchronizing replicas across data centers
  • Chubby uses Multi-Paxos for its lock service coordination
  • These systems demonstrate that basic Paxos, despite its theoretical nature, underlies production distributed systems
  • The optimization (1-RTT after leader election) makes it fast enough for production use
12. What is the relationship between Basic Paxos and Multi-Paxos?

Key points:

  • Multi-Paxos is not a separate protocol—it extends Basic Paxos
  • Each log slot is treated as an independent Basic Paxos instance
  • Leader election in Multi-Paxos is itself implemented using Basic Paxos
  • The "multi" part is an optimization layer on top of repeated Basic Paxos instances
13. Explain log truncation and why gaps might appear in a Multi-Paxos log.

Key points:

  • Truncation removes entries at the last known stable index, skipping gaps
  • Gaps naturally occur during leader changes
  • The new leader may not have entries that were committed by the previous leader
  • The leader re-proposes missing entries rather than copying them from other nodes
14. What happens when a follower crashes and restarts in a Multi-Paxos cluster?

Key points:

  • The follower must reconstruct state from its last checkpoint plus log entries after that index
  • If those log entries were lost during the crash, the follower has an inconsistent view
  • This inconsistency cannot be fixed by additional Paxos rounds
  • Proper persistence of promised and accepted state before responding prevents this
15. Why might clients need retry logic in a Multi-Paxos-based system?

Key points:

  • If the leader crashes after receiving a request but before replicating to a majority, the command may be lost
  • Careful clients must track which entries are actually committed before considering a request successful
  • Client retries with the same sequence numbers after leader changes can cause conflicting proposals
  • The system must either assign sequence numbers centrally or accept lost commands requiring client-level retry
16. How does leader stability affect Multi-Paxos throughput compared to basic Paxos?

Key points:

  • Under stable leadership, Multi-Paxos achieves 1-RTT per entry vs 2-RTT for basic Paxos
  • The optimization eliminates prepare phases for consecutive entries once leader is established
  • With frequent leadership changes, Multi-Paxos collapses to basic Paxos performance
  • Leader leases and failure detection tuning directly impact effective throughput
17. What are the trade-offs between synchronous and asynchronous checkpointing?

Key points:

  • Synchronous checkpointing pauses the protocol during checkpoint creation, ensuring all replicas stay in sync
  • Asynchronous checkpointing allows the leader to continue processing but risks divergent checkpoints if the leader crashes mid-creation
  • Synchronous is safer but impacts latency; asynchronous is faster but requires careful crash recovery handling
  • Hybrid approaches checkpoint when the system is idle or during low-traffic periods
18. Why does Multi-Paxos require a stable leader for its performance optimization?

Key points:

  • The optimization skips Phase 1 (prepare phase) assuming leadership remains stable
  • If leadership changes frequently, each new leader must run Phase 1 before Phase 2, losing the 1-RTT benefit
  • Systems use leader leases (time-bound authority) to maintain stability while allowing failure detection
  • The trade-off: aggressive leases improve performance but risk longer outages if the leader fails unexpectedly
19. What happens to pending client requests when a Multi-Paxos leader fails?

Key points:

  • Client requests that the old leader accepted but did not commit to a majority are lost
  • The client must retry the request, but retrying with the same sequence number risks conflicting proposals if a new leader is already processing different values
  • Idempotency keys help clients safely retry without causing duplicate commands
  • The new leader re-proposes missing log entries, but requests that only reached the old leader's log are effectively dropped
20. How does epoch-based batching improve Multi-Paxos throughput and what are its risks?

Key points:

  • Epoch-based batching collects multiple entries and sends them as a single accept round, amortizing network overhead
  • Followers acknowledge an entire epoch with one round trip, reducing per-entry latency
  • The risk: if the leader crashes before the epoch commits, all entries in that epoch are lost
  • Systems must balance batch size against crash loss window—larger batches mean higher throughput but greater loss on failure

Further Reading

Conclusion

Multi-Paxos extends the theoretical Paxos algorithm into a practical protocol for replicating logs and state machines. The optimization from two phases to one per entry makes it fast enough for production use.

Understanding Multi-Paxos helps when working with systems like Zookeeper, etcd, or distributed databases. But for new projects, Raft usually provides a cleaner foundation.

Category

Related Posts

Paxos Consensus Algorithm

Paxos is a family of consensus algorithms for achieving agreement in distributed systems despite failures, pioneered by Leslie Lamport.

#distributed-systems #consensus #paxos

Google Chubby: The Lock Service That Inspired ZooKeeper

Explore Google Chubby's architecture, lock-based coordination, Paxos integration, cell hierarchy, and its influence on distributed systems design.

#distributed-systems #databases #google

Raft Consensus Algorithm

Raft is a consensus algorithm designed to be understandable, decomposing the problem into leader election, log replication, and safety.

#distributed-systems #consensus #raft