Raft Consensus Algorithm

Raft is a consensus algorithm designed to be understandable, decomposing the problem into leader election, log replication, and safety.

published: reading time: 24 min read author: GeekWorkBench updated: March 24, 2026

Introduction

Raft was designed with a specific goal: be understandable. Diego Ongaro and John Ousterhout published the algorithm in 2014, explicitly stating that their aim was to create a consensus algorithm that practitioners could actually reason about. Their user study showed that people understood Raft better than Paxos, which matters because consensus algorithms are notoriously difficult to implement correctly.

The algorithm breaks the consensus problem into three independent subproblems: leader election, log replication, and safety. This decomposition makes it easier to reason about and implement than Paxos.

Leader Election

At any given time, each node in a Raft cluster is in one of three states: leader, follower, or candidate. All nodes start as followers. If a follower receives no communication from a leader within an election timeout, it becomes a candidate and initiates an election.

stateDiagram-v2
    [*] --> Follower
    Follower --> Candidate: Election timeout
    Candidate --> Follower: Another node wins
    Candidate --> Leader: Receives votes from majority
    Leader --> Follower: Higher term discovered
    Candidate --> Follower: Higher term discovered

    state Leader {
        [*] --> Active
        Active --> Active: Heartbeat received
    }

To win an election, a candidate must receive votes from a majority of nodes. Each node votes for at most one candidate per term, and nodes vote for the first candidate that requests their vote if that candidate’s log is at least as up-to-date as their own.

The election timeout is randomized within a range (typically 150-300ms). This randomness breaks ties and ensures that split votes are rare. When they do happen, nodes wait for a new timeout before trying again.

Log Replication

Once a leader is elected, it begins replicating log entries to followers. Clients send commands to the leader, which appends them to its log and then sends AppendEntries RPCs to all followers in parallel. Once a majority of nodes have written the entry, the leader applies it to the state machine and returns the result to the client.

The log is structured as a sequence of entries, each containing a term number and a command. The term numbers serve as logical clocks, helping nodes identify which entries are stale or conflicting.

sequenceDiagram
    participant C as Client
    participant L as Leader
    participant F1 as Follower 1
    participant F2 as Follower 2

    C->>L: Command: x=1
    L->>L: Append to local log
    L->>F1: AppendEntries(term=3, entry=x:1)
    L->>F2: AppendEntries(term=3, entry=x:1)
    F1-->>L: Success
    F2-->>L: Success
    L->>L: Apply to state machine
    L-->>C: OK
    Note over L: Committed: entry x=1

Safety

Raft’s safety guarantee is straightforward: if a command has been applied to the state machine on any node, no other node may apply a different command for the same log index. This is enforced through the voting mechanism—nodes will not vote for a candidate whose log is behind their own.

When a leader crashes, some of its log entries may not have been fully replicated to all followers. The new leader must bring all followers up to date. This process can involve sending conflicting entries or truncating follower’s logs.

Membership Changes

Raft handles cluster membership changes through joint consensus. When you need to add or remove nodes, the cluster transitions through a joint configuration where old and new nodes must agree before the transition completes. This prevents the scenario where two disjoint majorities could form and make conflicting decisions.

The joint consensus approach adds complexity but guarantees safety during transitions. You add nodes incrementally, giving the system time to replicate data and adjust quotas.

AppendEntries RPC

The AppendEntries RPC is how Raft’s leader replicates log entries to followers. Here’s the core structure:

type AppendEntriesArgs struct {
    Term         int    // Leader's current term
    LeaderID     int    // So followers can redirect clients
    PrevLogIndex int    // Index of log entry immediately before new ones
    PrevLogTerm  int    // Term of PrevLogIndex entry
    Entries      []byte // Log entries to store (empty for heartbeat)
    LeaderCommit int    // Leader's commitIndex
}

type AppendEntriesReply struct {
    Term    int  // Current term, for leader to update itself
    Success bool // True if follower contained entry matching PrevLogIndex/PrevLogTerm

    // Optimization: tell leader who to backtrack to
    ConflictIndex int // First index conflict starts at
    ConflictTerm  int // Term at ConflictIndex
}

A heartbeat is just an AppendEntries with an empty Entries slice. Followers expect heartbeats within the election timeout range (typically 150-300ms) to confirm the leader is still active.

Joint Consensus Membership Change

When adding or removing nodes, Raft uses a two-phase approach:

sequenceDiagram
    participant L as Leader
    participant O1 as Old Node 1
    participant O2 as Old Node 2
    participant N1 as New Node 3

    rect rgb(40, 40, 60)
        Note over L,N1: Phase 1: Joint Configuration
        L->>O1: C_old + C_new (joint config)
        L->>O2: C_old + C_new (joint config)
        L->>N1: C_old + C_new (joint config)
        O1-->>L: OK
        O2-->>L: OK
        N1-->>L: OK (catching up)
        Note over L: Committed joint config
    end

    rect rgb(40, 60, 40)
        Note over L,N1: Phase 2: New Configuration Only
        L->>O1: C_new only
        L->>O2: C_new only
        L->>N1: C_new only
        O1-->>L: OK
        O2-->>L: OK
        N1-->>L: OK
        Note over L: New config committed
    end

The key steps:

  1. Leader receives request to add/remove node(s)
  2. Leader creates joint configuration C_old + C_new
  3. Leader replicates joint config to all nodes
  4. Once joint config is committed, leader creates C_new config
  5. New config is replicated; old nodes outside new config can shut down

Performance Characteristics

Raft’s performance is dominated by the leader. It can process hundreds to thousands of commands per second depending on network latency and the number of followers. Write throughput scales with the leader’s ability to send RPCs in parallel to followers.

Read throughput can be improved by scaling followers for read-heavy workloads, but only if you accept stale reads. Strictly linear reads require confirming with the leader that it is still the current leader (using lease-based approaches or heartbeat confirmation).

Performance Bottleneck Table

ComponentLatency ContributorBottleneck
Leader electionClock drift + election timeoutWorst-case ~election timeout
Log replicationDisk fsync on followersDisk I/O is the ceiling
SnapshottingState machine serializationCPU and memory
Commit latencyMajority ack + disk writefsync latency
RPC round-tripNetworkGeographic distance

Performance Reference Table

ComponentTypical ValueWhat Affects It
Election timeout150-300ms randomNetwork jitter, load
Heartbeat interval50msThroughput vs sensitivity tradeoff
Log replication latency1-2 RTTsNetwork RTT, follower catch-up speed
Commit latency10-50msReplication factor, network latency
Snapshot sizeVariesApplication state size, snapshot frequency
Recovery timeElection timeout + catch-upHow far behind the new leader is

The leader can become a bottleneck under heavy write load. etcd mitigates this with batching and pipelining — batching multiple log entries into a single AppendEntries RPC reduces per-entry overhead. CockroachDB uses lease-based follower reads to scale read throughput without staleness risk.

Pre-Vote Mechanism

A partitioned node that re-joins the cluster can disrupt elections if it doesn’t realize its term is stale. When it sends a RequestVote with an old (higher) term, it can knock out the current leader and trigger an unnecessary election. The Pre-Vote phase prevents this.

Before becoming a candidate, a node first asks all other nodes: “Would you vote for me if I started an election?” If a majority responds yes, the node increments its term and proceeds with the real election. If not enough nodes respond, the node knows it is partitioned and does not disrupt the cluster.

sequenceDiagram
    participant F as Follower (partitioned)
    participant N1 as Node 1
    participant N2 as Node 2
    participant L as Leader

    F->>N1: Pre-Vote(term=9)?
    F->>N2: Pre-Vote(term=9)?
    Note over N1,N2: Current term = 5, leader is active
    N1-->>F: No (term too high, I'm with leader)
    N2-->>F: No (same reason)
    Note over F: Two rejections received — knows it is behind
    Note over F: Does NOT increment term
    Note over F: Does NOT send RequestVote
    Note over F: Stays quiet until it learns the new term

The Pre-Vote check forces the partitioned node to accept reality before it can cause damage.

Log Compaction and Snapshotting

Raft logs grow indefinitely. In production, you must compact the log — snapshotting is the mechanism that makes this safe.

Instead of keeping every log entry forever, the leader periodically creates a snapshot of the application state machine and discards all log entries up to that point. Each snapshot covers a contiguous prefix of the log.

graph LR
    A["Log entries 1-1000"] -->|Snapshot created| B["Log entries 1001-onwards"]
    A -->|Kept| S["Snapshot: state at idx 1000"]

When a follower falls too far behind, the leader sends its snapshot instead of log entries. The follower then replays the snapshot to rebuild its state, then continues from there.

Snapshot contents typically include:

  • Last included index (the last log entry in the snapshot)
  • Last included term (term of that index)
  • Application state (the actual data)
  • Last cluster configuration (membership info)

The tradeoff: snapshotting requires the application to support state machine snapshots, and sending large snapshots across the network is expensive. etcd recommends snapshotting every 10,000-20,000 entries for typical workloads.

Why Raft Wins in Practice

Raft’s emphasis on understandability has paid off. The algorithm has been implemented in many projects, including etcd (which backs Kubernetes), CockroachDB, and TiKV. The clear separation of concerns makes it easier to debug and verify.

Compared to Paxos, Raft has a stronger notion of leadership. A single leader processes all requests and coordinates replication, which simplifies the protocol at the cost of concentrating load on one node. When the leader fails, a new election happens quickly.

Relationship to Other Algorithms

Raft and Paxos achieve the same safety guarantees through different means. Paxos is more abstract and general; Raft is more prescriptive and practical. If you want to understand the theoretical foundations, I recommend reading about the CAP Theorem and how consensus algorithms navigate the consistency-availability trade-off.

Raft also relates closely to leader election concepts. My post on Leader Election explores different approaches to selecting leaders in distributed systems.

Limitations

Raft does not solve all distributed systems problems. It provides consensus for a single log, but building a full replicated state machine requires additional engineering. The algorithm also assumes bounded message delivery times, which can be violated in networks with long delays or partitions.

The leader can become a bottleneck. If the leader becomes overloaded or the network to the leader degrades, the entire cluster’s throughput suffers. Some systems route reads through followers with freshness checks, but this adds complexity.

Production Failure Scenarios

Scenario 1: Network Partition During Election

When a network partition splits the cluster, both sides may have a leader with different terms.

What happens:

  • Partition A (majority): Leader continues accepting writes, term increments
  • Partition B (minority): Followers timeout, elect new candidate with older term
  • When partition heals, node with lower term detects conflict and reverts to follower

Recovery:

  • Minority partition’s uncommitted entries become invalid
  • Leader from majority partition reconciles logs across all nodes
  • Client requests that did not commit return with errors or must be retried

Scenario 2: Disk I/O Bottleneck Causing Spurious Elections

If the leader experiences disk I/O delays (fsync latency spikes), heartbeats may be delayed beyond the election timeout.

What happens:

  • Followers timeout and become candidates
  • If disk latency clears and leader sends heartbeats, higher term propagates
  • Original leader detects higher term and steps down

Mitigation:

  • Use SSDs with consistent low-latency fsync
  • Adjust election timeout range to be longer than worst-case disk latency
  • Implement Pre-Vote to filter out spurious elections

Scenario 3: Snapshot Transfer During Catch-up

When a new node joins or a fallen behind node recovers, it may need a full snapshot instead of log entries.

What happens:

  • Leader detects follower is too far behind
  • Leader sends InstallSnapshot RPC with full state
  • Follower discards its log and replays snapshot
  • Follower then requests log entries from snapshot point

Risk:

  • Large snapshots can saturate network bandwidth
  • Follower is unavailable for reads during snapshot application

Trade-off Analysis

Leader-centric Design

Trade-offBenefitCost
Single leader writesSimple consistency modelLeader becomes throughput bottleneck
Heartbeat-based authorityClear leadership boundaryNetwork latency affects all operations
Client redirectAutomatic load balancingRequires redirect logic in clients

Membership Changes Trade-offs

Trade-offBenefitCost
Joint consensusSafety during transitionsTwo-phase complexity, slower changes
Incremental node addGradual replication rebuildQuorum shift during transition
Old node graceful exitNo downtime during maintenanceMust wait for catch-up before removing

Consistency vs Availability

Trade-offBenefitCost
Committed after majorityDurable writes, clear safetyHigher latency for geo-distributed clusters
Follower readsScaled read throughputPotential stale data reads
Lease-based readsReduced leader communicationClock skew risk, bounded staleness

Quick Recap Checklist

  • Raft decomposes consensus into three subproblems: leader election, log replication, and safety
  • Nodes exist in one of three states: follower, candidate, or leader
  • A candidate must receive votes from a majority of nodes to become leader
  • Election timeout is randomized (typically 150-300ms) to prevent split votes
  • Log entries contain term numbers that serve as logical clocks
  • An entry is committed once a majority of nodes have written it
  • Raft safety guarantee: no other node can apply a different command for the same log index
  • Membership changes use joint consensus (two-phase approach)
  • Pre-Vote mechanism prevents partitioned nodes from disrupting elections
  • Log compaction via snapshotting discards old entries to prevent unbounded growth

Interview Questions

1. Explain the three roles a Raft node can assume and the conditions for transitioning between them.
Nodes take on one of three roles at any given time. Followers are passive, responding to requests from leaders and candidates. Candidates step up when their election timer fires with no heartbeat from a leader. Leaders handle all client traffic and send heartbeats to maintain authority. The transitions work like this: a follower becomes a candidate when its timer fires, a candidate becomes leader when it gets majority votes, and both candidate and leader step back down to follower when they encounter a higher term number.
2. Why does Raft randomize the election timeout, and what problem does this solve?
Without randomization, all followers would hit their election timers simultaneously during a leader outage. They'd all become candidates at once and split votes endlessly. The 150-300ms random window gives one node a head start, letting it gather majority votes before the others even wake up. After a split vote, the same randomization kicks in again to break the tie.
3. How does Raft ensure the safety property that only one value can be committed for a given log index?
The voting restriction enforces this. Nodes only vote for candidates whose logs are at least as up-to-date as their own, meaning either a higher last term or the same term with equal or greater length. This blocks candidates with incomplete logs from winning elections, so any leader that gets elected already has all previously committed entries. Once a leader commits an entry, no future leader can have a different entry at that index.
4. Describe the Raft log replication process step by step when a client submits a command.
The client sends a command to the leader. The leader appends it to its local log, then fires off AppendEntries RPCs to all followers in parallel. Each follower writes the entry to its own log and replies. Once the leader sees acknowledgments from a majority (including itself), the entry is committed. The leader applies it to its state machine and responds to the client. Subsequent heartbeats tell followers to apply the entry themselves.
5. What is the purpose of term numbers in Raft, and how do they function as logical clocks?
Terms are monotonically increasing integers that increment each time an election starts. They act as logical clocks, letting nodes reason about freshness. If a node's term is lower than an incoming request, it rejects that request because it knows its information is stale. Terms also help nodes detect when they should step aside. A leader that sees a higher term knows there's a newer leader out there and steps down.
6. Explain the two-phase approach Raft uses for membership changes and why it is necessary.
Joint consensus uses two phases. First, the cluster transitions to a joint configuration C_old + C_new, which requires majority agreement from both old and new nodes before committing. Second, once that's stable, the leader switches to C_new alone. A direct one-step transition is dangerous because two disjoint majorities could form: {1,2} in the old config and {3,4,5} in the new, allowing conflicting decisions to slip through. The joint config forces an overlap that prevents this split-brain scenario.
7. What is the Pre-Vote mechanism and what problem does it solve?
When a partitioned node re-joins with a stale high term, it can accidentally knock out the current leader by sending RequestVote with that old term. The Pre-Vote phase prevents this by asking all nodes first: "Would you vote for me if I ran an election?" If a majority says yes, the node increments its term and proceeds for real. If not enough nodes respond yes, the node knows it's behind and stays quiet, avoiding the disruption.
8. How does Raft handle log compaction and snapshotting?
Logs grow forever, so Raft uses snapshotting to bound storage. The leader periodically snapshots the application state machine and discards all log entries up to that point. A snapshot contains the last included index, its term, the application state, and the cluster configuration. When a follower falls too far behind, the leader sends the snapshot directly instead of replaying potentially thousands of entries over the network.
9. What is the role of the AppendEntries RPC, and what fields does it contain?
AppendEntries is how the leader replicates log entries and maintains authority. It carries Term (the leader's current term), LeaderID (for redirecting clients), PrevLogIndex and PrevLogTerm (to verify log consistency), Entries (the new entries to write, empty for heartbeats), and LeaderCommit (the commit index). When Entries is empty, it's just a heartbeat. Followers respond with Success plus optional ConflictIndex and ConflictTerm to tell the leader where its log diverges.
10. What are the performance characteristics and bottlenecks of Raft?
Raft is leader-bound. Write throughput scales with how fast the leader can send RPCs in parallel to followers. The actual bottleneck is usually disk I/O on followers, specifically fsync latency for committing entries. Leader election latency is bounded by the election timeout. Geographic distance kills performance because every write needs a round trip to the majority. Mitigations include batching multiple entries per RPC, pipelining RPCs in flight, and offloading reads to followers with lease-based freshness checks.
11. How does Raft compare to Paxos in terms of understandability and practical implementation?
Raft was designed to be understandable; Paxos was designed to be theoretically correct. Raft breaks the problem into three clear subproblems with explicit implementation steps. Paxos is more abstract and general, but the original paper only describes single-decree Paxos, and building a practical system requires figuring out Multi-Paxos on your own. The strong leadership model in Raft concentrates load on one node, which simplifies reasoning but creates a bottleneck. Paxos allows any node to propose, which is more flexible but needs more coordination.
12. What is linearizability in the context of Raft, and how does it affect read operations?
Linearizability means every operation appears to take effect at some instant between when it was called and when the response came back. For Raft reads to be linearizable, you need to confirm the leader is still actually the leader, otherwise you might read stale data from a deposed leader. Lease-based reads assume the leader stays valid for the lease duration without heartbeats. If you accept stale reads, you can route reads to followers and scale read throughput horizontally, trading consistency for performance.
13. What happens during the recovery process when a Raft leader crashes with uncommitted entries?
When the leader crashes, some entries may be in flight, written to some followers but not yet committed. The new leader must reconcile everyone's logs. If a follower is behind, the leader sends the missing entries. If there's a conflict, the follower reports back where its log diverged (ConflictIndex, ConflictTerm), and the leader adjusts. Entries that were never committed by the old leader are not automatically committed by the new one; they have to survive another round of replication first.
14. Describe the commitment process in Raft. When is an entry considered committed?
An entry is committed when a majority of nodes have written it to their logs. The leader tracks this with commitIndex. Once committed, the leader applies the entry to its state machine. It then notifies followers of the new commitIndex via the LeaderCommit field in AppendEntries, and followers apply the entry on their end. Committed entries survive leader crashes because the majority already has them durable on disk.
15. What are the key differences between Raft's AppendEntries RPC and RequestVote RPC?
AppendEntries handles log replication and heartbeats. It includes PrevLogIndex and PrevLogTerm so followers can verify consistency, plus the actual Entries to write and LeaderCommit to advance the commit index. RequestVote is for election campaigning. It carries the candidate's last log term and log length so voters can enforce the election restriction. One subtlety: RequestVote can be sent during Pre-Vote without incrementing the candidate's term, allowing nodes to gauge viability before disrupting the cluster.
16. What are the assumptions Raft makes about network behavior, and what happens when these are violated?
Raft assumes messages arrive within the election timeout window and that terms increase monotonically. If messages are delayed indefinitely, followers may elect a new leader even though the old one is still alive, resulting in split-brain. Network partitions can prevent the cluster from reaching consensus, and Raft chooses availability over consistency during partitions. Byzantine failures (malicious or corrupted messages) are not handled by the base protocol. In practice, you need additional mechanisms for security-sensitive deployments.
17. How does etcd use Raft, and what optimizations does it implement?
etcd uses Raft to replicate its key-value state across nodes, providing the consistent store that Kubernetes uses for service discovery and configuration. It batches multiple log entries into a single AppendEntries RPC to reduce per-entry overhead and pipelines RPCs to overlap round trips. Snapshots trigger every 10,000-20,000 entries in typical workloads. For reads, etcd supports lease-based linearizable reads without touching the leader, improving read latency under low-contention workloads.
18. What is the purpose of the election restriction in Raft, and how is it enforced?
The election restriction prevents nodes with incomplete logs from disrupting committed state. When a candidate requests a vote, the voter checks whether the candidate's log is at least as up-to-date as its own. The comparison works as follows: the candidate with the higher last term wins, or if terms are equal, the longer log wins. This ensures any leader that gets elected has all previously committed entries. The restriction is enforced in the RequestVote handler, which rejects candidates that fall short.
19. Explain the concept of joint consensus in detail and why a direct transition is unsafe.
Joint consensus is a two-phase approach for cluster membership changes. In phase one, the leader creates a joint configuration C_old + C_new and replicates it to all nodes. This can't commit unless a majority of both old and new nodes agree. Once the joint config commits, phase two begins: the leader replicates C_new alone. A direct transition from C_old to C_new is unsafe because the membership change might result in two separate majorities that never overlap, allowing each to make conflicting decisions. The joint config forces a window where both configurations must agree.
20. How does the Raft leader handle read-heavy workloads, and what are the trade-offs involved?
If all reads go through the leader, the leader becomes a bottleneck. You can route reads to followers with a freshness check: the follower confirms with the leader that it hasn't been deposed. Lease-based reads are faster but depend on clock skew bounds. The trade-off is that follower reads can return stale data if a partition just happened and the old leader is still serving reads on the minority side. For true linearizability, you need to confirm leadership before each read.

Raft vs Paxos Comparison

While Raft and Paxos both solve the consensus problem and provide the same safety guarantees, their approaches differ significantly in practice.

AspectRaftPaxos
Design goalUnderstandabilityTheoretical correctness
StructureStrong leadership, single leaderMulti-decree, any node can propose
DecompositionLeader election + Log replication + SafetySingle protocol with variants
ComplexityLower (modular subproblems)Higher (abstract elegance)
Implementation guidanceExplicit steps providedAbstract, requires extensions
Cluster membership changesBuilt-in joint consensusRequires additional protocols
Production adoptionetcd, CockroachDB, TiKVChubby (Google), Spanner

Raft’s stronger notion of leadership means all requests flow through a single leader, which simplifies reasoning but concentrates load. Paxos allows any node to propose values, which is more flexible but requires more complex coordination.

The original Paxos paper (Lamport, 1998) described single-decree Paxos, and Multi-Paxos extensions are needed for practical systems. Raft was designed from the ground up as a complete consensus protocol suitable for building replicated state machines.

Further Reading

Conclusion

Raft succeeded at its primary goal: being understandable. It provides a practical foundation for building replicated systems without the complexity of more general consensus protocols. If you’re building a distributed system that needs consensus, Raft is often the right starting point.

For more on how consensus fits into broader distributed systems patterns, see my post on Consistency Models.

Category

Related Posts

etcd: Distributed Key-Value Store for Configuration

Deep dive into etcd architecture using Raft consensus, watches for reactive configuration, leader election patterns, and Kubernetes integration.

#distributed-systems #databases #etcd

Leader Election in Distributed Systems

Leader election is the process of designating a single node as the coordinator among a set of distributed nodes, critical for consensus protocols.

#distributed-systems #leader-election #consensus

Multi-Paxos

Multi-Paxos extends basic Paxos to achieve consensus on sequences of values, enabling practical replicated state machines for distributed systems and databases.

#distributed-systems #consensus #paxos