Two-Phase Commit Protocol Explained

Learn the two-phase commit protocol for distributed transactions: prepare and commit phases, coordinator role, failure handling, and trade-offs.

published: reading time: 29 min read author: GeekWorkBench

Two-Phase Commit Protocol Explained

Two-phase commit (2PC) is a protocol for achieving atomic commitment across multiple database nodes. All nodes either commit or rollback together. No partial states.

The idea is appealing: distributed transactions that behave like single-node transactions. In practice, 2PC has failure modes that make it problematic, and most systems reach for saga instead.

This post explains how 2PC works, the failure scenarios that cause trouble, and why it fell out of favor.

Introduction

2PC works in two phases: prepare and commit.

Phase 1: Prepare

The coordinator sends a prepare message to all participants. Each participant votes:

  • Vote Yes: The participant has completed its work, its locks are held, and it is ready to commit.
  • Vote No: Something went wrong. The participant rolls back and releases locks.
%%{ wrappingType: "word"}%%
sequenceDiagram
    participant Coordinator
    participant P1 as Participant 1
    participant P2 as Participant 2
    participant P3 as Participant 3
    Coordinator->>+P1: Prepare
    Coordinator->>+P2: Prepare
    Coordinator->>+P3: Prepare
    P1->>-Coordinator: Vote Yes
    P2->>-Coordinator: Vote Yes
    P3->>-Coordinator: Vote No

If any participant votes no, the coordinator sends rollback to all. Done.

Phase 2: Commit

If all participants vote yes, the coordinator sends commit to all. Each participant commits its local transaction and releases locks.

%%{ wrappingType: "word"}%%
sequenceDiagram
    participant Coordinator
    participant P1 as Participant 1
    participant P2 as Participant 2
    participant P3 as Participant 3
    Coordinator->>+P1: Commit
    Coordinator->>+P2: Commit
    Coordinator->>+P3: Commit
    P1->>-Coordinator: Committed
    P2->>-Coordinator: Committed
    P3->>-Coordinator: Committed

All participants must commit for the transaction to succeed.

Core Concepts

Coordinator and Participants

The coordinator runs the whole show. It manages voting, decides the outcome, and tells participants what to do.

In practice, the coordinator is often a separate service or embedded in the application. Some databases (like Percona XtraDB) implement 2PC internally for distributed transactions.

The coordinator must be reliable. If it crashes, participants may be left in limbo.

Embedded vs External Coordinator

AspectEmbedded CoordinatorExternal Coordinator
DefinitionCoordinator logic runs within the application process initiating the transactionCoordinator is a separate service/process that manages 2PC across participants
ExampleApplication code directly coordinates MySQL/PostgreSQL XA transactionsDedicated service like Narayana, WebSphere LTCO, or custom coordinator service
AdvantagesSimpler deployment; no additional service to operate; lower latency for single-site transactionsCoordinator survives application crashes; easier to monitor and manage; better for distributed multi-service transactions
DisadvantagesApplication crash kills the coordinator; harder to monitor; participants may block if application is overloadedAdditional dependency; network hop to coordinator; requires HA configuration for coordinator
Coordinator SPOFYes — if the application crashes, coordinator dies with itYes — unless configured in HA mode with consensus (e.g., Paxos-based)
When to UseSingle application coordinating local database partitions; simple XA transactions within one database clusterMulti-service transactions; transactions spanning multiple applications; when coordinator must survive application restarts
RecoveryApplication restart recovers coordinator state from persistent storageCoordinator service restart with state persisted to disk or replicated via consensus

Failure Handling

2PC has several failure scenarios that cause problems.

Participant Crashes Before Voting

If a participant crashes before voting, the coordinator times out and sends rollback. Other participants roll back. The crashed participant, when it recovers, also rolls back (or has nothing to do if it had not started work).

This case is manageable.

Participant Crashes After Voting Yes but Before Commit

This is where things break down. The participant has voted yes — it has acquired locks and finished its work. Now it is waiting for the coordinator to say commit or rollback.

If the coordinator crashes before sending the decision, this participant is stuck. It holds its locks indefinitely. Other participants have committed (if they received the commit message before crashing) or rolled back (if they received rollback or voted no).

The participant cannot decide unilaterally. It cannot commit (because the coordinator might have sent rollback to others). It cannot rollback (because the coordinator might have sent commit to others).

This is the blocking problem. The participant must wait for the coordinator to recover.

Coordinator Crashes

If the coordinator crashes after collecting yes votes but before sending commit, participants are stuck. They must wait for the coordinator to recover.

If the coordinator never comes back, participants stay stuck. In practice, systems fall back on timeouts and manual intervention.

Network Partition

If a participant cannot reach the coordinator, it blocks. If the coordinator cannot reach a participant, it must treat it as a no vote and send rollback.

Network partitions are common in distributed systems. 2PC does not handle them gracefully.

Alternatives to 2PC

Why 2PC Is Rarely Used

The blocking problem is what kills 2PC in practice. In a system that needs to stay available — which is most systems — waiting indefinitely for coordinator recovery is not an option.

Other protocols address this. Three-phase commit (3PC) adds a pre-commit phase to eliminate blocking, but it still assumes a synchronous system and makes stronger network assumptions. It is rarely used in practice either.

The saga pattern avoids locking entirely. Instead of atomic commitment, saga uses compensating transactions. If step 3 fails, steps 1 and 2 are undone. The system remains available. The cost is temporary inconsistency. For saga pattern details, see Saga Pattern.

Event sourcing stores events, not state. The event log is the source of truth. Rebuilding state is a matter of replaying events. This avoids distributed transactions altogether.

Three-Phase Commit (3PC)

3PC attempts to solve the blocking problem by adding a third phase between prepare and commit. The key difference is that 3PC is non-blocking, assuming synchronous enough network conditions.

Phase 1: Prepare — Same as 2PC, coordinator asks participants to vote.

Phase 2: Pre-commit — If all vote yes, coordinator sends pre-commit. Participants acknowledge receipt and stay in pre-commit state.

Phase 3: Commit — Coordinator sends commit, participants commit.

The pre-commit phase acts as an insurance policy. If the coordinator crashes during pre-commit, participants know a commit decision was made. They can decide to commit rather than block indefinitely.

Aspect2PC3PC
BlockingYes — blocks on coordinator failureNon-blocking (under synchronous assumptions)
PhasesPrepare → CommitPrepare → Pre-commit → Commit
Coordinator crashParticipants block indefinitelyParticipants can decide (commit) if pre-commit seen
Network partitionMay block indefinitelyMay still block if timeout assumptions fail
Message complexity3 messages per participant5 messages per participant
Failure handlingBlocking; manual interventionStill assumes network synchrony; less practical
In-doubt stateYes — participant stuck if crashPre-commit state gives participants decision capability
Real-world usageEmbedded in databases (XA)Rarely used; considered impractical for most systems
Atomicity guaranteeAll-or-nothingAll-or-nothing (same as 2PC)

3PC sounds better on paper. The problem is that it assumes the network behaves predictably — that messages arrive within a known timeout and that nodes do not fail simultaneously. In real distributed systems, these assumptions break. Networks partition. Nodes crash and come back. The protocol that should eliminate blocking can still block under realistic failure scenarios.

Most engineers who need distributed transaction guarantees end up with either a properly configured 2PC (with Paxos-based coordinator) or saga. 3PC occupies an awkward middle ground: it solves the blocking problem in theory but adds complexity without solving the underlying network assumptions that make it impractical.

Database Implementations and Practical Usage

Database Implementations

Most real-world 2PC implementations don’t look like the textbook version. The blocking problem is well-known, so database teams built around it. Spanner, CockroachDB, and TiDB each took a different path — but they share one idea: make the coordinator state survive crashes.

Database-Specific Implementations

Google Spanner

Spanner uses Paxos groups as the coordinator. Instead of one node deciding and hoping it doesn’t crash, the group agrees on the decision. When a transaction touches multiple participant groups:

  1. The coordinator leader proposes a prepare timestamp via Paxos
  2. Participants acknowledge via Paxos consensus
  3. The coordinator leader assigns a commit timestamp and broadcasts via Paxos

The coordinator is the Paxos leader — if it dies, another node picks up immediately. Spanner’s TrueTime adds another trick: even during uncertainty windows, participants know how long to wait before giving up.

Spanner’s Paxos integration solves the blocking problem by replicating coordinator state across all group members. Any surviving group member can drive recovery after a failure.

# Simplified Spanner-style Paxos-coordinated 2PC
class PaxosCoordinator:
    def __init__(self, participants, paxos_group):
        self.participants = participants
        self.paxos_group = paxos_group  # Replicated coordinator state

    def execute(self, transaction):
        # Propose commit decision to Paxos group (not single node)
        proposal = {'decision': 'commit', 'txn': transaction, 'timestamp': None}

        # Paxos consensus among group members
        decided_value = self.paxos_group.propose(proposal)

        if decided_value['decision'] == 'commit':
            # Broadcast commit to all participants
            for p in self.participants:
                p.commit(transaction)
        else:
            for p in self.participants:
                p.rollback(transaction)
CockroachDB

CockroachDB went with distributed commit, where the transaction record itself is a Raft entity. The leaseholder of the transaction record acts as the coordinator — not a separate process. When a transaction commits:

  1. The transaction record gets updated with a commit timestamp
  2. This update goes into the Raft log across all replica nodes
  3. Participants read the committed transaction record to figure out what happened

The outcome lives in Raft, so it survives coordinator crashes. No single-node SPOF.

TiDB

TiDB splits things across three components: the TiDB Server handles coordination, PD hands out timestamps, and TiKV stores everything with Raft replication underneath. Fail the TiDB Server and the transaction record survives in TiKV.

DatabaseCoordinator TypeConsensus LayerBlocking Problem Solved?
SpannerPaxos group leaderPaxosYes — replicated coordinator state
CockroachDBTransaction leaseholderRaftYes — transaction record replicated
TiDBTiDB Server + PDRaft (via TiKV)Yes — commit decision in replicated Raft log

When to Use / Recovery

When to Use 2PC

Use 2PC when:

  • You need atomic commitment across multiple nodes with strong consistency
  • Your network is stable and transactions are short (minimizing blocking window)
  • You are using a distributed database with Paxos-based coordinator (like Spanner) that eliminates the coordinator single point of failure
  • Occasional blocking during coordinator recovery is acceptable
  • Regulatory requirements demand ACID semantics across distributed nodes

Avoid 2PC when:

  • Your system must stay available during network partitions
  • Transactions are long-running (blocking window becomes unacceptable)
  • You have multiple independent services with separate databases
  • You cannot guarantee coordinator reliability (no HA configuration)
  • The blocking problem is unacceptable for your availability requirements

Recovery Protocol

Crash recovery in 2PC is where things get uncomfortable. When a participant restarts with a transaction in flight, it has to figure out what happened while it was gone — without asking the coordinator (because the coordinator might be down too).

The trick is that participants log their state to WAL before doing anything else. If you voted Yes, that goes to WAL before the vote message leaves. This means a restarted participant can reconstruct its state by reading its own log.

Recovery Protocol Implementation

WAL Entry Types

- TXN_PREPARE: Voted Yes, locks acquired, waiting for coordinator decision
- TXN_COMMIT: Decision came back commit, locks released
- TXN_ROLLBACK: Decision came back rollback (or voted No), locks released

Participant Recovery Procedure

On restart, scan for transactions stuck in PREPARED state. For each one:

  1. Check the coordinator’s decision — either from the coordinator’s recovery log (old style) or from the replicated Paxos/Raft log (modern systems)
  2. If the decision was COMMIT, commit locally and release locks
  3. If the decision was ROLLBACK, roll back and release locks
  4. If the coordinator is still unreachable, stay in prepared state and wait

Coordinator Recovery Procedure

The coordinator must write its decision to WAL before telling participants. If you crash after collecting Yes votes but before persisting, you’ve got an inconsistent mess that needs manual intervention.

On restart, the coordinator reads its WAL for unresolved transactions and resends the decision to participants that haven’t acknowledged.

The Critical Rule

The coordinator MUST write its decision to persistent storage before sending it to participants. If the coordinator crashes after collecting yes votes but before persisting, and the participants have already committed, the system enters an inconsistent state that requires manual intervention.

Modern databases sidestep this entirely — the commit decision lives in a replicated log, so it survives coordinator crashes without anyone having to remember anything.

Implementation Sketch

A simplified 2PC coordinator looks like this:

class TwoPhaseCommitCoordinator:
    def __init__(self, participants):
        self.participants = participants
        self.state = 'init'

    def execute(self, transaction):
        # Phase 1: Prepare
        votes = []
        for participant in self.participants:
            vote = participant.prepare(transaction)
            votes.append(vote)

        if any(vote == 'no' for vote in votes):
            self.state = 'rollback'
            for participant in self.participants:
                participant.rollback(transaction)
            return 'rolled back'

        # Phase 2: Commit
        self.state = 'commit'
        for participant in self.participants:
            participant.commit(transaction)
        return 'committed'

Real implementations must handle timeouts, crashes, and retries.

Trade-off Analysis

2PC vs Alternatives

Aspect2PCSaga PatternEvent Sourcing
AtomicityFull atomic commitmentEventual consistency via compensationState rebuilt from event log
BlockingYes — coordinator failure blocksNever blocksNo blocking
LocksHolds locks until commit/rollbackNo distributed locksNo locks
ComplexityRequires coordinator + participantsApplication-level compensation logicEvent replay complexity
AvailabilityCompromised during partitionsHigh availability maintainedHigh availability
ConsistencyStrong consistencyEventual consistencyStrong consistency (event log)
Latency2-3 round trips (prepare + commit)No distributed coordination overheadWrite to local event log
RecoveryWAL-based; coordinator must surviveCompensating transactions replayReplay events to rebuild state
Use CaseTightly coupled distributed DBsMicroservices, long-running processesAudit trails, CQRS, immutable logs

When to Choose Each Approach

Choose 2PC when:

  • All participants are tightly coupled databases under your control
  • Strong consistency is non-negotiable
  • Coordinator can be made highly available (Paxos-based)
  • Transaction duration is short and predictable

Choose Saga when:

  • Services have independent databases
  • Availability trumps immediate consistency
  • Business transactions span service boundaries
  • You can tolerate temporary inconsistency

Choose Event Sourcing when:

  • You need a complete audit trail
  • Rebuilding state from events is acceptable
  • You want to avoid distributed transactions entirely
  • Temporal queries and event replay are important

Production Failure Scenarios

FailureImpactMitigation
Participant crashes after voting Yes but before commitParticipant blocks indefinitely waiting for decisionUse Paxos-based consensus for coordinator; implement timeout-based lock release
Coordinator crashes after collecting Yes votesParticipants block waiting for decisionRun coordinator in HA with consensus; use persistent state for recovery
Network partition during commit phaseSome participants commit, others rollbackDesign for partition tolerance; use saga pattern for compensation
Participant recovery with uncertain stateParticipant doesn’t know whether to commit or rollbackUse WAL to determine state; implement recovery protocol; log prepare state
Coordinator timeout misconfigurationPremature rollback or indefinite waitingSet appropriate timeout values based on network characteristics
All participants vote NoCoordinator sends rollback; this is normalEnsure proper error handling; investigate root cause of votes

Common Pitfalls / Anti-Patterns

Ignoring the blocking problem: The coordinator crash scenario where participants block indefinitely is not theoretical. In production systems with network issues, it happens. Do not use 2PC without a plan for this.

Long-running transactions with 2PC: The longer the transaction, the longer participants hold locks. Long transactions increase contention and blocking window. Keep transactions short.

Using 2PC for microservice transactions: 2PC across independent services with separate databases is generally the wrong approach. Services should own their data and use saga or choreography for cross-service consistency.

Not planning for coordinator failure: If the coordinator is a single point of failure, you will eventually have a bad day. Use HA coordinator configuration or consensus-based coordination.

Underestimating timeout configuration: Timeouts that are too short cause unnecessary rollbacks. Timeouts that are too long cause long blocking. Tune based on actual network characteristics.

Assuming atomicity provides isolation: 2PC provides atomicity but not isolation. Concurrent transactions can still see each other’s uncommitted results. Use appropriate isolation levels.

Interview Questions

1. What problem does the two-phase commit protocol solve?
  • 2PC ensures atomic commitment across multiple database nodes in a distributed system
  • All participants either commit or rollback together — no partial states
  • It provides the illusion of a single-node transaction across multiple nodes
  • Solves the coordination problem: how to get independent nodes to agree on a single outcome
2. Describe the two phases of 2PC and what happens in each.
  • Phase 1 (Prepare): Coordinator sends prepare to all participants; each participant votes Yes (ready to commit, locks held) or No (rollback, release locks)
  • Phase 2 (Commit): If all vote Yes, coordinator sends commit to all; participants commit locally and release locks; if any vote No, coordinator sends rollback to all
  • The coordinator manages the voting and decision distribution
3. What is the blocking problem in 2PC and why is it serious?
  • Blocking occurs when a participant votes Yes but the coordinator crashes before sending the decision
  • The participant holds locks and cannot decide unilaterally — it cannot commit (coordinator might have sent rollback) and cannot rollback (coordinator might have sent commit)
  • The participant must wait indefinitely for coordinator recovery
  • Other participants may have already committed or rolled back, creating inconsistent state
  • This is unacceptable in high-availability systems
4. How do modern distributed databases like Spanner solve the blocking problem?
  • Spanner uses Paxos groups as the coordinator — the coordinator state is replicated across multiple nodes
  • If the coordinator leader crashes, another node can immediately take over and drive recovery
  • CockroachDB uses distributed commit where the transaction record is a Raft entity
  • TiDB stores the commit decision in TiKV which is Raft-replicated
  • The key insight: coordinator state must survive crashes, not just the coordinator process
5. What is the difference between an embedded and external coordinator?
  • Embedded: coordinator runs within the application process (e.g., XA transactions in MySQL); simpler but app crash kills coordinator
  • External: coordinator is a separate service (e.g., Narayana, WebSphere LTCO); survives app crashes but adds operational complexity
  • Both are single points of failure unless the coordinator itself is HA/replicated
  • External coordinators are better for multi-service transactions spanning applications
6. How does WAL (Write-Ahead Logging) help with 2PC recovery?
  • Participants write their state to WAL before sending vote messages
  • TXN_PREPARE = voted Yes, locks held, waiting; TXN_COMMIT = decided commit; TXN_ROLLBACK = decided rollback
  • On restart, participants scan WAL for PREPARED transactions
  • They check the coordinator's decision from replicated log (or ask coordinator) and complete accordingly
  • Critical rule: coordinator must write decision to WAL before sending to participants
7. Why is 2PC rarely used in microservice architectures?
  • Microservices have independent databases — 2PC across separate databases is complex and fragile
  • The blocking problem is unacceptable for available systems
  • Long-running transactions with 2PC cause lock contention across service boundaries
  • Saga pattern is preferred: each service handles its own transaction, compensating transactions handle failures
  • Saga trades atomicity and isolation for availability and scalability
8. What are the key metrics to monitor in a 2PC production deployment?
  • Transaction commit rate vs rollback rate (baseline and anomalies)
  • Prepare and commit phase durations (latency spikes indicate problems)
  • Coordinator crash count and recovery time
  • Participant blocking time (time stuck in PREPARED state)
  • Transaction timeout rate (misconfigured timeouts cause unnecessary rollbacks)
  • Alert thresholds: participants in prepared state > N seconds, rollback rate exceeds baseline
9. How does 3PC attempt to solve the blocking problem and why does it still fail in practice?
  • 3PC adds a pre-commit phase: Prepare → Pre-commit → Commit
  • If coordinator crashes after sending pre-commit, participants know a commit decision was made
  • Participants can decide to commit rather than block indefinitely
  • Problem: 3PC assumes synchronous network — if messages timeout unexpectedly, the protocol breaks down
  • Network partitions and simultaneous node failures still cause blocking in practice
  • Real-world systems find Paxos-based 2PC or saga more practical
10. What is the critical rule for coordinator crash recovery in 2PC?
  • The coordinator MUST write its decision to persistent storage (WAL or replicated log) BEFORE sending to participants
  • If coordinator crashes after collecting yes votes but before persisting, participants may have committed while coordinator lost the decision
  • This creates an inconsistent state requiring manual intervention
  • Modern databases avoid this by storing the commit decision in a Raft/Paxos replicated log
  • Recovery: coordinator reads WAL for unresolved transactions and resends decisions to unacknowledged participants
11. What is the role of the prepare phase in 2PC and why is it important?
  • The prepare phase is where the coordinator asks all participants to vote on whether they can commit
  • Each participant must write TXN_PREPARE to WAL before sending vote — this ensures the vote is durable
  • Participants acquire locks during prepare and hold them until the decision arrives
  • If any participant votes No, the transaction aborts immediately — no locks remain held
  • The prepare phase creates the "in-doubt" window: participants are committed to vote but awaiting decision
12. Explain the difference between 2PC and saga pattern for distributed transactions.
  • 2PC provides atomic commitment — all participants commit or all rollback together; saga provides eventual consistency via compensating transactions
  • 2PC blocks participants during coordinator failure; saga never blocks because it avoids distributed locks
  • 2PC requires a coordinator and participants to support the protocol; saga can be implemented at the application layer without database support
  • Saga trades isolation and atomicity for availability — intermediate states are visible
  • 2PC is suitable when strong consistency is required and blocking is acceptable; saga is better for microservices and long-running transactions
13. How does network partition affect 2PC and what are the outcomes?
  • If a participant cannot reach the coordinator during prepare, it blocks indefinitely waiting for the prepare message
  • If the coordinator cannot reach a participant, it must treat it as a No vote and send rollback to all
  • If partition occurs after participants vote Yes but before commit arrives, affected participants block indefinitely
  • Network partitions can cause split-brain scenarios where some participants commit and others rollback
  • CAP theorem implications: 2PC chooses consistency over availability during partitions — it will block rather than proceed inconsistently
14. What is an in-doubt transaction in 2PC and how should it be handled?
  • An in-doubt transaction occurs when a participant has voted Yes but not yet received the commit or rollback decision from the coordinator
  • The participant holds locks and cannot make a unilateral decision
  • Handling options: wait for coordinator recovery (may block indefinitely), use timeout-based lock release (risks inconsistency), consult coordinator's WAL on recovery
  • Modern databases with Paxos/Raft-based coordinators solve in-doubt states by replicating the decision
  • Monitoring in-doubt duration is critical — alerts should fire if participants exceed threshold time in prepared state
15. What are the isolation guarantees provided by 2PC?
  • 2PC provides atomicity but NOT isolation — these are separate properties in ACID
  • Concurrent transactions can see each other's uncommitted results depending on isolation level
  • The locks held during prepare phase provide serializability at the cost of concurrency
  • 2PC does not prevent dirty reads unless appropriate isolation levels are configured
  • Understanding isolation levels (Read Committed, Serializable, etc.) is essential for correct 2PC usage
16. Why does 2PC not work well with long-running transactions?
  • Long-running transactions hold locks for extended periods, blocking other transactions
  • The blocking window is proportional to transaction duration — longer transactions mean longer potential blocking
  • Network timeouts are harder to set correctly for long transactions — too short causes premature rollback, too long causes extended blocking
  • Deadlock probability increases with transaction duration and lock contention
  • Saga pattern is preferred for long-running business transactions to avoid lock contention
17. How does TiDB's architecture differ from Spanner and CockroachDB in solving the 2PC blocking problem?
  • TiDB uses a three-component architecture: TiDB Server (coordinator), PD (timestamp allocation), TiKV (storage with Raft replication)
  • The commit decision is stored in TiKV which is Raft-replicated, surviving TiDB Server failures
  • PD provides globally consistent timestamps, enabling distributed transaction ordering
  • Spanner uses Paxos groups with TrueTime for geo-replication; CockroachDB uses distributed Raft transaction records
  • TiDB's separation of compute (TiDB Server) and storage (TiKV) provides horizontal scalability
18. What is the message complexity of 2PC and how does it scale?
  • 2PC requires 3 messages per participant: prepare, vote response, commit/rollback
  • Total messages = 3N where N is the number of participants — scales linearly with participants
  • All messages must arrive within timeout windows — network latency impacts coordinator timeout configuration
  • 3PC requires 5 messages per participant (adds pre-commit and ack), increasing overhead by 67%
  • High participant counts make 2PC coordination expensive and block longer due to stragglers
19. What are the security considerations when implementing 2PC?
  • Coordinator-to-participant communication must be authenticated to prevent unauthorized transaction commands
  • Encryption (TLS) is needed to protect vote values and commit decisions from interception
  • Authorization: only authorized participants should join transactions — implement access control lists
  • Audit logging of all commit/rollback decisions for compliance and forensic analysis
  • Input validation at each participant prevents malicious transaction data from propagating
  • Coordinator management APIs must be protected to prevent coordinator compromise
20. How would you design a monitoring dashboard for a production 2PC deployment?
  • Key metrics: commit rate vs rollback rate, prepare/commit phase durations, in-doubt transaction count and duration
  • Coordinator health: crash count, recovery time, HA failover events
  • Participant health: blocking time in prepared state, vote distribution (yes/no ratio)
  • Network health: timeout rates, partition events, latency percentiles
  • Alerts: participants in prepared state > threshold, coordinator unavailable, rollback rate spike
  • Dashboards should show end-to-end transaction latency, not just 2PC protocol latency

Further Reading

Consensus Algorithms and 2PC

2PC is fundamentally a consensus problem — all participants must agree on a single decision (commit or abort). Understanding the relationship between consensus algorithms and 2PC illuminates why modern distributed databases moved toward consensus-based coordinators.

The Consensus Problem

In distributed systems, consensus requires agreement among distributed processes on a single value. The FLP impossibility result proved that no deterministic consensus algorithm can guarantee termination in an asynchronous system if even one process might fail. This means any practical consensus algorithm must make trade-offs between safety (agreement) and liveness (termination).

2PC sidesteps this by making a different assumption: the coordinator is trusted and reliable. If the coordinator fails, the system trades liveness for safety — participants block rather than risk inconsistency. This works when coordinator failure is rare and recovery is fast.

Paxos and Raft as Coordinator Backends

Modern databases solve the coordinator reliability problem by implementing the coordinator as a replicated state machine using consensus algorithms:

Paxos (used by Google Spanner, etcd, CockroachDB’s HA coordinator):

  • Guarantees safety even during network partitions
  • Uses leader lease to ensure only one leader proposes values
  • Coordinator state is replicated across a quorum of nodes
  • If coordinator crashes, a new leader is elected from the surviving quorum

Raft (used by CockroachDB, TiDB):

  • Designed to be understandable compared to Paxos
  • Leader-based — all writes go through the leader
  • Coordinator state lives in the Raft log, replicated to followers
  • Leader election provides automatic failover

Why Not Use Pure Consensus for Transactions?

Using a consensus algorithm as the transaction coordinator means every transaction decision requires a round of consensus (typically 2 round trips for Raft/Paxos). This adds latency compared to a single-node coordinator. The trade-off:

  • Single-node coordinator: Fast but SPOF
  • Consensus-backed coordinator: Higher latency but fault-tolerant

For high-value financial transactions, the fault tolerance justifies the latency. For low-latency OLTP workloads, the cost may be prohibitive.

Implementing Consensus-Backed 2PC

# Consensus-backed coordinator concept
class ConsensusBackedCoordinator:
    def __init__(self, paxos_group):
        self.paxos_group = paxos_group
        self.leader_lease_duration = 10  # seconds

    def begin_transaction(self, transaction):
        # Propose transaction decision to consensus group
        proposal = TransactionDecision(
            txn_id=transaction.id,
            decision='prepare',
            timestamp=self.paxos_group.get_current_timestamp()
        )

        # Wait for quorum agreement
        agreed_value = self.paxos_group.propose(proposal)

        # If we're the leader, we can proceed
        if self.paxos_group.is_leader():
            return self.execute_transaction(transaction)
        else:
            # Forward to current leader
            return self.forward_to_leader(transaction)

The Hybrid Approach

Most production systems use a hybrid: a highly available coordinator (potentially consensus-backed) that manages 2PC protocol, while the underlying storage uses Raft/Paxos for data replication. The coordinator manages the protocol semantics; the storage layer manages data safety.

This separation of concerns lets each layer be optimized independently — coordinator for low-latency protocol handling, storage layer for safe data replication.

Conclusion

Two-phase commit provides atomic commitment across distributed nodes. It works in theory. In practice, the blocking problem makes it unsuitable for high-availability systems.

When a participant votes yes and the coordinator crashes before sending the decision, the participant blocks indefinitely. This is unacceptable in systems that must stay available.

Most microservice architectures use saga instead. Saga sacrifices isolation and atomicity for availability. The trade-off is usually the right one. For cross-service business transactions, eventual consistency with compensating transactions is more practical than 2PC’s blocking semantics.

sequenceDiagram
    participant Coordinator
    participant P1 as Participant 1
    participant P2 as Participant 2
    Coordinator->>+P1: Prepare
    Coordinator->>+P2: Prepare
    P1->>-Coordinator: Vote Yes
    P2->>-Coordinator: Vote Yes
    Coordinator->>+P1: Commit
    Coordinator->>+P2: Commit

Key Points

  • Two-phase commit has a prepare phase and a commit phase
  • If the coordinator crashes after prepare, participants block
  • 2PC is unsuitable for high-availability microservice architectures
  • Modern systems use saga patterns and consensus algorithms instead
  • Raft and Paxos provide fault-tolerant coordination without blocking

Production Checklist

Production Readiness Checklist

# Two-Phase Commit Production Readiness

- [ ] Coordinator running in HA mode (Paxos-based if possible)
- [ ] Transaction timeout values tuned for network characteristics
- [ ] Participant recovery protocol implemented and tested
- [ ] Monitoring for prepared-state blocking time
- [ ] Alerting for coordinator crashes
- [ ] Rollback rate and latency monitored
- [ ] Audit logging for all commit/rollback decisions
- [ ] Clear escalation path for stuck transactions
- [ ] Documented why 2PC was chosen over saga

Observability Checklist

Metrics
  • Transaction commit rate vs rollback rate
  • Prepare phase duration
  • Commit phase duration
  • Coordinator crash count and recovery time
  • Participant blocking time (time in prepared state)
  • Transaction timeout rate
Logs
  • Log prepare and vote from each participant
  • Log coordinator decision (commit or rollback)
  • Log participant state changes
  • Include transaction ID and participant IDs for correlation
  • Log timeout and recovery events
Alerts
  • Alert when participants are in prepared state too long
  • Alert on coordinator crash
  • Alert when transaction timeout rate increases
  • Alert when rollback rate exceeds baseline
Security Checklist
  • Authenticate all participants in the transaction
  • Authorize which participants can join which transactions
  • Encrypt coordinator-to-participant communication
  • Audit log all commit and rollback decisions
  • Validate transaction inputs at each participant
  • Protect coordinator management APIs

Category

Related Posts

Distributed Transactions: ACID vs BASE Trade-offs

Explore distributed transaction patterns: ACID vs BASE trade-offs, two-phase commit, saga pattern, eventual consistency, and choosing the right model.

#distributed-systems #transactions #consistency

Google Spanner: Globally Distributed SQL at Scale

Google Spanner architecture combining relational model with horizontal scalability, TrueTime API for global consistency, and F1 database implementation.

#distributed-systems #databases #google

Gossip Protocol: Scalable State Propagation

Learn how gossip protocols enable scalable state sharing in distributed systems. Covers epidemic broadcast, anti-entropy, SWIM failure detection, and real-world applications like Cassandra and Consul.

#distributed-systems #gossip-protocol #consistency