The Saga Pattern: Managing Distributed Transactions

Learn saga pattern for distributed transactions without two-phase commit. Understand choreography vs orchestration with practical examples and production considerations.

published: reading time: 39 min read author: GeekWorkBench

The Saga Pattern: Managing Distributed Transactions

The saga pattern is one of the most practical patterns for managing distributed transactions without the complexity of two-phase commit. When a business operation spans multiple services — each with their own database — a single ACID transaction is not possible. The saga breaks this into a sequence of local transactions, each with a compensating transaction that can undo it if something fails.

Introduction

ACID transactions do not scale across services. When order service, inventory service, and payment service each have their own databases, you cannot wrap a single transaction around all three. Two-phase commit is theoretically possible but practically problematic (more on that in Two-Phase Commit).

The saga pattern offers an alternative. Instead of locking resources across services, a saga breaks a distributed transaction into a sequence of local transactions, each with a compensating transaction that can undo it.

Topic-Specific Deep Dives

Core Concepts

A saga represents a distributed transaction as a series of steps. Each step is a local transaction on one service. After each step, the saga either continues to the next step or, if a step fails, runs compensating transactions to undo the previous steps.

Step 1: Reserve Inventory (compensate: release inventory)
Step 2: Charge Payment (compensate: refund payment)
Step 3: Create Shipment (compensate: cancel shipment)

If step 3 fails, you run the compensation for step 2 (refund) and then step 1 (release inventory). The saga undoes what it did, leaving the system consistent.

sequenceDiagram
    participant Saga
    participant Inv as Inventory
    participant Pay as Payment
    participant Ship as Shipping
    Saga->>+Inv: Reserve
    Inv->>-Saga: OK
    Saga->>+Pay: Charge
    Pay->>-Saga: OK
    Saga->>+Ship: Ship
    Ship->>-Saga: OK
    Note over Saga: Success - all steps complete

Saga does not provide isolation.

Example: order processing saga

A complete order processing saga with three services:

def create_order_saga(order):
    saga = OrderSaga()

    # Step 1: Reserve inventory
    reservation = saga.reserve_inventory(order.items)
    if not reservation.success:
        return OrderResult(rejected=True, reason="Insufficient inventory")

    # Step 2: Authorize payment
    authorization = saga.authorize_payment(order.payment, order.total)
    if not authorization.success:
        saga.compensate_inventory(reservation.id)
        return OrderResult(rejected=True, reason="Payment declined")

    # Step 3: Create shipment
    shipment = saga.create_shipment(order.address, order.items)
    if not shipment.success:
        saga.reverse_payment(authorization.id)
        saga.compensate_inventory(reservation.id)
        return OrderResult(rejected=True, reason="Shipping unavailable")

    # All steps succeeded
    return OrderResult(confirmed=True, order_id=order.id, shipment=shipment)

Each compensate_* method is a compensating transaction. They run in reverse order on failure.

Saga vs two-phase commit

Two-phase commit (2PC) locks resources until all participants confirm. It provides atomicity but at the cost of availability. If the coordinator fails during the commit phase, participants may be left waiting indefinitely.

Saga takes a different approach. It sacrifices isolation and atomicity for availability and scalability. Steps execute one at a time. Failures trigger compensation, not rollback.

For a deep dive on 2PC and why it is often avoided, see Two-Phase Commit.

Architecture Deep Dives

There are two ways to implement a saga. Choreography uses events between services to coordinate. Orchestration uses a central orchestrator to manage the sequence.

Choreographed saga

Each service knows its own step and its own compensation. When a step completes, the service emits an event. The next service reacts to that event. If something fails, services emit failure events that trigger compensations.

graph LR
    Order[Order Service] -->|OrderCreated| Inv[Inventory]
    Inv -->|InventoryReserved| Pay[Payment]
    Pay -->|PaymentCharged| Ship[Shipping]
    Ship -->|ShipmentCreated| Order

In this flow, each service reacts to the previous step’s event. The behavior is distributed. Each service knows only its own piece.

When Payment fails after Inventory is reserved:

graph LR
    Pay -->|PaymentFailed| Inv
    Inv -->|InventoryReleased| Order

The compensation logic lives in each service. Inventory responds to the failure by releasing the reservation.

Orchestrated saga

A central orchestrator manages the sequence. It decides what step runs next, handles failures, and triggers compensations. The orchestrator knows the entire workflow.

graph LR
    Orch[Order Orchestrator] -->|Reserve| Inv
    Inv -->|OK| Orch
    Orch -->|Charge| Pay
    Pay -->|OK| Orch
    Orch -->|Ship| Ship
    Ship -->|OK| Orch

The orchestrator keeps state about what has completed. If Payment fails, the orchestrator tells Inventory to release and returns an error to the client.

For a full comparison, see Service Orchestration.

Implementing saga compensation

Compensation logic must be deterministic. If step 3 fails after step 2 succeeded, you must undo step 2. Running compensation twice or in the wrong order causes problems.

class OrderSaga:
    def execute(self, order):
        try:
            # Step 1: Reserve inventory
            reservation = self.inventory.reserve(order.items)
            self.steps.append(('reserve', reservation))

            # Step 2: Process payment
            charge = self.payment.charge(order.payment_info, order.total)
            self.steps.append(('charge', charge))

            # Step 3: Create shipment
            shipment = self.shipping.create(order.address)
            self.steps.append(('ship', shipment))

            return SagaResult(success=True, shipment=shipment)

        except PaymentDeclined:
            # Compensate step 1
            self.inventory.release(self.steps[0][1].reservation_id)
            return SagaResult(success=False, reason="Payment declined")

        except ShippingUnavailable:
            # Compensate step 2
            self.payment.refund(self.steps[1][1].charge_id)
            # Compensate step 1
            self.inventory.release(self.steps[0][1].reservation_id)
            return SagaResult(success=False, reason="Shipping unavailable")

The saga tracks completed steps in order. On failure, it runs compensations in reverse order. This is the compensating transaction pattern.

Idempotency in sagas

Sagas execute across unreliable networks. Messages may be delivered twice. Services may crash mid-operation. Your saga must handle idempotency.

Reserve the same inventory twice should not double-reserve. Charge the same payment twice should not double-charge.

def reserve_inventory(self, reservation_id, items):
    # Idempotency check
    if self.inventory.is_reserved(reservation_id):
        return self.inventory.get_reservation(reservation_id)

    # Actually reserve
    return self.inventory.create_reservation(reservation_id, items)

def charge_payment(self, charge_id, payment_info, amount):
    # Idempotency check using charge_id
    existing = self.payment.get_charge(charge_id)
    if existing:
        return existing

    # Process charge
    return self.payment.create_charge(charge_id, payment_info, amount)

Include an idempotency key (the reservation ID from step 1) in the compensation call. Check before acting.

Nested sagas

Large workflows sometimes need sub-sagas. Rather than one monolithic saga with dozens of steps, you can decompose into nested sagas where a step’s “execute” is itself a saga.

For example, an order fulfillment saga might have a step called ProcessPayment that internally runs authorize → capture as a nested saga. If the nested saga fails, the parent saga treats it as a single failed step and compensates accordingly.

Why nest sagas?

  • Reusability: The ProcessPayment nested saga can be reused across multiple parent sagas (order, subscription, refund)
  • Readability: Top-level saga reads like a business workflow, not a technical protocol
  • Scoped failures: If payment fails, you know exactly which sub-step failed without scrolling through 20 parent steps

Concurrency control with nested sagas:

Nested sagas introduce concurrency at the parent level. While the payment nested saga is running, other parent sagas may also be running and trying to access shared resources. Use optimistic or pessimistic locking at the parent level.

class ParentSaga:
    def execute(self):
        # Step 1: Reserve inventory (parent-level lock on inventory record)
        with self.lock('inventory', self.order.inventory_id):
            self.reserve_inventory()

        # Step 2: Run payment as nested saga (has its own compensation)
        payment_result = PaymentNestedSaga().execute(self.payment_context)

        if not payment_result.success:
            # Parent-level compensation for inventory
            self.release_inventory()  # Uses same lock
            return Failed

        # Step 3: Create shipment
        self.create_shipment()

Optimistic locking: Read the resource version before modifying. On update, check the version hasn’t changed. If it has, abort and retry.

Pessimistic locking: Acquire a lock before accessing the resource. Blocks other sagas from accessing it until the lock releases. Simpler but reduces concurrency.

Use optimistic locking for most cases (higher throughput). Use pessimistic locking only when the cost of a concurrent modification is very high (financial transactions, inventory with hard limits).

class SagaState:
    """Persisted saga state for crash recovery."""

    def __init__(self, saga_id: str):
        self.saga_id = saga_id
        self.steps: list[tuple[str, str, dict]] = []  # [(step_name, status, data)]
        self.status = "pending"
        self.created_at = datetime.utcnow()

    def mark_step_started(self, step_name: str, step_data: dict):
        self.steps.append((step_name, "started", step_data))
        self._persist()

    def mark_step_completed(self, step_name: str, result_data: dict):
        # Update step status
        for i, (name, status, data) in enumerate(self.steps):
            if name == step_name and status == "started":
                self.steps[i] = (name, "completed", {**data, **result_data})
                break
        self._persist()

    def mark_step_compensated(self, step_name: str):
        for i, (name, status, data) in enumerate(self.steps):
            if name == step_name and status == "completed":
                self.steps[i] = (name, "compensated", data)
                break
        self._persist()

    def get_pending_steps(self) -> list[str]:
        return [name for name, status, _ in self.steps if status == "started"]

    def get_completed_steps(self) -> list[str]:
        return [name for name, status, _ in self.steps if status == "completed"]

    def _persist(self):
        # Save to durable storage (database, etc.)
        db.sagas.upsert(self.saga_id, self.to_dict())

Failure Handling

Saga failures fall into several categories.

Transient failures: Network timeouts, temporary unavailability. Retry with backoff. If it keeps failing, treat as permanent.

Permanent failures: Insufficient inventory, card declined. These will not succeed on retry. Trigger compensation.

Unknown state: A service crashes mid-operation. When it recovers, determine what happened. Did the transaction commit before the crash? Did it not? This requires idempotency and state tracking.

Design compensations to be idempotent. If compensation runs twice, the second run should have no effect.

Performance considerations

Saga trades a synchronous two-round-trip 2PC for a sequential multi-step approach. The math is worth understanding.

2PC latency profile: Two network round-trips (prepare + commit), but all participants vote and commit in parallel. Typical: 10-50ms per participant phase.

Saga latency profile: Each step adds its own latency. If step 1 takes 20ms, step 2 takes 30ms, step 3 takes 15ms, your total is 65ms plus orchestration overhead. Steps run sequentially, not in parallel.

For a 3-step saga vs 2PC across the same 3 services:

Metric2PCSaga
Happy path latency~20-40ms (parallel phases)~50-80ms (sequential steps)
Failure recoveryBlocks until coordinator recoversCompensation runs immediately
AvailabilityLower (blocking on coordinator)Higher (no coordinator SPOF)
Lock durationAll locks held during both phasesEach lock released after its step

Saga’s latency overhead is real but often acceptable. If each step is 10-30ms (typical for service calls), a 5-step saga runs in 50-150ms total. Compare that to a user-facing API timeout (usually 1-5 seconds) and the overhead is negligible.

Where saga latency hurts is high-throughput, low-latency paths (trading systems, real-time pricing). In those cases, consider whether you can pipeline steps (some steps don’t depend on previous step results and can overlap).

Trade-off Analysis

The saga pattern involves several fundamental trade-offs compared to other distributed transaction approaches.

Testing Strategies

Saga vs Alternative Patterns

AspectSagaTwo-Phase CommitTCC (Try-Confirm-Cancel)
AtomicityEventual (via compensation)True atomic (all-or-nothing)Eventual (via Confirm/Cancel)
IsolationNone (partial states visible)Full isolationPartial (Try reserves)
AvailabilityHigh (no coordinator blocking)Low (coordinator is SPOF)Medium (Try phase reserves resources)
ComplexityMedium (compensation logic)High (coordinator protocol)High (3-phase interface per service)
Rollback MechanismCompensation transactionsAutomatic rollbackCancel / Confirm semantics
LatencyPer-step cumulativeTwo round-trips (parallel phases)Three round-trips per operation
Service CouplingEach service owns compensationAll services must support 2PCAll services must implement TCC
Failure HandlingApplication-defined compensationCoordinator-drivenTry phase failures auto-cancel

Choreography vs Orchestration Trade-offs

FactorChoreographyOrchestration
Central Point of FailureNone (fully distributed)Orchestrator is a potential failure point
Logic LocationDistributed across servicesCentralized in orchestrator
Adding New StepsAll participating services must changeOnly orchestrator changes
Understanding WorkflowHarder to see overall flowEasier to see complete workflow
DebuggingRequires distributed tracingEasier (centralized logs)
ScalabilityScales with number of servicesLimited by orchestrator capacity
Team IndependenceHigh (services evolve independently)Low (teams depend on orchestrator)
Suitable For2-4 services, simple event flows5+ services, complex workflows

Design Decision Matrix

Your ConstraintRecommended Approach
Maximum availability requiredSaga (either flavor)
True ACID isolation requiredSingle database or 2PC (with caveats)
Cross-service transactionsSaga
Can’t modify participant codeSaga (only compensation needed)
High-frequency, low-latencyPipelined saga or avoid distributed transaction
Long-running (hours to days)Orchestrated saga with durable execution (Temporal)
Simple 2-3 step workflowChoreographed saga
Complex multi-team workflowOrchestrated saga
Already using workflow engineUse the engine’s saga support (Temporal, Step Functions)

Cost-Benefit Summary

Saga Benefits:

  • No coordinator means no single point of failure
  • Steps execute sequentially — each lock is released after its step completes
  • Works across service boundaries with separate databases
  • Compensation is application-defined and domain-appropriate
  • Easier to retrofit onto existing services

Saga Costs:

  • No isolation — application must handle partial states
  • Compensation can fail, leaving the system in an inconsistent state
  • Longer cumulative latency (sum of step latencies vs parallel phases)
  • Application must implement idempotency throughout
  • Testing is more complex (failure scenarios, compensation ordering)

Saga Testing Strategies

Testing sagas is hard because they span services and involve time. A structured approach helps.

Unit testing compensations

Test each compensation in isolation first. The compensation is the most critical piece — if it fails, your saga is stuck.

# Test that releasing inventory twice has no effect (idempotency)
def test_release_inventory_idempotent():
    inventory = InMemoryInventory()
    inventory.reserve("order-123", ["item-a"])

    # First release — succeeds
    result1 = inventory.release("order-123")
    assert result1.success
    assert not inventory.is_reserved("order-123")

    # Second release — should be idempotent (no error)
    result2 = inventory.release("order-123")
    assert result2.success  # Still succeeds, even though already released

    # Third release — still idempotent
    result3 = inventory.release("order-123")
    assert result3.success

Integration testing failure scenarios

The real test is whether your saga handles failures correctly. Set up test infrastructure that simulates failures at each step.

# Test: step 3 fails, step 2 compensation runs correctly
def test_saga_step3_failure_triggers_step2_compensation():
    inventory = MockInventory()  # Always succeeds
    payment = MockPayment()      # Always succeeds
    shipping = MockShipping(fail_on="create")  # Fails on create

    saga = OrderSaga(inventory, payment, shipping)
    result = saga.execute(order_with_3_items)

    assert result.failed
    assert result.failed_step == "create_shipment"
    assert payment.refund_was_called()        # Step 2 compensated
    assert inventory.release_was_called()      # Step 1 compensated
    assert not shipping.shipment_created      # Step 3 never ran

Testing compensations run in correct order

The most common saga bug is compensation running in the wrong order. Write a test that explicitly verifies reverse order.

def test_compensation_runs_in_reverse_order():
    call_order = []

    class TrackingService:
        def do_step(self):
            call_order.append(f"do-{self.name}")
            return Success()

        def compensate(self):
            call_order.append(f"compensate-{self.name}")
            return Success()

    class OrderedSaga(Saga):
        def __init__(self):
            self.s1 = TrackingService(name="step1")
            self.s2 = TrackingService(name="step2")
            self.s3 = TrackingService(name="step3")

        def execute(self):
            self.do_step(self.s1)
            self.do_step(self.s2)
            self.do_step(self.s3)  # Fails here
            return Success()

        def compensate(self):
            # Should run in reverse: s3, s2, s1
            self.compensate(self.s3)
            self.compensate(self.s2)
            self.compensate(self.s1)

    saga = OrderedSaga()
    saga.execute()  # Fails on step 3
    saga.compensate()

    assert call_order == [
        "do-step1", "do-step2", "do-step3",
        "compensate-step3", "compensate-step2", "compensate-step1"
    ]

Chaos testing sagas in production

Once your saga is running in production, inject failures to verify it handles them:

  1. Kill a service mid-saga and verify compensation runs
  2. Introduce network latency and verify timeouts trigger correctly
  3. Fill up a resource (disk, connection pool) and verify graceful degradation
  4. Split the network between two services and verify saga completes or compensates correctly

Observability & Tracing

Observability

Sagas span multiple services. Without tracing, debugging a failed saga means grepping logs across 5 services and trying to piece together what happened. With distributed tracing (OpenTelemetry, Zipkin, Jaeger), you get a single trace ID that follows the saga across all services.

Trace structure for sagas

A saga trace has a parent span for the overall saga, with child spans for each step and compensation.

Trace: order-123-saga (trace-id: abc123)
├── Span: saga_created (service: order-service)
├── Span: step.reserve_inventory (service: inventory-service)
│   └── Span: compensate.reserve_inventory (service: inventory-service)
├── Span: step.charge_payment (service: payment-service)
│   └── Span: compensate.charge_payment (service: payment-service)
├── Span: step.create_shipment (service: shipping-service)
└── Span: saga_completed / saga_failed

Implementing trace context propagation

Pass trace context through saga steps using baggage or span links.

from opentelemetry import trace
from opentelemetry.propagate import inject, extract

tracer = trace.get_tracer(__name__)

class OrderSaga:
    def execute(self, order, context=None):
        # Extract incoming trace context (if triggered by HTTP request)
        if context:
            ctx = extract(context)
            span = tracer.start_span("order_saga", context=ctx)
        else:
            span = tracer.start_span("order_saga")

        with span:
            span.set_attribute("saga.id", order.saga_id)
            span.set_attribute("saga.type", "order_fulfillment")

            # Inject trace context into step calls
            headers = {}
            inject(headers)  # Injects current trace into headers

            try:
                # Step 1: Reserve inventory (pass headers for trace propagation)
                inventory_ctx = self.inventory.reserve(order.items, headers)

                # Step 2: Charge payment
                payment_ctx = self.payment.charge(order.payment, headers)

                # Step 3: Create shipment
                shipment = self.shipping.create(order.address, headers)

                span.set_status(trace.Status.OK)
                return Success(shipment)

            except Exception as e:
                span.record_exception(e)
                span.set_status(trace.Status.ERROR, str(e))
                raise

When a step fails and compensation runs, link the compensation span to the original step span so you can see the pairing in the trace.

def compensate_inventory(self, reservation_id, original_span):
    with tracer.start_as_current_span(
        "compensate.reserve_inventory",
        links=[Link(original_span.get_span_context())]
    ) as span:
        span.set_attribute("compensation.for", original_span.name)
        span.set_attribute("reservation.id", reservation_id)
        self.inventory.release(reservation_id)

This way, in your trace viewer, you see the compensate span linked back to its original do span — paired visually rather than guessing from logs.

Framework & Decision Guides

Building saga from scratch is educational but painful for production. Use a framework that handles state persistence, retry logic, observability, and distributed tracing out of the box.

Temporal

The strongest choice for saga orchestration. Temporal provides durable workflow execution — if your service crashes mid-saga, Temporal persists the workflow state and resumes it from where it left off. No need to build your own saga state machine.

  • Strengths: Durable execution (survives worker crashes), built-in retries with backoff, activity heartbeats, sandboxed workflow code, strong OpenTelemetry integration
  • Good for: Complex multi-step business workflows, long-running sagas (hours to days)
  • Trade-offs: Self-hosting is operationally heavy; Temporal Cloud pricing can be significant at scale
# Temporal workflow example
@workflow.defn
class OrderWorkflow:
    @workflow.run
    async def run(self, order: Order) -> OrderResult:
        # Activities with automatic retry
        reservation = await workflow.execute_activity(
            reserve_inventory,
            order.items,
            start_to_close_timeout=timedelta(seconds=30),
            retry_policy=ActivityRetryPolicy(maximum_attempts=3),
        )

        try:
            charge = await workflow.execute_activity(
                charge_payment,
                order.payment,
                order.total,
                start_to_close_timeout=timedelta(seconds=30),
            )
        except PaymentDeclined:
            await workflow.execute_activity(
                release_inventory,
                reservation.id,
            )
            return OrderResult(rejected=True, reason="Payment declined")

        shipment = await workflow.execute_activity(
            create_shipment,
            order.address,
        )

        return OrderResult(confirmed=True, shipment=shipment)

AWS Step Functions

Managed saga orchestration on AWS. Integrates tightly with AWS services (Lambda, ECS, SQS, DynamoDB). Good if you’re already all-in on AWS.

  • Strengths: Fully managed, pay-per-state-transition, tight Lambda integration, visual workflow designer
  • Good for: AWS-centric architectures, medium-complexity workflows
  • Trade-offs: Vendor lock-in, expensive at high step counts, debugging can be opaque

Conductor (Netflix)

Conductor is an open-source saga orchestrator from Netflix. Good for microservices that need workflow orchestration without heavy operational overhead.

  • Strengths: Open source (self-hostable), JSON-based workflow definitions, HTTP-based workers
  • Good for: Teams wanting open-source without Temporal complexity
  • Trade-offs: Not as battle-tested as Temporal at extreme scale, less mature ecosystem

Comparison

FrameworkDurabilityOpen SourceComplexityBest For
TemporalExcellent (durable execution)Yes (server) + cloudMediumComplex long-running workflows
AWS Step FunctionsGood (managed)NoLowAWS-centric, simple workflows
ConductorGoodYes (fully)MediumOpen-source preference

For most production scenarios, Temporal is the right call. The durable execution guarantee alone eliminates a whole class of saga state loss bugs.

When to Use / When Not to Use the Saga Pattern

CriteriaSaga (Choreography)Saga (Orchestration)Two-Phase Commit
AtomicityEventualEventualTrue atomicity
IsolationNoneNoneFull isolation
AvailabilityHighHighLow (blocks on coordinator failure)
ComplexityDistributed logicCentralized logicDistributed but synchronous
CompensationEach service knows its ownOrchestrator directsAutomatic rollback
LatencyPer-step latencyPer-step + orchestration overheadTwo round-trips
DebuggingHarder (distributed)Easier (centralized)Moderate
Rollback CostCompensation requiredCompensation requiredFree (automatic)

Use saga when:

  • Operations span multiple services with separate databases
  • You cannot use 2PC (which is usually the right call)
  • Business transactions map naturally to a sequence of steps
  • You can define compensating transactions for each step
  • Eventual consistency is acceptable for your domain
  • You need availability over strong isolation

Avoid saga when:

  • Steps have tight interdependencies that require strict isolation
  • Rollback must be immediate and guaranteed
  • Compensation is expensive or impossible (sending an email, charging a card with long refund times)
  • Your domain requires all-or-nothing atomicity that saga cannot provide
  • The inconsistency window is unacceptable for your use case

Production Considerations

Saga Failure Scenarios

FailureImpactMitigation
Step fails and compensation also failsSystem left in inconsistent stateDesign idempotent compensations; implement retry with exponential backoff; alert on repeated compensation failures
Service crashes mid-stepStep may or may not have completed; unknown stateUse idempotency keys; implement saga state tracking; use durable workflow engines
Concurrent sagas interfereOne saga’s uncommitted data affects another saga’s readImplement optimistic concurrency control; use application-level locks for critical resources
Compensation runs on already-succeeded stepDouble compensation causes incorrect stateTrack completed steps explicitly; prevent compensation on committed steps
Saga state lost (orchestrator crash)Cannot determine what steps completedPersist saga state to durable storage; use workflow engines that handle this
Infinite retry loopSystem stuck in repeated failed attemptsImplement max retry count; move to dead letter state after threshold; alert

Diagrams & State Machines

Failure Flow Diagram

graph TD
    Start[Start Saga] --> Step1[Execute Step 1]
    Step1 --> Step1OK{Step 1 OK?}
    Step1OK -->|No| Fail1[Compensate Step 1<br/>Return Error]
    Step1OK -->|Yes| Step2[Execute Step 2]
    Step2 --> Step2OK{Step 2 OK?}
    Step2OK -->|No| Comp1[Compensate Step 1]
    Comp1 --> Fail2[Return Error]
    Step2OK -->|Yes| Step3[Execute Step 3]
    Step3 --> Step3OK{Step 3 OK?}
    Step3OK -->|No| Comp2A[Compensate Step 2]
    Comp2A --> Comp2B[Compensate Step 1]
    Comp2B --> Fail3[Return Error]
    Step3OK -->|Yes| Success[Saga Complete]

Saga Execution State Machine

stateDiagram-v2
    [*] --> Pending: Saga created
    Pending --> Executing: First step started
    Executing --> Executing: Step N completed
    Executing --> Compensating: Step failed
    Compensating --> Compensating: Compensation in progress
    Compensating --> Completed: All compensations done
    Executing --> Completed: Final step succeeded
    Compensating --> Failed: Compensation failed after retries
    Pending --> Failed: Immediate failure (validation error)
    Completed --> [*]
    Failed --> [*]

State Definitions:

StateDescription
PendingSaga created but not yet started. Initial state before first step execution.
ExecutingOne or more steps have completed successfully. Saga is processing subsequent steps.
CompensatingA step failed and compensation is running in reverse order to undo completed steps.
CompletedAll steps succeeded or all compensations succeeded. Terminal success state.
FailedSaga reached an unrecoverable state (compensation failed or max retries exceeded).

State Transition Rules:

  1. Once in Executing, cannot return to Pending
  2. Compensating can only be entered from Executing
  3. Completed and Failed are terminal states
  4. From Compensating, success leads to Completed, persistent failure leads to Failed
  5. Only Pending and Failed can be safely retried at the saga level

Choreography vs Orchestration: Trade-off Analysis

FactorChoreographyOrchestrationRecommendation
CouplingLow — services own their own logicHigh — orchestrator knows all stepsChoreography for independent teams
ComplexityDistributed — each service knows its triggerCentralised — orchestrator is complexOrchestration for complex flows
VisibilityPoor — no central view of saga stateHigh — orchestrator tracks all stateOrchestration for debugging
ScalabilityHigh — no central bottleneckLimited by orchestrator capacityChoreography at scale
Failure HandlingEach service handles its own failuresCentralised retry and compensationOrchestration for guaranteed ordering

Production Failure Scenarios

Scenario 1: Payment Service Unavailable During Saga

What happened: A saga was executing an order placement across three services: inventory, payment, and shipping. The payment service became unavailable for 45 seconds mid-saga after the inventory reservation succeeded.

Root cause: The payment service’s database connection pool was exhausted due to a misconfigured max-connections setting. Requests queued up behind the pool limit and timed out.

Impact: 127 sagas entered the compensating state simultaneously. The compensating transaction to release inventory ran successfully, but the refund for charges already processed was delayed. 23 customers reported duplicate charges before idempotency checks caught them.

Lesson learned: Always implement idempotency keys at every saga step. The payment step used an idempotency key derived from the saga ID and step number, which prevented double-charging on retry — but the compensation path did not use an idempotency key, causing the delayed refund to appear as a duplicate in monitoring.

Scenario 2: Circular Dependency in Choreography

What happened: Two sagas in an e-commerce platform developed a circular dependency. Saga A needed to update product pricing and Saga B needed to update product availability. Both sagas published events that triggered each other’s next steps, creating an infinite loop that saturated the message broker.

Root cause: The choreography design did not account for the relationship between pricing updates and availability updates. A pricing change can affect which warehouse fulfils an order, which changes availability, which can affect pricing through promotional rules.

Impact: Message broker queue depth grew to 2.3 million messages in 3 minutes before the circuit breaker triggered. The system recovered after restarting both sagas and replaying from the last stable checkpoint.

Lesson learned: Model all cross-saga dependencies explicitly before deploying choreography. Implement circuit breakers per event type, not per saga. Always set message TTL and dead-letter queues for choreography events.

Scenario 3: Orchestrator State Loss

What happened: A Kubernetes pod running the saga orchestrator was terminated during a 12-step order fulfilment saga. The orchestrator had been persisting state every 10 steps, and the saga was at step 7. The replacement pod could not determine whether step 6 had completed.

Root cause: The orchestrator persisted saga state asynchronously to reduce latency, using a write-ahead log with batch commits. The batch had not been committed when the pod was terminated, losing visibility into steps 1–6.

Impact: The replacement pod retried from step 1, causing double reservation of inventory and double payment capture. The idempotency layer caught the double payment but not the double inventory reservation, leading to inventory inconsistency that took 4 hours to reconcile manually.

Lesson learned: Persist saga state synchronously at every step boundary, not in batches. Use the outbox pattern to guarantee that step completion and state persistence are atomic. Consider checkpointing after every step, not every N steps.

Scenario 4: Compensating Transaction Loop

What happened: A saga that books travel (flight, hotel, car) experienced a network partition between the hotel and flight services. The saga compensated the flight booking, then the network partition resolved, and the hotel service’s compensation event arrived late. The orchestrator interpreted the late event as a new failure and began compensating again.

Root cause: Events arrived out of order because the hotel service had a slower consumer group than the flight service. The compensation for the hotel ran after the saga had already reached the Completed state.

Impact: Three customers had their flight rebooked unnecessarily when the second compensation attempt ran after the original compensation had already released the inventory. One customer was moved from their booked flight to a later flight with a 6-hour delay.

Lesson learned: Use version numbers or vector clocks on saga events to detect and discard late events. Implement idempotency in compensation handlers using the original saga ID. Never compensate a saga that has already reached a terminal state.

Common Pitfalls / Anti-Patterns

Treating saga as ACID transaction: Saga does not provide isolation. Concurrent sagas can see each other’s partial results. If saga A reserves inventory and saga B reads inventory before A completes, B may make decisions on uncommitted data. Handle this at the application level.

Non-idempotent steps or compensations: If a step or compensation runs twice due to retries, the effect should be the same as running once. Reserve inventory twice should not double-reserve. Always check before acting.

Compensation order errors: Compensations must run in reverse order of execution. If step 2’s compensation runs before step 1’s, you may leave the system in a worse state. Explicitly track execution order.

Ignoring the inconsistency window: During a saga execution, the system is in an inconsistent state. Other operations may read partial results. Design your application to handle this (show pending states, use optimistic UI).

Long-running compensations: Compensations can take time (refunds, cancellations). While compensating, the system is still inconsistent. Minimize compensation time and alert if compensations are taking too long.

Not planning for compensation failure: What happens if compensation fails? The saga is stuck. Implement retry with backoff, then move to a dead letter state that requires manual intervention.

Interview Questions

1. What is the Saga pattern and why was it developed as an alternative to two-phase commit?

Expected answer points:

  • Saga is a distributed transaction pattern that chains local transactions with compensating actions
  • 2PC blocks resources and has a single point of failure (coordinator crash leaves participants waiting)
  • Saga was developed because ACID transactions do not scale across service boundaries — each service owns its database and cannot participate in a shared transaction
  • Saga trades ACID isolation for availability and scalability by using eventual consistency
2. What is the difference between choreographed and orchestrated sagas?

Expected answer points:

  • Choreographed saga: each service reacts to events from the previous service, behavior is fully distributed, no central coordinator
  • Orchestrated saga: a central orchestrator manages the workflow, knows the entire sequence, decides next steps and triggers compensations
  • Orchestration is easier to debug and reason about; choreography scales better but is harder to trace
  • Choreography couples services through event contracts; orchestration couples services to the orchestrator
3. What is a compensating transaction and why is it central to the Saga pattern?

Expected answer points:

  • A compensating transaction is an action that undoes a previously completed local transaction
  • Examples: refund payment, release inventory reservation, cancel shipment
  • When a step fails, the saga runs compensations for all completed steps in reverse order
  • Compensations must be deterministic and idempotent — running twice should have the same effect as running once
  • Not all operations can be compensated (sending an email, physical goods already shipped)
4. Why is idempotency critical in saga implementations?

Expected answer points:

  • Sagas run across unreliable networks — messages can be delivered twice, services can crash mid-operation
  • If a step retry delivers the same reservation request twice, you should not double-reserve inventory
  • Compensation runs must also be idempotent — refunding twice should not over-refund
  • Implementation: check before acting using an idempotency key (e.g., charge_id, reservation_id)
  • Without idempotency, retries cause data corruption and inconsistent state
5. What are the key states in a Saga execution state machine?

Expected answer points:

  • Pending: saga created but not started
  • Executing: one or more steps have completed successfully
  • Compensating: a step failed and compensation is running in reverse order
  • Completed: all steps succeeded or all compensations succeeded — terminal success state
  • Failed: unrecoverable state (compensation failed after retries, validation error, max retries exceeded)
  • Only Pending and Failed states are safely retryable at the saga level
6. How does optimistic locking work in the context of sagas, and when would you prefer it over pessimistic locking?

Expected answer points:

  • Optimistic locking: read the resource version before modifying, on update check version has not changed, if changed abort and retry
  • Pessimistic locking: acquire a lock before accessing a resource, blocks other sagas until lock releases
  • Optimistic locking provides higher throughput since locks are not held; pessimistic locking reduces concurrency
  • Use optimistic locking for most cases (financial transactions, inventory with hard limits prefer pessimistic)
  • In nested sagas, parent-level locking protects shared resources while nested sagas handle sub-workflows
7. What are the main trade-offs between Saga and Two-Phase Commit?

Expected answer points:

  • 2PC provides true atomicity and isolation; Saga provides eventual consistency with no isolation
  • 2PC blocks on coordinator failure; Saga continues because no coordinator is a single point of failure
  • Saga has higher latency (sequential steps) vs 2PC (parallel phases), but 2PC locks are held longer
  • 2PC has lower availability due to coordinator blocking; Saga has higher availability
  • Saga compensation is application-defined and can fail; 2PC rollback is automatic
  • Saga works across service boundaries with separate databases; 2PC requires all participants to support it
8. What is a nested saga and when would you use this pattern?

Expected answer points:

  • A nested saga is where a step's "execute" is itself a saga with its own compensation logic
  • Example: a Payment step runs authorize → capture as a nested saga; if it fails, parent saga treats it as one failed step
  • Benefits: reusability (ProcessPayment nested saga reused across order, subscription, refund workflows), readability (top-level reads like business workflow), scoped failures
  • Introduces concurrency at parent level — requires optimistic or pessimistic locking at parent level
  • Nested sagas help decompose large monolithic workflows into manageable pieces
9. How do you handle the case where a compensation itself fails?

Expected answer points:

  • Design idempotent compensations — if compensation runs twice the second run should have no effect
  • Implement retry with exponential backoff for transient failures
  • After max retries exceeded, move the saga to a dead letter state that requires manual intervention
  • Alert on repeated compensation failures — this indicates systemic issues
  • Design compensations to be as reliable as forward actions; avoid compensations that can themselves fail permanently (e.g., physical goods already delivered)
10. What are the key observability metrics and logs you should track for a production saga implementation?

Expected answer points:

  • Metrics: saga completion rate (success vs failure vs compensating), execution duration by type, compensation count and success rate, step failure rate by type, concurrent saga count
  • Logs: saga start with correlation ID, each step start/completion/compensation with step index, compensating transaction ID, saga outcome, all relevant IDs (saga, correlation, step, compensation)
  • Alerts: saga taking longer than threshold, compensation repeatedly failing, failure rate exceeding baseline, max retry count reached, stuck sagas
  • Distributed tracing (OpenTelemetry/Zipkin/Jaeger): parent span for overall saga, child spans per step and compensation, link compensation spans to original step spans
11. How does distributed tracing help debug saga failures across services?

Expected answer points:

  • Sagas span multiple services — without tracing, debugging means grepping logs across all services and piecing together what happened
  • Distributed tracing (OpenTelemetry, Zipkin, Jaeger) gives each saga a single trace ID that propagates through all service calls
  • Trace structure: parent span for the overall saga, child spans for each step and compensation, with links pairing compensation spans to original step spans
  • Trace context propagation: pass trace context through saga steps using baggage or span links so the full execution path is visible
  • When a step fails and compensation runs, link the compensation span to the original step span so you can see the pairing visually in the trace viewer
12. Why is saga state persistence critical and how would you implement it?

Expected answer points:

  • If the saga orchestrator crashes mid-execution, without persisted state you cannot determine which steps completed and which need compensation
  • Persist saga state to durable storage (database) after each step: saga ID, step name, status (started/completed/compensated), result data
  • The saga state machine tracks Pending, Executing, Compensating, Completed, and Failed states
  • On recovery, read persisted state and determine whether to resume, complete compensation, or treat as failed
  • Workflow engines like Temporal handle this automatically with durable workflow execution — if a worker crashes, the workflow resumes from where it left off
13. How does saga differ from other distributed transaction patterns like TCC (Try-Confirm-Cancel) and 2PC?

Expected answer points:

  • TCC uses three phases per operation: Try (reserve resources), Confirm (commit the reservation), Cancel (release the reservation). TCC requires infrastructure support from all participants
  • Saga uses two operations per step: the forward action and a separate compensating action. Simpler to implement but compensation can fail
  • 2PC is atomic across all participants but blocks on coordinator failure; saga is eventually consistent but never blocks waiting for a coordinator
  • TCC provides some isolation at the Try phase; saga provides no isolation between steps
  • Saga is easier to retrofit onto existing services since each service only needs to implement its own compensation; TCC requires participants to implement Try/Confirm/Cancel interfaces
14. What happens when a network partition occurs during saga execution?

Expected answer points:

  • Network partitions cause timeouts and indeterminate state — you do not know whether the remote service received your request and processed it
  • Apply idempotency logic so duplicate messages after partition recovery do not cause duplicate operations
  • The saga must distinguish between "step is still running" and "step completed but network failed before response arrived"
  • Timeout-based compensation: if a step does not respond within a threshold, trigger compensation for completed steps
  • Use idempotency keys and state checks on recovery to determine whether to retry or continue as-is
15. What are the main testing strategies for saga implementations and why is each important?

Expected answer points:

  • Unit testing compensations: test each compensation in isolation, verify idempotency (running twice has no additional effect)
  • Integration testing failure scenarios: simulate step failures at each point and verify compensations run in correct reverse order
  • Compensation order testing: explicitly verify that compensations run in reverse execution order — this is the most common saga bug
  • Chaos testing in production: kill services mid-saga, introduce latency, fill resources, split networks — verify saga handles each gracefully
  • Distributed tracing verification: confirm trace IDs propagate correctly through all steps and compensations
16. What are the best practice guidelines for designing saga step granularity?

Expected answer points:

  • Steps should represent business activities, not technical operations — each step should be meaningful in the domain
  • Too fine-grained: too many steps means many compensations to manage and track; high orchestration overhead
  • Too coarse-grained: large steps mean expensive compensations; if one step fails you may have to undo a lot of work
  • Use nested sagas to decompose coarse steps into sub-workflows while keeping the top-level saga readable
  • Compensation cost should be proportional to forward action cost — avoid steps where compensation is significantly more expensive than the forward action
17. How does the saga pattern relate to eventual consistency and what are the implications for application design?

Expected answer points:

  • Saga provides eventual consistency: after the saga completes (success or compensation), the system reaches a consistent state
  • During saga execution, the system is in a temporarily inconsistent state — other operations can observe partial results
  • Application must handle intermediate states: show users pending states, use optimistic UI, design for the inconsistency window
  • Do not assume that because step 1 succeeded, step 2 will also succeed — handle failure at each step
  • Eventual consistency is acceptable for most business workflows (order fulfillment, booking, subscriptions) where strict isolation is not required
18. What is the role of correlation IDs in saga debugging and how should they be implemented?

Expected answer points:

  • A correlation ID is a unique identifier assigned to a saga instance at creation and propagated through all subsequent step calls and events
  • Include correlation ID in all logs so you can filter all services by the same ID to reconstruct the full saga execution
  • Propagate correlation ID through HTTP headers, message queue properties, or gRPC metadata depending on your transport
  • Do not expose internal IDs (database primary keys) directly — use the correlation ID as the public-facing identifier in logs and traces
  • Distribute tracing tools use trace IDs as correlation IDs when properly configured
19. How do you decide between choreography and orchestration for a given saga implementation?

Expected answer points:

  • Choreography works well for simple workflows with 2-4 services where the logic is straightforward and teams are independent
  • Orchestration is better for complex workflows with many steps, multiple teams, or workflows that benefit from centralized error handling
  • Choreography creates loose coupling through event contracts but makes it harder to see the overall workflow state
  • Orchestration centralizes logic in one place making it easier to debug and modify but creates a coupling point to the orchestrator
  • Consider the team structure: if one team owns the workflow, orchestration gives them clear ownership; if multiple teams share responsibility, choreography may reduce coordination overhead
20. What are the key security considerations when implementing saga patterns?

Expected answer points:

  • Authenticate and authorize saga trigger endpoints — do not allow arbitrary saga instantiation without access control
  • Validate all saga input parameters to prevent injection attacks through saga payloads
  • Do not log sensitive data (payment info, passwords, PII) in saga context — logs may be stored in centralized logging systems
  • Encrypt saga state at rest if using external storage for persistence — saga state may contain business-sensitive information
  • Audit log all saga state changes (start, step completion, compensation, completion, failure) with correlation IDs for forensic analysis

Further Reading

For distributed transaction fundamentals, see Distributed Transactions. For two-phase commit, see Two-Phase Commit.

For workflow patterns, see Service Orchestration and Service Choreography.

Quick Recap Checklist

The saga pattern manages distributed transactions without 2PC. Break the transaction into steps, and define a compensation for each step. If a step fails, run the compensations in reverse order. Saga provides eventual consistency, not ACID isolation. Two implementations exist: choreography distributes behavior across services, while orchestration centralizes coordination in an orchestrator. Both work; the choice depends on workflow complexity and team structure. Idempotency is essential for safe retries and preventing duplicate operations.

Reference Checklists

Production Checklist

  • Saga breaks distributed transactions into steps with compensating transactions
  • If a step fails, previous steps are compensated in reverse order
  • Saga provides eventual consistency, not ACID isolation
  • Two implementations: choreography (distributed) and orchestration (centralized)
  • Idempotency is essential for safe retries

Saga Production Checklist

  • Each saga step has a defined compensation transaction
  • Compensation transactions are tested with injected failures
  • Idempotency implemented for all saga steps
  • Saga state persisted for crash recovery
  • Timeout values tuned for each step (long enough to complete, short enough to recover)
  • Concurrent saga execution handled correctly
  • Monitoring and alerting for saga failures configured

Conclusion

The saga pattern manages distributed transactions without 2PC. Break the transaction into steps, and define a compensation for each step. If a step fails, run the compensations in reverse order.

Choreographed sagas distribute behavior across services. Orchestrated sagas centralize coordination in an orchestrator. Both work; the choice depends on workflow complexity and team structure.

Sagas trade ACID isolation for availability. The application must handle partial states and concurrent access. This complexity is inherent to distributed transactions; saga just makes it explicit rather than pretending it does not exist.

sequenceDiagram
    participant Saga
    participant Inv as Inventory
    participant Pay as Payment
    Saga->>+Inv: Step 1: Reserve
    Inv->>-Saga: OK
    Saga->>+Pay: Step 2: Charge
    Pay->>-Saga: Fail
    Saga->>+Inv: Compensate
  • Idempotent step and compensation handlers implemented
  • Saga state persisted to durable storage
  • Compensation logic tested for each step
  • Maximum retry count configured per step
  • Monitoring for saga execution duration
  • Alerting for compensation failures
  • Concurrent saga handling planned (optimistic locking or locks)
  • User-facing pending state handling designed
  • Dead letter handling for unrecoverable saga states
  • Correlation IDs in all saga logs

Observability Checklist

Metrics
  • Saga completion rate (success vs failure vs compensating)
  • Saga execution duration by type
  • Compensation execution count and success rate
  • Step failure rate by step type
  • Concurrent saga count
  • Average number of steps per saga
Logs
  • Log saga start with correlation ID and input parameters
  • Log each step start, completion, and compensation with step index
  • Include compensating transaction ID in compensation logs
  • Log saga outcome (success, failure, compensating)
  • Include all relevant IDs: saga ID, correlation ID, step IDs, compensation IDs
Alerts
  • Alert when saga takes longer than expected threshold
  • Alert when compensation repeatedly fails
  • Alert when saga failure rate exceeds normal baseline
  • Alert when max retry count reached on a step
  • Alert on stuck sagas (no progress for extended period)
Security Checklist
  • Authenticate and authorize saga trigger endpoints
  • Validate saga input parameters to prevent injection
  • Audit log all saga state changes (start, step completion, compensation)
  • Do not log sensitive data (payment info, passwords) in saga context
  • Encrypt saga state at rest if using external storage
  • Use correlation IDs for tracing without exposing internal IDs

Category

Related Posts

TCC: Try-Confirm-Cancel Pattern for Distributed Transactions

Learn the Try-Confirm-Cancel pattern for distributed transactions. Explore how TCC differs from 2PC and saga, with implementation examples and real-world use cases.

#distributed-systems #transactions #saga

Service Orchestration: Coordinating Distributed Workflows

Explore service orchestration patterns for managing distributed workflows, workflow engines, saga implementations, and how orchestration compares to choreography.

#microservices #orchestration #workflow

Common Coding Interview Patterns

Master the essential patterns—sliding window, two pointers, fast-slow pointers—that solve 80% of linked list and array problems.

#coding-interview #problem-solving #patterns