The Saga Pattern: Managing Distributed Transactions

Learn saga pattern for distributed transactions without two-phase commit. Understand choreography vs orchestration with practical examples and production considerations.

published: March 22, 2026 reading time: 50 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

The Saga pattern lets you handle distributed transactions across microservices without the pain of two-phase commit. Instead of locking resources, each step does its work and defines a way to undo it. If something breaks, compensations run backward through completed steps. You can either let services coordinate through events (choreography) or give one central orchestrator full control (orchestration) — the right choice depends on your workflow complexity and team setup.

The Saga Pattern: Managing Distributed Transactions

The saga pattern is one of the most practical patterns for managing distributed transactions without the complexity of two-phase commit. When a business operation spans multiple services — each with their own database — a single ACID transaction is not possible. The saga breaks this into a sequence of local transactions, each with a compensating transaction that can undo it if something fails.

Introduction

ACID transactions do not scale across services. When order service, inventory service, and payment service each have their own databases, you cannot wrap a single transaction around all three. Two-phase commit is theoretically possible but practically problematic (more on that in Two-Phase Commit).

The saga pattern offers an alternative. Instead of locking resources across services, a saga breaks a distributed transaction into a sequence of local transactions, each with a compensating transaction that can undo it.

Topic-Specific Deep Dives

Core Concepts

A saga represents a distributed transaction as a series of steps. Each step is a local transaction on one service. After each step, the saga either continues to the next step or, if a step fails, runs compensating transactions to undo the previous steps.

Step 1: Reserve Inventory (compensate: release inventory)
Step 2: Charge Payment (compensate: refund payment)
Step 3: Create Shipment (compensate: cancel shipment)

If step 3 fails, you run the compensation for step 2 (refund) and then step 1 (release inventory). The saga undoes what it did, leaving the system consistent.

sequenceDiagram
    participant Saga
    participant Inv as Inventory
    participant Pay as Payment
    participant Ship as Shipping
    Saga->>+Inv: Reserve
    Inv->>-Saga: OK
    Saga->>+Pay: Charge
    Pay->>-Saga: OK
    Saga->>+Ship: Ship
    Ship->>-Saga: OK
    Note over Saga: Success - all steps complete

Saga does not provide isolation.

Example: order processing saga

A complete order processing saga with three services:

def create_order_saga(order):
    saga = OrderSaga()

    # Step 1: Reserve inventory
    reservation = saga.reserve_inventory(order.items)
    if not reservation.success:
        return OrderResult(rejected=True, reason="Insufficient inventory")

    # Step 2: Authorize payment
    authorization = saga.authorize_payment(order.payment, order.total)
    if not authorization.success:
        saga.compensate_inventory(reservation.id)
        return OrderResult(rejected=True, reason="Payment declined")

    # Step 3: Create shipment
    shipment = saga.create_shipment(order.address, order.items)
    if not shipment.success:
        saga.reverse_payment(authorization.id)
        saga.compensate_inventory(reservation.id)
        return OrderResult(rejected=True, reason="Shipping unavailable")

    # All steps succeeded
    return OrderResult(confirmed=True, order_id=order.id, shipment=shipment)

Each compensate_* method is a compensating transaction. They run in reverse order on failure.

Saga vs two-phase commit

Two-phase commit (2PC) locks resources until all participants confirm. It provides atomicity but at the cost of availability. If the coordinator fails during the commit phase, participants may be left waiting indefinitely.

Saga takes a different approach. It sacrifices isolation and atomicity for availability and scalability. Steps execute one at a time. Failures trigger compensation, not rollback.

For a deep dive on 2PC and why it is often avoided, see Two-Phase Commit.

Architecture Deep Dives

The choice between choreography and orchestration shapes how teams debug a saga and how it behaves under load. Neither approach is better across the board. Each makes different trade-offs in coupling, visibility, and operational complexity.

Choreography distributes saga logic across services. Each service knows its own step and reacts to events emitted by the previous one. There is no central coordinator, so teams stay loosely coupled and can evolve independently. The trade-off is visibility: no single place shows the overall workflow. Debugging a failed choreographed saga means following an event chain across multiple services, which requires distributed tracing and domain knowledge of each service’s event handlers.

Orchestration centralises the workflow in a dedicated orchestrator. It holds the full step sequence, decides what runs next, and triggers compensations on failure. This gives you a single source of truth for saga state. Debugging is more straightforward: you check the orchestrator logs. The cost is that the orchestrator becomes a coupling point and a potential bottleneck. Teams that own steps must coordinate with the orchestrator team when adding or modifying workflows.

After comparing the two patterns, this section also covers implementation concerns that apply regardless of which style you choose: designing deterministic compensations, ensuring idempotency across unreliable networks, and decomposing large workflows into nested sagas.

Choreographed saga

Each service knows its own step and its own compensation. When a step completes, the service emits an event. The next service reacts to that event. If something fails, services emit failure events that trigger compensations.

graph LR
    Order[Order Service] -->|OrderCreated| Inv[Inventory]
    Inv -->|InventoryReserved| Pay[Payment]
    Pay -->|PaymentCharged| Ship[Shipping]
    Ship -->|ShipmentCreated| Order

In this flow, each service reacts to the previous step’s event. The behavior is distributed. Each service knows only its own piece.

When Payment fails after Inventory is reserved:

graph LR
    Pay -->|PaymentFailed| Inv
    Inv -->|InventoryReleased| Order

The compensation logic lives in each service. Inventory responds to the failure by releasing the reservation.

Orchestrated saga

A central orchestrator manages the sequence. It decides what step runs next, handles failures, and triggers compensations. The orchestrator knows the entire workflow.

graph LR
    Orch[Order Orchestrator] -->|Reserve| Inv
    Inv -->|OK| Orch
    Orch -->|Charge| Pay
    Pay -->|OK| Orch
    Orch -->|Ship| Ship
    Ship -->|OK| Orch

The orchestrator keeps state about what has completed. If Payment fails, the orchestrator tells Inventory to release and returns an error to the client.

For a full comparison, see Service Orchestration.

Implementing saga compensation

Compensation logic must be deterministic. If step 3 fails after step 2 succeeded, you must undo step 2. Running compensation twice or in the wrong order causes problems.

class OrderSaga:
    def execute(self, order):
        try:
            # Step 1: Reserve inventory
            reservation = self.inventory.reserve(order.items)
            self.steps.append(('reserve', reservation))

            # Step 2: Process payment
            charge = self.payment.charge(order.payment_info, order.total)
            self.steps.append(('charge', charge))

            # Step 3: Create shipment
            shipment = self.shipping.create(order.address)
            self.steps.append(('ship', shipment))

            return SagaResult(success=True, shipment=shipment)

        except PaymentDeclined:
            # Compensate step 1
            self.inventory.release(self.steps[0][1].reservation_id)
            return SagaResult(success=False, reason="Payment declined")

        except ShippingUnavailable:
            # Compensate step 2
            self.payment.refund(self.steps[1][1].charge_id)
            # Compensate step 1
            self.inventory.release(self.steps[0][1].reservation_id)
            return SagaResult(success=False, reason="Shipping unavailable")

The saga tracks completed steps in order. On failure, it runs compensations in reverse order. This is the compensating transaction pattern.

Idempotency in sagas

Sagas execute across unreliable networks. Messages may be delivered twice. Services may crash mid-operation. Your saga must handle idempotency.

Reserve the same inventory twice should not double-reserve. Charge the same payment twice should not double-charge.

def reserve_inventory(self, reservation_id, items):
    # Idempotency check
    if self.inventory.is_reserved(reservation_id):
        return self.inventory.get_reservation(reservation_id)

    # Actually reserve
    return self.inventory.create_reservation(reservation_id, items)

def charge_payment(self, charge_id, payment_info, amount):
    # Idempotency check using charge_id
    existing = self.payment.get_charge(charge_id)
    if existing:
        return existing

    # Process charge
    return self.payment.create_charge(charge_id, payment_info, amount)

Include an idempotency key (the reservation ID from step 1) in the compensation call. Check before acting.

Nested sagas

Large workflows sometimes need sub-sagas. Rather than one monolithic saga with dozens of steps, you can decompose into nested sagas where a step’s “execute” is itself a saga.

For example, an order fulfillment saga might have a step called ProcessPayment that internally runs authorize → capture as a nested saga. If the nested saga fails, the parent saga treats it as a single failed step and compensates accordingly.

Why nest sagas?

Reusability: The ProcessPayment nested saga can be reused across multiple parent sagas (order, subscription, refund)
Readability: Top-level saga reads like a business workflow, not a technical protocol
Scoped failures: If payment fails, you know exactly which sub-step failed without scrolling through 20 parent steps

Concurrency control with nested sagas:

Nested sagas introduce concurrency at the parent level. While the payment nested saga is running, other parent sagas may also be running and trying to access shared resources. Use optimistic or pessimistic locking at the parent level.

class ParentSaga:
    def execute(self):
        # Step 1: Reserve inventory (parent-level lock on inventory record)
        with self.lock('inventory', self.order.inventory_id):
            self.reserve_inventory()

        # Step 2: Run payment as nested saga (has its own compensation)
        payment_result = PaymentNestedSaga().execute(self.payment_context)

        if not payment_result.success:
            # Parent-level compensation for inventory
            self.release_inventory()  # Uses same lock
            return Failed

        # Step 3: Create shipment
        self.create_shipment()

Optimistic locking: Read the resource version before modifying. On update, check the version hasn’t changed. If it has, abort and retry.

Pessimistic locking: Acquire a lock before accessing the resource. Blocks other sagas from accessing it until the lock releases. Simpler but reduces concurrency.

Use optimistic locking for most cases (higher throughput). Use pessimistic locking only when the cost of a concurrent modification is very high (financial transactions, inventory with hard limits).

class SagaState:
    """Persisted saga state for crash recovery."""

    def __init__(self, saga_id: str):
        self.saga_id = saga_id
        self.steps: list[tuple[str, str, dict]] = []  # [(step_name, status, data)]
        self.status = "pending"
        self.created_at = datetime.utcnow()

    def mark_step_started(self, step_name: str, step_data: dict):
        self.steps.append((step_name, "started", step_data))
        self._persist()

    def mark_step_completed(self, step_name: str, result_data: dict):
        # Update step status
        for i, (name, status, data) in enumerate(self.steps):
            if name == step_name and status == "started":
                self.steps[i] = (name, "completed", {**data, **result_data})
                break
        self._persist()

    def mark_step_compensated(self, step_name: str):
        for i, (name, status, data) in enumerate(self.steps):
            if name == step_name and status == "completed":
                self.steps[i] = (name, "compensated", data)
                break
        self._persist()

    def get_pending_steps(self) -> list[str]:
        return [name for name, status, _ in self.steps if status == "started"]

    def get_completed_steps(self) -> list[str]:
        return [name for name, status, _ in self.steps if status == "completed"]

    def _persist(self):
        # Save to durable storage (database, etc.)
        db.sagas.upsert(self.saga_id, self.to_dict())

Failure Handling

Saga failures fall into three categories, and each requires a different response.

Failure Type	Examples	Default Behaviour	Fallback
Transient	Network timeout, connection pool exhaustion, 503 Service Unavailable	Retry with exponential backoff (3-5 attempts)	Treat as permanent after max retries
Permanent	Insufficient inventory, card declined, invalid input	Trigger compensation immediately	N/A, will never succeed
Unknown State	Service crashes mid-step, response lost in transit	Check idempotency key / query step status	Recover state or treat as permanent

Transient failures are temporary glitches that resolve on their own. A network blip, a connection pool draining under load, or a service restarting mid-deployment. The right response is retry with exponential backoff. Start small (100-500ms), double each attempt, cap at 10-15s, and stop after 3-5 tries. Add jitter so concurrent sagas don’t pile up on the same retry window.

def execute_with_retry(step_fn, context, max_retries=3):
    delay = 0.5
    for attempt in range(max_retries):
        try:
            return step_fn(context)
        except TransientError as e:
            if attempt < max_retries - 1:
                sleep(delay + random.uniform(0, delay * 0.5))
                delay *= 2
                continue
            raise PermanentError(f"Failed after {max_retries} retries") from e

Permanent failures will never succeed on retry. Retrying a declined card or zero-inventory item is wasted time that keeps the system inconsistent longer. Detect them early and trigger compensation immediately with no retry delay.

def handle_permanent_failure(saga, failed_step_index):
    for index in range(failed_step_index - 1, -1, -1):
        compensate_step(saga, saga.steps[index])
    saga.status = "failed"
    saga.persist()

Unknown state is the hardest category. You sent “reserve inventory” and the service crashed or the response was lost. Did it execute? The answer is the idempotency check-then-act pattern: check whether a step already completed using its idempotency key before running it.

def execute_step_with_idempotency(step_id, step_fn):
    existing = get_step_result(step_id)
    if existing is not None:
        return existing
    result = step_fn()
    store_step_result(step_id, result)
    return result

When state cannot be determined, mark the step as ambiguous and alert an operator. Workflow engines like Temporal eliminate this problem through durable execution: step progress is persisted atomically so a crash cannot leave state indeterminate.

Compensation failure is the meta-risk. If the compensation itself fails (payment service is down when you try to refund), you are stuck with an inconsistent system. Retry compensation with exponential backoff, just like a forward step. If it keeps failing, move the saga to a dead letter queue for manual review and alert aggressively.

Design compensations to be idempotent as well. If a compensation runs twice, the second run should have no effect.

Performance considerations

Saga trades a synchronous two-round-trip 2PC for a sequential multi-step approach. The math is worth understanding.

2PC latency profile: Two network round-trips (prepare + commit), but all participants vote and commit in parallel. Typical: 10-50ms per participant phase.

Saga latency profile: Each step adds its own latency. If step 1 takes 20ms, step 2 takes 30ms, step 3 takes 15ms, your total is 65ms plus orchestration overhead. Steps run sequentially, not in parallel.

For a 3-step saga vs 2PC across the same 3 services:

Metric	2PC	Saga
Happy path latency	~20-40ms (parallel phases)	~50-80ms (sequential steps)
Failure recovery	Blocks until coordinator recovers	Compensation runs immediately
Availability	Lower (blocking on coordinator)	Higher (no coordinator SPOF)
Lock duration	All locks held during both phases	Each lock released after its step

Saga’s latency overhead is real but often acceptable. If each step is 10-30ms (typical for service calls), a 5-step saga runs in 50-150ms total. Compare that to a user-facing API timeout (usually 1-5 seconds) and the overhead is negligible.

Where saga latency hurts is high-throughput, low-latency paths (trading systems, real-time pricing). In those cases, consider whether you can pipeline steps (some steps don’t depend on previous step results and can overlap).

Trade-off Analysis

The saga pattern involves several fundamental trade-offs compared to other distributed transaction approaches.

Testing Strategies

Sagas are harder to test than single-service transactions because they span multiple services and involve time-sensitive compensation logic. A unit test that verifies one step is not enough. You also need to verify that compensations run in the right order and that idempotency holds under retries. Concurrent sagas should not interfere either.

Testing sagas works best as a layered approach. Start with unit tests for individual compensations. Verify each is idempotent and deterministic. Then move to integration tests that simulate failures at each step and confirm the correct compensations fire in reverse order. Finally, run chaos tests in a staging environment: kill services mid-saga, inject network latency, exhaust connection pools.

The sub-sections that follow compare Saga to alternative patterns (2PC and TCC) across key dimensions, summarise the costs and benefits, and then walk through concrete testing strategies with code examples.

Saga vs Alternative Patterns

Aspect	Saga	Two-Phase Commit	TCC (Try-Confirm-Cancel)
Atomicity	Eventual (via compensation)	True atomic (all-or-nothing)	Eventual (via Confirm/Cancel)
Isolation	None (partial states visible)	Full isolation	Partial (Try reserves)
Availability	High (no coordinator blocking)	Low (coordinator is SPOF)	Medium (Try phase reserves resources)
Complexity	Medium (compensation logic)	High (coordinator protocol)	High (3-phase interface per service)
Rollback Mechanism	Compensation transactions	Automatic rollback	Cancel / Confirm semantics
Latency	Per-step cumulative	Two round-trips (parallel phases)	Three round-trips per operation
Service Coupling	Each service owns compensation	All services must support 2PC	All services must implement TCC
Failure Handling	Application-defined compensation	Coordinator-driven	Try phase failures auto-cancel

Choreography vs Orchestration Trade-offs

Factor	Choreography	Orchestration
Central Point of Failure	None (fully distributed)	Orchestrator is a potential failure point
Logic Location	Distributed across services	Centralized in orchestrator
Adding New Steps	All participating services must change	Only orchestrator changes
Understanding Workflow	Harder to see overall flow	Easier to see complete workflow
Debugging	Requires distributed tracing	Easier (centralized logs)
Scalability	Scales with number of services	Limited by orchestrator capacity
Team Independence	High (services evolve independently)	Low (teams depend on orchestrator)
Suitable For	2-4 services, simple event flows	5+ services, complex workflows

Design Decision Matrix

Your Constraint	Recommended Approach
Maximum availability required	Saga (either flavor)
True ACID isolation required	Single database or 2PC (with caveats)
Cross-service transactions	Saga
Can’t modify participant code	Saga (only compensation needed)
High-frequency, low-latency	Pipelined saga or avoid distributed transaction
Long-running (hours to days)	Orchestrated saga with durable execution (Temporal)
Simple 2-3 step workflow	Choreographed saga
Complex multi-team workflow	Orchestrated saga
Already using workflow engine	Use the engine’s saga support (Temporal, Step Functions)

Cost-Benefit Summary

Saga Benefits:

No coordinator means no single point of failure
Steps execute sequentially — each lock is released after its step completes
Works across service boundaries with separate databases
Compensation is application-defined and domain-appropriate
Easier to retrofit onto existing services

Saga Costs:

No isolation — application must handle partial states
Compensation can fail, leaving the system in an inconsistent state
Longer cumulative latency (sum of step latencies vs parallel phases)
Application must implement idempotency throughout
Testing is more complex (failure scenarios, compensation ordering)

Saga Testing Strategies

Testing sagas is hard because they span services and involve time. A structured approach helps.

Unit testing compensations

Test each compensation in isolation first. The compensation is the most critical piece — if it fails, your saga is stuck.

# Test that releasing inventory twice has no effect (idempotency)
def test_release_inventory_idempotent():
    inventory = InMemoryInventory()
    inventory.reserve("order-123", ["item-a"])

    # First release — succeeds
    result1 = inventory.release("order-123")
    assert result1.success
    assert not inventory.is_reserved("order-123")

    # Second release — should be idempotent (no error)
    result2 = inventory.release("order-123")
    assert result2.success  # Still succeeds, even though already released

    # Third release — still idempotent
    result3 = inventory.release("order-123")
    assert result3.success

Integration testing failure scenarios

The real test is whether your saga handles failures correctly. Set up test infrastructure that simulates failures at each step.

# Test: step 3 fails, step 2 compensation runs correctly
def test_saga_step3_failure_triggers_step2_compensation():
    inventory = MockInventory()  # Always succeeds
    payment = MockPayment()      # Always succeeds
    shipping = MockShipping(fail_on="create")  # Fails on create

    saga = OrderSaga(inventory, payment, shipping)
    result = saga.execute(order_with_3_items)

    assert result.failed
    assert result.failed_step == "create_shipment"
    assert payment.refund_was_called()        # Step 2 compensated
    assert inventory.release_was_called()      # Step 1 compensated
    assert not shipping.shipment_created      # Step 3 never ran

Testing compensations run in correct order

The most common saga bug is compensation running in the wrong order. Write a test that explicitly verifies reverse order.

def test_compensation_runs_in_reverse_order():
    call_order = []

    class TrackingService:
        def do_step(self):
            call_order.append(f"do-{self.name}")
            return Success()

        def compensate(self):
            call_order.append(f"compensate-{self.name}")
            return Success()

    class OrderedSaga(Saga):
        def __init__(self):
            self.s1 = TrackingService(name="step1")
            self.s2 = TrackingService(name="step2")
            self.s3 = TrackingService(name="step3")

        def execute(self):
            self.do_step(self.s1)
            self.do_step(self.s2)
            self.do_step(self.s3)  # Fails here
            return Success()

        def compensate(self):
            # Should run in reverse: s3, s2, s1
            self.compensate(self.s3)
            self.compensate(self.s2)
            self.compensate(self.s1)

    saga = OrderedSaga()
    saga.execute()  # Fails on step 3
    saga.compensate()

    assert call_order == [
        "do-step1", "do-step2", "do-step3",
        "compensate-step3", "compensate-step2", "compensate-step1"
    ]

Chaos testing sagas in production

Once your saga is running in production, inject failures to verify it handles them:

Kill a service mid-saga and verify compensation runs
Introduce network latency and verify timeouts trigger correctly
Fill up a resource (disk, connection pool) and verify graceful degradation
Split the network between two services and verify saga completes or compensates correctly

Observability & Tracing

When a saga runs across five services and fails at step four, finding the root cause means tracking what happened in inventory, payment, shipping, and the orchestrator. Without observability tooling, you are left grepping logs from each service and trying to reconstruct the timeline by hand. With distributed tracing and structured logging, the saga’s entire execution path becomes a single searchable unit.

This section covers what to capture for saga observability and why, the trace structure that makes sagas debuggable with a concrete span hierarchy, and how to propagate trace context through saga steps so the full execution path stays linked even when compensations fire asynchronously.

Observability

Sagas span multiple services. Without tracing, debugging a failed saga means grepping logs across 5 services and trying to piece together what happened. With distributed tracing (OpenTelemetry, Zipkin, Jaeger), you get a single trace ID that follows the saga across all services.

Trace structure for sagas

A saga trace has a parent span for the overall saga, with child spans for each step and compensation.

Trace: order-123-saga (trace-id: abc123)
├── Span: saga_created (service: order-service)
├── Span: step.reserve_inventory (service: inventory-service)
│   └── Span: compensate.reserve_inventory (service: inventory-service)
├── Span: step.charge_payment (service: payment-service)
│   └── Span: compensate.charge_payment (service: payment-service)
├── Span: step.create_shipment (service: shipping-service)
└── Span: saga_completed / saga_failed

Implementing trace context propagation

Pass trace context through saga steps using baggage or span links.

from opentelemetry import trace
from opentelemetry.propagate import inject, extract

tracer = trace.get_tracer(__name__)

class OrderSaga:
    def execute(self, order, context=None):
        # Extract incoming trace context (if triggered by HTTP request)
        if context:
            ctx = extract(context)
            span = tracer.start_span("order_saga", context=ctx)
        else:
            span = tracer.start_span("order_saga")

        with span:
            span.set_attribute("saga.id", order.saga_id)
            span.set_attribute("saga.type", "order_fulfillment")

            # Inject trace context into step calls
            headers = {}
            inject(headers)  # Injects current trace into headers

            try:
                # Step 1: Reserve inventory (pass headers for trace propagation)
                inventory_ctx = self.inventory.reserve(order.items, headers)

                # Step 2: Charge payment
                payment_ctx = self.payment.charge(order.payment, headers)

                # Step 3: Create shipment
                shipment = self.shipping.create(order.address, headers)

                span.set_status(trace.Status.OK)
                return Success(shipment)

            except Exception as e:
                span.record_exception(e)
                span.set_status(trace.Status.ERROR, str(e))
                raise

When a step fails and compensation runs, link the compensation span to the original step span so you can see the pairing in the trace.

def compensate_inventory(self, reservation_id, original_span):
    with tracer.start_as_current_span(
        "compensate.reserve_inventory",
        links=[Link(original_span.get_span_context())]
    ) as span:
        span.set_attribute("compensation.for", original_span.name)
        span.set_attribute("reservation.id", reservation_id)
        self.inventory.release(reservation_id)

This way, in your trace viewer, you see the compensate span linked back to its original do span — paired visually rather than guessing from logs.

Framework & Decision Guides

Building saga from scratch is educational but not practical for production. You end up writing state persistence, retry logic, idempotency checks, and distributed tracing integration yourself. All of these take time to get right and are easy to get subtly wrong. A framework that handles these concerns out of the box gets you to production faster and with fewer surprises.

The main options fall into three categories. Durable workflow engines like Temporal give you crash-resilient execution with automatic state recovery. Managed cloud services like AWS Step Functions handle persistence and retries without any infrastructure overhead. Self-hosted orchestrators like Netflix Conductor offer control over the runtime without depending on a cloud vendor.

The comparison below covers each framework’s strengths, ideal use cases, and trade-offs. Following that, a decision matrix maps your constraints (availability needs, team size, latency requirements) to the right approach.

Temporal

The strongest choice for saga orchestration. Temporal provides durable workflow execution — if your service crashes mid-saga, Temporal persists the workflow state and resumes it from where it left off. No need to build your own saga state machine.

Strengths: Durable execution (survives worker crashes), built-in retries with backoff, activity heartbeats, sandboxed workflow code, strong OpenTelemetry integration
Good for: Complex multi-step business workflows, long-running sagas (hours to days)
Trade-offs: Self-hosting is operationally heavy; Temporal Cloud pricing can be significant at scale

# Temporal workflow example
@workflow.defn
class OrderWorkflow:
    @workflow.run
    async def run(self, order: Order) -> OrderResult:
        # Activities with automatic retry
        reservation = await workflow.execute_activity(
            reserve_inventory,
            order.items,
            start_to_close_timeout=timedelta(seconds=30),
            retry_policy=ActivityRetryPolicy(maximum_attempts=3),
        )

        try:
            charge = await workflow.execute_activity(
                charge_payment,
                order.payment,
                order.total,
                start_to_close_timeout=timedelta(seconds=30),
            )
        except PaymentDeclined:
            await workflow.execute_activity(
                release_inventory,
                reservation.id,
            )
            return OrderResult(rejected=True, reason="Payment declined")

        shipment = await workflow.execute_activity(
            create_shipment,
            order.address,
        )

        return OrderResult(confirmed=True, shipment=shipment)

AWS Step Functions

Managed saga orchestration on AWS. Integrates tightly with AWS services (Lambda, ECS, SQS, DynamoDB). Good if you’re already all-in on AWS.

Strengths: Fully managed, pay-per-state-transition, tight Lambda integration, visual workflow designer
Good for: AWS-centric architectures, medium-complexity workflows
Trade-offs: Vendor lock-in, expensive at high step counts, debugging can be opaque

Conductor (Netflix)

Conductor is an open-source saga orchestrator from Netflix. Good for microservices that need workflow orchestration without heavy operational overhead.

Strengths: Open source (self-hostable), JSON-based workflow definitions, HTTP-based workers
Good for: Teams wanting open-source without Temporal complexity
Trade-offs: Not as battle-tested as Temporal at extreme scale, less mature ecosystem

Comparison

Framework	Durability	Open Source	Complexity	Best For
Temporal	Excellent (durable execution)	Yes (server) + cloud	Medium	Complex long-running workflows
AWS Step Functions	Good (managed)	No	Low	AWS-centric, simple workflows
Conductor	Good	Yes (fully)	Medium	Open-source preference

For most production scenarios, Temporal is the right call. The durable execution guarantee alone eliminates a whole class of saga state loss bugs.

When to Use / When Not to Use the Saga Pattern

Criteria	Saga (Choreography)	Saga (Orchestration)	Two-Phase Commit
Atomicity	Eventual	Eventual	True atomicity
Isolation	None	None	Full isolation
Availability	High	High	Low (blocks on coordinator failure)
Complexity	Distributed logic	Centralized logic	Distributed but synchronous
Compensation	Each service knows its own	Orchestrator directs	Automatic rollback
Latency	Per-step latency	Per-step + orchestration overhead	Two round-trips
Debugging	Harder (distributed)	Easier (centralized)	Moderate
Rollback Cost	Compensation required	Compensation required	Free (automatic)

Use saga when:

Operations span multiple services with separate databases
You cannot use 2PC (which is usually the right call)
Business transactions map naturally to a sequence of steps
You can define compensating transactions for each step
Eventual consistency is acceptable for your domain
You need availability over strong isolation

Avoid saga when:

Steps have tight interdependencies that require strict isolation
Rollback must be immediate and guaranteed
Compensation is expensive or impossible (sending an email, charging a card with long refund times)
Your domain requires all-or-nothing atomicity that saga cannot provide
The inconsistency window is unacceptable for your use case

Production Considerations

Staging environments rarely expose the failure modes that surface in production. A saga that passes every integration test can still break when real network partitions occur, connection pools exhaust under concurrent load, or Kubernetes evicts a pod mid-transaction. Production introduces latencies, failures, and load patterns that are difficult to simulate reliably in pre-production environments.

The failure scenarios below walk through the most common production failures and their specific mitigations. After that come two diagram types that help teams reason about saga behaviour at the system level: failure flow diagrams and state machine diagrams. These are not decorative. They are the shared vocabulary teams use when debugging a step failure at 3 AM.

Key areas where production reality diverges from staging:

Network partitions: Staging uses controlled network conditions. Production sees partial partitions, asymmetric latency, and DNS failures that staging cannot replicate.
Concurrent load: A saga tested in isolation behaves differently under real concurrent load. Two sagas racing for the same inventory reveal race conditions that single-threaded tests miss.
Resource exhaustion: Connection pools, thread pools, and memory limits behave differently under production load. A misconfigured pool limit that passes staging tests can cause cascading failures under real traffic.
Partial deployments: A rolling deployment can leave some services on version N and others on N+1 mid-saga. The saga must handle step results from a service running different code than expected.

Saga Failure Scenarios

Failure	Impact	Mitigation
Step fails and compensation also fails	System left in inconsistent state	Design idempotent compensations; implement retry with exponential backoff; alert on repeated compensation failures
Service crashes mid-step	Step may or may not have completed; unknown state	Use idempotency keys; implement saga state tracking; use durable workflow engines
Concurrent sagas interfere	One saga’s uncommitted data affects another saga’s read	Implement optimistic concurrency control; use application-level locks for critical resources
Compensation runs on already-succeeded step	Double compensation causes incorrect state	Track completed steps explicitly; prevent compensation on committed steps
Saga state lost (orchestrator crash)	Cannot determine what steps completed	Persist saga state to durable storage; use workflow engines that handle this
Infinite retry loop	System stuck in repeated failed attempts	Implement max retry count; move to dead letter state after threshold; alert

Diagrams & State Machines

Diagrams serve two purposes when designing sagas. A failure flow diagram maps the decision tree that the saga follows when each step succeeds or fails, making it clear which compensations fire and in what order. A state machine diagram defines the legal state transitions a saga instance can go through, from creation to terminal completion or failure.

The two diagrams below show the standard patterns. They are not specific to any framework or implementation. You can map them directly onto Temporal workflows, Step Functions state machines, or a custom orchestrator.

Failure Flow Diagram

graph TD
    Start[Start Saga] --> Step1[Execute Step 1]
    Step1 --> Step1OK{Step 1 OK?}
    Step1OK -->|No| Fail1[Compensate Step 1<br/>Return Error]
    Step1OK -->|Yes| Step2[Execute Step 2]
    Step2 --> Step2OK{Step 2 OK?}
    Step2OK -->|No| Comp1[Compensate Step 1]
    Comp1 --> Fail2[Return Error]
    Step2OK -->|Yes| Step3[Execute Step 3]
    Step3 --> Step3OK{Step 3 OK?}
    Step3OK -->|No| Comp2A[Compensate Step 2]
    Comp2A --> Comp2B[Compensate Step 1]
    Comp2B --> Fail3[Return Error]
    Step3OK -->|Yes| Success[Saga Complete]

Saga Execution State Machine

stateDiagram-v2
    [*] --> Pending: Saga created
    Pending --> Executing: First step started
    Executing --> Executing: Step N completed
    Executing --> Compensating: Step failed
    Compensating --> Compensating: Compensation in progress
    Compensating --> Completed: All compensations done
    Executing --> Completed: Final step succeeded
    Compensating --> Failed: Compensation failed after retries
    Pending --> Failed: Immediate failure (validation error)
    Completed --> [*]
    Failed --> [*]

State Definitions:

State	Description
Pending	Saga created but not yet started. Initial state before first step execution.
Executing	One or more steps have completed successfully. Saga is processing subsequent steps.
Compensating	A step failed and compensation is running in reverse order to undo completed steps.
Completed	All steps succeeded or all compensations succeeded. Terminal success state.
Failed	Saga reached an unrecoverable state (compensation failed or max retries exceeded).

State Transition Rules:

Once in Executing, cannot return to Pending
Compensating can only be entered from Executing
Completed and Failed are terminal states
From Compensating, success leads to Completed, persistent failure leads to Failed
Only Pending and Failed can be safely retried at the saga level

Choreography vs Orchestration: Trade-off Analysis

Factor	Choreography	Orchestration	Recommendation
Coupling	Low — services own their own logic	High — orchestrator knows all steps	Choreography for independent teams
Complexity	Distributed — each service knows its trigger	Centralised — orchestrator is complex	Orchestration for complex flows
Visibility	Poor — no central view of saga state	High — orchestrator tracks all state	Orchestration for debugging
Scalability	High — no central bottleneck	Limited by orchestrator capacity	Choreography at scale
Failure Handling	Each service handles its own failures	Centralised retry and compensation	Orchestration for guaranteed ordering

Production Failure Scenarios

Scenario 1: Payment Service Unavailable During Saga

What happened: A saga was executing an order placement across three services: inventory, payment, and shipping. The payment service became unavailable for 45 seconds mid-saga after the inventory reservation succeeded.

Root cause: The payment service’s database connection pool was exhausted due to a misconfigured max-connections setting. Requests queued up behind the pool limit and timed out.

Impact: 127 sagas entered the compensating state simultaneously. The compensating transaction to release inventory ran successfully, but the refund for charges already processed was delayed. 23 customers reported duplicate charges before idempotency checks caught them.

Lesson learned: Always implement idempotency keys at every saga step. The payment step used an idempotency key derived from the saga ID and step number, which prevented double-charging on retry — but the compensation path did not use an idempotency key, causing the delayed refund to appear as a duplicate in monitoring.

The payment step used a composite idempotency key combining the order ID, step sequence number, and a nonce. When the payment service received a charge request with key order-abc:step-2:nonce-x, it checked the idempotency store first. If the key existed, it returned the cached result without re-executing. This caught all 23 duplicate charge attempts that arrived during the 45-second outage. The refund compensation, however, used only the transaction ID as its lookup key, which was not unique enough to distinguish retried refunds from genuinely new ones. A retried refund carrying the same transaction ID was logged as a separate refund event, which is what triggered the duplicate charge alerts.

The delayed refund made things worse. The idempotency store had a 24-hour TTL, so when the refund finally arrived after the connection pool recovered, it was accepted and processed. The monitoring dashboard then displayed it alongside the original refund, making it look like two separate refunds had been issued. The compensation path needs the same idempotency key structure as the forward path: saga ID, step number, and compensation type encoded together. Without that, a retried compensation gets treated as a distinct compensation rather than a retried one. Idempotency on forward steps alone is incomplete.

Scenario 2: Circular Dependency in Choreography

What happened: Two sagas in an e-commerce platform developed a circular dependency. Saga A needed to update product pricing and Saga B needed to update product availability. Both sagas published events that triggered each other’s next steps, creating an infinite loop that saturated the message broker.

Root cause: The choreography design did not account for the relationship between pricing updates and availability updates. A pricing change can affect which warehouse fulfils an order, which changes availability, which can affect pricing through promotional rules.

Impact: Message broker queue depth grew to 2.3 million messages in 3 minutes before the circuit breaker triggered. The system recovered after restarting both sagas and replaying from the last stable checkpoint.

Lesson learned: Model all cross-saga dependencies explicitly before deploying choreography. Implement circuit breakers per event type, not per saga. Always set message TTL and dead-letter queues for choreography events.

The two event types driving the loop were ProductPriceChanged and WarehouseAvailabilityChanged. Saga A consumed WarehouseAvailabilityChanged, recalculated fulfillment-region pricing, and emitted ProductPriceChanged. Saga B did the inverse: it consumed ProductPriceChanged, ran its pricing engine, and emitted WarehouseAvailabilityChanged after its availability model updated. Normally this propagated in one direction only. The problem surfaced during a promotional campaign that updated both price and availability within a 200ms window. Both sagas processed their inputs simultaneously and each emitted an event that fed back into the other, which started the cycle over.

The circuit breaker that cut the loop had a TTL of 30 seconds and a threshold of 1000 messages per minute per event type. When ProductPriceChanged exceeded 800 messages per minute, the breaker opened for that event type specifically, blocking ProductPriceChanged consumption for 30 seconds. This gave the in-flight WarehouseAvailabilityChanged events time to drain and broke the feedback cycle. After 30 seconds the breaker closed automatically and the sagas resumed from their current state, not from the beginning. If the breaker had been applied per saga instead of per event type, one saga’s open breaker would not have stopped the other from continuing to emit.

Scenario 3: Orchestrator State Loss

What happened: A Kubernetes pod running the saga orchestrator was terminated during a 12-step order fulfilment saga. The orchestrator had been persisting state every 10 steps, and the saga was at step 7. The replacement pod could not determine whether step 6 had completed.

Root cause: The orchestrator persisted saga state asynchronously to reduce latency, using a write-ahead log with batch commits. The batch had not been committed when the pod was terminated, losing visibility into steps 1–6.

Impact: The replacement pod retried from step 1, causing double reservation of inventory and double payment capture. The idempotency layer caught the double payment but not the double inventory reservation, leading to inventory inconsistency that took 4 hours to reconcile manually.

Lesson learned: Persist saga state synchronously at every step boundary, not in batches. Use the outbox pattern to guarantee that step completion and state persistence are atomic. Consider checkpointing after every step, not every N steps.

The persistence bug was a write-ahead log that batch-committed every 10 steps. The orchestrator wrote to the WAL immediately but only flushed to the database in batches of 10. The WAL itself was durable, since it used fsync-on-write locally, but the batch database commit was asynchronous and unacknowledged. When the Kubernetes pod received SIGTERM during a rolling update at step 7, the WAL held steps 1 through 7, but only steps 1 through 6 had made it into the database batch. The replacement pod queried the database, found no step 6 record, and restarted from step 1.

The double inventory reservation exposed a flaw in the idempotency logic. The payment step used a reservation ID built from the order ID plus a deterministic nonce, so when step 1 retried, the payment service recognized the ID and returned the cached capture without double-charging. The inventory step used only the order ID as its idempotency key. The original step 1 execution reserved inventory under order-abc. When the saga restarted and retried step 1, the inventory service found an existing reservation for order-abc but it was in “pending” state, not “confirmed”. The retry logic treated this as a new reservation and created a second hold. The duplicate only surfaced when a concurrent saga tried to reserve the same SKU and hit insufficient stock.

Scenario 4: Compensating Transaction Loop

What happened: A saga that books travel (flight, hotel, car) experienced a network partition between the hotel and flight services. The saga compensated the flight booking, then the network partition resolved, and the hotel service’s compensation event arrived late. The orchestrator interpreted the late event as a new failure and began compensating again.

Root cause: Events arrived out of order because the hotel service had a slower consumer group than the flight service. The compensation for the hotel ran after the saga had already reached the Completed state.

Impact: Three customers had their flight rebooked unnecessarily when the second compensation attempt ran after the original compensation had already released the inventory. One customer was moved from their booked flight to a later flight with a 6-hour delay.

Lesson learned: Use version numbers or vector clocks on saga events to detect and discard late events. Implement idempotency in compensation handlers using the original saga ID. Never compensate a saga that has already reached a terminal state.

The throughput gap was about 8 to 1. The flight service consumer processed 800 messages per second with a single-threaded consumer. The hotel service consumer processed 100 messages per second through a shared thread pool across 12 partitions. When the network partition cleared after 90 seconds, the flight consumer had cleared its 72,000-message backlog and moved on to new events. The hotel consumer had only cleared 9,000 of its 90,000 pending messages. The hotel compensation event arrived roughly 3 minutes after the saga had already reached Completed.

The fix was a version counter on each saga instance, included in every emitted event. When the late hotel compensation arrived with version 14, the saga had already moved to version 15. The orchestrator rejected any event whose version lagged behind its current state, logging it as late and taking no action. The version counter was a simple monotonic integer incremented on each state transition. More complex systems would use a vector clock mapping each service name to its last-seen version, which handles cross-saga scenarios where different services may have observed events in different orders.

Common Pitfalls / Anti-Patterns

Treating saga as ACID transaction: Saga does not provide isolation. Concurrent sagas can see each other’s partial results. If saga A reserves inventory and saga B reads inventory before A completes, B may make decisions on uncommitted data. Handle this at the application level.

Non-idempotent steps or compensations: If a step or compensation runs twice due to retries, the effect should be the same as running once. Reserve inventory twice should not double-reserve. Always check before acting.

Compensation order errors: Compensations must run in reverse order of execution. If step 2’s compensation runs before step 1’s, you may leave the system in a worse state. Explicitly track execution order.

Ignoring the inconsistency window: During a saga execution, the system is in an inconsistent state. Other operations may read partial results. Design your application to handle this (show pending states, use optimistic UI).

Long-running compensations: Compensations can take time (refunds, cancellations). While compensating, the system is still inconsistent. Minimize compensation time and alert if compensations are taking too long.

Not planning for compensation failure: What happens if compensation fails? The saga is stuck. Implement retry with backoff, then move to a dead letter state that requires manual intervention.

Interview Questions

1. What is the Saga pattern and why was it developed as an alternative to two-phase commit?

Expected answer points:

Saga is a distributed transaction pattern that chains local transactions with compensating actions
2PC blocks resources and has a single point of failure (coordinator crash leaves participants waiting)
Saga was developed because ACID transactions do not scale across service boundaries — each service owns its database and cannot participate in a shared transaction
Saga trades ACID isolation for availability and scalability by using eventual consistency

2. What is the difference between choreographed and orchestrated sagas?

Expected answer points:

Choreographed saga: each service reacts to events from the previous service, behavior is fully distributed, no central coordinator
Orchestrated saga: a central orchestrator manages the workflow, knows the entire sequence, decides next steps and triggers compensations
Orchestration is easier to debug and reason about; choreography scales better but is harder to trace
Choreography couples services through event contracts; orchestration couples services to the orchestrator

3. What is a compensating transaction and why is it central to the Saga pattern?

Expected answer points:

A compensating transaction is an action that undoes a previously completed local transaction
Examples: refund payment, release inventory reservation, cancel shipment
When a step fails, the saga runs compensations for all completed steps in reverse order
Compensations must be deterministic and idempotent — running twice should have the same effect as running once
Not all operations can be compensated (sending an email, physical goods already shipped)

4. Why is idempotency critical in saga implementations?

Expected answer points:

Sagas run across unreliable networks — messages can be delivered twice, services can crash mid-operation
If a step retry delivers the same reservation request twice, you should not double-reserve inventory
Compensation runs must also be idempotent — refunding twice should not over-refund
Implementation: check before acting using an idempotency key (e.g., charge_id, reservation_id)
Without idempotency, retries cause data corruption and inconsistent state

5. What are the key states in a Saga execution state machine?

Expected answer points:

Pending: saga created but not started
Executing: one or more steps have completed successfully
Compensating: a step failed and compensation is running in reverse order
Completed: all steps succeeded or all compensations succeeded — terminal success state
Failed: unrecoverable state (compensation failed after retries, validation error, max retries exceeded)
Only Pending and Failed states are safely retryable at the saga level

6. How does optimistic locking work in the context of sagas, and when would you prefer it over pessimistic locking?

Expected answer points:

Optimistic locking: read the resource version before modifying, on update check version has not changed, if changed abort and retry
Pessimistic locking: acquire a lock before accessing a resource, blocks other sagas until lock releases
Optimistic locking provides higher throughput since locks are not held; pessimistic locking reduces concurrency
Use optimistic locking for most cases (financial transactions, inventory with hard limits prefer pessimistic)
In nested sagas, parent-level locking protects shared resources while nested sagas handle sub-workflows

7. What are the main trade-offs between Saga and Two-Phase Commit?

Expected answer points:

2PC provides true atomicity and isolation; Saga provides eventual consistency with no isolation
2PC blocks on coordinator failure; Saga continues because no coordinator is a single point of failure
Saga has higher latency (sequential steps) vs 2PC (parallel phases), but 2PC locks are held longer
2PC has lower availability due to coordinator blocking; Saga has higher availability
Saga compensation is application-defined and can fail; 2PC rollback is automatic
Saga works across service boundaries with separate databases; 2PC requires all participants to support it

8. What is a nested saga and when would you use this pattern?

Expected answer points:

A nested saga is where a step's "execute" is itself a saga with its own compensation logic
Example: a Payment step runs authorize → capture as a nested saga; if it fails, parent saga treats it as one failed step
Benefits: reusability (ProcessPayment nested saga reused across order, subscription, refund workflows), readability (top-level reads like business workflow), scoped failures
Introduces concurrency at parent level — requires optimistic or pessimistic locking at parent level
Nested sagas help decompose large monolithic workflows into manageable pieces

9. How do you handle the case where a compensation itself fails?

Expected answer points:

Design idempotent compensations — if compensation runs twice the second run should have no effect
Implement retry with exponential backoff for transient failures
After max retries exceeded, move the saga to a dead letter state that requires manual intervention
Alert on repeated compensation failures — this indicates systemic issues
Design compensations to be as reliable as forward actions; avoid compensations that can themselves fail permanently (e.g., physical goods already delivered)

10. What are the key observability metrics and logs you should track for a production saga implementation?

Expected answer points:

Metrics: saga completion rate (success vs failure vs compensating), execution duration by type, compensation count and success rate, step failure rate by type, concurrent saga count
Logs: saga start with correlation ID, each step start/completion/compensation with step index, compensating transaction ID, saga outcome, all relevant IDs (saga, correlation, step, compensation)
Alerts: saga taking longer than threshold, compensation repeatedly failing, failure rate exceeding baseline, max retry count reached, stuck sagas
Distributed tracing (OpenTelemetry/Zipkin/Jaeger): parent span for overall saga, child spans per step and compensation, link compensation spans to original step spans

11. How does distributed tracing help debug saga failures across services?

Expected answer points:

Sagas span multiple services — without tracing, debugging means grepping logs across all services and piecing together what happened
Distributed tracing (OpenTelemetry, Zipkin, Jaeger) gives each saga a single trace ID that propagates through all service calls
Trace structure: parent span for the overall saga, child spans for each step and compensation, with links pairing compensation spans to original step spans
Trace context propagation: pass trace context through saga steps using baggage or span links so the full execution path is visible
When a step fails and compensation runs, link the compensation span to the original step span so you can see the pairing visually in the trace viewer

12. Why is saga state persistence critical and how would you implement it?

Expected answer points:

If the saga orchestrator crashes mid-execution, without persisted state you cannot determine which steps completed and which need compensation
Persist saga state to durable storage (database) after each step: saga ID, step name, status (started/completed/compensated), result data
The saga state machine tracks Pending, Executing, Compensating, Completed, and Failed states
On recovery, read persisted state and determine whether to resume, complete compensation, or treat as failed
Workflow engines like Temporal handle this automatically with durable workflow execution — if a worker crashes, the workflow resumes from where it left off

13. How does saga differ from other distributed transaction patterns like TCC (Try-Confirm-Cancel) and 2PC?

Expected answer points:

TCC uses three phases per operation: Try (reserve resources), Confirm (commit the reservation), Cancel (release the reservation). TCC requires infrastructure support from all participants
Saga uses two operations per step: the forward action and a separate compensating action. Simpler to implement but compensation can fail
2PC is atomic across all participants but blocks on coordinator failure; saga is eventually consistent but never blocks waiting for a coordinator
TCC provides some isolation at the Try phase; saga provides no isolation between steps
Saga is easier to retrofit onto existing services since each service only needs to implement its own compensation; TCC requires participants to implement Try/Confirm/Cancel interfaces

14. What happens when a network partition occurs during saga execution?

Expected answer points:

Network partitions cause timeouts and indeterminate state — you do not know whether the remote service received your request and processed it
Apply idempotency logic so duplicate messages after partition recovery do not cause duplicate operations
The saga must distinguish between "step is still running" and "step completed but network failed before response arrived"
Timeout-based compensation: if a step does not respond within a threshold, trigger compensation for completed steps
Use idempotency keys and state checks on recovery to determine whether to retry or continue as-is

15. What are the main testing strategies for saga implementations and why is each important?

Expected answer points:

Unit testing compensations: test each compensation in isolation, verify idempotency (running twice has no additional effect)
Integration testing failure scenarios: simulate step failures at each point and verify compensations run in correct reverse order
Compensation order testing: explicitly verify that compensations run in reverse execution order — this is the most common saga bug
Chaos testing in production: kill services mid-saga, introduce latency, fill resources, split networks — verify saga handles each gracefully
Distributed tracing verification: confirm trace IDs propagate correctly through all steps and compensations

16. What are the best practice guidelines for designing saga step granularity?

Expected answer points:

Steps should represent business activities, not technical operations — each step should be meaningful in the domain
Too fine-grained: too many steps means many compensations to manage and track; high orchestration overhead
Too coarse-grained: large steps mean expensive compensations; if one step fails you may have to undo a lot of work
Use nested sagas to decompose coarse steps into sub-workflows while keeping the top-level saga readable
Compensation cost should be proportional to forward action cost — avoid steps where compensation is significantly more expensive than the forward action

17. How does the saga pattern relate to eventual consistency and what are the implications for application design?

Expected answer points:

Saga provides eventual consistency: after the saga completes (success or compensation), the system reaches a consistent state
During saga execution, the system is in a temporarily inconsistent state — other operations can observe partial results
Application must handle intermediate states: show users pending states, use optimistic UI, design for the inconsistency window
Do not assume that because step 1 succeeded, step 2 will also succeed — handle failure at each step
Eventual consistency is acceptable for most business workflows (order fulfillment, booking, subscriptions) where strict isolation is not required

18. What is the role of correlation IDs in saga debugging and how should they be implemented?

Expected answer points:

A correlation ID is a unique identifier assigned to a saga instance at creation and propagated through all subsequent step calls and events
Include correlation ID in all logs so you can filter all services by the same ID to reconstruct the full saga execution
Propagate correlation ID through HTTP headers, message queue properties, or gRPC metadata depending on your transport
Do not expose internal IDs (database primary keys) directly — use the correlation ID as the public-facing identifier in logs and traces
Distribute tracing tools use trace IDs as correlation IDs when properly configured

19. How do you decide between choreography and orchestration for a given saga implementation?

Expected answer points:

Choreography works well for simple workflows with 2-4 services where the logic is straightforward and teams are independent
Orchestration is better for complex workflows with many steps, multiple teams, or workflows that benefit from centralized error handling
Choreography creates loose coupling through event contracts but makes it harder to see the overall workflow state
Orchestration centralizes logic in one place making it easier to debug and modify but creates a coupling point to the orchestrator
Consider the team structure: if one team owns the workflow, orchestration gives them clear ownership; if multiple teams share responsibility, choreography may reduce coordination overhead

20. What are the key security considerations when implementing saga patterns?

Expected answer points:

Authenticate and authorize saga trigger endpoints — do not allow arbitrary saga instantiation without access control
Validate all saga input parameters to prevent injection attacks through saga payloads
Do not log sensitive data (payment info, passwords, PII) in saga context — logs may be stored in centralized logging systems
Encrypt saga state at rest if using external storage for persistence — saga state may contain business-sensitive information
Audit log all saga state changes (start, step completion, compensation, completion, failure) with correlation IDs for forensic analysis

Quick Recap Checklist

The saga pattern manages distributed transactions without 2PC. Break the transaction into steps, and define a compensation for each step. If a step fails, run the compensations in reverse order. Saga provides eventual consistency, not ACID isolation. Two implementations exist: choreography distributes behavior across services, while orchestration centralizes coordination in an orchestrator. Both work; the choice depends on workflow complexity and team structure. Idempotency is essential for safe retries and preventing duplicate operations.

Reference Checklists

The checklists below consolidate the key requirements from this guide into actionable items. Use the production checklist as a pre-deployment gate: every item should be verified before a saga goes live. The observability checklist covers metrics, logs, and alerts to monitor in production. The security checklist covers authentication, validation, and audit concerns that are easy to overlook when focusing on compensation logic.

Production Checklist

Saga breaks distributed transactions into steps with compensating transactions
If a step fails, previous steps are compensated in reverse order
Saga provides eventual consistency, not ACID isolation
Two implementations: choreography (distributed) and orchestration (centralized)
Idempotency is essential for safe retries

Saga Production Checklist

Each saga step has a defined compensation transaction
Compensation transactions are tested with injected failures
Idempotency implemented for all saga steps
Saga state persisted for crash recovery
Timeout values tuned for each step (long enough to complete, short enough to recover)
Concurrent saga execution handled correctly
Monitoring and alerting for saga failures configured

Conclusion

The saga pattern manages distributed transactions without 2PC. Break the transaction into steps, and define a compensation for each step. If a step fails, run the compensations in reverse order.

Choreographed sagas distribute behavior across services. Orchestrated sagas centralize coordination in an orchestrator. Both work; the choice depends on workflow complexity and team structure.

Sagas trade ACID isolation for availability. The application must handle partial states and concurrent access. This complexity is inherent to distributed transactions; saga just makes it explicit rather than pretending it does not exist.

sequenceDiagram
    participant Saga
    participant Inv as Inventory
    participant Pay as Payment
    Saga->>+Inv: Step 1: Reserve
    Inv->>-Saga: OK
    Saga->>+Pay: Step 2: Charge
    Pay->>-Saga: Fail
    Saga->>+Inv: Compensate

Observability Checklist

Metrics

Saga completion rate (success vs failure vs compensating)
Saga execution duration by type
Compensation execution count and success rate
Step failure rate by step type
Concurrent saga count
Average number of steps per saga

Logs

Log saga start with correlation ID and input parameters
Log each step start, completion, and compensation with step index
Include compensating transaction ID in compensation logs
Log saga outcome (success, failure, compensating)
Include all relevant IDs: saga ID, correlation ID, step IDs, compensation IDs

Alerts

Alert when saga takes longer than expected threshold
Alert when compensation repeatedly fails
Alert when saga failure rate exceeds normal baseline
Alert when max retry count reached on a step
Alert on stuck sagas (no progress for extended period)

Security Checklist

Authenticate and authorize saga trigger endpoints
Validate saga input parameters to prevent injection
Audit log all saga state changes (start, step completion, compensation)
Do not log sensitive data (payment info, passwords) in saga context
Encrypt saga state at rest if using external storage
Use correlation IDs for tracing without exposing internal IDs

The Saga Pattern: Managing Distributed Transactions

Introduction

Topic-Specific Deep Dives

Core Concepts

Example: order processing saga

Saga vs two-phase commit

Architecture Deep Dives

Choreographed saga

Orchestrated saga

Implementing saga compensation

Idempotency in sagas

Nested sagas

Failure Handling

Performance considerations

Trade-off Analysis

Testing Strategies

Saga vs Alternative Patterns

Choreography vs Orchestration Trade-offs

Design Decision Matrix

Cost-Benefit Summary

Saga Testing Strategies

Unit testing compensations

Integration testing failure scenarios

Testing compensations run in correct order

Chaos testing sagas in production

Observability & Tracing

Observability

Trace structure for sagas

Implementing trace context propagation

Framework & Decision Guides

Temporal

AWS Step Functions

Conductor (Netflix)

Comparison

When to Use / When Not to Use the Saga Pattern

Production Considerations

Saga Failure Scenarios

Diagrams & State Machines

Failure Flow Diagram

Saga Execution State Machine

Choreography vs Orchestration: Trade-off Analysis

Production Failure Scenarios

Scenario 1: Payment Service Unavailable During Saga

Scenario 2: Circular Dependency in Choreography

Scenario 3: Orchestrator State Loss

Scenario 4: Compensating Transaction Loop

Common Pitfalls / Anti-Patterns

Interview Questions

Further Reading

Quick Recap Checklist

Reference Checklists

Production Checklist

Saga Production Checklist

Conclusion

Observability Checklist

Metrics

Logs

Alerts

Security Checklist

Category

Tags

Related Posts

TCC: Try-Confirm-Cancel Pattern for Distributed Transactions

Service Orchestration: Coordinating Distributed Workflows

Common Coding Interview Patterns