The Saga Pattern: Managing Distributed Transactions
Learn saga pattern for distributed transactions without two-phase commit. Understand choreography vs orchestration with practical examples and production considerations.
The Saga Pattern: Managing Distributed Transactions
The saga pattern is one of the most practical patterns for managing distributed transactions without the complexity of two-phase commit. When a business operation spans multiple services — each with their own database — a single ACID transaction is not possible. The saga breaks this into a sequence of local transactions, each with a compensating transaction that can undo it if something fails.
Introduction
ACID transactions do not scale across services. When order service, inventory service, and payment service each have their own databases, you cannot wrap a single transaction around all three. Two-phase commit is theoretically possible but practically problematic (more on that in Two-Phase Commit).
The saga pattern offers an alternative. Instead of locking resources across services, a saga breaks a distributed transaction into a sequence of local transactions, each with a compensating transaction that can undo it.
Topic-Specific Deep Dives
Core Concepts
A saga represents a distributed transaction as a series of steps. Each step is a local transaction on one service. After each step, the saga either continues to the next step or, if a step fails, runs compensating transactions to undo the previous steps.
Step 1: Reserve Inventory (compensate: release inventory)
Step 2: Charge Payment (compensate: refund payment)
Step 3: Create Shipment (compensate: cancel shipment)
If step 3 fails, you run the compensation for step 2 (refund) and then step 1 (release inventory). The saga undoes what it did, leaving the system consistent.
sequenceDiagram
participant Saga
participant Inv as Inventory
participant Pay as Payment
participant Ship as Shipping
Saga->>+Inv: Reserve
Inv->>-Saga: OK
Saga->>+Pay: Charge
Pay->>-Saga: OK
Saga->>+Ship: Ship
Ship->>-Saga: OK
Note over Saga: Success - all steps complete
Saga does not provide isolation.
Example: order processing saga
A complete order processing saga with three services:
def create_order_saga(order):
saga = OrderSaga()
# Step 1: Reserve inventory
reservation = saga.reserve_inventory(order.items)
if not reservation.success:
return OrderResult(rejected=True, reason="Insufficient inventory")
# Step 2: Authorize payment
authorization = saga.authorize_payment(order.payment, order.total)
if not authorization.success:
saga.compensate_inventory(reservation.id)
return OrderResult(rejected=True, reason="Payment declined")
# Step 3: Create shipment
shipment = saga.create_shipment(order.address, order.items)
if not shipment.success:
saga.reverse_payment(authorization.id)
saga.compensate_inventory(reservation.id)
return OrderResult(rejected=True, reason="Shipping unavailable")
# All steps succeeded
return OrderResult(confirmed=True, order_id=order.id, shipment=shipment)
Each compensate_* method is a compensating transaction. They run in reverse order on failure.
Saga vs two-phase commit
Two-phase commit (2PC) locks resources until all participants confirm. It provides atomicity but at the cost of availability. If the coordinator fails during the commit phase, participants may be left waiting indefinitely.
Saga takes a different approach. It sacrifices isolation and atomicity for availability and scalability. Steps execute one at a time. Failures trigger compensation, not rollback.
For a deep dive on 2PC and why it is often avoided, see Two-Phase Commit.
Architecture Deep Dives
There are two ways to implement a saga. Choreography uses events between services to coordinate. Orchestration uses a central orchestrator to manage the sequence.
Choreographed saga
Each service knows its own step and its own compensation. When a step completes, the service emits an event. The next service reacts to that event. If something fails, services emit failure events that trigger compensations.
graph LR
Order[Order Service] -->|OrderCreated| Inv[Inventory]
Inv -->|InventoryReserved| Pay[Payment]
Pay -->|PaymentCharged| Ship[Shipping]
Ship -->|ShipmentCreated| Order
In this flow, each service reacts to the previous step’s event. The behavior is distributed. Each service knows only its own piece.
When Payment fails after Inventory is reserved:
graph LR
Pay -->|PaymentFailed| Inv
Inv -->|InventoryReleased| Order
The compensation logic lives in each service. Inventory responds to the failure by releasing the reservation.
Orchestrated saga
A central orchestrator manages the sequence. It decides what step runs next, handles failures, and triggers compensations. The orchestrator knows the entire workflow.
graph LR
Orch[Order Orchestrator] -->|Reserve| Inv
Inv -->|OK| Orch
Orch -->|Charge| Pay
Pay -->|OK| Orch
Orch -->|Ship| Ship
Ship -->|OK| Orch
The orchestrator keeps state about what has completed. If Payment fails, the orchestrator tells Inventory to release and returns an error to the client.
For a full comparison, see Service Orchestration.
Implementing saga compensation
Compensation logic must be deterministic. If step 3 fails after step 2 succeeded, you must undo step 2. Running compensation twice or in the wrong order causes problems.
class OrderSaga:
def execute(self, order):
try:
# Step 1: Reserve inventory
reservation = self.inventory.reserve(order.items)
self.steps.append(('reserve', reservation))
# Step 2: Process payment
charge = self.payment.charge(order.payment_info, order.total)
self.steps.append(('charge', charge))
# Step 3: Create shipment
shipment = self.shipping.create(order.address)
self.steps.append(('ship', shipment))
return SagaResult(success=True, shipment=shipment)
except PaymentDeclined:
# Compensate step 1
self.inventory.release(self.steps[0][1].reservation_id)
return SagaResult(success=False, reason="Payment declined")
except ShippingUnavailable:
# Compensate step 2
self.payment.refund(self.steps[1][1].charge_id)
# Compensate step 1
self.inventory.release(self.steps[0][1].reservation_id)
return SagaResult(success=False, reason="Shipping unavailable")
The saga tracks completed steps in order. On failure, it runs compensations in reverse order. This is the compensating transaction pattern.
Idempotency in sagas
Sagas execute across unreliable networks. Messages may be delivered twice. Services may crash mid-operation. Your saga must handle idempotency.
Reserve the same inventory twice should not double-reserve. Charge the same payment twice should not double-charge.
def reserve_inventory(self, reservation_id, items):
# Idempotency check
if self.inventory.is_reserved(reservation_id):
return self.inventory.get_reservation(reservation_id)
# Actually reserve
return self.inventory.create_reservation(reservation_id, items)
def charge_payment(self, charge_id, payment_info, amount):
# Idempotency check using charge_id
existing = self.payment.get_charge(charge_id)
if existing:
return existing
# Process charge
return self.payment.create_charge(charge_id, payment_info, amount)
Include an idempotency key (the reservation ID from step 1) in the compensation call. Check before acting.
Nested sagas
Large workflows sometimes need sub-sagas. Rather than one monolithic saga with dozens of steps, you can decompose into nested sagas where a step’s “execute” is itself a saga.
For example, an order fulfillment saga might have a step called ProcessPayment that internally runs authorize → capture as a nested saga. If the nested saga fails, the parent saga treats it as a single failed step and compensates accordingly.
Why nest sagas?
- Reusability: The
ProcessPaymentnested saga can be reused across multiple parent sagas (order, subscription, refund) - Readability: Top-level saga reads like a business workflow, not a technical protocol
- Scoped failures: If payment fails, you know exactly which sub-step failed without scrolling through 20 parent steps
Concurrency control with nested sagas:
Nested sagas introduce concurrency at the parent level. While the payment nested saga is running, other parent sagas may also be running and trying to access shared resources. Use optimistic or pessimistic locking at the parent level.
class ParentSaga:
def execute(self):
# Step 1: Reserve inventory (parent-level lock on inventory record)
with self.lock('inventory', self.order.inventory_id):
self.reserve_inventory()
# Step 2: Run payment as nested saga (has its own compensation)
payment_result = PaymentNestedSaga().execute(self.payment_context)
if not payment_result.success:
# Parent-level compensation for inventory
self.release_inventory() # Uses same lock
return Failed
# Step 3: Create shipment
self.create_shipment()
Optimistic locking: Read the resource version before modifying. On update, check the version hasn’t changed. If it has, abort and retry.
Pessimistic locking: Acquire a lock before accessing the resource. Blocks other sagas from accessing it until the lock releases. Simpler but reduces concurrency.
Use optimistic locking for most cases (higher throughput). Use pessimistic locking only when the cost of a concurrent modification is very high (financial transactions, inventory with hard limits).
class SagaState:
"""Persisted saga state for crash recovery."""
def __init__(self, saga_id: str):
self.saga_id = saga_id
self.steps: list[tuple[str, str, dict]] = [] # [(step_name, status, data)]
self.status = "pending"
self.created_at = datetime.utcnow()
def mark_step_started(self, step_name: str, step_data: dict):
self.steps.append((step_name, "started", step_data))
self._persist()
def mark_step_completed(self, step_name: str, result_data: dict):
# Update step status
for i, (name, status, data) in enumerate(self.steps):
if name == step_name and status == "started":
self.steps[i] = (name, "completed", {**data, **result_data})
break
self._persist()
def mark_step_compensated(self, step_name: str):
for i, (name, status, data) in enumerate(self.steps):
if name == step_name and status == "completed":
self.steps[i] = (name, "compensated", data)
break
self._persist()
def get_pending_steps(self) -> list[str]:
return [name for name, status, _ in self.steps if status == "started"]
def get_completed_steps(self) -> list[str]:
return [name for name, status, _ in self.steps if status == "completed"]
def _persist(self):
# Save to durable storage (database, etc.)
db.sagas.upsert(self.saga_id, self.to_dict())
Failure Handling
Saga failures fall into several categories.
Transient failures: Network timeouts, temporary unavailability. Retry with backoff. If it keeps failing, treat as permanent.
Permanent failures: Insufficient inventory, card declined. These will not succeed on retry. Trigger compensation.
Unknown state: A service crashes mid-operation. When it recovers, determine what happened. Did the transaction commit before the crash? Did it not? This requires idempotency and state tracking.
Design compensations to be idempotent. If compensation runs twice, the second run should have no effect.
Performance considerations
Saga trades a synchronous two-round-trip 2PC for a sequential multi-step approach. The math is worth understanding.
2PC latency profile: Two network round-trips (prepare + commit), but all participants vote and commit in parallel. Typical: 10-50ms per participant phase.
Saga latency profile: Each step adds its own latency. If step 1 takes 20ms, step 2 takes 30ms, step 3 takes 15ms, your total is 65ms plus orchestration overhead. Steps run sequentially, not in parallel.
For a 3-step saga vs 2PC across the same 3 services:
| Metric | 2PC | Saga |
|---|---|---|
| Happy path latency | ~20-40ms (parallel phases) | ~50-80ms (sequential steps) |
| Failure recovery | Blocks until coordinator recovers | Compensation runs immediately |
| Availability | Lower (blocking on coordinator) | Higher (no coordinator SPOF) |
| Lock duration | All locks held during both phases | Each lock released after its step |
Saga’s latency overhead is real but often acceptable. If each step is 10-30ms (typical for service calls), a 5-step saga runs in 50-150ms total. Compare that to a user-facing API timeout (usually 1-5 seconds) and the overhead is negligible.
Where saga latency hurts is high-throughput, low-latency paths (trading systems, real-time pricing). In those cases, consider whether you can pipeline steps (some steps don’t depend on previous step results and can overlap).
Trade-off Analysis
The saga pattern involves several fundamental trade-offs compared to other distributed transaction approaches.
Testing Strategies
Saga vs Alternative Patterns
| Aspect | Saga | Two-Phase Commit | TCC (Try-Confirm-Cancel) |
|---|---|---|---|
| Atomicity | Eventual (via compensation) | True atomic (all-or-nothing) | Eventual (via Confirm/Cancel) |
| Isolation | None (partial states visible) | Full isolation | Partial (Try reserves) |
| Availability | High (no coordinator blocking) | Low (coordinator is SPOF) | Medium (Try phase reserves resources) |
| Complexity | Medium (compensation logic) | High (coordinator protocol) | High (3-phase interface per service) |
| Rollback Mechanism | Compensation transactions | Automatic rollback | Cancel / Confirm semantics |
| Latency | Per-step cumulative | Two round-trips (parallel phases) | Three round-trips per operation |
| Service Coupling | Each service owns compensation | All services must support 2PC | All services must implement TCC |
| Failure Handling | Application-defined compensation | Coordinator-driven | Try phase failures auto-cancel |
Choreography vs Orchestration Trade-offs
| Factor | Choreography | Orchestration |
|---|---|---|
| Central Point of Failure | None (fully distributed) | Orchestrator is a potential failure point |
| Logic Location | Distributed across services | Centralized in orchestrator |
| Adding New Steps | All participating services must change | Only orchestrator changes |
| Understanding Workflow | Harder to see overall flow | Easier to see complete workflow |
| Debugging | Requires distributed tracing | Easier (centralized logs) |
| Scalability | Scales with number of services | Limited by orchestrator capacity |
| Team Independence | High (services evolve independently) | Low (teams depend on orchestrator) |
| Suitable For | 2-4 services, simple event flows | 5+ services, complex workflows |
Design Decision Matrix
| Your Constraint | Recommended Approach |
|---|---|
| Maximum availability required | Saga (either flavor) |
| True ACID isolation required | Single database or 2PC (with caveats) |
| Cross-service transactions | Saga |
| Can’t modify participant code | Saga (only compensation needed) |
| High-frequency, low-latency | Pipelined saga or avoid distributed transaction |
| Long-running (hours to days) | Orchestrated saga with durable execution (Temporal) |
| Simple 2-3 step workflow | Choreographed saga |
| Complex multi-team workflow | Orchestrated saga |
| Already using workflow engine | Use the engine’s saga support (Temporal, Step Functions) |
Cost-Benefit Summary
Saga Benefits:
- No coordinator means no single point of failure
- Steps execute sequentially — each lock is released after its step completes
- Works across service boundaries with separate databases
- Compensation is application-defined and domain-appropriate
- Easier to retrofit onto existing services
Saga Costs:
- No isolation — application must handle partial states
- Compensation can fail, leaving the system in an inconsistent state
- Longer cumulative latency (sum of step latencies vs parallel phases)
- Application must implement idempotency throughout
- Testing is more complex (failure scenarios, compensation ordering)
Saga Testing Strategies
Testing sagas is hard because they span services and involve time. A structured approach helps.
Unit testing compensations
Test each compensation in isolation first. The compensation is the most critical piece — if it fails, your saga is stuck.
# Test that releasing inventory twice has no effect (idempotency)
def test_release_inventory_idempotent():
inventory = InMemoryInventory()
inventory.reserve("order-123", ["item-a"])
# First release — succeeds
result1 = inventory.release("order-123")
assert result1.success
assert not inventory.is_reserved("order-123")
# Second release — should be idempotent (no error)
result2 = inventory.release("order-123")
assert result2.success # Still succeeds, even though already released
# Third release — still idempotent
result3 = inventory.release("order-123")
assert result3.success
Integration testing failure scenarios
The real test is whether your saga handles failures correctly. Set up test infrastructure that simulates failures at each step.
# Test: step 3 fails, step 2 compensation runs correctly
def test_saga_step3_failure_triggers_step2_compensation():
inventory = MockInventory() # Always succeeds
payment = MockPayment() # Always succeeds
shipping = MockShipping(fail_on="create") # Fails on create
saga = OrderSaga(inventory, payment, shipping)
result = saga.execute(order_with_3_items)
assert result.failed
assert result.failed_step == "create_shipment"
assert payment.refund_was_called() # Step 2 compensated
assert inventory.release_was_called() # Step 1 compensated
assert not shipping.shipment_created # Step 3 never ran
Testing compensations run in correct order
The most common saga bug is compensation running in the wrong order. Write a test that explicitly verifies reverse order.
def test_compensation_runs_in_reverse_order():
call_order = []
class TrackingService:
def do_step(self):
call_order.append(f"do-{self.name}")
return Success()
def compensate(self):
call_order.append(f"compensate-{self.name}")
return Success()
class OrderedSaga(Saga):
def __init__(self):
self.s1 = TrackingService(name="step1")
self.s2 = TrackingService(name="step2")
self.s3 = TrackingService(name="step3")
def execute(self):
self.do_step(self.s1)
self.do_step(self.s2)
self.do_step(self.s3) # Fails here
return Success()
def compensate(self):
# Should run in reverse: s3, s2, s1
self.compensate(self.s3)
self.compensate(self.s2)
self.compensate(self.s1)
saga = OrderedSaga()
saga.execute() # Fails on step 3
saga.compensate()
assert call_order == [
"do-step1", "do-step2", "do-step3",
"compensate-step3", "compensate-step2", "compensate-step1"
]
Chaos testing sagas in production
Once your saga is running in production, inject failures to verify it handles them:
- Kill a service mid-saga and verify compensation runs
- Introduce network latency and verify timeouts trigger correctly
- Fill up a resource (disk, connection pool) and verify graceful degradation
- Split the network between two services and verify saga completes or compensates correctly
Observability & Tracing
Observability
Sagas span multiple services. Without tracing, debugging a failed saga means grepping logs across 5 services and trying to piece together what happened. With distributed tracing (OpenTelemetry, Zipkin, Jaeger), you get a single trace ID that follows the saga across all services.
Trace structure for sagas
A saga trace has a parent span for the overall saga, with child spans for each step and compensation.
Trace: order-123-saga (trace-id: abc123)
├── Span: saga_created (service: order-service)
├── Span: step.reserve_inventory (service: inventory-service)
│ └── Span: compensate.reserve_inventory (service: inventory-service)
├── Span: step.charge_payment (service: payment-service)
│ └── Span: compensate.charge_payment (service: payment-service)
├── Span: step.create_shipment (service: shipping-service)
└── Span: saga_completed / saga_failed
Implementing trace context propagation
Pass trace context through saga steps using baggage or span links.
from opentelemetry import trace
from opentelemetry.propagate import inject, extract
tracer = trace.get_tracer(__name__)
class OrderSaga:
def execute(self, order, context=None):
# Extract incoming trace context (if triggered by HTTP request)
if context:
ctx = extract(context)
span = tracer.start_span("order_saga", context=ctx)
else:
span = tracer.start_span("order_saga")
with span:
span.set_attribute("saga.id", order.saga_id)
span.set_attribute("saga.type", "order_fulfillment")
# Inject trace context into step calls
headers = {}
inject(headers) # Injects current trace into headers
try:
# Step 1: Reserve inventory (pass headers for trace propagation)
inventory_ctx = self.inventory.reserve(order.items, headers)
# Step 2: Charge payment
payment_ctx = self.payment.charge(order.payment, headers)
# Step 3: Create shipment
shipment = self.shipping.create(order.address, headers)
span.set_status(trace.Status.OK)
return Success(shipment)
except Exception as e:
span.record_exception(e)
span.set_status(trace.Status.ERROR, str(e))
raise
When a step fails and compensation runs, link the compensation span to the original step span so you can see the pairing in the trace.
def compensate_inventory(self, reservation_id, original_span):
with tracer.start_as_current_span(
"compensate.reserve_inventory",
links=[Link(original_span.get_span_context())]
) as span:
span.set_attribute("compensation.for", original_span.name)
span.set_attribute("reservation.id", reservation_id)
self.inventory.release(reservation_id)
This way, in your trace viewer, you see the compensate span linked back to its original do span — paired visually rather than guessing from logs.
Framework & Decision Guides
Building saga from scratch is educational but painful for production. Use a framework that handles state persistence, retry logic, observability, and distributed tracing out of the box.
Temporal
The strongest choice for saga orchestration. Temporal provides durable workflow execution — if your service crashes mid-saga, Temporal persists the workflow state and resumes it from where it left off. No need to build your own saga state machine.
- Strengths: Durable execution (survives worker crashes), built-in retries with backoff, activity heartbeats, sandboxed workflow code, strong OpenTelemetry integration
- Good for: Complex multi-step business workflows, long-running sagas (hours to days)
- Trade-offs: Self-hosting is operationally heavy; Temporal Cloud pricing can be significant at scale
# Temporal workflow example
@workflow.defn
class OrderWorkflow:
@workflow.run
async def run(self, order: Order) -> OrderResult:
# Activities with automatic retry
reservation = await workflow.execute_activity(
reserve_inventory,
order.items,
start_to_close_timeout=timedelta(seconds=30),
retry_policy=ActivityRetryPolicy(maximum_attempts=3),
)
try:
charge = await workflow.execute_activity(
charge_payment,
order.payment,
order.total,
start_to_close_timeout=timedelta(seconds=30),
)
except PaymentDeclined:
await workflow.execute_activity(
release_inventory,
reservation.id,
)
return OrderResult(rejected=True, reason="Payment declined")
shipment = await workflow.execute_activity(
create_shipment,
order.address,
)
return OrderResult(confirmed=True, shipment=shipment)
AWS Step Functions
Managed saga orchestration on AWS. Integrates tightly with AWS services (Lambda, ECS, SQS, DynamoDB). Good if you’re already all-in on AWS.
- Strengths: Fully managed, pay-per-state-transition, tight Lambda integration, visual workflow designer
- Good for: AWS-centric architectures, medium-complexity workflows
- Trade-offs: Vendor lock-in, expensive at high step counts, debugging can be opaque
Conductor (Netflix)
Conductor is an open-source saga orchestrator from Netflix. Good for microservices that need workflow orchestration without heavy operational overhead.
- Strengths: Open source (self-hostable), JSON-based workflow definitions, HTTP-based workers
- Good for: Teams wanting open-source without Temporal complexity
- Trade-offs: Not as battle-tested as Temporal at extreme scale, less mature ecosystem
Comparison
| Framework | Durability | Open Source | Complexity | Best For |
|---|---|---|---|---|
| Temporal | Excellent (durable execution) | Yes (server) + cloud | Medium | Complex long-running workflows |
| AWS Step Functions | Good (managed) | No | Low | AWS-centric, simple workflows |
| Conductor | Good | Yes (fully) | Medium | Open-source preference |
For most production scenarios, Temporal is the right call. The durable execution guarantee alone eliminates a whole class of saga state loss bugs.
When to Use / When Not to Use the Saga Pattern
| Criteria | Saga (Choreography) | Saga (Orchestration) | Two-Phase Commit |
|---|---|---|---|
| Atomicity | Eventual | Eventual | True atomicity |
| Isolation | None | None | Full isolation |
| Availability | High | High | Low (blocks on coordinator failure) |
| Complexity | Distributed logic | Centralized logic | Distributed but synchronous |
| Compensation | Each service knows its own | Orchestrator directs | Automatic rollback |
| Latency | Per-step latency | Per-step + orchestration overhead | Two round-trips |
| Debugging | Harder (distributed) | Easier (centralized) | Moderate |
| Rollback Cost | Compensation required | Compensation required | Free (automatic) |
Use saga when:
- Operations span multiple services with separate databases
- You cannot use 2PC (which is usually the right call)
- Business transactions map naturally to a sequence of steps
- You can define compensating transactions for each step
- Eventual consistency is acceptable for your domain
- You need availability over strong isolation
Avoid saga when:
- Steps have tight interdependencies that require strict isolation
- Rollback must be immediate and guaranteed
- Compensation is expensive or impossible (sending an email, charging a card with long refund times)
- Your domain requires all-or-nothing atomicity that saga cannot provide
- The inconsistency window is unacceptable for your use case
Production Considerations
Saga Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Step fails and compensation also fails | System left in inconsistent state | Design idempotent compensations; implement retry with exponential backoff; alert on repeated compensation failures |
| Service crashes mid-step | Step may or may not have completed; unknown state | Use idempotency keys; implement saga state tracking; use durable workflow engines |
| Concurrent sagas interfere | One saga’s uncommitted data affects another saga’s read | Implement optimistic concurrency control; use application-level locks for critical resources |
| Compensation runs on already-succeeded step | Double compensation causes incorrect state | Track completed steps explicitly; prevent compensation on committed steps |
| Saga state lost (orchestrator crash) | Cannot determine what steps completed | Persist saga state to durable storage; use workflow engines that handle this |
| Infinite retry loop | System stuck in repeated failed attempts | Implement max retry count; move to dead letter state after threshold; alert |
Diagrams & State Machines
Failure Flow Diagram
graph TD
Start[Start Saga] --> Step1[Execute Step 1]
Step1 --> Step1OK{Step 1 OK?}
Step1OK -->|No| Fail1[Compensate Step 1<br/>Return Error]
Step1OK -->|Yes| Step2[Execute Step 2]
Step2 --> Step2OK{Step 2 OK?}
Step2OK -->|No| Comp1[Compensate Step 1]
Comp1 --> Fail2[Return Error]
Step2OK -->|Yes| Step3[Execute Step 3]
Step3 --> Step3OK{Step 3 OK?}
Step3OK -->|No| Comp2A[Compensate Step 2]
Comp2A --> Comp2B[Compensate Step 1]
Comp2B --> Fail3[Return Error]
Step3OK -->|Yes| Success[Saga Complete]
Saga Execution State Machine
stateDiagram-v2
[*] --> Pending: Saga created
Pending --> Executing: First step started
Executing --> Executing: Step N completed
Executing --> Compensating: Step failed
Compensating --> Compensating: Compensation in progress
Compensating --> Completed: All compensations done
Executing --> Completed: Final step succeeded
Compensating --> Failed: Compensation failed after retries
Pending --> Failed: Immediate failure (validation error)
Completed --> [*]
Failed --> [*]
State Definitions:
| State | Description |
|---|---|
| Pending | Saga created but not yet started. Initial state before first step execution. |
| Executing | One or more steps have completed successfully. Saga is processing subsequent steps. |
| Compensating | A step failed and compensation is running in reverse order to undo completed steps. |
| Completed | All steps succeeded or all compensations succeeded. Terminal success state. |
| Failed | Saga reached an unrecoverable state (compensation failed or max retries exceeded). |
State Transition Rules:
- Once in Executing, cannot return to Pending
- Compensating can only be entered from Executing
- Completed and Failed are terminal states
- From Compensating, success leads to Completed, persistent failure leads to Failed
- Only Pending and Failed can be safely retried at the saga level
Choreography vs Orchestration: Trade-off Analysis
| Factor | Choreography | Orchestration | Recommendation |
|---|---|---|---|
| Coupling | Low — services own their own logic | High — orchestrator knows all steps | Choreography for independent teams |
| Complexity | Distributed — each service knows its trigger | Centralised — orchestrator is complex | Orchestration for complex flows |
| Visibility | Poor — no central view of saga state | High — orchestrator tracks all state | Orchestration for debugging |
| Scalability | High — no central bottleneck | Limited by orchestrator capacity | Choreography at scale |
| Failure Handling | Each service handles its own failures | Centralised retry and compensation | Orchestration for guaranteed ordering |
Production Failure Scenarios
Scenario 1: Payment Service Unavailable During Saga
What happened: A saga was executing an order placement across three services: inventory, payment, and shipping. The payment service became unavailable for 45 seconds mid-saga after the inventory reservation succeeded.
Root cause: The payment service’s database connection pool was exhausted due to a misconfigured max-connections setting. Requests queued up behind the pool limit and timed out.
Impact: 127 sagas entered the compensating state simultaneously. The compensating transaction to release inventory ran successfully, but the refund for charges already processed was delayed. 23 customers reported duplicate charges before idempotency checks caught them.
Lesson learned: Always implement idempotency keys at every saga step. The payment step used an idempotency key derived from the saga ID and step number, which prevented double-charging on retry — but the compensation path did not use an idempotency key, causing the delayed refund to appear as a duplicate in monitoring.
Scenario 2: Circular Dependency in Choreography
What happened: Two sagas in an e-commerce platform developed a circular dependency. Saga A needed to update product pricing and Saga B needed to update product availability. Both sagas published events that triggered each other’s next steps, creating an infinite loop that saturated the message broker.
Root cause: The choreography design did not account for the relationship between pricing updates and availability updates. A pricing change can affect which warehouse fulfils an order, which changes availability, which can affect pricing through promotional rules.
Impact: Message broker queue depth grew to 2.3 million messages in 3 minutes before the circuit breaker triggered. The system recovered after restarting both sagas and replaying from the last stable checkpoint.
Lesson learned: Model all cross-saga dependencies explicitly before deploying choreography. Implement circuit breakers per event type, not per saga. Always set message TTL and dead-letter queues for choreography events.
Scenario 3: Orchestrator State Loss
What happened: A Kubernetes pod running the saga orchestrator was terminated during a 12-step order fulfilment saga. The orchestrator had been persisting state every 10 steps, and the saga was at step 7. The replacement pod could not determine whether step 6 had completed.
Root cause: The orchestrator persisted saga state asynchronously to reduce latency, using a write-ahead log with batch commits. The batch had not been committed when the pod was terminated, losing visibility into steps 1–6.
Impact: The replacement pod retried from step 1, causing double reservation of inventory and double payment capture. The idempotency layer caught the double payment but not the double inventory reservation, leading to inventory inconsistency that took 4 hours to reconcile manually.
Lesson learned: Persist saga state synchronously at every step boundary, not in batches. Use the outbox pattern to guarantee that step completion and state persistence are atomic. Consider checkpointing after every step, not every N steps.
Scenario 4: Compensating Transaction Loop
What happened: A saga that books travel (flight, hotel, car) experienced a network partition between the hotel and flight services. The saga compensated the flight booking, then the network partition resolved, and the hotel service’s compensation event arrived late. The orchestrator interpreted the late event as a new failure and began compensating again.
Root cause: Events arrived out of order because the hotel service had a slower consumer group than the flight service. The compensation for the hotel ran after the saga had already reached the Completed state.
Impact: Three customers had their flight rebooked unnecessarily when the second compensation attempt ran after the original compensation had already released the inventory. One customer was moved from their booked flight to a later flight with a 6-hour delay.
Lesson learned: Use version numbers or vector clocks on saga events to detect and discard late events. Implement idempotency in compensation handlers using the original saga ID. Never compensate a saga that has already reached a terminal state.
Common Pitfalls / Anti-Patterns
Treating saga as ACID transaction: Saga does not provide isolation. Concurrent sagas can see each other’s partial results. If saga A reserves inventory and saga B reads inventory before A completes, B may make decisions on uncommitted data. Handle this at the application level.
Non-idempotent steps or compensations: If a step or compensation runs twice due to retries, the effect should be the same as running once. Reserve inventory twice should not double-reserve. Always check before acting.
Compensation order errors: Compensations must run in reverse order of execution. If step 2’s compensation runs before step 1’s, you may leave the system in a worse state. Explicitly track execution order.
Ignoring the inconsistency window: During a saga execution, the system is in an inconsistent state. Other operations may read partial results. Design your application to handle this (show pending states, use optimistic UI).
Long-running compensations: Compensations can take time (refunds, cancellations). While compensating, the system is still inconsistent. Minimize compensation time and alert if compensations are taking too long.
Not planning for compensation failure: What happens if compensation fails? The saga is stuck. Implement retry with backoff, then move to a dead letter state that requires manual intervention.
Interview Questions
Expected answer points:
- Saga is a distributed transaction pattern that chains local transactions with compensating actions
- 2PC blocks resources and has a single point of failure (coordinator crash leaves participants waiting)
- Saga was developed because ACID transactions do not scale across service boundaries — each service owns its database and cannot participate in a shared transaction
- Saga trades ACID isolation for availability and scalability by using eventual consistency
Expected answer points:
- Choreographed saga: each service reacts to events from the previous service, behavior is fully distributed, no central coordinator
- Orchestrated saga: a central orchestrator manages the workflow, knows the entire sequence, decides next steps and triggers compensations
- Orchestration is easier to debug and reason about; choreography scales better but is harder to trace
- Choreography couples services through event contracts; orchestration couples services to the orchestrator
Expected answer points:
- A compensating transaction is an action that undoes a previously completed local transaction
- Examples: refund payment, release inventory reservation, cancel shipment
- When a step fails, the saga runs compensations for all completed steps in reverse order
- Compensations must be deterministic and idempotent — running twice should have the same effect as running once
- Not all operations can be compensated (sending an email, physical goods already shipped)
Expected answer points:
- Sagas run across unreliable networks — messages can be delivered twice, services can crash mid-operation
- If a step retry delivers the same reservation request twice, you should not double-reserve inventory
- Compensation runs must also be idempotent — refunding twice should not over-refund
- Implementation: check before acting using an idempotency key (e.g., charge_id, reservation_id)
- Without idempotency, retries cause data corruption and inconsistent state
Expected answer points:
- Pending: saga created but not started
- Executing: one or more steps have completed successfully
- Compensating: a step failed and compensation is running in reverse order
- Completed: all steps succeeded or all compensations succeeded — terminal success state
- Failed: unrecoverable state (compensation failed after retries, validation error, max retries exceeded)
- Only Pending and Failed states are safely retryable at the saga level
Expected answer points:
- Optimistic locking: read the resource version before modifying, on update check version has not changed, if changed abort and retry
- Pessimistic locking: acquire a lock before accessing a resource, blocks other sagas until lock releases
- Optimistic locking provides higher throughput since locks are not held; pessimistic locking reduces concurrency
- Use optimistic locking for most cases (financial transactions, inventory with hard limits prefer pessimistic)
- In nested sagas, parent-level locking protects shared resources while nested sagas handle sub-workflows
Expected answer points:
- 2PC provides true atomicity and isolation; Saga provides eventual consistency with no isolation
- 2PC blocks on coordinator failure; Saga continues because no coordinator is a single point of failure
- Saga has higher latency (sequential steps) vs 2PC (parallel phases), but 2PC locks are held longer
- 2PC has lower availability due to coordinator blocking; Saga has higher availability
- Saga compensation is application-defined and can fail; 2PC rollback is automatic
- Saga works across service boundaries with separate databases; 2PC requires all participants to support it
Expected answer points:
- A nested saga is where a step's "execute" is itself a saga with its own compensation logic
- Example: a Payment step runs authorize → capture as a nested saga; if it fails, parent saga treats it as one failed step
- Benefits: reusability (ProcessPayment nested saga reused across order, subscription, refund workflows), readability (top-level reads like business workflow), scoped failures
- Introduces concurrency at parent level — requires optimistic or pessimistic locking at parent level
- Nested sagas help decompose large monolithic workflows into manageable pieces
Expected answer points:
- Design idempotent compensations — if compensation runs twice the second run should have no effect
- Implement retry with exponential backoff for transient failures
- After max retries exceeded, move the saga to a dead letter state that requires manual intervention
- Alert on repeated compensation failures — this indicates systemic issues
- Design compensations to be as reliable as forward actions; avoid compensations that can themselves fail permanently (e.g., physical goods already delivered)
Expected answer points:
- Metrics: saga completion rate (success vs failure vs compensating), execution duration by type, compensation count and success rate, step failure rate by type, concurrent saga count
- Logs: saga start with correlation ID, each step start/completion/compensation with step index, compensating transaction ID, saga outcome, all relevant IDs (saga, correlation, step, compensation)
- Alerts: saga taking longer than threshold, compensation repeatedly failing, failure rate exceeding baseline, max retry count reached, stuck sagas
- Distributed tracing (OpenTelemetry/Zipkin/Jaeger): parent span for overall saga, child spans per step and compensation, link compensation spans to original step spans
Expected answer points:
- Sagas span multiple services — without tracing, debugging means grepping logs across all services and piecing together what happened
- Distributed tracing (OpenTelemetry, Zipkin, Jaeger) gives each saga a single trace ID that propagates through all service calls
- Trace structure: parent span for the overall saga, child spans for each step and compensation, with links pairing compensation spans to original step spans
- Trace context propagation: pass trace context through saga steps using baggage or span links so the full execution path is visible
- When a step fails and compensation runs, link the compensation span to the original step span so you can see the pairing visually in the trace viewer
Expected answer points:
- If the saga orchestrator crashes mid-execution, without persisted state you cannot determine which steps completed and which need compensation
- Persist saga state to durable storage (database) after each step: saga ID, step name, status (started/completed/compensated), result data
- The saga state machine tracks Pending, Executing, Compensating, Completed, and Failed states
- On recovery, read persisted state and determine whether to resume, complete compensation, or treat as failed
- Workflow engines like Temporal handle this automatically with durable workflow execution — if a worker crashes, the workflow resumes from where it left off
Expected answer points:
- TCC uses three phases per operation: Try (reserve resources), Confirm (commit the reservation), Cancel (release the reservation). TCC requires infrastructure support from all participants
- Saga uses two operations per step: the forward action and a separate compensating action. Simpler to implement but compensation can fail
- 2PC is atomic across all participants but blocks on coordinator failure; saga is eventually consistent but never blocks waiting for a coordinator
- TCC provides some isolation at the Try phase; saga provides no isolation between steps
- Saga is easier to retrofit onto existing services since each service only needs to implement its own compensation; TCC requires participants to implement Try/Confirm/Cancel interfaces
Expected answer points:
- Network partitions cause timeouts and indeterminate state — you do not know whether the remote service received your request and processed it
- Apply idempotency logic so duplicate messages after partition recovery do not cause duplicate operations
- The saga must distinguish between "step is still running" and "step completed but network failed before response arrived"
- Timeout-based compensation: if a step does not respond within a threshold, trigger compensation for completed steps
- Use idempotency keys and state checks on recovery to determine whether to retry or continue as-is
Expected answer points:
- Unit testing compensations: test each compensation in isolation, verify idempotency (running twice has no additional effect)
- Integration testing failure scenarios: simulate step failures at each point and verify compensations run in correct reverse order
- Compensation order testing: explicitly verify that compensations run in reverse execution order — this is the most common saga bug
- Chaos testing in production: kill services mid-saga, introduce latency, fill resources, split networks — verify saga handles each gracefully
- Distributed tracing verification: confirm trace IDs propagate correctly through all steps and compensations
Expected answer points:
- Steps should represent business activities, not technical operations — each step should be meaningful in the domain
- Too fine-grained: too many steps means many compensations to manage and track; high orchestration overhead
- Too coarse-grained: large steps mean expensive compensations; if one step fails you may have to undo a lot of work
- Use nested sagas to decompose coarse steps into sub-workflows while keeping the top-level saga readable
- Compensation cost should be proportional to forward action cost — avoid steps where compensation is significantly more expensive than the forward action
Expected answer points:
- Saga provides eventual consistency: after the saga completes (success or compensation), the system reaches a consistent state
- During saga execution, the system is in a temporarily inconsistent state — other operations can observe partial results
- Application must handle intermediate states: show users pending states, use optimistic UI, design for the inconsistency window
- Do not assume that because step 1 succeeded, step 2 will also succeed — handle failure at each step
- Eventual consistency is acceptable for most business workflows (order fulfillment, booking, subscriptions) where strict isolation is not required
Expected answer points:
- A correlation ID is a unique identifier assigned to a saga instance at creation and propagated through all subsequent step calls and events
- Include correlation ID in all logs so you can filter all services by the same ID to reconstruct the full saga execution
- Propagate correlation ID through HTTP headers, message queue properties, or gRPC metadata depending on your transport
- Do not expose internal IDs (database primary keys) directly — use the correlation ID as the public-facing identifier in logs and traces
- Distribute tracing tools use trace IDs as correlation IDs when properly configured
Expected answer points:
- Choreography works well for simple workflows with 2-4 services where the logic is straightforward and teams are independent
- Orchestration is better for complex workflows with many steps, multiple teams, or workflows that benefit from centralized error handling
- Choreography creates loose coupling through event contracts but makes it harder to see the overall workflow state
- Orchestration centralizes logic in one place making it easier to debug and modify but creates a coupling point to the orchestrator
- Consider the team structure: if one team owns the workflow, orchestration gives them clear ownership; if multiple teams share responsibility, choreography may reduce coordination overhead
Expected answer points:
- Authenticate and authorize saga trigger endpoints — do not allow arbitrary saga instantiation without access control
- Validate all saga input parameters to prevent injection attacks through saga payloads
- Do not log sensitive data (payment info, passwords, PII) in saga context — logs may be stored in centralized logging systems
- Encrypt saga state at rest if using external storage for persistence — saga state may contain business-sensitive information
- Audit log all saga state changes (start, step completion, compensation, completion, failure) with correlation IDs for forensic analysis
Further Reading
For distributed transaction fundamentals, see Distributed Transactions. For two-phase commit, see Two-Phase Commit.
For workflow patterns, see Service Orchestration and Service Choreography.
Quick Recap Checklist
The saga pattern manages distributed transactions without 2PC. Break the transaction into steps, and define a compensation for each step. If a step fails, run the compensations in reverse order. Saga provides eventual consistency, not ACID isolation. Two implementations exist: choreography distributes behavior across services, while orchestration centralizes coordination in an orchestrator. Both work; the choice depends on workflow complexity and team structure. Idempotency is essential for safe retries and preventing duplicate operations.
Reference Checklists
Production Checklist
- Saga breaks distributed transactions into steps with compensating transactions
- If a step fails, previous steps are compensated in reverse order
- Saga provides eventual consistency, not ACID isolation
- Two implementations: choreography (distributed) and orchestration (centralized)
- Idempotency is essential for safe retries
Saga Production Checklist
- Each saga step has a defined compensation transaction
- Compensation transactions are tested with injected failures
- Idempotency implemented for all saga steps
- Saga state persisted for crash recovery
- Timeout values tuned for each step (long enough to complete, short enough to recover)
- Concurrent saga execution handled correctly
- Monitoring and alerting for saga failures configured
Conclusion
The saga pattern manages distributed transactions without 2PC. Break the transaction into steps, and define a compensation for each step. If a step fails, run the compensations in reverse order.
Choreographed sagas distribute behavior across services. Orchestrated sagas centralize coordination in an orchestrator. Both work; the choice depends on workflow complexity and team structure.
Sagas trade ACID isolation for availability. The application must handle partial states and concurrent access. This complexity is inherent to distributed transactions; saga just makes it explicit rather than pretending it does not exist.
sequenceDiagram
participant Saga
participant Inv as Inventory
participant Pay as Payment
Saga->>+Inv: Step 1: Reserve
Inv->>-Saga: OK
Saga->>+Pay: Step 2: Charge
Pay->>-Saga: Fail
Saga->>+Inv: Compensate
- Idempotent step and compensation handlers implemented
- Saga state persisted to durable storage
- Compensation logic tested for each step
- Maximum retry count configured per step
- Monitoring for saga execution duration
- Alerting for compensation failures
- Concurrent saga handling planned (optimistic locking or locks)
- User-facing pending state handling designed
- Dead letter handling for unrecoverable saga states
- Correlation IDs in all saga logs
Observability Checklist
Metrics
- Saga completion rate (success vs failure vs compensating)
- Saga execution duration by type
- Compensation execution count and success rate
- Step failure rate by step type
- Concurrent saga count
- Average number of steps per saga
Logs
- Log saga start with correlation ID and input parameters
- Log each step start, completion, and compensation with step index
- Include compensating transaction ID in compensation logs
- Log saga outcome (success, failure, compensating)
- Include all relevant IDs: saga ID, correlation ID, step IDs, compensation IDs
Alerts
- Alert when saga takes longer than expected threshold
- Alert when compensation repeatedly fails
- Alert when saga failure rate exceeds normal baseline
- Alert when max retry count reached on a step
- Alert on stuck sagas (no progress for extended period)
Security Checklist
- Authenticate and authorize saga trigger endpoints
- Validate saga input parameters to prevent injection
- Audit log all saga state changes (start, step completion, compensation)
- Do not log sensitive data (payment info, passwords) in saga context
- Encrypt saga state at rest if using external storage
- Use correlation IDs for tracing without exposing internal IDs
Category
Related Posts
TCC: Try-Confirm-Cancel Pattern for Distributed Transactions
Learn the Try-Confirm-Cancel pattern for distributed transactions. Explore how TCC differs from 2PC and saga, with implementation examples and real-world use cases.
Service Orchestration: Coordinating Distributed Workflows
Explore service orchestration patterns for managing distributed workflows, workflow engines, saga implementations, and how orchestration compares to choreography.
Common Coding Interview Patterns
Master the essential patterns—sliding window, two pointers, fast-slow pointers—that solve 80% of linked list and array problems.