Asynchronous Communication in Microservices: Events and Patterns
Deep dive into asynchronous communication patterns for microservices including event-driven architecture, message queues, and choreography vs orchestration.
Asynchronous Communication: Events, Messages, and Event-Driven Patterns
In synchronous systems, services call each other and wait. Service A calls Service B and blocks until B responds. If B is slow, A is slow. If B is down, A fails. This works fine until it does not.
Asynchronous communication breaks this coupling. Service A sends a message and continues. Service B picks it up when ready. The two services never wait for each other.
This post covers events vs commands vs queries, message brokers and when to use each, choreography vs orchestration, and the practical problems you will hit in production.
Introduction
Synchronous communication means calling a service and waiting for a response before continuing. Service A calls Service B, blocks until B responds, then proceeds. This is simple to understand and debug, but it creates tight coupling. If Service B is slow, Service A is slow. If Service B is down, Service A fails.
Asynchronous communication breaks this coupling. Service A sends a message and moves on. Service B receives the message when it is ready and processes it on its own timeline. The two services do not wait for each other.
graph LR
A[Service A] -->|async message| B[(Message Broker)]
B -->|deliver when ready| C[Service B]
A -->|sync call| D[Service D]
D -->|immediate response| A
The diagram shows the difference. Service A sends a message to a broker and continues working. Service B picks up the message later. Meanwhile, Service A makes a synchronous call to Service D and waits for the response.
Why Asynchronous Communication Matters
Microservices fail. Networks partition. Disks fill. When services communicate asynchronously, they do not share failure modes. If the payment service is down, the order service can still accept orders. The orders queue up in a broker and get processed when the payment service recovers. The order service does not crash and users do not see errors.
Independent scaling is another benefit. The checkout service might handle 100 requests per second. The inventory service can only handle 50. A queue between them absorbs the difference. You scale consumers independently without redesigning the producers.
Latency improves too. Service A does not wait for Service B to finish work. It sends a message and immediately moves to the next task.
Core Concepts
These three words get mixed up constantly, so let us be clear.
Commands are directed requests: “do this thing.” They expect exactly one handler. When you send ReserveInventory, you expect the inventory service to act on it. Commands imply intent.
Events are facts: “this thing happened.” They are broadcast. When InventoryReserved is emitted, notification, analytics, and fulfillment services can all respond. Events do not imply that anyone is listening.
Queries are requests for data. In synchronous systems, queries return data immediately. In asynchronous systems, you might send a query message and wait for a response, or use a separate query service that maintains a read model. This leads to CQRS patterns where read and write models are completely separated.
graph LR
subgraph Commands
CMD[ReserveInventory] --> IS[Inventory Service]
end
subgraph Events
EV[InventoryReserved] --> NS[Notification Service]
EV --> AS[Analytics Service]
EV --> FS[Fulfillment Service]
end
IS --> EV
Naming conventions help distinguish them. Commands use verb-noun: CreateOrder, CancelReservation, UpdateInventory. Events use noun-verb past tense: OrderCreated, ReservationCancelled, InventoryUpdated. Queries are typically questions: GetOrderStatus, ListAvailableItems.
Message Queues and Brokers
Messages have to go somewhere. Message brokers store and forward messages between services.
RabbitMQ
RabbitMQ implements the AMQP protocol with flexible routing through exchanges and queues. Producers publish to exchanges, which route to queues based on binding rules. Consumers receive from queues.
RabbitMQ supports multiple exchange types:
- Direct: Routes to queue matching the routing key exactly
- Fanout: Routes to all bound queues
- Topic: Routes to queues matching wildcard patterns
- Headers: Routes based on message header values
graph LR
P[Publisher] -->|publish| X[Exchange]
X -->|direct| Q1[Queue 1]
X -->|fanout| Q2[Queue 2]
X -->|topic| Q3[Queue 3]
RabbitMQ is a solid general-purpose broker. It is mature, well-documented, and runs in many production environments. The trade-off is that it is not designed for extremely high throughput or infinite retention.
Apache Kafka
Kafka is a distributed log rather than a traditional message queue. Messages are appended to partitions and retained for a configurable period (or indefinitely). Consumers track their position in the log rather than consuming and removing messages.
This design gives you:
- Replay: Consumers can re-read historical messages to rebuild state
- Multiple consumers: The same message can be consumed by different consumer groups independently
- Infinite retention: Events can be kept forever and processed later
Kafka handles millions of messages per second across distributed partitions.
AWS SQS and SNS
AWS offers managed messaging services that remove operational burden.
SQS (Simple Queue Service) is a fully managed point-to-point queue. You create queues, send messages, and receive messages. AWS handles scaling, availability, and maintenance. SQS has two types: standard queues (at-least-once delivery, best-effort ordering) and FIFO queues (exactly-once processing, strict ordering).
SNS (Simple Notification Service) is a pub/sub service. You create topics, subscribe endpoints (SQS queues, HTTP endpoints, Lambda functions, email, SMS), and publish messages. SNS fan-out delivers copies to all subscribers.
Many architectures use both: SNS for pub/sub fan-out to multiple consumers, SQS for durable point-to-point processing with load leveling.
Publish-Subscribe Patterns
Pub/sub is a messaging pattern where producers publish messages to topics rather than sending directly to specific consumers. Subscribers receive messages from topics they are interested in.
Fan-out is the key property: one message reaches multiple subscribers. This is fundamentally different from point-to-point queues where each message goes to exactly one consumer.
graph LR
Pub[Publisher] -->|message| Topic[Topic]
Topic -->|copy 1| Sub1[Subscriber 1]
Topic -->|copy 2| Sub2[Subscriber 2]
Topic -->|copy 3| Sub3[Subscriber 3]
Topic Design
Topics should be organized around meaningful categories. Flat topics work for simple systems:
user.created
order.placed
payment.processed
Hierarchical topics enable broader subscriptions:
users/
users.created
users.updated
users.deleted
orders/
orders.placed
orders.updated
orders.cancelled
Subscribing to orders/ captures all order events. Subscribing to orders.placed captures only placement events.
Subscription Types
Durable subscriptions persist when subscribers go offline. When a subscriber reconnects, it receives messages that arrived during the offline period. This matters for services that restart or clients that disconnect.
Shared subscriptions distribute messages across multiple instances of a service. If you run three notification service instances, shared subscription means each instance gets approximately one-third of the messages. This enables horizontal scaling.
Choreography vs Orchestration
When a business operation spans multiple services, someone has to coordinate the steps. Two approaches: choreography and orchestration.
Choreography
In choreography, services emit events and react to each other’s events. No central coordinator exists. Each service knows only its own trigger and reaction.
graph LR
Order[Order Service] -->|OrderPlaced| Inv[Inventory Service]
Inv -->|InventoryReserved| Pay[Payment Service]
Pay -->|PaymentCharged| Ship[Shipping Service]
Ship -->|ShipmentCreated| Notify[Notification Service]
The order service does not know what happens after placing an order. It emits OrderPlaced and moves on. The inventory service reacts by reserving inventory and emitting InventoryReserved. Payment reacts, then shipping, then notification.
For a deeper look at choreography patterns, see Service Choreography.
Orchestration
In orchestration, a central process (the orchestrator) coordinates the entire workflow. The orchestrator knows the complete sequence, decides what to do at each step, and handles failures.
graph LR
Orch[Order Orchestrator] -->|Reserve| Inv[Inventory Service]
Orch -->|Charge| Pay[Payment Service]
Orch -->|Schedule| Ship[Shipping Service]
Inv -->|Reserved| Orch
Pay -->|Charged| Orch
Ship -->|Scheduled| Orch
The orchestrator sends commands to each service and receives responses. Based on those responses, it decides the next step. If something fails, it triggers compensating transactions to undo previous steps.
For more on orchestration, see Service Orchestration.
Which to Choose
Choreography works well when services are truly independent, workflows are linear, and you want to avoid central points of failure. It is simpler at first but behavior becomes scattered as workflows grow complex.
Orchestration works better when workflows have branching logic, compensation is complex, and you need visibility into the complete transaction state. The orchestrator becomes critical infrastructure but gives you control.
Many production systems use both. Core business workflows with complex compensation run through orchestrators. Peripheral side effects (notifications, analytics, logging) happen through choreography.
Idempotency Considerations
At-least-once delivery is the norm in asynchronous systems. Messages may be delivered more than once due to retries, network partitions, or consumer crashes. Services must handle duplicate messages safely.
Idempotency means processing a message multiple times produces the same result as processing it once.
def handle_order_placed(event):
# Check if already processed
if order_processed(event.order_id):
return
# Process the order
process_order(event.order_id)
# Mark as processed
mark_order_processed(event.order_id)
This pattern uses a deduplication table keyed on message ID. Before processing, check if the ID exists. After processing, insert the ID. The database enforces uniqueness.
Idempotency Keys
Every message should carry a unique identifier. Producers generate the ID. Consumers check against it.
{
"message_id": "msg-uuid-12345",
"type": "OrderPlaced",
"order_id": "ord-67890",
"timestamp": "2026-03-24T10:30:00Z"
}
Store processed IDs in a database, Redis set, or any persistent store with reasonable performance for lookup.
Idempotent Operations
Some operations are naturally idempotent. Updating a record to a specific value is idempotent: setting status = 'shipped' twice produces the same result as setting it once. Creating a record with a deterministic ID is idempotent: inserting order-123 twice typically fails on the second attempt if the ID is unique-constrained.
Operations that transfer resources (charging a card, deducting inventory) are not naturally idempotent. Deducting 10 units twice causes incorrect state. These require explicit idempotency handling.
Handling Eventual Consistency
Synchronous systems provide strong consistency: after a write, all subsequent reads see that write. Asynchronous systems provide eventual consistency: after a write, reads will eventually reflect that change, but the delay is unknown.
This has real implications for user experience and system design.
User Experience
Users might see stale data. They place an order and immediately check their order list. The order might not appear yet because the notification service has not processed the OrderPlaced event and updated the view.
Solutions include optimistic UI (show the result immediately and reconcile later), polling (refresh the UI after a delay), or WebSockets (push updates to the client when events are processed).
Read Models and Projections
In event-driven systems, the current state is often derived from events. The event log is the source of truth. Read models are projections built from events.
graph LR
Events[Event Log] -->|project| RM1[Read Model: User Orders]
Events -->|project| RM2[Read Model: Order Analytics]
Events -->|project| RM3[Read Model: Inventory Status]
If a read model is wrong, you rebuild it from the event log. This is useful for fixing bugs in projections without changing underlying data.
The trade-off is that read models lag behind writes. The lag might be milliseconds or seconds during high load. Design UIs and expectations around this lag.
Compensation and Sagas
When a multi-step transaction fails partway through, you must undo the completed steps. This is compensation. The saga pattern manages this.
For example, if payment fails after inventory is reserved:
- Reserve inventory (succeeds)
- Charge payment (fails)
- Compensate: release inventory reservation
Each step has a corresponding compensation action. If a later step fails, compensation actions run in reverse order.
For details on saga patterns, see Saga Pattern. For event-driven fundamentals, see Event-Driven Architecture. For message queue types, see Message Queue Types.
When to Use / When Not to Use Asynchronous Communication
Trade-off Table
| Scenario | Use Asynchronous | Use Synchronous Instead |
|---|---|---|
| Services operate at different speeds | Message queue absorbs difference | Fast service waits on slow |
| Fault isolation required | Failures do not cascade | One failure affects callers |
| Independent scaling needed | Producers and consumers scale separately | Must scale together |
| Multiple consumers need same data | Pub/sub broadcasts to all | Multiple calls to same service |
| Replay capability needed | Rebuild state from event log | No replay without additional infra |
| Long-running operations | Initiate and return immediately | User waits blocking |
| Audit trail required | Event log is immutable history | Request logs may not capture full state |
| Immediate consistency needed | Eventual consistency only | Strong consistency guaranteed |
When to Use Asynchronous Communication
Use async when:
- Services operate at different speeds and you need to absorb the difference
- You want fault isolation so failures do not cascade across services
- You need independent scaling of producers and consumers
- Multiple services need to react to the same event
- You need replay capability to rebuild state or recover from failures
- Operations are long-running and blocking the caller is impractical
- You need an immutable audit trail of what happened in the system
- You are building event-driven architecture with event sourcing
Avoid async when:
- You need immediate consistency between services
- Latency budgets are tight and every millisecond matters
- Your team lacks experience debugging distributed async systems
- The workflow is simple request-response with no real benefit from decoupling
- You need predictable latency for real-time user interactions
- Debugging simplicity is more important than loose coupling
Production Challenges
Asynchronous systems introduce operational complexity that synchronous systems avoid.
Observability is harder. Request tracing requires correlation IDs propagated through messages. You need to track messages from publication through consumption to completion. Distributed tracing tools help but require instrumentation.
Debugging is more complex. A user reports an order was not created. In a synchronous system, you trace the request. In an async system, you ask: did the order service publish the event? Did the queue deliver it? Did the payment service receive it? Multiple logs across multiple services must be correlated.
Ordering is not guaranteed. Unless your broker provides ordering guarantees (Kafka partitions, SQS FIFO), messages may arrive out of order. If ordering matters, handle it in application logic with sequence numbers or timestamps.
Backpressure is implicit. Producers might send faster than consumers can process. Without limits, queues grow unbounded and latency spikes. Configure queue depth limits and consumer prefetch.
Failure Flow Diagrams
Message Retry Flow
When a consumer fails to process a message, the broker retries with backoff.
sequenceDiagram
participant Pub as Publisher
participant Broker as Message Broker
participant Cons as Consumer
participant DLQ as Dead Letter Queue
Pub->>Broker: Publish message
Broker->>Cons: Deliver message
Cons->>Cons: Process (attempt 1)
Cons->>Broker: NACK / Failure
Broker->>Cons: Retry with backoff
Cons->>Cons: Process (attempt 2)
Cons->>Broker: NACK / Failure
Broker->>Cons: Retry with backoff
Cons->>Cons: Process (attempt 3)
Cons->>Broker: NACK / Failure
Broker->>DLQ: Route to Dead Letter Queue
The broker tracks delivery attempts. After configured retries exhausted, the message routes to a dead letter queue for manual inspection or automated handling.
Consumer Crash Recovery
When a consumer crashes mid-processing, messages are reprocessed by another consumer instance.
sequenceDiagram
participant Broker as Message Broker
participant Cons1 as Consumer 1
participant Cons2 as Consumer 2
Broker->>Cons1: Deliver message
Cons1->>Cons1: Process partially
Cons1--xBroker: Crash (acknowledged but not done)
Note over Broker: Message still in flight
Broker->>Cons2: Redeliver message
Cons2->>Cons2: Process from scratch
Cons2->>Broker: ACK
Consumer groups handle failover. If Consumer 1 crashes, Consumer 2 picks up the message. This is why idempotency is essential.
Broker Failure and Recovery
When the message broker itself fails, messages in transit may be lost.
stateDiagram-v2
[*] --> Publishing
Publishing --> Delivered: Message persisted
Delivered --> Delivered: Consumer ACK
Delivered --> Lost: Broker crashes before persist
Lost --> [*]
Publishing --> Lost: Broker crashes during write
Delivered --> Redelivered: Consumer crash detected
Redelivered --> Delivered: Redelivery succeeds
Durable brokers persist messages to disk before acknowledging. Configure producer acks and broker replication factor appropriately for your durability requirements.
Eventual Consistency Flow
Updates propagate through the system over time, not instantly.
sequenceDiagram
participant C as Client
participant SvcA as Service A
participant Broker as Event Bus
participant SvcB as Service B
participant RM as Read Model
C->>SvcA: Update request
SvcA->>Broker: Publish event
Broker->>SvcA: Persisted
SvcA-->>C: 200 OK (optimistic)
C->>SvcA: Read request
SvcA->>RM: Query read model
RM-->>SvcA: (stale) Old value
Note over RM: Few ms delay
Broker->>SvcB: Deliver event
SvcB->>RM: Update read model
RM-->>SvcB: Updated
C->>SvcA: Read request
SvcA->>RM: Query read model
RM-->>SvcA: (consistent) New value
The client receives success before the update propagates. Subsequent reads may return stale data until the event is processed and the read model is updated.
Real-world Failure Scenarios
These scenarios come from production incidents at companies running large-scale distributed message systems.
Scenario: The Poison Message
A message with malformed payload gets published to a high-throughput queue. Consumer attempts to parse it, throws an exception, NACKs the message. The broker retries with backoff. After max retries, it routes to DLQ. Meanwhile, thousands of identical retry attempts have consumed processing time and filled logs.
What went wrong: No schema validation at the producer. The message was never validated before publishing.
Prevention: Implement schema registry (Apache Avro, Protobuf) with backward/forward compatibility checks. Validate messages at the producer before publishing. Add dead letter queue monitoring alerts before the DLQ fills up.
Scenario: The Unbounded Queue
A consumer service deploys a new version with a memory leak. The old consumer unregisters slowly during the deployment. Messages back up in the queue. By the time the deployment completes, the queue has 500,000 pending messages. The new consumer takes 40 minutes to drain the backlog while processing at 10% of expected throughput due to the memory leak exacerbating GC pauses.
What went wrong: No queue depth monitoring. Consumer scaling did not account for deployment overlap. No backpressure mechanism.
Prevention: Set queue depth alerts at configurable thresholds. Implement consumer prefetch limits to control memory usage. Use circuit breakers to reject messages when downstream services are unhealthy. Test deployment behavior with injected failures.
Scenario: The Clock Skew Event
A system relies on message timestamps for ordering. Two services running in different data centers have clocks 30 seconds apart. Service A processes an event at t=0 and publishes with timestamp 00:00:00. Service B processes the same logical event 1 second later but its clock shows 00:00:31 due to skew. Downstream consumers order events by timestamp and see B’s event as happening after A’s event when the logical order is reversed.
What went wrong: Relying on wall-clock timestamps from multiple machines without clock synchronization.
Prevention: Use logical clocks (Lamport timestamps, vector clocks) for causal ordering. If using physical timestamps, ensure NTP synchronization across all services. Include sequence numbers for critical ordering requirements.
Scenario: The Schema Break
Team A deploys a new version of the inventory service that changes the InventoryReserved event schema. The new version removes a field that fulfillment service depends on. Fulfillment service has not been updated yet. When InventoryReserved events arrive with the new schema, fulfillment service silently ignores them or throws unhandled exceptions. Orders stop shipping with no error logged because the message was delivered successfully.
What went wrong: No schema versioning strategy. No consumer contract testing. Breaking changes deployed without coordination.
Prevention: Use schema registry with strict compatibility rules. Implement consumer-driven contracts (CDC) testing. Maintain backward compatibility for at least one version cycle. Blue-green deploy consumers before producers change schemas.
Scenario: The Retry Storm
A downstream notification service becomes temporarily unavailable. Message retry kicks in with exponential backoff. The notification service recovers, but the retry backoff has not caught up to real-time yet. Meanwhile, the dead letter queue handler notices the DLQ filling and automatically replays old messages. The combination creates a sudden burst of 10x normal message volume that overwhelms the recovered notification service.
What went wrong: DLQ auto-replay without rate limiting. Retry backoff not aligned with actual recovery detection.
Prevention: Implement jitter on retry delays to spread load. Rate-limit DLQ reprocessing. Use circuit breakers with half-open state to test recovery before resuming full load. Monitor retry rates per message type to detect thundering herd conditions.
Observability Hooks
Asynchronous systems require different observability approaches than synchronous systems. You cannot observe a request trace end-to-end because there is no direct request path.
Message Tracing
Every message should carry a correlation ID that spans from publication through consumption.
import json
import uuid
from dataclasses import dataclass, asdict
from typing import Optional
@dataclass
class EventEnvelope:
event_type: str
payload: dict
correlation_id: str
message_id: str
timestamp: str
version: str = "1.0"
@classmethod
def create(cls, event_type: str, payload: dict, correlation_id: Optional[str] = None):
return cls(
event_type=event_type,
payload=payload,
correlation_id=correlation_id or str(uuid.uuid4()),
message_id=str(uuid.uuid4()),
timestamp="2026-03-24T10:30:00Z"
)
def to_json(self) -> str:
return json.dumps(asdict(self))
@classmethod
def from_json(cls, data: str) -> "EventEnvelope":
return cls(**json.loads(data))
Producer Instrumentation
import structlog
from typing import Any
logger = structlog.get_logger()
class InstrumentedProducer:
def __init__(self, broker_client):
self.broker = broker_client
async def publish(self, topic: str, event: EventEnvelope):
logger.info(
"event_published",
topic=topic,
event_type=event.event_type,
message_id=event.message_id,
correlation_id=event.correlation_id
)
try:
await self.broker.publish(topic, event.to_json())
logger.info(
"event_published_success",
topic=topic,
message_id=event.message_id
)
except Exception as e:
logger.error(
"event_published_failed",
topic=topic,
message_id=event.message_id,
error=str(e)
)
raise
Consumer Instrumentation
class InstrumentedConsumer:
def __init__(self, broker_client, handlers: dict):
self.broker = broker_client
self.handlers = handlers
async def process_message(self, message: str) -> bool:
event = EventEnvelope.from_json(message)
logger.info(
"event_received",
topic=message.topic,
event_type=event.event_type,
message_id=event.message_id,
correlation_id=event.correlation_id
)
handler = self.handlers.get(event.event_type)
if not handler:
logger.warning(
"no_handler_for_event",
event_type=event.event_type,
message_id=event.message_id
)
return False
try:
await handler(event)
logger.info(
"event_processed",
event_type=event.event_type,
message_id=event.message_id
)
return True
except Exception as e:
logger.error(
"event_processing_failed",
event_type=event.event_type,
message_id=event.message_id,
error=str(e)
)
raise
Key Metrics to Track
| Metric | Purpose | Alert Threshold |
|---|---|---|
| Messages published per second | Throughput monitoring | Drop > 50% |
| Consumer lag by partition | Processing backlog | Lag growing continuously |
| Dead letter queue depth | Failed processing | > 100 messages |
| Consumer retry rate | Transient vs permanent failures | > 30% retries |
| Event processing duration | Performance baseline | p99 > SLA |
| Duplicate event rate | Upstream producer issues | Spike detection |
CQRS Deep Dive
Command Query Responsibility Segregation separates read and write operations into distinct models. In synchronous systems, the same model handles both. In CQRS, writes go through a command model and reads come from one or more projected read models.
graph LR
subgraph Commands
CMD[Command] --> CH[Command Handler]
CH --> DB[(Write DB)]
DB --> EV[Event Log]
end
subgraph Queries
RM1[Read Model 1] --> Q1[Query]
RM2[Read Model 2] --> Q2[Query]
EV --> RM1
EV --> RM2
end
The event log is the source of truth. Read models are projections built from events. This separation gives you independent scaling of reads and writes, optimized read models per query pattern, and the ability to rebuild read models from the event log.
When to use CQRS:
- Read and write workloads have different scaling needs
- Multiple query patterns require different data structures
- Team wants to evolve read and write sides independently
- Event sourcing provides the event log backing
Trade-offs:
- Added complexity with separate models
- Eventual consistency between write and read models
- Mapping between command inputs and event representations
Quick Recap
graph LR
A[Service A] -->|Event| B[(Message Broker)]
B -->|Event| C[Service B]
B -->|Event| D[Service C]
B -->|Event| E[Service D]
Key Points
- Asynchronous communication decouples services in time and space
- Events broadcast to multiple consumers; commands target one handler
- Message brokers (RabbitMQ, Kafka, SQS/SNS) handle delivery guarantees
- Idempotency is essential because at-least-once delivery is the norm
- Eventual consistency means updates propagate over time, not instantly
- Correlation IDs enable tracing messages across service boundaries
- Dead letter queues capture messages that fail after max retries
- Consumer lag monitoring prevents stale data from accumulating
When to Choose Asynchronous
- Services operate at different speeds and queues absorb the difference
- Fault isolation matters so one failure does not cascade
- Multiple consumers need to react to the same event
- You need replay capability to rebuild state from event history
- Operations are long-running and blocking is impractical
Production Checklist
# Asynchronous Communication Production Readiness
- [ ] Idempotent message handlers implemented
- [ ] Correlation IDs in all messages
- [ ] Dead letter queue configured and monitored
- [ ] Consumer lag alerting configured
- [ ] Message retry with exponential backoff
- [ ] Schema registry for event versioning
- [ ] Distributed tracing across message consumers
- [ ] Consumer group failover tested
- [ ] Broker durability settings configured (acks, replication)
- [ ] Backpressure handling via prefetch limits
Interview Questions
Expected answer points:
- Decoupling: Services do not wait for each other, reducing temporal coupling
- Independent scaling: Producers and consumers scale at different rates
- Fault isolation: Failure in one service does not cascade to others
- Latency improvement: Service can move to next task without waiting
- Backpressure handling: Queues absorb bursts when services operate at different speeds
Expected answer points:
- Message retention: Kafka retains messages indefinitely; RabbitMQ removes messages after consumption
- Replay capability: Kafka allows re-reading historical messages; RabbitMQ does not
- Consumer model: Kafka uses consumer groups tracking offset; RabbitMQ uses dedicated queues
- Ordering guarantees: Kafka maintains partition ordering; RabbitMQ ordering is exchange-dependent
- Use case fit: Kafka for event streaming and audit logs; RabbitMQ for task queues and request-response
Expected answer points:
- Deduplication table: Store processed message IDs with a unique constraint
- Idempotency keys: Each message carries a unique identifier checked before processing
- Idempotent operations: Use deterministic IDs for record creation; updates to specific values are naturally idempotent
- Database enforcement: Unique constraints prevent duplicate processing
- Non-idempotent operations: Resource transfers require explicit deduplication logic
Expected answer points:
- Advantages: Loose coupling, scalability, extensibility, multiple consumers per event
- Disadvantages: Eventual consistency, debugging complexity, ordering issues
- Challenge: Harder to trace end-to-end requests without correlation IDs
- Challenge: Duplicate event handling requires idempotency
Expected answer points:
- Definition: Sequence of local transactions where each has a compensating action for rollback
- Failure handling: If a step fails, compensating actions run in reverse order
- Example: Reserve inventory (compensate: release) → Charge payment (compensate: refund)
- Orchestration vs choreography: Central orchestrator coordinates; or services emit events and react
- Trade-off: Complexity of compensation logic vs distributed commit protocols
Expected answer points:
- Choreography: Services emit and react to events without central coordinator; behavior scattered across services
- Orchestration: Central process coordinates entire workflow; knows complete sequence
- When choreography works: Linear workflows, independent services, no central point of failure
- When orchestration works: Branching logic, complex compensation, need for visibility
- Hybrid approach: Core workflows use orchestration; peripheral side effects use choreography
Expected answer points:
- Stale data: Users may see outdated information after writes
- Solutions: Optimistic UI (show result immediately), polling (refresh after delay), WebSockets (push updates)
- Read models: Separate read models built from events may lag behind writes
- UX design: Set user expectations around propagation delay; provide loading states
Expected answer points:
- Correlation IDs: Each message carries a unique ID propagated from origin through all services
- Event envelope: Standard structure with message_id, correlation_id, event_type, timestamp
- Instrumentation: Logging at publish, deliver, and processing stages with consistent IDs
- Distributed tracing: Tools like Jaeger or Zipkin span across async message consumers
- Alerting: Track processing duration, failure rates, and consumer lag per correlation ID
Expected answer points:
- Point-to-point (queues): Exactly one consumer per message, load leveling, task distribution
- Pub/sub (topics): Multiple consumers receive copies, fan-out, event broadcasting
- Hybrid: SNS fan-out to multiple SQS queues for durable point-to-point processing
- Consider ordering: P2P preserves order within a queue; topics do not guarantee order across subscribers
Expected answer points:
- Dead Letter Queue (DLQ): Failed messages route to DLQ after max retries
- DLQ monitoring: Alert on DLQ depth to detect processing failures
- Manual inspection: DLQ messages preserved for debugging and manual replay
- Exponential backoff: Retry with increasing delays to handle transient failures
- Idempotency: Ensure reprocessing from DLQ produces same result
Expected answer points:
- Problem: Dual-write (write to DB then publish to broker) can lose events if the process crashes between the two operations
- Solution: Write events to an outbox table in the same transaction as business data, then publish from the outbox separately
- Implementation: Transactional outbox ensures atomicity; a separate process polls the outbox and publishes to the broker
- CDC approach: Use change data capture (Debezium) to publish outbox changes to the broker
- Benefit: Guarantees at-least-once delivery without distributed two-phase commit
Expected answer points:
- CQRS: Separates read and write models; writes go through commands, reads come from projected views
- Event sourcing: Stores all state changes as a sequence of immutable events instead of current state
- Combination: Event log is the source of truth; read models are projections of the event log
- Benefits: Complete audit trail, replay capability, independent read/write scaling, multiple optimized read models
- Trade-offs: Added complexity, eventual consistency, event schema evolution required
Expected answer points:
- At-least-once: Messages may be delivered more than once but never lost; requires idempotent consumers
- At-most-once: Messages may be lost but never delivered more than once; acceptable for low-value notifications
- Exactly-once: Messages delivered exactly once to the consumer; combines broker deduplication with consumer idempotency
- Implementation: Kafka enables exactly-once with transactional producers and consumers; SQS FIFO provides exactly-once with deduplication
- Trade-off: Exactly-once has higher latency and lower throughput than at-least-once
Expected answer points:
- Partition-based ordering: Kafka maintains ordering within a partition; use consistent partitioning key for related messages
- FIFO queues: SQS FIFO provides strict ordering within a single queue
- Application-level ordering: Include sequence numbers or vector clocks in messages; reorder in consumer logic
- Handling out-of-order: Use correlation IDs to group related messages; buffer and sort by timestamp if needed
- Limitation: Most pub/sub systems do not guarantee cross-topic ordering
Expected answer points:
- Message model: RabbitMQ uses logical exchanges with binding rules; Kafka uses physical logs with consumer groups
- Retention: RabbitMQ removes messages after ACK; Kafka retains messages for configurable duration
- Consumer model: RabbitMQ uses push-based delivery to consumers; Kafka uses pull-based consumption from partitions
- Replay: Kafka allows replay from any offset; RabbitMQ cannot replay consumed messages
- Scaling: Kafka scales by adding partitions; RabbitMQ scales by adding queues and consumers
Expected answer points:
- Schema registry: Use Avro, Protobuf, or JSON Schema to define message schemas centrally
- Compatibility checking: Enforce backward compatibility (consumers can read old messages) or forward compatibility (producers can write old messages)
- Versioning strategy: Include version field in message envelope; support multiple schema versions simultaneously
- Evolution rules: Adding optional fields is safe; removing fields or changing types requires new version
- Validation pipeline: Validate at producer before publishing and optionally at consumer on receipt
Expected answer points:
- Saga as orchestration: Orchestrator sends commands and handles responses; coordinates compensation on failure
- Saga as choreography: Services emit events that trigger next steps in other services; compensation triggered by failure events
- Event-driven sagas: Each saga step publishes a success/failure event; subsequent steps react to these events
- Compensation events: Failure events trigger compensating actions in reverse order of completed steps
- Trade-off: Choreography is simpler but harder to debug; orchestration provides visibility but adds central dependency
Expected answer points:
- Replication lag: Cross-datacenter replication introduces latency; monitor mirror maker or MM2 lag closely
- Ordering across DCs: Kafka only guarantees ordering within a partition; cross-DC ordering requires careful partition strategy
- Schema conflicts: Different datacenters may run different schema versions; schema registry must be globally consistent
- Network partitions: Datacenter-level network splits can cause replication stalls; need circuit breakers
- Cost of replication: Cross-DC replication bandwidth is expensive; compress messages and batch where possible
Expected answer points:
- Consumer group concept: All consumers in a group share messages; each message goes to one consumer in the group
- Partition assignment: Kafka assigns partitions to consumers in a group; scaling adds rebalancing
- Rebalancing impact: Rebalance pauses consumption and triggers retry; use sticky partitioning to minimize reshuffling
- Shared subscriptions: Multiple consumers in a group receive ~1/N of messages
- Capacity planning: Maximum parallelism equals number of partitions; more consumers than partitions leaves some idle
Expected answer points:
- Prefetch limits: Consumer acknowledges only after processing; controls how many messages are in flight
- Queue depth limits: Reject or throttle new messages when queue exceeds threshold
- Circuit breakers: Stop accepting messages when downstream failure rate exceeds threshold
- Consumer scaling: Add consumer instances to process backlog faster
- Rate limiting: Producer respects consumer capacity signals; use token bucket or leaky bucket algorithms
Further Reading
Core Concepts:
- Event-Driven Architecture - Foundational patterns for event systems
- Message Queue Types - Detailed comparison of broker implementations
- Service Choreography - Decentralized service coordination
- Service Orchestration - Centralized workflow management
- Saga Pattern - Distributed transaction management
- Pub/Sub Patterns - Topic-based messaging patterns
Message Brokers:
- RabbitMQ Documentation - AMQP protocol reference
- Apache Kafka Documentation - Event streaming platform docs
- AWS SQS/SNS Documentation - Managed messaging services
Advanced Patterns:
- CQRS Pattern - Command Query Responsibility Segregation
- Event Sourcing - Immutable event log as source of truth
- Outbox Pattern - Reliable dual-write pattern for transactional outbox
Related Posts:
- Event-Driven Architecture - Foundational patterns for event systems
- Message Queue Types - Detailed comparison of broker implementations
- Pub/Sub Patterns - Topic-based messaging patterns
- Service Choreography - Decentralized service coordination
- Service Orchestration - Centralized workflow management
Conclusion
Asynchronous communication lets microservices scale independently, survive failures gracefully, and evolve separately. Events and messages decouple services in time and space.
Message brokers like RabbitMQ and Kafka handle the infrastructure. Pub/sub broadcasts events to multiple consumers. Choreography and orchestration offer different trade-offs for multi-service workflows. Idempotency and eventual consistency are solvable problems.
The complexity is real. You deal with out-of-order messages, duplicate processing, distributed debugging, and lag between writes and reads. Before adopting async wholesale, start with bounded contexts where the benefits are clear: high write volume, independent scaling needs, fault isolation requirements, or genuinely decoupled services.
Category
Related Posts
CQRS and Event Sourcing: Distributed Data Management
Learn about Command Query Responsibility Segregation and Event Sourcing patterns for managing distributed data in microservices architectures.
Amazon Architecture: Lessons from the Pioneer of Microservices
Learn how Amazon pioneered service-oriented architecture, the famous 'two-pizza team' rule, and how they built the foundation for AWS.
Client-Side Discovery: Direct Service Routing in Microservices
Explore client-side service discovery patterns, how clients directly query the service registry, and when this approach works best.