AWS SQS and SNS: Cloud Messaging Services
Learn AWS SQS for point-to-point queues and SNS for pub/sub notifications, including FIFO ordering, message filtering, and common use cases.
SQS and SNS are managed messaging services that eliminate the operational burden of running your own broker. SQS gives you point-to-point queues; SNS gives you pub/sub topics. AWS handles the infrastructure — automatic scaling, high availability, pay-per-use pricing. No capacity planning, no clusters to maintain. This makes them useful for production workloads where you need durability without dedicated ops effort. The Message Queue Types post covers the underlying patterns.
AWS SQS and SNS: Cloud Messaging Services
Introduction
AWS offers two managed messaging services that handle most asynchronous communication needs in cloud applications: SQS for point-to-point queues and SNS for pub/sub notifications. Both are fully managed — no servers to provision, no clusters to maintain — and scale automatically from a single message per day to millions per second.
Core Concepts
AWS SQS: Point-to-Point Queues
SQS gives you managed message queues without running your own broker. You create a queue, send messages, and consume them.
SQS: Simple Queue Service
Queue Types
SQS has two types of queues. Standard queues offer best-effort ordering and at-least-once delivery with unlimited throughput. FIFO queues preserve exactly-once processing and guarantee ordering within message groups.
Message Lifecycle
A producer sends a message to the queue. A consumer polls for messages. After processing, the consumer deletes the message from the queue. SQS holds messages until deletion — failing to delete means the message reappears after the visibility timeout.
Working with SQS
import boto3
sqs = boto3.client('sqs')
# Create a queue
queue_url = sqs.create_queue(
QueueName='tasks.fifo',
Attributes={'FifoQueue': 'true', 'ContentBasedDeduplication': 'true'}
)['QueueUrl']
# Send a message
sqs.send_message(
QueueUrl=queue_url,
MessageBody=json.dumps({'task': 'process', 'data': value}),
MessageGroupId='task-group'
)
# Receive messages
response = sqs.receive_message(
QueueUrl=queue_url,
MaxNumberOfMessages=10,
WaitTimeSeconds=20
)
for msg in response['Messages']:
process(json.loads(msg['Body']))
sqs.delete_message(QueueUrl=queue_url, ReceiptHandle=msg['ReceiptHandle'])
Key SQS Features
Visibility timeout is how long a message stays invisible after your consumer picks it up. If your consumer crashes mid-processing, the message reappears once the timeout expires. The catch: your consumers need to handle duplicates.
Dead letter queues catch messages that fail repeatedly. Set a redrive policy and messages go to your DLQ after N failed attempts. You can inspect what went wrong without blocking the queue.
Message retention spans up to 14 days by default. Your consumers can be down for a weekend and messages survive. This buffers for downtime without losing work.
Long polling cuts down on empty responses. SQS waits up to 20 seconds for messages to arrive before replying. Fewer API calls, lower costs, less waiting.
AWS SNS: Pub/Sub Notifications
SNS is a managed pub/sub service. You create topics, subscribe endpoints (email, SMS, HTTP, Lambda, SQS, mobile push), then publish messages and SNS fans them out to all subscribers.
SNS Topic Operations
How Topics Work
You create an SNS topic. Subscribe endpoints to it (email, SMS, HTTP, Lambda, SQS, mobile push). Publish a message, SNS fans it out to all subscribers. That’s the whole model.
Message Filtering
Subscribers can use filter policies to receive only messages they care about:
sns.subscribe(
TopicArn=topic_arn,
Protocol='sqs',
Endpoint='sqs-arn',
Attributes={'FilterPolicy': json.dumps({
'event': ['order.placed', 'order.cancelled'],
'region': ['us-west', 'us-east']
})}
)
Messages not matching the filter policy are not delivered to that subscriber.
SNS Message Batching
SNS supports message batching to lower costs and handle more throughput. The PublishBatch API lets you send up to 10 messages at once.
# Send batch of messages (up to 10 per batch)
entries = [
{'Id': '1', 'Message': json.dumps({'event': 'order.placed', 'order_id': '1001'})},
{'Id': '2', 'Message': json.dumps({'event': 'order.placed', 'order_id': '1002'})},
{'Id': '3', 'Message': json.dumps({'event': 'order.placed', 'order_id': '1003'})},
]
sns.publish_batch(TopicArn=topic_arn, PublishBatchRequestEntries=entries)
Batching reduces costs at scale: 100 messages means 10 API calls instead of 100.
SNS FIFO Topics
SNS supports FIFO (First-In-First-Out) topics that provide strict ordering and exactly-once delivery. FIFO topics are designed for scenarios where message order matters, such as financial transactions or inventory updates.
# Create FIFO topic
fifo_topic_arn = sns.create_topic(
Name='order-events.fifo',
Attributes={'FifoTopic': 'true', 'ContentBasedDeduplication': 'true'}
)['TopicArn']
# Publish with message group ID for ordering
sns.publish(
TopicArn=fifo_topic_arn,
Message=json.dumps({'event': 'order.placed', 'order_id': '12345'}),
MessageGroupId='order-processing' # Ensures ordering within group
)
| Feature | SNS Standard | SNS FIFO |
|---|---|---|
| Ordering | No guarantee | Per message group |
| Deduplication | None | 5-minute window |
| Throughput | Unlimited | 300 messages/sec per topic |
| Message group | N/A | Groups messages for ordering |
| Cost | Per message + delivery | Higher (per message) |
SNS FIFO is a good fit when you need messages for the same entity (same order, same user) processed in order.
Capacity and Scaling
SQS and SNS scale automatically, but your architecture decisions determine how well they hold up under load.
Message Volume Estimation
For SQS, figure out your peak messages per second and pick Standard (unlimited throughput) or FIFO (3000 messages/sec with batching). FIFO message groups let you parallelize while keeping order within each group.
For SNS, throughput is effectively unlimited, but delivery costs scale with subscriber count. If you have 100K subscribers and publish 1M messages/day, delivery costs dominate your bill.
Backpressure Handling
SQS consumers control their own polling rate. Long polling with WaitTimeSeconds=20 naturally throttles when the queue is empty. Set MaxNumberOfMessages based on your processing capacity.
For burst traffic, SQS buffers automatically. But if your consumers fall behind, messages pile up and ApproximateAgeOfOldestMessage climbs. Watch this metric and scale consumers out.
# Adaptive polling based on queue depth
response = sqs.receive_message(
QueueUrl=queue_url,
MaxNumberOfMessages=10 if queue_depth > 100 else 1,
WaitTimeSeconds=5 if queue_depth > 100 else 20
)
SNS Delivery Rate Limiting
SNS limits deliveries per subscriber to 30K/second by default. For Lambda, that maps to concurrent invocations. If you need higher rates, file a service quota increase.
Cross-account SNS subscriptions add data transfer charges. Keep publishers and subscribers in the same region when you can.
Scaling Patterns
Horizontal consumer scaling: Each SQS queue supports multiple consumers across different machines. SQS visibility timeout lets failed processing recover without duplicates.
SNS fan-out scaling: Add more SQS queues rather than more subscribers to one queue. This avoids head-of-line blocking where one slow consumer throttles the whole queue.
FIFO scaling: FIFO queues with message group IDs let you order messages within groups while processing groups in parallel. Keep message groups around independent entities (order IDs, user IDs).
SNS and SQS Patterns
A common pattern is SNS fan-out to multiple SQS queues. One event, multiple consumers, each with its own queue.
graph LR
Publisher -->|publish| SNS[SNS Topic]
SNS -->|deliver| Q1[SQS Queue: Analytics]
SNS -->|deliver| Q2[SQS Queue: Notifications]
SNS -->|deliver| Q3[SQS Queue: Audit]
This combines SNS’s pub/sub with SQS’s queuing. Each consumer gets its own queue, so retry logic and parallel processing work independently.
# SNS publishes to multiple SQS queues (configured via topic subscription)
sns.publish(TopicArn=topic_arn, Message=json.dumps(event))
# Each consumer group has its own queue
# Analytics queue consumer
for msg in sqs.receive_message(QueueUrl=analytics_queue_url):
run_analytics(msg)
# Notifications queue consumer
for msg in sqs.receive_message(QueueUrl=notifications_queue_url):
send_notification(msg)
This gives you SNS topic-based routing, SQS per-consumer queuing and retry, and independent scaling per consumer group.
Fan-Out to SQS
The fan-out pattern uses SNS topics to publish a message once, then routes copies to multiple SQS queues. Each consumer processes from its own queue independently.
Cost Optimization
SQS and SNS costs scale with API calls and message delivery. Optimizing both brings down your bill.
SQS Cost Factors
SQS charges per API request:
| Request Type | Standard Queue | FIFO Queue |
|---|---|---|
| Send, Receive, Delete | $0.40 per million | $0.40 per million |
| Other operations | $0.40 per million | $0.40 per million |
Reducing SQS Costs
Long polling is the main way to reduce SQS costs. Short polling (the default) bills per request regardless of whether a message arrives. Long polling waits up to 20 seconds, batching multiple empty responses into one billable request.
# Enable long polling to reduce costs
sqs.receive_message(
QueueUrl=queue_url,
MaxNumberOfMessages=10,
WaitTimeSeconds=20, # Long polling - wait up to 20s
ReceiveRequestAttemptM=3 # For FIFO, helps with ordering
)
For high-throughput queues, batching with ReceiveMessage (up to 10 messages per call) cuts down billable requests.
SNS Cost Factors
SNS charges per message published plus per message delivered:
| Operation | Cost per Million |
|---|---|
| Publish | $0.50 |
| Subscribe/Confirm | $0.40 |
| Delivery to SQS/HTTP/Lambda | $0.50 |
| Delivery to Mobile Push | $1.50 - $6.00 |
Reducing SNS Costs
Use message batching with PublishBatch to cut publish costs. For delivery, use filter policies so you do not send messages to subscribers who will just discard them.
# Batch publish to reduce costs
entries = [
{'Id': str(i), 'Message': json.dumps({'event': f'event-{i}'})}
for i in range(10)
]
sns.publish_batch(TopicArn=topic_arn, PublishBatchRequestEntries=entries)
# 10 messages for the price of 1 publish call + 10 deliveries
Cross-region SNS subscriptions add data transfer costs. Keep subscribers in the same region when you can.
SQS vs SNS: Choosing and Knowing When Not To
SQS vs SNS: When to Use Which
| Aspect | SQS | SNS |
|---|---|---|
| Pattern | Point-to-point queue | Pub/sub |
| Delivery | Pull (consumers poll) | Push (subscribers receive) |
| Multiple consumers | Single consumer per message | All subscribers receive |
| Ordering | FIFO option available | No ordering guarantee |
| Throughput | Unlimited (standard), 3000/s (FIFO) | Unlimited |
| Cost | Per API call | Per message published + per delivery |
Pick SQS when you need work distribution across consumers, each processing a message once. It handles burst traffic well, with visibility timeout and redrive built in.
Pick SNS when multiple consumers need the same message, when you want push-based delivery, or when broadcasting events to many subscribers. Simpler than running your own pub/sub.
When Not to Use SQS and SNS
When Not to Use SQS
- When you need push-based delivery: SQS is pull-based; consumers must poll
- When you need message ordering across queues: Standard queues do not guarantee ordering
- When you need exactly-once delivery without deduplication logic: Standard queues deliver at-least-once
- When you need multiple consumers on same stream: Each message goes to one queue only
When Not to Use SNS
- When you need message persistence beyond 14 days: SNS does not persist messages (subscribers must be available)
- When you need strict ordering: SNS does not guarantee ordering across subscribers
- When you need exactly-once without client deduplication: SNS delivers at-least-once
- When you have many small subscribers: Each subscription incurs delivery costs
SNS vs EventBridge
EventBridge is a serverless event bus that builds on SNS with event routing rules, schema discovery, and SaaS integrations. The choice depends on your use case.
| Aspect | SNS | EventBridge |
|---|---|---|
| Architecture | Pub/sub topic | Event bus with routing rules |
| Schema registry | None | Built-in schema registry |
| SaaS integrations | None | 200+ SaaS sources |
| Event routing | Topic-based | Rule-based with filtering |
| Archive and replay | No | Yes (up to 24 hours) |
| Cost | Per message + delivery | Per event + processing |
| Dead letter handling | DLQ per subscription | Via API destinations |
EventBridge shines when you need SaaS integrations (Salesforce, Zendesk, third-party webhooks) or schema validation. SNS is simpler and cheaper for pure point-to-point fan-out within AWS.
# EventBridge rule-based routing example
import boto3
events = boto3.client('events')
# Create rule with multiple targets based on detail type
events.put_rule(
Name='order-events',
EventPattern='{"source": ["aws.ec2"], "detail-type": ["EC2 Instance State Change"]}',
State='ENABLED'
)
# Add targets
events.put_targets(
Rule='order-events',
Targets=[
{'Id': '1', 'Arn': 'lambda-arn', 'RoleArn': 'execution-role-arn'},
{'Id': '2', 'Arn': 'sqs-arn'}
]
)
Within your application stack, SNS handles most messaging well. EventBridge costs more but adds value when you need event routing, schema management, or SaaS ingestion.
Comparison to Self-Managed Solutions
Managed services like SQS and SNS remove operational burden. You do not provision servers, manage replication, or tune performance. AWS handles availability and durability.
The tradeoff is vendor lock-in. Your code depends on AWS APIs, and moving to another platform means rewriting the messaging layer. Self-managed solutions (Kafka, RabbitMQ) give you portability but need more ops work.
For understanding messaging patterns that apply regardless of platform, see message queue types and pub/sub patterns.
AWS PrivateLink/VPC Endpoint Configuration
PrivateLink keeps SQS and SNS traffic inside the AWS network, avoiding the public internet and giving you private connectivity from within a VPC.
SQS VPC Endpoints
# Create VPC endpoint for SQS
aws ec2 create-vpc-endpoint \
--vpc-id vpc-012345678 \
--service-name com.amazonaws.us-east-1.sqs \
--vpc-endpoint-type Interface \
--subnet-ids subnet-012345678 subnet-876543210 \
--security-group-ids sg-012345678
VPC endpoints use ENIs in your subnets. Set up the security group to allow traffic on port 443 from your application servers.
SNS VPC Endpoints
# Create VPC endpoint for SNS
aws ec2 create-vpc-endpoint \
--vpc-id vpc-012345678 \
--service-name com.amazonaws.us-east-1.sns \
--vpc-endpoint-type Interface \
--subnet-ids subnet-012345678 subnet-876543210 \
--security-group-ids sg-012345678
IAM Policies for VPC Endpoints
VPC endpoints need IAM policies that allow access from the VPC endpoint, not just the public internet:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": "*",
"Action": ["sqs:ReceiveMessage", "sqs:DeleteMessage"],
"Resource": "arn:aws:sqs:us-east-1:123456789:my-queue",
"Condition": {
"StringEquals": {
"aws:sourceVpce": "vpce-012345678"
}
}
}
]
}
This policy limits access to messages in the queue to traffic coming through your VPC endpoint.
Trade-off Analysis
| Scenario | SQS Standard | SQS FIFO | SNS Standard | SNS FIFO |
|---|---|---|---|---|
| Message ordering | Best-effort | Per message group | None | Per message group |
| Deduplication | At-least-once | Exactly-once (5-min window) | At-least-once | Exactly-once (5-min window) |
| Throughput | Unlimited | 3000/sec with batching | Unlimited | 300/sec |
| Delivery model | Pull (consumer polls) | Pull (consumer polls) | Push (fan-out) | Push (fan-out) |
| Multiple consumers per message | Single | Single | All subscribers | All subscribers |
| Cost efficiency at scale | Low (API calls) | Higher (per message) | Higher (per delivery) | Highest (per message + delivery) |
| Complexity | Low | Medium | Low | Medium |
| Best for | Background jobs, task queues | Order-critical processing | Event broadcasting, notifications | Order-critical fan-out |
Choosing Between SQS and SNS
Prefer SQS when:
- Work items must be processed exactly once and in order within an entity
- Consumer needs exclusive access to messages
- You need visibility timeout and automatic retry per message
- Buffering for burst traffic is important
Prefer SNS when:
- Multiple independent systems need the same event
- Push-based notification delivery is required
- Broadcasting to many subscribers at once
- Decoupling producer from consumer processing time
Prefer SNS + SQS (fan-out) when:
- You want broadcast semantics with queue-based processing
- Different consumer groups need independent retry and throttle handling
- You want SNS simplicity with SQS durability guarantees
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| SQS broker failure | Queue temporarily unavailable; messages not sent or received | SQS manages replication; multi-AZ deployment is automatic |
| SNS broker failure | Messages not delivered to subscribers | Use SNS topic ARN retries; implement dead letter handling |
| Consumer crash mid-processing | Message becomes visible again after visibility timeout | Use visibility timeout appropriately; implement idempotent processing |
| SNS subscription deleted | Messages silently dropped for that subscriber | Use CloudWatch to monitor subscription status |
| SQS queue deletion | All messages permanently lost | Use SQS lifecycle policies; backup critical messages |
| Throughput limit exceeded | Messages rejected or throttled | Request service quota increase; use exponential backoff |
| FIFO ordering violation | Messages processed out of order | Use message group IDs correctly; single consumer per group |
| SNS filter policy misconfiguration | Subscribers receive no messages or wrong messages | Test filter policies; monitor filtered-out message counts |
| Lambda throttling | SNS retries with exponential backoff | Request more concurrent executions |
| Lambda timeout | Treated as invocation failure | Set appropriate timeout + DLQ |
| Lambda crash | Message not processed | DLQ for failed messages |
| Invalid payload to Lambda | Lambda throws on unmarshal | SNS rejects before invocation |
| Lambda permission denied | SNS retries with exponential backoff up to max retries, then DLQ | Verify IAM role has lambda:InvokeFunction permission |
SNS to Lambda: DLQ Configuration
When SNS delivers to Lambda, configure a delivery policy and DLQ to handle invocation failures:
# SNS-to-Lambda DLQ configuration
sns.subscribe(
TopicArn=topic_arn,
Protocol='lambda',
Endpoint=lambda_arn,
Attributes={
'DeliveryPolicy': json.dumps({
'healthyRetryPolicy': {
'minDelayTarget': 60,
'maxDelayTarget': 600,
'numRetries': 3,
'numNoDelayRetries': 0,
'backoffFunction': 'exponential'
}
})
}
)
# In Lambda, send failures to DLQ
def handler(event, context):
try:
process_event(event)
except Exception as e:
# Send to DLQ via SNS
sns.publish(
TopicArn=dlq_arn,
Message=json.dumps({'original': event, 'error': str(e)})
)
raise # Re-raise so SNS marks as failed
Configure Lambda async invocation settings to align with SNS retry behavior:
# Configure Lambda async settings via boto3
lambda_client.put_function_event_invoke_config(
FunctionName='my-function',
MaximumRetryAttempts=2,
MaximumEventAgeInSeconds=3600,
DestinationConfig={
'OnFailure': {
'Destination': 'arn:aws:sqs:us-east-1:123456789:my-dlq'
}
}
)
SNS-to-Lambda chains the SNS retry on top of Lambda’s own retry. Set both to avoid duplicate processing or message loss.
Common Pitfalls / Anti-Patterns
Pitfall 1: Not Setting Visibility Timeout Correctly
If visibility timeout is too short, messages are reprocessed before the consumer finishes. If too long, poison messages block the queue. Set it based on expected processing time plus a buffer.
Queue Configuration Pitfalls
Using Standard Queues When FIFO Is Needed
Standard queues offer best-effort ordering. If your business requires ordering, use FIFO queues with message group IDs.
Not Polling Efficiently
Short polling (default) wastes API calls. Use long polling (WaitTimeSeconds > 0) to reduce costs and latency.
Forgetting to Delete Messages After Processing
SQS does not auto-delete. Always call DeleteMessage after successful processing or messages will be reprocessed.
Mixing Message Types in One Queue
Different consumers processing different message types in one queue leads to coupling and processing errors. Use separate queues per message type.
Pitfall 6: Not Handling SNS Delivery Failures
If a Lambda subscriber throws an error or an HTTP endpoint is unreachable, SNS retries. But without a DLQ, failed messages are lost after retries. Always configure dead letter queues for failed deliveries.
Interview Questions
Expected answer points:
- Standard queues provide at-least-once delivery (messages may be delivered more than once); FIFO provides exactly-once processing
- Standard queues offer best-effort ordering; FIFO preserves message order within message groups
- FIFO throughput is lower (300 messages/sec without batching, 3000 with); Standard has unlimited throughput
- FIFO uses message group IDs to enable parallel ordering within groups
Expected answer points:
- After a consumer receives a message, it becomes invisible to other consumers for the visibility timeout duration
- If the consumer crashes before deleting the message, it reappears after the visibility timeout expires
- Setting too short: message reprocessed before consumer finishes; setting too long: poison messages block the queue
- Best practice: set timeout to longer than expected processing time plus buffer
Expected answer points:
- A DLQ captures messages that fail processing after a configured number of attempts
- For SQS: configure a redrive policy to send messages to DLQ after maxReceiveCount failures
- For SNS: configure delivery policy with DLQ for failed Lambda invocations
- DLQs enable failure analysis without blocking the main queue or losing messages
Expected answer points:
- One SNS topic fans out to multiple SQS queues, one per consumer group
- Solves the pub/sub to point-to-point conversion: SNS broadcasts, SQS queues deliver
- Each consumer group processes independently with its own retry and visibility timeout
- Example: analytics, notifications, and audit each get their own queue processing the same events
Expected answer points:
- Short polling (default): SQS responds immediately even if queue is empty, billing per request
- Long polling: SQS waits up to 20 seconds for messages to arrive before responding
- Batches multiple empty responses into one billable request
- Set WaitTimeSeconds > 0 to enable long polling
Expected answer points:
- Subscribers define filter policies in JSON; SNS only delivers messages matching the policy
- Reduces unnecessary processing: subscribers receive only relevant messages
- Messages not matching filter are not delivered (not charged)
- Example: an SQS queue only interested in order.placed events can filter out order.cancelled
Expected answer points:
- SNS FIFO is a pub/sub topic; SQS FIFO is a point-to-point queue
- SNS FIFO delivers to all matching subscribers; SQS FIFO delivers to one consumer
- SNS FIFO ordering is per message group; SQS FIFO ordering is per queue
- Both provide exactly-once delivery and 5-minute deduplication window
Expected answer points:
- EventBridge has built-in schema registry for event validation and discovery
- EventBridge supports 200+ SaaS integrations (Salesforce, Zendesk, etc.)
- EventBridge has rule-based routing with filtering (not just topic-based)
- EventBridge supports archive and replay (up to 24 hours); SNS does not
- EventBridge is more expensive: per event + processing vs SNS per message + delivery
Expected answer points:
- SNS retries with exponential backoff based on delivery policy (minDelayTarget, maxDelayTarget, numRetries)
- After retries exhausted, message goes to DLQ if configured
- Lambda async invocation has separate retry behavior (MaximumRetryAttempts)
- Chain: SNS retry → Lambda retry; configure both to avoid duplicates or message loss
Expected answer points:
- SQS: Enable long polling to batch empty responses; batch receives with MaxNumberOfMessages=10
- SNS: Use PublishBatch API (up to 10 messages per call); use filter policies to avoid unnecessary deliveries
- Both: Keep publishers and subscribers in the same region to avoid cross-region data transfer costs
- SNS FIFO costs more than standard (per message + delivery)
Expected answer points:
- ContentBasedDeduplication uses SHA-256 hash of the message body as the deduplication ID automatically
- Eliminates the need to generate and pass MessageDeduplicationId on every publish
- Use when message body is unique per intended message (same content means duplicate)
- Use manual MessageDeduplicationId when body may repeat but intent is unique (e.g., same action for different entities)
Expected answer points:
- SQS maximum message size is 256KB (262,144 bytes)
- For larger payloads: store payload in S3 or DynamoDB, send reference URL/ID in SQS message
- Extended client library (Java) handles this pattern automatically with S3
- Tradeoff: adds latency (extra S3/DynamoDB call) but keeps SQS for coordination
Expected answer points:
- If a subscriber has a filter policy but receives a message that does not match, the message is silently discarded
- No notification to publisher or subscriber that message was filtered out
- Monitor NumberOfMessagesFilteredOut metric in CloudWatch to detect silent drops
- Always test filter policies before deployment; consider default subscription without filters
Expected answer points:
- HTTP/HTTPS: SNS sends a POST with SubscribeURL that must be visited to confirm; times out in 3 days
- Lambda: Confirmation is automatic via the SNS service invoking Lambda with a subscription confirmation event
- Lambda function receives and processes the confirmation message (must call ConfirmSubscription API)
- Both require valid endpoints that can receive and process the confirmation request
Expected answer points:
- Lambda polls SQS using receive_message; each message becomes invisible for the visibility timeout
- If Lambda runs longer than visibility timeout, SQS makes the message visible again and another Lambda can pick it up
- Set Lambda timeout slightly shorter than visibility timeout to avoid duplicate processing
- If Lambda crashes before deleting, message reappears after visibility timeout; idempotent processing is essential
Expected answer points:
- Messages with the same GroupId are processed in order; messages with different GroupIds can be processed in parallel
- Enables ordering guarantee within an entity (same order, same user) while scaling horizontally
- Each consumer instance processes messages from one or more groups independently
- Design groups around independent entities: one group per order ID, user ID, or entity requiring ordering
Expected answer points:
- Publisher account creates SNS topic with resource-based policy allowing subscriber account
- Policy must grant sns:Publish to subscriber account or specific IAM roles in subscriber account
- Subscriber account creates SQS queue and SNS subscription to the cross-account topic
- Data transfer charges apply for cross-region or cross-account data transfer
Expected answer points:
- Messages in other message groups continue processing normally; they are not blocked
- Messages in the stuck group accumulate until the consumer recovers or the message TTL expires
- FIFO within a group is preserved; other groups are independent
- Monitor ApproximateNumberOfMessagesDelayed metric per queue to detect stuck messages
Expected answer points:
- Delivery policy controls retry behavior when delivery fails (non-2xx response or timeout)
- minDelayTarget: initial retry delay (default 0 seconds)
- maxDelayTarget: maximum delay between retries
- numRetries: maximum retry attempts before moving to DLQ
- backoffFunction: linear, exponential, or arithmetic backoff between retries
Expected answer points:
- AWS managed key (SSE-S3): no additional charge, automatic key rotation, no management overhead
- Customer managed CMK: allows key access policies, audit logging via CloudTrail, manual key rotation, costs per key
- CMK enables stricter access controls: only authorized consumers can decrypt messages
- Use CMK when regulatory or compliance requirements mandate customer control over encryption keys
Further Reading
AWS SQS Developer Guide - Official AWS documentation for SQS patterns and best practices
- AWS SNS Developer Guide - Official AWS documentation for SNS pub/sub patterns
- Message Queue Types - Understanding queue patterns beyond AWS
- Pub/Sub Patterns - Event-driven architecture fundamentals
- System Design Fundamentals - Core concepts for distributed systems
Conclusion
SQS is pull-based point-to-point queuing. SNS is push-based pub/sub. Standard queues give you unlimited throughput with at-least-once delivery; FIFO gives you exactly-once with ordering. SNS fan-out to multiple SQS queues combines pub/sub flexibility with queuing durability. Visibility timeout controls when messages reappear if not acknowledged. Dead letter queues capture failures. Long polling reduces empty responses and costs. Server-side encryption with KMS protects messages at rest.
Pre-Deployment Checklist
- SQS visibility timeout set based on expected processing time
- SQS long polling enabled (WaitTimeSeconds > 0)
- Dead letter queue configured for failed message handling
- FIFO queues used when ordering is required
- Message group IDs set correctly for FIFO ordering
- Idempotent message processing implemented
- SSE-KMS encryption enabled for SQS queues
- IAM policies scoped to minimum required permissions
- VPC endpoints configured for private network access
- CloudWatch alarms set for queue depth and message age
- SNS filter policies tested before deployment
- SNS dead letter queue configured for failed deliveries
- DeleteMessage called after successful processing
- SNS subscription permissions reviewed for cross-account access
Pre-Deployment and Operations
Metrics to Monitor
- SQS queue depth: ApproximateNumberOfMessagesVisible
- SQS old message age: ApproximateAgeOfOldestMessage (critical for ordering)
- SNS delivery rate: NumberOfMessagesPublished, NumberOfNotificationsDelivered
- SNS delivery success/failure: NumberOfNotificationsFailed
- SNS filter policy matches: NumberOfMessagesFilteredOut
- SQS receive latency: ReceiveMessageWaitTimeSeconds (long polling effectiveness)
- FIFO group ordering lag: Monitors per message group
Logs to Capture
- SQS sendMessage, receiveMessage, and deleteMessage events
- SNS publish and delivery events
- SNS subscription creation and deletion
- SQS visibility timeout expirations
- Dead letter queue arrivals (via DLQ subscription)
- CloudTrail API calls for administrative actions
Alerts to Configure
- SQS queue depth exceeds threshold
- Oldest message age exceeds SLA threshold
- SNS delivery failure rate exceeds threshold
- SNS filtered-out message rate is abnormal
- SQS long polling not effective (empty responses)
- FIFO message group lag for critical groups
Security Checklist
- Authentication: Use IAM roles for AWS SDK clients; avoid long-term access keys
- Authorization: Use IAM policies for SQS/SNS access; principle of least privilege
- Encryption in transit: Enable TLS; use VPC endpoints for private access
- Encryption at rest: Enable SQS server-side encryption (SSE) with KMS
- VPC endpoints: Use AWS PrivateLink to keep traffic within AWS network
- Message content: Do not send sensitive data unencrypted; use SNS message encryption or application-level encryption
- Cross-account access: Use resource policies for cross-account SNS subscriptions
- Audit logging: Enable CloudTrail for all SQS and SNS API operations
Category
Related Posts
Cloud Cost Optimization: Right-Sizing, Reserved Capacity
Control cloud costs without sacrificing reliability. Learn right-sizing, reserved capacity planning, spot instances, and cost allocation strategies.
Object Storage: S3, Blob Storage, and Scale of Data
Learn how object storage systems like Amazon S3 handle massive unstructured data, buckets, keys, metadata, versioning, and durability patterns.
AWS Data Services: Kinesis, Glue, Redshift, and S3
Guide to AWS data services for building data pipelines. Compare Kinesis vs Kafka, use Glue for ETL, query with Athena, and design S3 data lakes.