Amazon DynamoDB: Scalable NoSQL with Predictable Performance

Deep dive into Amazon DynamoDB architecture, partitioned tables, eventual consistency, on-demand capacity, and the single-digit millisecond SLA.

published: March 24, 2026 reading time: 47 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

DynamoDB spreads data across partitioned tables using consistent hashing, delivering single-digit millisecond latency at any scale. You can run it in on-demand mode for unpredictable traffic or provisioned mode with auto-scaling for steady workloads, and secondary indexes unlock flexible query patterns beyond simple key-value access. The DAX in-memory cache sits in front of DynamoDB for read-heavy microsecond-latency requirements. The real win is predictable performance without operational overhead: DynamoDB handles splitting hot partitions, replication across availability zones, and capacity adjustments automatically.

Amazon DynamoDB: Scalable NoSQL with Predictable Performance

Introduction

DynamoDB traces back to an internal Amazon project tackling the shopping cart’s scalability problems in the mid-2000s. The 2007 Dynamo paper introduced consistent hashing, vector clocks, and eventual consistency, ideas that shifted how engineers approached distributed storage.

Amazon launched DynamoDB as a managed service in 2012. Today it powers massive-scale applications at companies like Airbnb, Dropbox, and Netflix. The core trade-off from the original paper: relax strong consistency, gain scalability and availability.

Core Concepts

DynamoDB organizes data in tables. Each table holds items, and each item has attributes. DynamoDB is schemaless except for the primary key, so items can have different attributes.

Primary keys come in two flavors:

Simple: single attribute (partition key only)
Composite: two attributes (partition key + sort key)

# DynamoDB item example
{
    "UserId": "user-12345",        # Partition key
    "OrderId": "order-9876",        # Sort key (if composite)
    "Status": "shipped",
    "TotalAmount": 129.99,
    "Items": ["SKU-001", "SKU-042"],  # List attribute
    "Metadata": {                    # Map attribute
        "ShippingAddress": "123 Main St",
        "Carrier": "UPS"
    }
}

DynamoDB spreads data across storage nodes called partitions. The partition key determines which partition holds your data, and DynamoDB routes requests accordingly.

Data Distribution: Consistent Hashing

DynamoDB uses consistent hashing for data distribution. Hash the partition key to find the partition. Add or remove nodes, and only a fraction of the data needs to move. This is the core mechanism behind DynamoDB’s horizontal scaling.

graph TD
    A[Key Space 0-2^32] --> B[Partition 1]
    A --> C[Partition 2]
    A --> D[Partition 3]
    A --> E[Partition N]

    B --> F[Replica 1]
    C --> G[Replica 2]
    D --> H[Replica 3]

    F --> I[us-east-1a]
    G --> J[us-east-1b]
    H --> K[us-east-1c]

Each partition replicates across three availability zones automatically. AWS uses a Paxos variant for partition leader election, though the details are internal.

Partitions split when they exceed capacity (roughly 10GB or 3,000 write capacity units). As your table grows, DynamoDB handles the redistribution.

Hot Partitions: Problems and Mitigation

A hot partition occurs when one partition key receives disproportionately high traffic. Since DynamoDB routes all requests for that key to a single partition, it becomes a bottleneck while other partitions sit idle.

Common causes:

Using a low-cardinality attribute as partition key (e.g., “status” with values like “active”, “pending”, “completed”)
Viral content where one item gets massively more reads than others
Time-based keys causing all writes during peak hours to hit one partition

Mitigation strategies:

Randomized partition keys: Add a random suffix to distribute writes across partitions

import random

def generate_partition_key(user_id):
    # Spread writes across 10 virtual partitions
    random_suffix = random.randint(0, 9)
    return f"{user_id}#{random_suffix}"

# Trade-off: now you must query all 10 partitions to find a user

Write sharding with high-cardinality salts: Use multiple distinct values that map to the same underlying entity
Read centralization: For read-heavy hot items, use DAX (DynamoDB Accelerator) to cache aggressively
Adaptive capacity: DynamoDB now automatically distributes traffic for eventually consistent reads, but writes still hit the partition directly

Warning signs of hot partitions:

ProvisionedThroughputExceededException errors affecting only certain keys
ConsumedWriteCapacityUnits showing one partition at 100% while others are under 10%
Latency spikes on specific keys during traffic bursts

# Monitoring hot partition with CloudWatch
import boto3

cloudwatch = boto3.client('cloudwatch')

# Get partition-level metrics
response = cloudwatch.get_metric_statistics(
    Namespace='AWS/DynamoDB',
    MetricName='ConsumedWriteCapacityUnits',
    Dimensions=[
        {'Name': 'TableName', 'Value': 'Orders'}
    ],
    StartTime='2024-01-01T00:00:00Z',
    EndTime='2024-01-02T00:00:00Z',
    Period=3600,
    Statistics=['Sum']
)

When hot partitions are unavoidable: Consider whether DynamoDB is the right fit, or split the entity into multiple tables with different partition schemes.

Consistency Models

DynamoDB lets you pick consistency per request:

Eventually Consistent Reads (default):

Returns data within milliseconds
Might occasionally return stale data
Highest throughput, lowest latency
Costs 0.5 RCU

Strongly Consistent Reads:

Always returns the most recent write
Higher latency due to synchronous replication
Costs 1 RCU

import boto3

dynamodb = boto3.resource('dynamodb', region_name='us-east-1')
table = dynamodb.Table('Orders')

# Eventually consistent read (default)
response = table.get_item(Key={'OrderId': '123'})

# Strongly consistent read
response = table.get_item(
    Key={'OrderId': '123'},
    ConsistentRead=True
)

For read-heavy applications, eventual consistency is the obvious choice. For inventory checks or anything needing fresh data, strong consistency is worth the extra RCU.

Capacity Management

DynamoDB offers two capacity modes, switchable once per day:

Provisioned Mode

Provisioned mode asks you to declare upfront how many reads and writes per second your table needs. DynamoDB reserves that capacity and bills you by the hour regardless of actual traffic. The core unit is the RCU (Read Capacity Unit) and WCU (Write Capacity Unit). One RCU delivers one strongly consistent read per second for items up to 4KB, while one WCU delivers one write per second for items up to 1KB. Larger items consume proportionally more capacity.

DynamoDB provisions partition-level capacity based on your table’s total RCU and WCU. If your table has 100 RCU and 50 WCU spread across 10 partitions, each partition gets roughly 10 RCU and 5 WCU of reserved throughput. That reservation is where the cost advantage comes from at sustained throughput — you pay a fixed hourly rate that works out 4–5x cheaper than on-demand for continuous high-utilization workloads.

Auto-scaling is where provisioned mode gets flexible. You set a minimum and maximum capacity bound and a target utilization percentage (AWS defaults to 70%). CloudWatch metrics drive the scaling decisions: when consumed capacity trends above the target for a sustained period, DynamoDB adds capacity up to the maximum. When utilization drops, it scales down to the minimum. The tradeoff is adjustment lag — DynamoDB evaluates utilization over a cooldown period, so a sudden traffic spike briefly exceeds your provisioned capacity before auto-scaling reacts. For most production workloads this gap is bridged by the SDK’s built-in retry with exponential backoff.

# Provisioned capacity specification
{
    'TableName': 'Orders',
    'ProvisionedThroughput': {
        'ReadCapacityUnits': 100,    # 100 strongly consistent reads/sec
        'WriteCapacityUnits': 50     # 50 writes/sec
    }
}

Provisioned mode makes sense when your traffic is predictable or when you can absorb the brief throttling during auto-scaling lag. It is the default choice for production tables with known load patterns. The fixed cost lets you forecast billing accurately and reserve capacity at a steep discount compared to on-demand pricing.

On-Demand Mode

On-demand mode is DynamoDB’s fully elastic capacity model. You pay per read and write request with no capacity reservations, no hourly commitments, and no need to estimate your workload. DynamoDB instantly adapts to traffic from zero to millions of requests per second without any configuration changes on your part.

The pricing mechanics are straightforward: one read request unit (on-demand) is consumed per strongly consistent read or two eventually consistent reads per second, and one write request unit per write per second. This maps directly to the RCU and WCU concept from provisioned mode, but you never think about those units — DynamoDB tracks your actual consumed throughput and bills accordingly. At sustained high throughput, on-demand runs 4–5x more expensive than provisioned. What you are paying for is the ability to go from near-zero traffic to a 100x spike in seconds without pre-arranging capacity.

On-demand eliminates the capacity planning step entirely. For new tables where you do not yet know the load, for migration projects with temporary high write volumes, or for event-driven workloads that spike unpredictably, on-demand means you never throttle due to under-provisioning. The tradeoff is that you cannot set billing alerts at the capacity level — you only know your cost after traffic has already moved through. DynamoDB adapts capacity in roughly 4-minute windows, so traffic bursts within a single window see throttling before the service scales up. For most burst scenarios, the SDK’s built-in retry with exponential backoff bridges the gap.

On-demand works best when your workload is genuinely unpredictable or when you are in the early stages of understanding your traffic patterns. Once the load stabilizes, you can switch to provisioned with auto-scaling and immediately cut costs by 60–80% for the same effective throughput.

On-Demand vs Provisioned Break-Even Calculator

Use this framework to decide which capacity mode makes financial sense for your workload:

def calculate_monthly_cost_dynamodb(mode, rcu_per_second, wcu_per_second):
    """
    Calculate monthly DynamoDB cost for a given mode and capacity.

    Pricing as of 2024 (us-east-1):
    - Provisioned RCU: $0.00013 per RCU-hour
    - Provisioned WCU: $0.000065 per WCU-hour
    - On-demand: $0.25 per million read units
    - On-demand: $1.25 per million write units

    Each RCU supports 1 strongly consistent read/sec or 2 eventually consistent reads/sec
    Each WCU supports 1 write/sec
    """
    if mode == 'provisioned':
        hours_per_month = 730  # Average month
        rcu_cost = rcu_per_second * hours_per_month * 0.00013
        wcu_cost = wcu_per_second * hours_per_month * 0.000065
        return rcu_cost + wcu_cost
    elif mode == 'ondemand':
        # Convert per-second capacity to monthly request units
        reads_per_month = rcu_per_second * 3600 * 24 * 30
        writes_per_month = wcu_per_second * 3600 * 24 * 30
        # On-demand pricing (reads are per RCU, not per read unit)
        # 1 RCU = 1 strongly consistent read/sec = 2 eventually consistent reads/sec
        read_units = reads_per_month  # Assuming strongly consistent
        write_units = writes_per_month
        return (read_units / 1_000_000) * 0.25 + (write_units / 1_000_000) * 1.25

# Example: 100 RCU, 50 WCU sustained
rcu = 100
wcu = 50
provisioned_cost = calculate_monthly_cost_dynamodb('provisioned', rcu, wcu)
ondemand_cost = calculate_monthly_cost_dynamodb('ondemand', rcu, wcu)

print(f"Provisioned: ${provisioned_cost:.2f}/month")
print(f"On-demand: ${ondemand_cost:.2f}/month")
print(f"On-demand is {ondemand_cost/provisioned_cost:.1f}x more expensive at sustained load")

Break-even calculation:

The break-even point occurs when on-demand pricing equals provisioned pricing:

provisioned_monthly = RCU * 730 * $0.00013 + WCU * 730 * $0.000065
ondemand_monthly = (RCU * 2592000 / 1M) * $0.25 + (WCU * 2592000 / 1M) * $1.25

For 100 RCU and 50 WCU sustained: provisioned is ~$12.47/month while on-demand is ~$52.19/month at the same throughput. The break-even multiplier is roughly 4-5x for typical workloads.

Practical decision matrix:

Scenario	Recommended Mode	Reason
Steady 24/7 traffic	Provisioned	4-5x cheaper at sustained load
Predictable daily peaks	Provisioned + auto-scaling	Base capacity + elastic peak handling
Unpredictable traffic	On-demand	No throttling risk, pay per use
New table / unknown load	On-demand	Avoid over-provisioning
Migration with known load	Provisioned	More cost-effective once load known
Burst-heavy (night/day)	Provisioned + auto-scaling	Reserve for base, scale for bursts
< 25 RCU or < 25 WCU average	On-demand minimum	Below provisioned floor pricing

Auto-scaling as a hybrid approach:

For workloads with a predictable baseline but occasional spikes, provisioned with auto-scaling captures most savings while handling bursts:

# Configure auto-scaling for a table
appautoscaling = boto3.client('application-autoscaling')

# Register scalable target
appautoscaling.register_scalable_target(
    ServiceNamespace='dynamodb',
    ResourceId='table/Orders',
    ScalableDimension='dynamodb:table:WriteCapacityUnits',
    MinCapacity=10,   # Never below 10 WCU
    MaxCapacity=1000  # Can scale to 1000 WCU
)

# Define scaling policy
appautoscaling.put_scaling_policy(
    ServiceNamespace='dynamodb',
    ResourceId='table/Orders',
    ScalableDimension='dynamodb:table:WriteCapacityUnits',
    PolicyName='OrdersWriteCapacityScalingPolicy',
    TargetTrackingConfiguration={
        'TargetValue': 70.0,  # Keep utilization at 70%
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'DynamoDBWriteCapacityUtilization'
        }
    }
)

Auto-scaling has a 4-5 minute adjustment period. For very spiky workloads that exceed auto-scaling response time, on-demand or a manual capacity increase handles the spike.

Primary Operations

Core DynamoDB operations:

PutItem / BatchWriteItem: Write items, with optional conditional writes

table.put_item(
    Item={'UserId': '123', 'Email': 'user@example.com'},
    ConditionExpression='attribute_not_exists(UserId)'  # Prevent overwrites
)

GetItem / BatchGetItem: Retrieve items by primary key

# Single item retrieval
response = table.get_item(Key={'UserId': '123'})

# Batch retrieval (up to 100 items)
response = table.batch_get_item(
    RequestItems={
        'Orders': {'Keys': [{'OrderId': '1'}, {'OrderId': '2'}]}
    }
)

Query: Retrieve items by partition key and optional sort key range

# Query all orders for user-123 placed in 2024
response = table.query(
    KeyConditionExpression='UserId = :uid AND begins_with(OrderId, :year)',
    ExpressionAttributeValues={
        ':uid': 'user-123',
        ':year': '2024'
    }
)

Scan: Full table scan — expensive, avoid in production

# Scan reads every item - use sparingly
response = table.scan()

Error Handling and Retry Logic

DynamoDB throws specific exceptions that require different handling strategies. The most common is ProvisionedThroughputExceededException, which occurs when you exceed your reserved read or write capacity.

import boto3
from botocore.config import Config
from botocore.exceptions import ClientError
import time
import random

dynamodb = boto3.resource('dynamodb')

# Configure retries with exponential backoff and jitter
config = Config(
    retries={
        'max_attempts': 10,
        'mode': 'adaptive'  # Adaptive mode adds rate-limiting awareness
    }
)
dynamodb = boto3.resource('dynamodb', config=config)

def put_item_with_backoff(table_name, item, max_retries=8):
    """Put item with exponential backoff and jitter."""
    table = dynamodb.Table(table_name)
    base_delay = 0.025  # 25ms base (DynamoDB SDK default)

    for attempt in range(max_retries):
        try:
            table.put_item(Item=item)
            return True
        except ClientError as e:
            error_code = e.response['Error']['Code']

            if error_code == 'ProvisionedThroughputExceededException':
                # Exponential backoff with full jitter
                delay = random.uniform(0, base_delay * (2 ** attempt))
                time.sleep(delay)
                base_delay = min(base_delay * 2, 5.0)  # Cap at 5 seconds
            elif error_code == 'ThrottlingException':
                # Same handling as ProvisionedThroughputExceeded
                delay = random.uniform(0, base_delay * (2 ** attempt))
                time.sleep(delay)
            elif error_code == 'InternalServerError':
                # Transient server-side error - retry
                delay = random.uniform(0, base_delay * (2 ** attempt))
                time.sleep(delay)
            else:
                # Non-retryable error
                raise

    raise Exception(f"Failed after {max_retries} attempts")

Common DynamoDB exceptions and their handling:

Exception	Cause	Retry Strategy
`ProvisionedThroughputExceededException`	RCU/WCU exceeded	Exponential backoff with jitter, 25ms base
`ThrottlingException`	Request rate too high	Same as above
`InternalServerError`	DynamoDB internal error	Retry with backoff
`RequestLimitExceeded`	Too many requests simultaneously	Backoff and reduce concurrency
`ResourceNotFoundException`	Table or GSI does not exist	Do not retry - fix the code
`ConditionalCheckFailedException`	Condition expression not met	Do not retry - business logic issue
`TransactionCanceledException`	Transaction conflict	Retry with backoff after resolution

Exponential backoff with jitter formula:

delay = random(0, min(cap, base * 2^attempt))

Parameter	Value	Notes
base_delay	25ms	SDK default, adjust based on table size
max_delay	5000ms	Cap to avoid long waits
max_attempts	10	After 10 failures, something is wrong
jitter	full	Random value between 0 and calculated delay

For provisioned capacity with auto-scaling, the SDK’s built-in retry handles most throttling automatically. For on-demand, throttling is more frequent during traffic spikes since DynamoDB adapts capacity in 4-minute windows.

Secondary Indexes

Primary key lookups are efficient, but access patterns often need other attributes. DynamoDB provides two index types:

Global Secondary Index (GSI):

Own partition key and optional sort key
Queries across the entire table
eventual or strong consistency
Projects attributes from main table

Local Secondary Index (LSI):

Shares partition key with main table
Different sort key
Strongly consistent reads only
Must be created with table, cannot be modified later

# Creating a GSI for email lookups
table.update(
    GlobalSecondaryIndexUpdates=[{
        'Create': {
            'IndexName': 'EmailIndex',
            'KeySchema': [{'AttributeName': 'Email', 'KeyType': 'HASH'}],
            'Projection': {'ProjectionType': 'ALL'},
            'ProvisionedThroughput': {
                'ReadCapacityUnits': 10,
                'WriteCapacityUnits': 10
            }
        }
    }]
)

GSIs handle most flexible access patterns. LSI only matters when you need strongly consistent queries within a partition key.

Sparse Index Tricks for GSIs

GSIs only index items that have the GSI key attribute. This behavior enables powerful patterns where you selectively include items in an index.

Pattern: Sparse GSI for Rarely-Accessed Conditional Data. Only items with the GSI key attribute get indexed. Items without that attribute are invisible to the GSI. This lets you maintain a sparse index containing only a subset of items.

# Example: GSI for "orders with issues only"
# Items without IssueFlag are not indexed, keeping GSI small

# Order item with no issue - not indexed
table.put_item(Item={
    'OrderId': 'order-001',
    'CustomerId': 'cust-123',
    'Status': 'delivered',
    'Total': 99.99
    # No IssueFlag attribute - invisible to IssueIndex GSI
})

# Order item with issue - indexed
table.put_item(Item={
    'OrderId': 'order-002',
    'CustomerId': 'cust-123',
    'Status': 'delivered',
    'Total': 99.99,
    'IssueFlag': 'REFUND_REQUESTED',
    'IssueReason': 'damaged_in_transit'
    # IssueFlag present - appears in IssueIndex GSI
})

Pattern: Sparse GSI for Soft-Deleted Items. Instead of deleting items immediately, set a Deleted attribute. Query the GSI for items without the deleted flag to find active records. This pattern is useful when you need to maintain deleted item history in DynamoDB Streams but want efficient queries for active items.

# Soft delete - add Deleted attribute instead of deleting
table.update_item(
    Key={'OrderId': 'order-001'},
    UpdateExpression='SET Deleted = :del, DeletedAt = :now',
    ExpressionAttributeValues={':del': True, ':now': int(time.time())}
)

# Query GSI for non-deleted orders only
response = table.query(
    IndexName='StatusIndex',
    KeyConditionExpression='CustomerId = :cid AND #status = :status',
    FilterExpression='attribute_not_exists(Deleted)',
    ExpressionAttributeNames={'#status': 'Status'},
    ExpressionAttributeValues={':cid': 'cust-123', ':status': 'pending'}
)

Pattern: Versioned Data with Sparse GSI. Store multiple versions of an item using a version key. Only the current version has the CurrentVersion flag, making the GSI index small and efficient for “current” queries.

# Historical versions without CurrentVersion
table.put_item(Item={
    'EntityId': 'user-profile-123',
    'VersionId': 'v1',
    'Data': {'name': 'Alice', 'email': 'alice@v1.com'},
    'UpdatedAt': 1704067200
})

# Current version with CurrentVersion attribute
table.put_item(Item={
    'EntityId': 'user-profile-123',
    'VersionId': 'v2',
    'Data': {'name': 'Alice', 'email': 'alice@v2.com'},
    'UpdatedAt': 1704153600,
    'CurrentVersion': True  # Sparse index only includes this
})

# Query GSI to get current version only
response = table.query(
    IndexName='CurrentVersionIndex',
    KeyConditionExpression='EntityId = :eid',
    FilterExpression='attribute_exists(CurrentVersion)'
)

Sparse GSI trade-offs:

Aspect	Consideration
GSI size	GSI contains only items with the key attribute - can be much smaller than main table
Write amplification	Every write with the GSI key attribute consumes GSI write capacity
Null handling	Items without GSI key are invisible - cannot query for “missing” items
TTL interaction	TTL-deleted items remain in GSI until TTL processes them
Update behavior	Updating an item to remove the GSI key removes it from the GSI

Common sparse GSI use cases:

Orders with issues (separate index from all orders)
Active subscriptions vs expired
Featured products flag
User notifications that need efficient unread queries
Entities pending approval vs approved

DAX: DynamoDB Accelerator In-Memory Cache

DAX is a fully managed, in-memory cache for DynamoDB. It sits in front of DynamoDB and caches read results, dramatically reducing read latency for frequently accessed items.

When DAX Makes Sense

DAX makes sense when your read pattern looks like this: repeated reads to the same hot keys, with data that changes infrequently compared to how often you read it. If your application queries the same set of items thousands of times per second and those items only update once every few minutes or longer, DAX eliminates the DynamoDB round-trip entirely. The latency difference is real — sub-millisecond cache hits versus 3-5ms DynamoDB reads changes how tightly-coupled request chains behave, like a product catalog serving a flash sale where the same featured items are read by every session.

The cost-benefit comes down to whether cutting DynamoDB latency by 3ms actually improves what the user sees. If your application already batches reads behind a slower downstream service, 3ms off DynamoDB may not matter. DAX helps most when DynamoDB is the outermost latency bound in your request chain — session stores, user profiles, game state, or configuration data where the application directly returns DynamoDB data to the end user. For batch processing pipelines or async workflows where DynamoDB feeds into a slower operation, the cache benefit rarely justifies the cost.

DAX also helps during traffic spikes that would otherwise throttle. If your provisioned capacity is sized for 80% average utilization and you experience a sudden 5x traffic burst, DAX absorbs the excess reads from cache while DynamoDB adapts. DAX caches the result, so repeated reads during the spike are served from memory. Once the burst subsides, cache hit rates normalize. For sustained high-throughput workloads where you can afford the provisioned capacity, DAX is rarely the right answer — you would simply provision more RCU. DAX makes sense when the spike is short and unpredictable and you do not want to pay for idle capacity between spikes.

When DAX is Not the Answer

DAX does not work for workloads that need fresh data on every read. DAX serves eventually consistent data — if an item was written milliseconds ago, a cache hit returns the stale value. For inventory counts, seat reservations, financial balances, or any domain where reading stale data causes business loss, DAX is not a viable caching layer. You would need to implement explicit cache invalidation on every write, which reintroduces the latency you were trying to eliminate and adds significant complexity. At that point, ElastiCache with write-through caching is a better fit, though it requires application-level cache management.

Write-heavy workloads also get no benefit from DAX. The cache only stores read results, so a workload that is 90% writes — like IoT telemetry ingestion, event logging, or time-series data collection — gets no performance improvement from an in-memory cache. The write path still hits DynamoDB directly on every operation, and the cache infrastructure sits idle. If your application is predominantly writes, DAX is a pure cost with zero performance return. The same applies to workloads with very low read repetition — if each item is read only once or twice before being replaced, the cache hit rate stays near zero and you are paying for infrastructure that is not helping.

DAX also has a subtler problem with strongly consistent reads. If your application requires ConsistentRead=True on every request, DAX cannot help because it only supports eventually consistent reads. Some teams adopt DAX expecting transparent performance improvement and then discover that critical features were silently degraded to eventual consistency. Before deploying DAX, audit every query that needs fresh data and decide whether to exclude those operations from DAX or accept the consistency trade-off explicitly. For applications where the correctness of every read is paramount — healthcare records, payment processing, concurrency-sensitive game mechanics — DAX is not the answer.

DAX Architecture

DAX is a clustered in-memory cache that sits between your application and DynamoDB. When your application calls GetItem, the request goes to the DAX cluster first — a cache hit returns from memory in sub-millisecond time, while a cache miss transparently fetches from DynamoDB, caches the result, and returns it to your application. Your code calls the same DynamoDB API; DAX handles the cache logic transparently.

graph LR
    A[Application] --> B[DAX Cluster]
    B --> C[Node 1 Cache]
    B --> D[Node 2 Cache]
    B --> E[Node 3 Cache]
    C --> F[DynamoDB]
    D --> F
    E --> F

The cluster is where the HA story lives. A DAX cluster runs across multiple nodes (minimum three for production), each caching the same data in a distributed in-memory store. Reads are distributed across nodes using consistent hashing, so cache hits are spread across the cluster rather than concentrated on a single node. If one node fails, the cluster detects the failure via health checks and routes requests to the remaining nodes without your application noticing. DAX also uses write-through caching: when you write an item to DynamoDB via DAX, the cluster automatically invalidates the cached copy so the next read returns the fresh value.

Cache invalidation is event-driven. DAX subscribes to the DynamoDB Streams of your table — any write, update, or delete to an item in DynamoDB generates a stream event that DAX consumes and uses to evict the affected item from cache. This keeps the cache coherent without any application-level invalidation calls. The invalidation happens within seconds of the DynamoDB write, which is why DAX is eventually consistent by design — a write that happened milliseconds ago may still return the old value on a cache hit before the invalidation propagates.

Node sizing in DAX is simple: you pick a node type (similar to EC2 instance types) and a number of nodes. Larger nodes cache more items; more nodes mean higher throughput and better fault tolerance. DAX handles replication, failover, and cache distribution automatically, so you never touch the cluster configuration directly.

Code Example

import boto3
from amazon.dax.runtime import Cluster

# DAX client - same API as DynamoDB
dax = boto3.resource('dynamodb', endpoint_url='https://dax.us-east-1.amazonaws.com')

# Same get_item call, but now goes through DAX cache
table = dax.Table('Orders')
response = table.get_item(Key={'OrderId': '123'})

# DAX automatically handles:
# - Cache hit: returns from memory (< 1ms)
# - Cache miss: fetches from DynamoDB, caches result, returns
# - Item updated in DynamoDB: DAX invalidates that item

DAX vs ElastiCache Comparison

Aspect	DAX	ElastiCache (Redis)
Integration	Native DynamoDB API	Requires code changes
Cache invalidation	Automatic on DynamoDB writes	Manual invalidation logic
Strong consistency	Not supported (eventual only)	Possible with write-through
Item-level caching	Yes	Yes
Query caching	No	Yes
Cluster management	Fully managed	Requires cluster management

Bottom line: DAX is simpler if you only need item-level caching with DynamoDB. Use ElastiCache if you need query result caching or cross-table data caching.

NoSQL Databases covers the NoSQL landscape DynamoDB sits in
CAP Theorem explains why DynamoDB chose eventual consistency
Consistent Hashing describes DynamoDB’s partitioning strategy
Database Replication explains the cross-AZ replication model

Streams and Triggers

DynamoDB Streams captures a time-ordered sequence of item changes:

INSERT: New item added
MODIFY: Item updated
REMOVE: Item deleted

DynamoDB Streams Limitations

Streams are powerful but come with constraints that matter at scale:

Ordered per-partition only:

DynamoDB Streams maintains arrival order only within a single partition key. If you have events for UserId=A and UserId=B interleaved in time, the stream guarantees ordering for A’s events among themselves and B’s events among themselves, but not across A and B. This matters for event processing where causality across partition keys matters.

# Stream record example - note the partition key determines ordering scope
for record in stream_records:
    event_source_arn = record['eventSourceARN']  # Table and stream ARN
    event_name = record['eventName']             # INSERT/MODIFY/REMOVE
    partition_key = record['dynamodb']['Keys']['UserId']['S']  # Ordering scope
    sequence_number = record['dynamodb']['SequenceNumber']
    # Events for same UserId arrive in sequence; cross-partition ordering is not guaranteed

24-hour retention cap:

Streams retain only 24 hours of data. For longer retention, you must pipe events to Kinesis Data Streams, Kafka, or a custom consumer that writes to persistent storage.

# Kinesis Data Streams as a longer-retention alternative
import boto3

kinesis = boto3.client('kinesis')
dynamodb = boto3.client('dynamodb')

# Enable DynamoDB Streams to Kinesis integration
dynamodb.enable_kinesis_streaming_destination(
    TableName='Users',
    StreamArn='arn:aws:kinesis:us-east-1:123456789:stream/user-events'
)

Shard consumption parallelism:

A stream’s shards determine your Lambda concurrency. A single shard processes roughly 1MB/second and supports one concurrent Lambda invocation. If your table has 100 active partitions, you need at least 100 shards for full parallel processing. You can pre-split shards using UpdateTable with ProvisionedThroughput.

Limitation	Impact	Mitigation
Per-partition ordering only	Cross-partition event ordering not guaranteed	Design partition keys around event co-ordering needs
24-hour retention	Long-term event history unavailable	Pipe to Kinesis/Kafka for archival
Shard-level parallelism	High-partition tables need corresponding shards	Pre-split shards before enabling streams
No dead-letter queue	Failed Lambda events lost after retry exhaustion	Configure SQS DLQ on Lambda async invocation
GSI updates not tracked	GSI changes do not generate stream events	Query GSI separately if GSI change capture needed

Build trigger-like behavior with Lambda:

# Lambda function triggered by DynamoDB stream
def lambda_handler(event, context):
    for record in event['Records']:
        if record['eventName'] == 'INSERT':
            new_item = dynamodb_types.unmarshal(record['dynamodb']['NewImage'])
            send_welcome_email(new_item['Email'])

Streams retain 24 hours of changes, fine for most use cases. For longer retention, pipe events to Kinesis or a custom consumer.

TTL Behavior and Tombstone Garbage Collection

DynamoDB TTL automatically deletes items after a specified timestamp, but the deletion process works differently than a direct DeleteItem call.

How TTL deletion works:

When an item’s TTL attribute passes the current time, DynamoDB marks it as expired
The expired item appears in your table but becomes invisible to queries
A background process removes the item within 48 hours (typically minutes)
The deletion generates a REMOVE event in DynamoDB Streams

# Setting TTL on an item - TTL attribute must be a Unix timestamp in seconds
import time

table.put_item(
    Item={
        'UserId': 'user-123',
        'Email': 'user@example.com',
        'SessionData': '...',
        'TTL': int(time.time()) + 86400 * 30  # Expire in 30 days
    }
)

# Check if TTL is enabled on the table
table.meta.client.describe_time_to_live(TableName='Users')

Tombstone behavior:

Unlike Cassandra where deletions create tombstones that persist until compaction, DynamoDB tombstones exist only during the 48-hour deletion window. After that, the item is permanently removed.

Garbage collection impact:

TTL deletions do not consume write capacity
Items deleted by TTL do not appear in on-demand backup exports
Point-in-time recovery includes TTL-deleted items within the retention window
Global Tables replicate TTL deletions across regions

Common TTL pitfalls:

Pitfall	Impact	Mitigation
TTL attribute in past	Item immediately invisible	Always set TTL to future timestamps
TTL on GSI partition key	GSI updates queued, delayed deletion	Avoid TTL on frequently indexed attributes
TTL during backups	Backup may include soon-to-expire items	Check TTL values after restore
Timezone confusion	TTL is Unix epoch seconds, not local time	Use `int(time.time())` or equivalent

TTL is approximate: DynamoDB guarantees items expire within 48 hours of the TTL timestamp, though in practice it usually happens within minutes. Do not use TTL for time-sensitive deletions like session expiry where exact timing matters.

Global Tables

Global Tables replicate across AWS regions:

Active-active replication for low-latency global access
Automatic conflict resolution (last-writer-wins by default)
Sub-second replication between regions

# Create global table in two regions
dynamodb_client.create_table(
    TableName='Users',
    # ... table definition ...
    StreamSpecification={'StreamViewType': 'NEW_AND_OLD_IMAGES'}
)

# Enable multi-region replication
dynamodb_client.create_table(
    TableName='Users',
    # Specify replicas in us-west-2 and eu-west-1
)

Global Tables use all-or-nothing replication: either all replicas accept a write or none do. This gives eventual consistency with conflict resolution.

Backup and Restore

DynamoDB provides three backup mechanisms with different trade-offs: on-demand backups, point-in-time recovery, and cross-region backup replication.

On-Demand Backups

On-demand backups create a full snapshot of the table at a point in time. They do not affect read or write performance and retain until you delete them.

import boto3

dynamodb = boto3.client('dynamodb')

# Create on-demand backup
response = dynamodb.create_backup(
    TableName='Orders',
    BackupName='Orders-backup-2024-01-15'
)

backup_arn = response['BackupDetails']['BackupArn']
backup_creation_date = response['BackupDetails']['BackupCreationDateTime']

# Restore from backup
dynamodb.restore_table_from_backup(
    TargetTableName='Orders-Restored',
    BackupARN=backup_arn
)

# List all backups
backups = dynamodb.list_backups(TableName='Orders')
for backup in backups['BackupSummaries']:
    print(f"{backup['BackupName']}: {backup['BackupStatus']}")

On-demand backups capture the entire table including TTL-deleted items still within the 48-hour window.

Point-in-Time Recovery

Point-in-time recovery (PITR) continuously backs up your table with second-level granularity for the last 35 days. Enable it per table:

# Enable point-in-time recovery
dynamodb.update_continuous_backups(
    TableName='Orders',
    PointInTimeRecoverySpecification={
        'PointInTimeRecoveryEnabled': True
    }
)

# Check PITR status
status = dynamodb.describe_continuous_backups(TableName='Orders')
pitr_status = status['ContinuousBackupsDescription']['PointInTimeRecoveryDescription']['PointInTimeRecoveryStatus']
print(f"PITR status: {pitr_status}")

# Restore to specific time (must be within last 35 days)
dynamodb.restore_table_to_point_in_time(
    SourceTableName='Orders',
    TargetTableName='Orders-PITR-Restore',
    RestoreDateTime=datetime(2024, 1, 10, 12, 0, 0)
)

PITR has no performance impact and does not consume write capacity. Restoration typically takes minutes to hours depending on table size.

Cross-Region Backup with AWS Backup

For disaster recovery across regions, AWS Backup manages cross-region backup copies:

# Create cross-region backup plan with AWS Backup
backup = boto3.client('backup')

# Define backup rule with copy to another region
backup_plan = {
    'BackupPlanName': 'DynamoDB-CrossRegion-Backup',
    'Rules': [{
        'RuleName': 'CopyToDrRegion',
        'TargetBackupVaultName': 'backup-vault-us-east-1',
        'CopyActions': [{
            'DestinationBackupVaultArn': 'arn:aws:backup:us-west-2:123456789:backup-vault/dr-vault',
            'Lifecycle': {'DeleteAfterDays': 30}
        }],
        'Lifecycle': {'DeleteAfterDays': 7}
    }]
}

backup.create_backup_plan(BackupPlan=backup_plan)

Backup Comparison Table

Choosing a backup strategy comes down to three things: how far back you need to recover, whether you need cross-region protection, and what compliance requirements you have. On-demand backups work for tables where recovery is infrequent and always to a specific known-good point in time — weekly snapshots for disaster recovery, for example, or backups taken before major schema migrations. They are manual, retained until deleted, and priced per backup based on table size. If your table is 50GB and you take one backup per week, you pay for 50GB of storage per week retained.

Point-in-time recovery acts as the operational safety net that runs continuously in the background. Once enabled, it logs every write for 35 days and lets you restore to any second within that window. The granularity is second-level, which matters if your data changes frequently and you cannot afford to lose more than minutes of work. PITR costs a flat hourly rate plus storage for the change logs, and it has no performance impact on the table. Enable it on every production table by default and treat it as baseline insurance. The limitation is that it does not protect against region-level failures; if the entire region goes dark, your PITR restore point is in the same region.

AWS Backup cross-region copies fill the gap for disaster recovery planning that requires geographic redundancy. If your compliance posture demands that backups survive the loss of a primary region, AWS Backup can orchestrate continuous copies to a secondary region on a configurable schedule and retention policy. This costs more — you pay for data transfer between regions plus storage in the destination region — and restore time includes cross-region data transfer. For most applications, PITR plus on-demand backups to S3 in a secondary region is sufficient. For regulated industries with explicit RTO and RPO requirements, the cross-region AWS Backup approach with documented and tested runbooks is the right choice.

Feature	On-Demand Backup	Point-in-Time Recovery	AWS Backup (Cross-Region)
Retention	Until deleted	35 days	Configurable (e.g., 30 days)
Granularity	Full table snapshot	Second-level within window	Configurable schedule
Performance impact	None	None	None
Cost	Per-backup storage	Continuous backup storage	Per-copy storage + transfer
Cross-region	Manual export to S3	Not native	Native managed copies
Restore time	Minutes to hours	Minutes to hours	Minutes to hours

Export to S3 for Long-Term Archival

For compliance or analytics needs, export table data to S3:

# Export to S3 (using Data Pipeline historically, now S3 Export directly)
dynamodb.export_table_to_point_in_time(
    TableArn='arn:aws:dynamodb:us-east-1:123456789:table/Orders',
    ExportTime=datetime(2024, 1, 15, 0, 0, 0),
    S3Bucket='my-dynamodb-exports',
    S3Prefix='exports/orders-2024-01-15/',
    ExportFormat='DYNAMODB_JSON'
)

Exports do not impact production performance and read from backup storage, not the live table.

Production Failure Scenarios

Failure	Impact	Mitigation
Hot partition key	One partition absorbs disproportionate traffic; throttles while other partitions are underutilized	Use partition key design that distributes traffic (add random suffix, use composite keys); split high-traffic items across multiple keys
Provisioned capacity exceeded	Requests throttled with `ProvisionedThroughputExceededException`	Enable auto-scaling; set appropriate RCU/WCU; use exponential backoff with jitter on retries
On-demand mode cost spike	Unpredictable traffic patterns can cause unexpectedly high bills	Set billing alerts; use provisioned capacity for predictable baseline + on-demand for spikes
DynamoDB outage in single region	Applications depending on that region fail until DNS failover	Use Global Tables for multi-region active-active; implement cross-region read replicas if active-active not needed
Cache invalidation race	DAX returns stale data after DynamoDB write due to eventual consistency	DAX is not strong-consistency compatible; use short TTL; invalidate DAX entry explicitly on writes
Transaction conflict	Concurrent transactions modifying same item in opposite directions both fail	Design item access patterns to minimize conflicts; use conditional writes with exponential backoff
TTL expired but not deleted immediately	Items remain viewable briefly after TTL expires	TTL is approximate (within 48 hours of actual expiry); if strict expiry needed, handle deletion in application logic

Common Pitfalls / Anti-Patterns

When to Use DynamoDB

DynamoDB works best when you design your access patterns upfront in exchange for operational simplicity. You define your access patterns before deployment, model your data around those patterns, and DynamoDB handles everything else — scaling, replication, failover, capacity management. This is a different mental model from relational databases where you can evolve queries as requirements change. The benefit is that DynamoDB never needs you to manage shards, tune indexes, or provision hardware. The trap is treating it like a general-purpose database and then fighting the model when you need a query that does not fit.

“Known access patterns” is not a trivial requirement. It means you can enumerate every query you will run against the table and express them as partition key lookups, sort key ranges, or GSI queries. If your application needs ad-hoc filtering — find all orders where status = 'pending' regardless of which user placed them — DynamoDB is not a good fit. You would need a GSI with status as the partition key, which only works if that is your primary access pattern, not an occasional investigative query. The 400KB item limit also means large objects and documents belong in S3, not DynamoDB. Design around entity relationships that map cleanly to primary keys and sort keys.

Predictable performance in DynamoDB means something specific: your latency stays in the single-digit millisecond range regardless of table size, because DynamoDB routes requests at the partition level with no central coordinator. This is different from databases that slow down as data grows. But that predictability requires good partition key design — a hot partition destroys the guarantee and introduces the exact latency spikes DynamoDB is supposed to prevent. Before choosing DynamoDB, invest time in partition key strategy. If your access patterns are settled and your data model fits a key-value or document structure, DynamoDB scales without operational burden. If those conditions do not hold, the model will fight you.

When Not to Use DynamoDB

DynamoDB is the wrong choice when you need complex queries across multiple entities. There are no JOINs, no aggregations, and no ad-hoc filtering across tables — PostgreSQL or Elasticsearch handle those better. Strong consistency across multiple tables also does not work well since transactions can span multiple items but not multiple tables; cross-table consistency demands application-level coordination. Full-text search is not available natively — you would need OpenSearch or Elasticsearch as a sidecar. The 400KB item size limit rules out large objects, so media and big blobs belong in S3 instead. And if SQL compatibility matters, DynamoDB has none — migrating from relational databases is painful.

Quick Recap Checklist

Run through this before your next DynamoDB interview or when auditing a live table. Each item catches a real mistake engineers make in production.

Interview Questions

1. Explain how DynamoDB's partition key determines data distribution and what happens when a single partition key receives disproportionately high traffic.

Expected answer points:

DynamoDB hashes the partition key to determine which partition stores the item; requests for the same partition key always route to the same partition
A hot partition occurs when one key receives excessive traffic relative to other partitions, causing throttling while other partitions remain underutilized
Mitigation strategies include adding random suffixes to partition keys, write sharding with high-cardinality salts, using DAX for read-heavy hot items, and adaptive capacity for eventually consistent reads

2. What is the difference between eventually consistent reads and strongly consistent reads in DynamoDB, and when would you choose each?

Expected answer points:

Eventually consistent reads return data within milliseconds and might occasionally return stale data; they cost 0.5 RCU per read
Strongly consistent reads always return the most recent write due to synchronous replication across AZs; they cost 1 RCU and have higher latency
Choose eventually consistent reads for read-heavy applications like feeds, timelines, or dashboards where slightly stale data is acceptable
Choose strongly consistent reads for inventory checks, payment processing, or any operation requiring fresh data accuracy

3. How does consistent hashing enable DynamoDB's horizontal scaling?

Expected answer points:

Consistent hashing maps partition keys to a key space (0–2^32); each partition owns a range of that space
When partitions are added or removed, only a fraction of keys remap to different partitions, minimizing data movement
Partitions split when they exceed approximately 10GB or 3,000 write capacity units; DynamoDB automatically redistributes data
This mechanism allows DynamoDB to scale horizontally without a central bottleneck or coordinator

4. When should you choose provisioned capacity mode versus on-demand capacity mode?

Expected answer points:

Provisioned mode suits steady, predictable workloads where you can forecast RCU/WCU needs; 4–5x cheaper at sustained throughput
On-demand mode suits variable, spiky, or unpredictable workloads where you pay per request without capacity planning
Use provisioned with auto-scaling for a predictable baseline plus elastic peak handling
On-demand is ideal for new tables with unknown load and migration workloads with temporary high capacity needs
Break-even is approximately 4–5x: on-demand costs 4–5x more than provisioned for the same sustained throughput

5. What is the difference between a Global Secondary Index (GSI) and a Local Secondary Index (LSI)?

Expected answer points:

GSI has its own partition key and optional sort key; queries span the entire table; supports eventual or strong consistency; can be created/modified after table creation
LSI shares the same partition key as the main table but has a different sort key; only supports strongly consistent reads; must be created with the table and cannot be modified later
GSIs are more flexible and commonly used for alternate access patterns; LSIs are only needed when you require strongly consistent queries within a partition key

6. What is a sparse index in DynamoDB and what are two practical patterns that exploit it?

Expected answer points:

GSIs only index items that have the GSI key attribute; items without that attribute are invisible to the index, keeping the GSI small
Pattern 1 – Conditional indexing: Create a GSI for "orders with issues only" by including an IssueFlag attribute selectively; orders without issues are not indexed
Pattern 2 – Soft deletes: Use a Deleted attribute; items without Deleted appear in the "active" GSI; querying for items without Deleted efficiently returns only active records
Trade-off: write amplification (every item with the GSI key consumes GSI write capacity) and null handling (items without the key are invisible)

7. How does DynamoDB Streams work and what are its key limitations?

Expected answer points:

Streams capture a time-ordered sequence of item-level changes: INSERT, MODIFY, REMOVE events with before/after images
Limitation 1 – Ordered per-partition only: cross-partition event ordering is not guaranteed; design partition keys around event co-ordering needs
Limitation 2 – 24-hour retention: long-term event history requires piping to Kinesis Data Streams, Kafka, or custom persistent storage
Limitation 3 – Shard-level parallelism: full parallel processing requires corresponding number of shards (1MB/sec per shard); pre-split shards for high-partition tables
Limitation 4 – No native dead-letter queue: failed Lambda events lost after retry exhaustion; configure SQS DLQ on Lambda async invocation
Limitation 5 – GSI updates not tracked: GSI changes do not generate stream events

8. How does DynamoDB TTL work and what are common pitfalls to avoid?

Expected answer points:

TTL deletes items automatically after a specified timestamp; DynamoDB marks items as expired when TTL attribute passes current time, then removes them within 48 hours (typically minutes)
TTL generates REMOVE events in DynamoDB Streams so consumers can react to expirations
Pitfall 1 – TTL attribute in the past: item immediately becomes invisible; always set TTL to future timestamps
Pitfall 2 – TTL on GSI partition key: GSI updates are queued, causing delayed deletions
Pitfall 3 – Timezone confusion: TTL is Unix epoch seconds, not local time; always use `int(time.time())` or equivalent
TTL does not consume write capacity; items deleted by TTL do not appear in on-demand backup exports; PITR includes TTL-deleted items within the retention window

9. What is DAX (DynamoDB Accelerator) and when is it the right choice versus ElastiCache?

Expected answer points:

DAX is a fully managed, in-memory cache that sits in front of DynamoDB; cache hits return in microseconds vs. single-digit milliseconds from DynamoDB
DAX is the right choice for read-heavy workloads with hot data, microsecond latency requirements, and burst read patterns where you need to handle traffic spikes without throttling
DAX is NOT suitable for write-heavy workloads (does not cache writes), real-time data requirements, or when you need strongly consistent reads
Choose DAX over ElastiCache when you need native DynamoDB API compatibility and item-level caching without code changes
Choose ElastiCache when you need query result caching, cross-table data caching, or write-through caching for strong consistency

10. Explain the retry strategy for `ProvisionedThroughputExceededException` in DynamoDB.

Expected answer points:

Use exponential backoff with jitter: delay = random(0, min(cap, base × 2^attempt)) where base is 25ms and cap is 5000ms
Set max_attempts to 10; after 10 failures something is wrong and you should investigate capacity design
Adaptive retry mode in the SDK adds rate-limiting awareness, adjusting retry behavior based on throttling signals from DynamoDB
For provisioned capacity with auto-scaling, the SDK's built-in retry handles most throttling automatically
For on-demand, throttling is more frequent during traffic spikes since DynamoDB adapts capacity in 4-minute windows; exponential backoff bridges those windows

11. How do Global Tables provide multi-region replication in DynamoDB?

Expected answer points:

Global Tables provide active-active replication across AWS regions, enabling low-latency global access without a custom replication layer
They use all-or-nothing replication: either all replicas accept a write or none do
Automatic conflict resolution uses last-writer-wins by default; applications can implement custom conflict resolution if needed
Sub-second replication between regions is guaranteed by the managed service
Global Tables replicate TTL deletions across regions

12. What are the three backup mechanisms in DynamoDB and when would you use each?

Expected answer points:

On-demand backups: Full table snapshots retained until deleted; no performance impact; good for scheduled weekly/monthly snapshots and migrations
Point-in-time recovery (PITR): Continuous second-level backup for last 35 days; no performance impact; good for any table requiring recent point-in-time restore capability
AWS Backup cross-region: Managed service for disaster recovery with configurable retention and lifecycle; good for compliance requiring cross-region backup copies
Export to S3: For long-term archival and compliance; reads from backup storage, not live table; no performance impact
TTL deletion is NOT captured in on-demand backup exports but IS included in PITR within the retention window

13. What is the difference between Query and Scan operations in DynamoDB, and when would you use each?

Expected answer points:

Query retrieves items by partition key and optional sort key range; efficient because it targets specific partitions
Scan reads every item in a table or GSI; highly inefficient as it reads unrelated data and consumes large amounts of RCU
Use Query for all production access patterns; it should be your primary operation
Use Scan only for one-off administrative tasks, data exports, or when you genuinely need to process all items; never in hot production paths
BatchGetItem can retrieve up to 100 items by primary key in a single call, reducing round trips compared to multiple GetItem calls

14. DynamoDB transactions can span multiple items but not multiple tables. How would you implement cross-table consistency?

Expected answer points:

DynamoDB transactions (TransactWriteItems, TransactGetItems) operate within a single table or across tables in the same AWS account and region
For true cross-table consistency, use application-level coordination: write to both tables, use conditional writes to verify state, and implement compensating transactions for rollback
Alternatively, use DynamoDB Streams to trigger a Lambda that updates secondary tables and handles failures with retry logic or dead-letter queues
For strong cross-region consistency, DynamoDB transactions are not supported across regions; Global Tables provide eventual consistency across regions only

15. How would you design a DynamoDB schema for a multi-tenant SaaS application?

Expected answer points:

Use tenant ID as the leading element of the partition key (e.g., TenantID or TenantID#EntityType) to ensure data isolation and even distribution
If a single tenant's data exceeds partition limits, add a sub-entity ID as a sort key or use hierarchical keys (TenantID#UserID as PK, OrderID as SK)
GSIs should also start with TenantID to maintain tenant isolation in alternate access patterns
Implement row-level security at the application layer by always including TenantID in queries
Consider separate tables per tenant for extreme isolation requirements, but manage operational complexity

16. What strategies can you use to handle transactions and race conditions in DynamoDB?

Expected answer points:

Conditional writes: Use `ConditionExpression='attribute_not_exists(pk)'` or `attribute_exists(pk)'` to prevent overwrites or ensure presence
TransactWriteItems: Atomically write or delete up to 100 items across tables; fails if any condition fails
Optimistic locking: Store a version attribute; update only if version matches expected value using conditional expressions
Exponential backoff with jitter: Handle `TransactionCanceledException` due to conflicts between concurrent transactions modifying the same items
Design item access patterns to minimize conflicts: avoid having multiple concurrent processes modifying the same item in opposite directions

17. What are the item size limits in DynamoDB and how would you model data that exceeds those limits?

Expected answer points:

DynamoDB has a 400KB maximum item size limit (attributes plus key names and values)
For large objects, store the data in S3 and save the S3 object key as a DynamoDB attribute; never store media or large blobs directly in DynamoDB
For documents exceeding 400KB, split them into chunks (chunk 1, chunk 2, etc.) with a composite sort key, then reassemble in the application
Store metadata in DynamoDB for querying while the actual content lives in S3

18. How does DynamoDB handle network partitions and what consistency guarantees does it provide under the CAP theorem?

Expected answer points:

DynamoDB is a CP (Consistent and Partition-tolerant) system under the CAP theorem during network partitions
When a network partition occurs, DynamoDB sacrifices availability and returns errors for strongly consistent reads/writes until the partition heals
Eventually consistent reads may still be served during a partition (reducing RCU cost by half)
The 2017 DynamoDB paper clarified that DynamoDB chose to give up strong consistency for scalability and availability; you can achieve strong consistency at the cost of higher latency and RCU
DynamoDB's single-digit millisecond SLA is a deliberate trade-off: predictable performance over absolute consistency

19. What monitoring and alerting would you configure for a production DynamoDB table?

Expected answer points:

CloudWatch metrics to monitor: ConsumedReadCapacityUnits, ConsumedWriteCapacityUnits, ThrottledRequests, UserErrors, SystemErrors
Partition-level metrics via CloudWatch Contributor Insights to identify hot partition keys
Billing alerts: set CloudWatch alarms for unusual spend spikes, especially in on-demand mode
PITR status monitoring: ensure point-in-time recovery remains enabled
DAX cache hit/miss ratio if DAX is in use; high miss rate indicates cache sizing issues
Stream metrics: Lambda invocation errors and iterator age for stream consumers

20. What are the main reasons teams regret choosing DynamoDB and how can you avoid those pitfalls?

Expected answer points:

Choosing DynamoDB for complex relational data: no JOINs, no ad-hoc queries, no aggregations; teams end up building workarounds that negate DynamoDB's benefits
Poor partition key design: hot partitions cause throttling and unpredictable costs; invest upfront in partition key strategy
Underestimating GSI costs: each GSI has its own provisioned throughput; excessive GSIs multiply both cost and operational complexity
Vendor lock-in concerns: DynamoDB is a proprietary managed service with no SQL compatibility; ensure the team accepts this trade-off before committing
Over-engineering early: start simple with on-demand mode and evolve capacity mode as load becomes predictable; avoid premature optimization

Conclusion

DynamoDB embraces eventual consistency to achieve scale and availability. Predictable performance, managed operations, and flexible data modeling make it a solid choice for many modern applications.

Key points:

Partition key design matters for performance and cost
Eventually consistent reads halve RCU costs
GSIs enable query flexibility at the cost of additional throughput
On-demand mode simplifies capacity planning for variable workloads
Global Tables give active-active replication across regions

DynamoDB is not the right fit for complex queries (look at PostgreSQL), multi-item transactions (Aurora), or teams who cannot accept vendor lock-in. For high-throughput, access-pattern-driven data with predictable scaling needs, DynamoDB performs where few alternatives can.

The 2017 DynamoDB paper clarified the hard choices: always writable, scalable, and simple came at the cost of strong consistency. The managed service keeps those trade-offs while adding operational simplicity the original Dynamo lacked.

Amazon DynamoDB: Scalable NoSQL with Predictable Performance

Introduction

Core Concepts

Data Distribution: Consistent Hashing

Hot Partitions: Problems and Mitigation

Consistency Models

Capacity Management

Provisioned Mode

On-Demand Mode

On-Demand vs Provisioned Break-Even Calculator

Primary Operations

Error Handling and Retry Logic

Secondary Indexes

Sparse Index Tricks for GSIs

DAX: DynamoDB Accelerator In-Memory Cache

When DAX Makes Sense

When DAX is Not the Answer

DAX Architecture

Code Example

DAX vs ElastiCache Comparison

Streams and Triggers

DynamoDB Streams Limitations

TTL Behavior and Tombstone Garbage Collection

Global Tables

Backup and Restore

On-Demand Backups

Point-in-Time Recovery

Cross-Region Backup with AWS Backup

Backup Comparison Table

Export to S3 for Long-Term Archival

Production Failure Scenarios

Common Pitfalls / Anti-Patterns

When to Use DynamoDB

When Not to Use DynamoDB

Quick Recap Checklist

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

Apache Cassandra: Distributed Column Store Built for Scale

Google Spanner: Globally Distributed SQL at Scale

Column-Family Databases: Cassandra and HBase Architecture