Cache Patterns: Thundering Herd, Stampede Prevention, and Cache Warming
A comprehensive guide to advanced cache patterns — thundering herd, cache stampede prevention with distributed locking and probabilistic early expiration, and cache warming strategies.
Introduction
This guide covers thundering herd, circuit breakers, request coalescing, distributed locking, cache warming strategies, tiered caching, race conditions, invalidation, and production concerns you’ll face when scaling cache layers.
Core Concepts
The Thundering Herd Problem
When a popular cache entry expires, every request that hits it tries to rebuild from the database simultaneously. The database wasn’t designed for that kind of concurrent load. It slows down, requests pile up, and now you’ve turned a caching problem into a cascading failure.
graph TD
A[Cache Miss] --> B[Request DB]
B --> C[100 more requests hit expired key]
C --> D[All 100 try DB simultaneously]
D --> E[DB overwhelmed]
E --> F[Requests timeout]
F --> G[Cache never repopulated]
This is the cache stampede problem. Thundering herd is the same problem at larger scale — typically when an entire cache tier goes down or restarts.
graph TD
A[Cache goes down] --> B[All requests hit DB]
B --> C[DB overwhelmed]
C --> D[System crashes]
D --> E[Cache never recovers]
Circuit Breaker Pattern
Design your cache to degrade gracefully under failure. When cache failures exceed a threshold, bypass it entirely and go directly to the database.
class CircuitBreakerCache:
def __init__(self, redis_client, db, failure_threshold=5, timeout=30):
self.redis = redis_client
self.db = db
self.failure_count = 0
self.last_failure_time = 0
self.timeout = timeout
self.failure_threshold = failure_threshold
self.state = "closed" # closed, open, half-open
def get(self, key):
# If circuit open, skip cache and go to DB directly
if self.state == "open":
if time.time() - self.last_failure_time > self.timeout:
self.state = "half-open"
else:
return self.db.query(key)
try:
value = self.redis.get(key)
if value:
return json.loads(value)
# Cache miss - get from DB
result = self.db.query(key)
self.redis.setex(key, 3600, json.dumps(result))
return result
except Exception as e:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "open"
# Fallback to DB directly
return self.db.query(key)
Request Coalescing
Multiple requests for the same key get coalesced into a single database query.
import asyncio
from concurrent.futures import Future
class RequestCoalescer:
def __init__(self):
self.pending = {} # key -> Future
async def get(self, key, fetch_func):
# Check if there's already a pending request for this key
if key in self.pending:
# Wait for the existing request
return await self.pending[key]
# Create a future for this request
future = Future()
async def fetch_and_store():
try:
result = await fetch_func(key)
future.set_result(result)
except Exception as e:
future.set_exception(e)
finally:
del self.pending[key]
self.pending[key] = future
asyncio.create_task(fetch_and_store())
return await future
Topic-Specific Deep Dives (H1)
Stampede Prevention
Three main approaches to prevent cache stampedes, each with different trade-offs.
Distributed Locking
Only one process rebuilds the cache; others wait or serve stale data.
import redis
import time
import json
def get_with_lock(key, fetch_func, lock_timeout=5, stale_timeout=30):
# Try to get from cache
value = redis.get(key)
if value:
return json.loads(value)
# Try to acquire lock
lock_key = f"lock:{key}"
lock = redis.lock(lock_key, timeout=lock_timeout)
if lock.acquire(blocking=True, blocking_timeout=1):
try:
# Double-check cache (another process might have populated)
value = redis.get(key)
if value:
return json.loads(value)
# Fetch from source
result = fetch_func(key)
redis.setex(key, 3600, json.dumps(result))
return result
finally:
lock.release()
else:
# Didn't get lock - wait briefly and retry cache
time.sleep(0.1)
value = redis.get(key)
if value:
return json.loads(value)
# Still nothing - either serve stale or error
return None
This prevents the stampede but adds latency for the first request. The others wait or get stale data.
Probabilistic Early Expiration
Refresh the cache before it expires, based on probability that decreases as TTL decreases.
import time
import random
import hashlib
def get_with_probabilistic_early_expiration(key, fetch_func, ttl=3600, beta=0.3):
cached = redis.get(key)
if cached:
return json.loads(cached)
# Get current TTL
current_ttl = redis.ttl(key)
if current_ttl <= 0:
current_ttl = ttl
# Calculate probability of early expiration
# Lower beta = less aggressive early refresh
remaining_ratio = current_ttl / ttl
probability = 1 - (remaining_ratio ** beta)
if random.random() < probability:
# Refresh in background (don't block)
def background_refresh():
try:
result = fetch_func(key)
redis.setex(key, ttl, json.dumps(result))
except Exception as e:
pass # Log but don't fail
# In production, submit to thread pool or queue
import threading
threading.Thread(target=background_refresh).start()
else:
# Normal cache miss handling
result = fetch_func(key)
redis.setex(key, ttl, json.dumps(result))
return result
return fetch_func(key) # Fallback
The math is: as TTL decreases, probability of refresh increases. Popular items get refreshed before they expire. The beta parameter controls aggressiveness.
Hysteresis (Stale-While-Refresh)
Keep serving stale data while refreshing in background, even after expiration.
def get_with_hysteresis(key, fetch_func, ttl=3600, stale_ttl=60):
value = redis.get(key)
if value:
return json.loads(value)
# Check for stale version
stale_value = redis.get(f"{key}:stale")
if stale_value:
# Return stale, trigger background refresh
def background_refresh():
try:
result = fetch_func(key)
redis.setex(key, ttl, json.dumps(result))
redis.delete(f"{key}:stale")
except Exception as e:
pass
import threading
threading.Thread(target=background_refresh).start()
return json.loads(stale_value)
# No stale version - must refresh synchronously
result = fetch_func(key)
redis.setex(key, ttl, json.dumps(result))
return result
This ensures users never hit a cache miss but requires careful handling of staleness.
Stampede Detection Counter Pattern
A lightweight stampede detector using atomic increment:
def get_with_stampede_detection(key, fetch_func, threshold=10, window=5):
"""
Detect stampede condition by counting concurrent misses.
Returns (value, was_stampede).
"""
stampede_key = f"stampede:{key}"
count = redis.incr(stampede_key)
if count == 1:
# First request - set expiry window
redis.expire(stampede_key, window)
if count > threshold:
# Stampede detected - log and potentially trigger mitigation
logger.warning("cache_stampede_detected",
key=key,
concurrent_requests=count)
return redis.get(key), True
try:
value = fetch_func(key)
redis.setex(key, 3600, json.dumps(value))
return value, False
finally:
# Only decrement if we're the last request in window
remaining = redis.decr(stampede_key)
if remaining <= 0:
redis.delete(stampede_key)
Cache Warming
On startup or after cache invalidation, the cache is empty. Users experience slow requests until popular items are repopulated. Cache warming pre-loads the cache before users need it.
When to Warm Proactively
Cache warming makes sense when:
- Known hot dataset: You can enumerate the top N keys that will be accessed
- After planned outage: Cache restarts, deployments, maintenance windows
- Predictable traffic spikes: Black Friday, product launches, marketing campaigns
- Regulatory/compliance: Some data must be in cache before users access it
Proactive Warming on Startup
def warm_cache_on_startup(redis_client, db, popular_keys):
"""
Load popular items into cache before users arrive.
Run this after cache restart or deployment.
"""
print(f"Warming cache with {len(popular_keys)} keys...")
for key in popular_keys:
try:
# Fetch from database
data = db.query(key)
# Store in cache with longer TTL
redis_client.setex(key, 86400, json.dumps(data)) # 24hr TTL
except Exception as e:
print(f"Failed to warm key {key}: {e}")
print("Cache warming complete")
# Common patterns for popular keys:
# - Top 1000 products by sales
# - Top 100 users by activity
# - Category listing pages
# - Feature flags and config
Background Warming
Warm cache entries in the background as they expire or as traffic patterns emerge.
import threading
import time
from collections import Counter
class BackgroundWarmingCache:
def __init__(self, redis_client, db, warm_threshold=100):
self.redis = redis_client
self.db = db
self.access_count = Counter()
self.warm_threshold = warm_threshold
self.running = True
# Start background thread
self.warmer_thread = threading.Thread(target=self._warm_loop)
self.warmer_thread.daemon = True
self.warmer_thread.start()
def track_access(self, key):
"""Track cache access for warming decisions"""
self.access_count[key] += 1
def get(self, key):
value = self.redis.get(key)
if value:
self.track_access(key)
return json.loads(value)
return None
def _warm_loop(self):
while self.running:
# Find hot keys that aren't in cache
for key, count in self.access_count.most_common(100):
if count >= self.warm_threshold and not self.redis.exists(key):
try:
data = self.db.query(key)
self.redis.setex(key, 86400, json.dumps(data))
self.access_count[key] = 0 # Reset after warming
except Exception:
pass
time.sleep(60) # Check every minute
When NOT to Warm
Don’t warm if:
- Access pattern is unpredictable: You’ll waste resources preloading unused data
- Data is highly dynamic: Values change faster than TTL expires
- Wide working set: Too much data to warm meaningfully before first user requests hit
- Just-in-time is fast enough: Database can handle initial cache misses
Demand-Fetch with Background Warming Hybrid
Instead of warming everything upfront, use a “speculative warming” approach:
class SpeculativeWarmer:
def __init__(self, redis_client, db, hot_threshold=100, warm_window=60):
self.redis = redis_client
self.db = db
self.hot_threshold = hot_threshold
self.warm_window = warm_window
self.access_log = Counter()
self.running = True
def record_access(self, key):
"""Called on every cache access"""
self.access_log[key] += 1
def speculatively_warm(self):
"""Run periodically to warm hot-but-missing keys"""
now = time.time()
for key, count in self.access_log.most_common(50):
if count >= self.hot_threshold:
if not self.redis.exists(key):
try:
data = self.db.fetch(key)
self.redis.setex(key, 3600, json.dumps(data))
self.access_log[key] = 0 # Reset after warming
except Exception:
pass
# Decay counts slightly to prioritize recent hot keys
for k in self.access_log:
self.access_log[k] = max(0, self.access_log[k] // 2)
Tiered Caching
Not all caches are equal. Using multiple cache tiers with different characteristics gives you the best of both worlds: fast access for hot data, larger capacity for warm data.
graph TD
A[Request] --> B[L1 Cache<br/>CPU Cache / In-Memory<br/>~MB, <1ms]
B -->|miss| C[L2 Cache<br/>Redis / Memcached<br/>~GB, <10ms]
C -->|miss| D[L3 Cache<br/>CDN / Edge<br/>~TB, <100ms]
D -->|miss| E[Database<br/>~TB, 10-100ms]
L1 + L2 Pattern
import redis
import threading
import time
class TieredCache:
def __init__(self, l1_size=1000, l2_client=None, ttl=3600):
# L1: local memory cache (thread-safe dict with LRU)
from collections import OrderedDict
class LRUCache:
def __init__(self, capacity):
self.capacity = capacity
self.cache = OrderedDict()
def get(self, key):
if key in self.cache:
self.cache.move_to_end(key)
return self.cache[key]
return None
def put(self, key, value):
if key in self.cache:
self.cache.move_to_end(key)
self.cache[key] = value
if len(self.cache) > self.capacity:
self.cache.popitem(last=False)
self.l1 = LRUCache(l1_size)
self.l2 = l2_client # Redis
self.ttl = ttl
def get(self, key):
# Check L1 first
value = self.l1.get(key)
if value:
return value
# Check L2
if self.l2:
value = self.l2.get(key)
if value:
# Promote to L1
self.l1.put(key, value)
return value
return None
def put(self, key, value):
# Write to both tiers
self.l1.put(key, value)
if self.l2:
self.l2.setex(key, self.ttl, value)
CDN + Origin Pattern
graph TD
A[User Request] --> B[CDN Edge<br/>~100ms TTL]
B -->|miss| C[CDN Origin<br/>~1hr TTL]
C -->|miss| D[Application<br/>~24hr TTL]
D -->|miss| E[Database]
For web content:
# CDN cache control headers
Cache-Control: public, max-age=100, s-maxage=100 # Edge cache 100s
Cache-Control: public, max-age=3600, s-maxage=3600 # Origin cache 1hr
# Private content (user-specific)
Cache-Control: private, max-age=3600 # Only browser caches this
Local Cache (L1) Pitfalls
L1 (local in-memory) caches give sub-millisecond access but introduce problems that don’t exist with a centralized L2 cache.
Memory Leaks in Process-Level Caches
Python’s dict-based LRU cache won’t automatically shrink:
# BAD: Unbounded local cache
l1_cache = {} # Grows forever
# GOOD: Bounded LRU cache with max size
from functools import lru_cache
@lru_cache(maxsize=1000)
def local_cache_lookup(key):
return l2.get(key)
Stale Data Across Nodes
This is the hardest L1 problem. When data updates in L2, different nodes may have different L1 values:
graph TD
A[Update to L2] --> B[Node 1 L1 invalidated]
A --> C[Node 2 L1 NOT invalidated yet]
A --> D[Node 3 L1 NOT invalidated yet]
B --> E[Node 1 sees new value]
C --> F[Node 3 sees stale value - 5 minutes old]
D --> G[Node 4 sees stale value - 5 minutes old]
Solutions:
- Short L1 TTL: Only a few seconds, essentially using L1 as a read buffer
- L1 Invalidation Broadcast: When L2 updates, publish invalidation to all nodes
- Version-Tagged Entries: L1 checks version against L2 before serving
class L1WithVersionCheck:
def __init__(self, l2_client, l1_size=1000):
self.l2 = l2_client
self.l1 = LRUCache(l1_size)
def get(self, key):
l1_entry = self.l1.get(key)
if l1_entry:
data, l1_version = l1_entry
l2_entry = self.l2.get(key)
if l2_entry:
_, l2_version = json.loads(l2_entry)
if l1_version >= l2_version:
return data # L1 is fresh enough
# L1 is stale, invalidate and fetch from L2
self.l1.delete(key)
# Fetch from L2 and promote to L1
l2_value = self.l2.get(key)
if l2_value:
data, version = json.loads(l2_value)
self.l1.put(key, (data, version))
return data
return None
NUMA Awareness
On multi-socket servers, a local L1 on one socket can’t be accessed efficiently from the other socket. If you’re running on such hardware, consider CPU-affinitized local caches or just use L2 for everything.
Race Conditions & Invalidation
When multiple processes or machines update the same cache key independently, you can end up with stale data overwriting fresh data, or fresh data being immediately clobbered by stale data. This is especially dangerous with the hysteresis and early expiration patterns.
The “Last Writer Wins” Problem
With distributed locking, the lock ensures only one process rebuilds at a time. But what happens when the database itself is being updated while the cache rebuilds?
# Timeline of a race condition
# T1: Process A acquires lock, starts DB query
# T2: Process B updates DB with new value
# T3: Process A finishes, writes stale value to cache
# T4: Cache now contains stale data until next expiration
# Result: fresh DB data is invisible because cache says otherwise
Handling Concurrent Writes with Version Vectors
One solution is to tag cache entries with a version or timestamp from the database:
def get_with_version(key, fetch_func, ttl=3600):
value = redis.get(key)
if value:
data, version = json.loads(value)
# Verify version against DB
current_version = db.get_version(key)
if version >= current_version:
return data
# Cache miss or stale version - must refresh
data, version = fetch_func(key)
redis.setex(key, ttl, json.dumps((data, version)))
return data
CAS (Compare-And-Swap) Pattern for Cache Updates
Use Redis WATCH/MULTI/EXEC for atomic read-modify-write:
def update_cache_atomic(key, update_func):
"""
Atomically read cache, apply update, write back.
Retries if cache was modified by another process.
"""
retries = 3
while retries > 0:
try:
redis.watch(key)
value = redis.get(key)
if value:
current = json.loads(value)
updated = update_func(current)
redis.multi()
redis.setex(key, 3600, json.dumps(updated))
redis.exec()
return updated
else:
redis.unwatch()
return None
except redis.WatchError:
retries -= 1
continue
Cache Invalidation Deep Dive
Cache invalidation is notoriously hard. Getting it wrong means users see stale data, or your cache never helps because entries are constantly evicted.
Write-Through vs Write-Back vs Write-Around
Three strategies keep cache and database in sync:
graph LR
subgraph WriteThrough
A[Write] --> B[Write to Cache]
B --> C[Write to DB]
C --> D[Return]
end
subgraph WriteBack
X[Write] --> Y[Write to Cache]
Y --> Z[Async Write to DB]
Z --> W[Return immediately]
end
subgraph WriteAround
P[Write] --> Q[Write directly to DB]
Q --> R[Invalidate Cache]
R --> S[Return]
end
Write-Through: Every write updates both cache and DB synchronously. Strong consistency but adds latency to every write. Best for read-heavy workloads where data must be correct.
Write-Back: Write to cache, return immediately, sync to DB asynchronously. Lowest write latency but risk of data loss if cache fails before DB is updated. Use when durability is handled at the DB level (e.g., SSDs, WAL).
Write-Around: Write directly to DB, invalidate (don’t update) the cache. Subsequent reads pull from DB and repopulate cache. Good for workloads where stale reads are acceptable for a brief window.
Invalidation Strategies
Immediate Invalidation: Update the cache immediately when the database changes:
def update_with_immediate_invalidation(key, value, db):
# Write to database
db.update(key, value)
# Immediately update cache
redis.setex(key, 3600, json.dumps(value))
Problem: If the cache update fails but DB update succeeds, you’re out of sync.
Delete-On-Write (Cache-Aside with Invalidation): The cleanest approach: write to DB, delete from cache. Don’t try to update the cache:
def update_with_delete_on_write(key, value, db):
db.update(key, value)
redis.delete(key) # Simple, reliable
# Next read will repopulate from DB
This avoids the “update cache failed” problem entirely.
TTL Selection Trade-offs
| TTL Setting | Pros | Cons | Best For |
|---|---|---|---|
| Short (seconds) | High freshness | Low hit rate | Real-time data, rapidly changing values |
| Medium (minutes) | Balanced | Stale reads possible | Most web content |
| Long (hours) | High hit rate | Stale data risk | Configuration, reference data |
| No TTL + manual invalidation | Perfect hit rate | Complexity, staleness risk | Stable reference data |
Rule of thumb: Set TTL to 1-10x the expected refresh interval. If data changes every ~10 minutes, a 30-60 minute TTL is reasonable.
Monitoring Invalidation Health
Track these metrics to catch invalidation problems early:
# Track staleness by comparing cache timestamp with DB timestamp
def check_invalidation_health(key):
cache_value = redis.get(key)
db_value = db.get(key)
if not cache_value or not db_value:
return {"status": "unknown"}
cache_ts = json.loads(cache_value).get("updated_at", 0)
db_ts = db.get_updated_at(key)
staleness_seconds = cache_ts - db_ts
return {
"status": "stale" if staleness_seconds > 60 else "fresh",
"staleness_seconds": staleness_seconds
}
Sharding Strategies
When a single cache instance isn’t enough, you need to distribute across multiple nodes.
Consistent Hashing
import hashlib
from bisect import bisect
class ConsistentHashRing:
def __init__(self, nodes, replicas=150):
self.replicas = replicas
self.ring = {}
self.sorted_keys = []
for node in nodes:
for i in range(replicas):
key = f"{node}:{i}"
hash_key = int(hashlib.md5(key.encode()).hexdigest(), 16)
self.ring[hash_key] = node
self.sorted_keys.append(hash_key)
self.sorted_keys.sort()
def get_node(self, key):
if not self.ring:
return None
hash_key = int(hashlib.md5(key.encode()).hexdigest(), 16)
idx = bisect(self.sorted_keys, hash_key)
if idx >= len(self.sorted_keys):
idx = 0
return self.ring[self.sorted_keys[idx]]
def add_node(self, node):
for i in range(self.replicas):
key = f"{node}:{i}"
hash_key = int(hashlib.md5(key.encode()).hexdigest(), 16)
self.ring[hash_key] = node
self.sorted_keys.append(hash_key)
self.sorted_keys.sort()
def remove_node(self, node):
for i in range(self.replicas):
key = f"{node}:{i}"
hash_key = int(hashlib.md5(key.encode()).hexdigest(), 16)
del self.ring[hash_key]
self.sorted_keys.remove(hash_key)
Trade-off Analysis
Stampede Prevention Comparison
| Pattern | Stampede Protection | Latency Impact | Complexity | Consistency Risk |
|---|---|---|---|---|
| Distributed Locking | Strong | First request blocked | Medium | Low |
| Probabilistic Early | Probabilistic | Minimal | Low | Low |
| Hysteresis | Strong | Zero (always served) | Medium | Medium |
| Request Coalescing | Strong | First request blocked | Medium | Low |
| No Protection | None | None | None | High (cascade) |
Write Strategy Comparison
| Strategy | Write Latency | Read Performance | Data Loss Risk | Complexity |
|---|---|---|---|---|
| Write-Through | High (sync) | Best | None | Low |
| Write-Back | Lowest (async) | Best | High | High |
| Write-Around | Medium | Initially poor | None | Lowest |
| Delete-on-Write | Medium | Initially poor | None | Lowest |
Cache Tier Comparison
| Tier | Capacity | Latency | Shared Across Nodes | Best For |
|---|---|---|---|---|
| L1 | MB | <1ms | No | Hottest 0.1% of keys |
| L2 | GB | 1-10ms | Yes | Hot dataset |
| L3 | TB | 10-100ms | Yes (global) | CDN edge, static assets |
| DB | TB+ | 10-100ms | Yes | Source of truth |
Invalidation Strategy Comparison
| Strategy | Staleness Window | Implementation | Failure Mode |
|---|---|---|---|
| TTL-based | Full TTL | Simplest | Stale data up to TTL |
| Immediate | None | Moderate | Cache/DB out of sync |
| Delete-on-Write | Until next read | Simple | Brief read miss |
| Version-tagged | None (checked) | Complex | Requires version sync |
When to Use / When Not to Use
| Pattern | When to Use | When Not to Use |
|---|---|---|
| Distributed Locking | Preventing cache stampede on popular keys; ensuring single-flight rebuilds | High-frequency writes; when lock contention is a problem |
| Probabilistic Early Expiration | Popular items with predictable access; reducing miss latency | Unpredictable access patterns; items with short TTLs |
| Hysteresis (Stale-While-Refresh) | Latency-sensitive applications; users must never wait | When stale data is unacceptable; very short TTLs |
| Circuit Breaker | External cache service; preventing cascade failures | Local/in-process caches; when cache is always available |
| Request Coalescing | High concurrency on same keys; thundering herd prevention | Low traffic; unique keys per request |
| Cache Warming | After deployments; known hot dataset; reducing cold-start latency | Unknown access patterns; dynamic content |
| Tiered Caching (L1+L2) | Multiple application instances; sub-millisecond access needed for hot data | Single instance; uniform access patterns |
| CDN + Origin | Static content; global user base; reducing origin load | Dynamic, personalized content; real-time data |
Decision Guide
graph TD
A[Problem] --> B{Cache unavailable?}
B -->|Yes| C[Use Circuit Breaker]
B -->|No| D{Stampede on miss?}
D -->|Yes| E{Hot key access pattern?}
D -->|No| F{Tiered Latency Needed?}
E -->|Predictable| G[Probabilistic Early Expiration]
E -->|High Concurrency| H[Request Coalescing]
E -->|Low Latency Critical| I[Locking + Hysteresis]
F -->|Yes| J[L1 + L2 or CDN + Origin]
F -->|No| K[Standard Cache-Aside]
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Lock holder crashes | Lock never released; key permanently blocked | Use TTL on locks; implement lock timeout/auto-release |
| Hysteresis serves too-stale data | Users see outdated content beyond acceptable window | Set reasonable stale_ttl limit; monitor staleness metrics |
| Circuit breaker stuck open | Cache permanently bypassed; database always hit | Implement half-open state for testing; set maximum open time |
| Tiered cache inconsistency | L1 has stale data while L2 has updated | Use invalidation broadcast; prefer L2 as source of truth |
| Request coalescing memory leak | Pending futures accumulate if requests fail | Implement timeout on futures; cleanup on error |
| CDN serves stale after purge | Users see old content | Use versioned URLs; implement proper cache invalidation |
| Probabilistic early expiration overload | Too many background refreshes | Tune beta parameter; limit concurrent refreshes |
Common Pitfalls / Anti-Patterns
1. Lock Without Timeout
Locks without TTL become permanent if the holder crashes.
# BAD: Lock without timeout
redis.set(f"lock:{key}", "1") # No TTL - forever if process dies
# GOOD: Lock with TTL
redis.set(f"lock:{key}", "1", nx=True, ex=5) # Auto-expires in 5 seconds
2. Not Handling Lock Acquisition Failure
Assuming you always get the lock leads to deadlocks.
# BAD: Assume lock always acquired
lock.acquire()
try:
result = fetch_and_cache()
finally:
lock.release()
# GOOD: Handle failure gracefully
if not lock.acquire(blocking=True, blocking_timeout=1):
# Wait and retry from cache, or serve stale
time.sleep(0.1)
return redis.get(key)
try:
result = fetch_and_cache()
finally:
lock.release()
3. Probabilistic Early Expiration Beta Too Aggressive
Beta controls refresh aggressiveness. Too high causes unnecessary refreshes.
# BAD: Beta too high (aggressive) - refreshes constantly
get_with_probabilistic_early_expiration(key, fetch_func, beta=1.0)
# Almost every request triggers refresh for low TTL items
# GOOD: Beta around 0.3 - measured refresh
get_with_probabilistic_early_expiration(key, fetch_func, beta=0.3)
# Only refreshes when TTL gets low enough
4. Hysteresis Without Staleness Limit
Serving stale forever defeats the purpose of cache invalidation.
# BAD: No limit on how long to serve stale
if redis.get(key):
return json.loads(redis.get(key)) # Always return whatever
# GOOD: Limit staleness window
def get_with_hysteresis(key, fetch_func, ttl=3600, max_stale_ttl=60):
value = redis.get(key)
if value:
return json.loads(value)
stale_value = redis.get(f"{key}:stale")
if stale_value:
if redis.ttl(f"{key}:stale") > -max_stale_ttl:
# Serving within acceptable stale window
trigger_background_refresh(key, fetch_func)
return json.loads(stale_value)
# No stale available - must refresh synchronously
result = fetch_func(key)
redis.setex(key, ttl, json.dumps(result))
return result
5. Tiered Cache Without Invalidation Strategy
Updates to L2 don’t propagate to L1, causing stale reads.
# BAD: Write to L2 but L1 still has old data
def update(key, value):
l2.setex(key, 3600, value)
# L1 still has old value - will be served until evicted
# GOOD: Delete from both tiers
def update(key, value):
l2.setex(key, 3600, value)
l1.delete(key) # Invalidate L1
Quick Recap
Key Bullets
- Stampede protection is essential for production caches with popular expiring keys
- Locking, hysteresis, and probabilistic early expiration all solve stampede differently
- Circuit breakers prevent cascade failures when cache is unavailable
- Tiered caching (L1+L2) gives sub-millisecond access for hottest data
- Request coalescing reduces thundering herd by collapsing concurrent requests
- Cache warming prevents cold-start latency after deployments or restarts
Copy/Paste Checklist
# Stampede protection - Locking pattern
lock = redis.lock(f"lock:{key}", timeout=5)
if lock.acquire(blocking=True, blocking_timeout=1):
try:
value = redis.get(key)
if not value:
value = fetch_from_db()
redis.setex(key, 3600, value)
finally:
lock.release()
# Hysteresis pattern
value = redis.get(key)
if not value:
stale = redis.get(f"{key}:stale")
if stale:
trigger_background_refresh(key)
return json.loads(stale)
value = fetch_from_db()
redis.setex(key, 3600, value)
# Circuit breaker
if circuit_breaker.state == "open":
return db.query(key) # Skip cache
try:
value = redis.get(key)
if value:
return json.loads(value)
except Exception as e:
circuit_breaker.record_failure()
return db.query(key)
# Tiered cache get
value = l1.get(key)
if value:
return value
value = l2.get(key)
if value:
l1.put(key, value) # Promote to L1
return value
value = db.query(key)
l2.setex(key, 3600, value)
l1.put(key, value)
return value
# Deployment checklist:
# [ ] Implement stampede protection before going to production
# [ ] Monitor lock contention and circuit breaker state
# [ ] Plan cache warming strategy for deployments
# [ ] Set appropriate TTLs for each cache tier
# [ ] Log tiered cache hit distribution (L1 vs L2)
# [ ] Have circuit breaker fallback to direct DB access
Observability Checklist
Metrics to Track
- Stampede Rate: Number of simultaneous cache misses for same key
- Lock Wait Time: How long requests wait for cache lock acquisition
- Circuit Breaker State: Time spent in open/closed/half-open states
- Stale Hit Ratio: Percentage of cache hits served while refreshing
- L1 vs L2 Hit Distribution: Split between local and distributed cache tiers
- Background Refresh Rate: How often early expiration triggers
# Stampede detection
def detect_stampede(redis_client, key, threshold=10):
"""
Detect if multiple requests trying to rebuild same key.
Uses INCR with short TTL as a lightweight counter.
"""
stampede_key = f"stampede:{key}"
count = redis_client.incr(stampede_key)
if count == 1:
redis_client.expire(stampede_key, 5) # 5 second window
if count > threshold:
logger.warning("cache_stampede_detected", key=key, concurrent_requests=count)
return True
return False
# Hysteresis monitoring
def monitor_staleness(redis_client, key):
stale_key = f"{key}:stale"
if redis_client.exists(stale_key):
ttl = redis_client.ttl(key)
logger.info("serving_stale_data", key=key, fresh_ttl=ttl)
Logs to Capture
logger = structlog.get_logger()
# Circuit breaker state changes
def log_circuit_breaker(previous_state, new_state, key=None):
logger.warning("circuit_breaker_state_change",
previous_state=previous_state,
new_state=new_state,
cache_key=key,
timestamp=time.time())
# Cache tier miss tracking
def log_tiered_miss(l1_hit=False, l2_hit=False, origin_required=True):
logger.info("cache_tier_miss",
l1_hit=l1_hit,
l2_hit=l2_hit,
origin_required=origin_required,
latency_tier="l1" if l1_hit else "l2" if l2_hit else "origin")
Alert Rules
- alert: HighLockContention
expr: rate(cache_lock_wait_seconds_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High lock contention on cache keys"
- alert: CircuitBreakerOpen
expr: cache_circuit_breaker_open_duration_seconds > 60
for: 1m
labels:
severity: critical
annotations:
summary: "Circuit breaker open for more than 60 seconds"
- alert: HighStaleHitRatio
expr: cache_stale_hits_total / cache_total_hits > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "More than 10% of cache hits are serving stale data"
Security Checklist
- Lock key injection - Validate key names in lock acquisition; prevent large/special key injection
- Rate limit lock attempts - Prevent denial of service via lock exhaustion
- Stale data exposure - Ensure stale cache doesn’t expose sensitive data between users
- Monitor circuit breaker state - An open circuit breaker may cause database overload
- Tiered cache data isolation - Ensure L1 (local) cache doesn’t leak data between users/tenants
- Background refresh access control - Background refresh threads should have same permission checks
- CDN edge security - Don’t cache sensitive content at CDN edge
# Secure lock implementation
class SecureLockingCache:
def __init__(self, redis_client, max_lock_timeout=5):
self.redis = redis_client
self.max_lock_timeout = max_lock_timeout
def acquire_lock(self, key):
# Validate key format - no special characters that could cause issues
if not self._is_valid_key(key):
raise ValueError(f"Invalid cache key: {key}")
lock_key = f"lock:{key}"
lock_timeout = min(self.max_lock_timeout, 5) # Cap at 5 seconds
# Use SET NX EX for atomic lock acquisition
acquired = self.redis.set(lock_key, "1", nx=True, ex=lock_timeout)
return acquired
def _is_valid_key(self, key):
# Basic validation - alphanumeric, colons, dashes, underscores
import re
return bool(re.match(r'^[\w:.-]+$', key))
Real-World Case Studies
No real-world case studies are currently documented for this topic.
Interview Questions
A cache stampede happens when a popular cache entry expires and multiple concurrent requests all try to rebuild the cache simultaneously. Since the cache is empty, all requests hit the database at once, overwhelming it and potentially causing cascading failures. The thundering herd is the same problem at a larger scale, typically when an entire cache tier goes down.
- Multiple requests detect the same cache miss
- All requests independently query the database
- Database becomes a bottleneck under concurrent load
- Cache never gets repopulated because requests keep timing out
Distributed Locking: Only one process acquires the lock and rebuilds the cache; others wait or serve stale data. This guarantees no stampede but adds latency for the first request and complexity around lock management.
Probabilistic Early Expiration: Refreshes the cache before it expires based on a probability that increases as TTL decreases. This spreads the refresh load across time but doesn't guarantee stampede protection.
- Use locking when: stampede protection is critical, high-concurrency on hot keys, lock infrastructure is already available
- Use probabilistic when: latency variation is acceptable, access patterns are predictable, you want to avoid lock overhead
Hysteresis keeps serving stale data while asynchronously refreshing the cache after the nominal TTL expires. The stale version is kept for a short window while a background thread repopulates the fresh value.
Advantages:
- Users never experience a cache miss — response is always immediate
- Eliminates request spikes at TTL expiration boundaries
- Simple mental model: fresh or stale, no "loading" state
Use when:
- Staleness tolerance exists (not financial data, medical records)
- Latency sensitivity is high (user-facing requests)
- A reasonable
max_stale_ttlcan be set and enforced
A circuit breaker wraps cache access and tracks failures. When failures exceed a threshold, the breaker "opens" — subsequent requests skip the cache entirely and go directly to the database. After a timeout, it enters a "half-open" state to test if the cache has recovered.
Role in cascade prevention:
- Isolates cache failures from database access
- Allows cache to recover under reduced load
- Provides graceful degradation: system keeps working (slower) even when cache is down
Tiered caching uses a fast, small local cache (L1 — CPU cache or in-memory dict) in front of a slower, larger distributed cache (L2 — Redis/Memcached). Requests check L1 first, then L2, then the database.
L1 (Local/In-Memory):
- Sub-millisecond access, limited size (MB range)
- Hot data only: top 0.1% of accessed keys by a single node
- Must be thread-safe and bounded to prevent memory leaks
L2 (Distributed):
- Higher latency (~1-10ms) but larger capacity (GB range)
- Shared across all application nodes
- Source of truth for cache data across nodes
Request coalescing collapses multiple in-flight requests for the same key into a single database query. When a request finds another request already fetching the same key, it waits for that request's result rather than spawning a new database call.
Implementation uses a pending-request registry (e.g., a dict of Futures). When the first request starts, it creates a Future and stores it. Subsequent requests receive the same Future and await it. When the fetch completes, all waiters receive the result.
- Dramatically reduces concurrent DB queries during cache misses
- Works at the application level, not the cache level
- Effective for thundering herd after cache restart
Write-Through: Synchronously update both cache and database on every write. Strong consistency but adds latency to writes. Best for read-heavy workloads where data must be correct.
Write-Back: Write to cache, return immediately, sync to database asynchronously. Lowest write latency but risk of data loss if cache fails before DB is updated. Use with durable storage.
Write-Around: Write directly to database, invalidate (not update) the cache. Simple and reliable — avoids the "cache update failed" problem. Brief staleness window after writes. Good for workloads where stale reads are briefly acceptable.
Race conditions occur when multiple processes independently read, modify, and write cache entries without coordination. The classic problem: Process A reads cache (stale), Process B updates DB and invalidates cache, Process A writes its (now stale) value back to cache, overwriting the fresh data.
Solutions:
- Version vectors: Tag each cache entry with a DB timestamp. Only serve/update cache if the version is newer than what's in the DB.
- CAS (Compare-And-Swap): Use atomic operations (Redis WATCH/MULTI/EXEC) to detect concurrent modifications and retry.
- Delete-on-write: Instead of updating cache on DB writes, delete the cache entry. Let the next read rebuild it. Eliminates the race entirely by not allowing concurrent updates.
Stale data across nodes: When L2 is updated, different nodes' L1 caches may still hold old values. Without an invalidation broadcast, some nodes serve stale data for the L1 TTL duration.
Memory leaks: Unbounded in-process caches (dicts, lists) grow without limit as more keys are accessed. Must use bounded LRU structures.
Invalidation complexity: L1 invalidation requires either broadcast (complex) or version checking (overhead). Simpler solutions just use short L1 TTLs or skip L1 entirely.
NUMA effects: On multi-socket servers, accessing an L1 cache on a different NUMA node is slow. CPU-affinitized caches or skipping L1 may be better.
A multi-tier warming strategy works best:
- Pre-deployment: Identify the top N keys by expected traffic (top products, categories, user segments). Warm these into L2 before the traffic spike.
- Tiered by priority: Warm the most popular 1000 keys first with longest TTL, then secondary keys. Use different TTLs for different tiers.
- Background warming: Use a speculative warmer that tracks access counts and pre-loads keys that are trending hot but not yet in cache.
- During the spike: Rely on hysteresis and probabilistic early expiration to keep hot keys fresh without lock contention.
- Monitor: Watch hit rates and staleness. If L1 hit rate drops below threshold, temporarily increase L1 TTL to reduce L2 load.
Critical metrics for cache observability fall into four categories:
- Hit/Miss Ratio: Track L1 hit rate, L2 hit rate, and overall cache hit rate. Alert when hit rate drops below baseline (indicates warming failure or cache invalidation issue).
- Latency Percentiles: P50, P95, P99 latency at each cache tier. Latency spikes often precede cache failures.
- Stampede Detection: Monitor concurrent miss counts per key. High concurrent misses on a single key signals a stampede in progress.
- Circuit Breaker State: Track time spent in open vs closed state. Extended open state means cache is bypassed and database is under stress.
- Staleness Metrics: Track how often stale data is served (hysteresis) vs fresh. Rising stale ratios indicate TTL settings are too aggressive.
Alert thresholds should be based on baseline measurements, not arbitrary values. Set warning alerts at 2 standard deviations and critical at 3 standard deviations from normal.
The algorithm calculates refresh probability using: probability = 1 - (remaining_ttl / total_ttl) ^ beta
As TTL decreases, the probability increases. The beta parameter controls the curve shape:
- beta = 0.3 (recommended): Gradual probability increase. Refreshes start when TTL is ~30% remaining. Most requests still serve cached data.
- beta too high (>1.0): Aggressive refresh. Probability climbs steeply even when TTL is high. Results in excessive background refreshes, wasting resources and potentially causing lock contention.
- beta too low (<0.1): Rarely refreshes. Effectively becomes standard TTL expiration — stampede risk returns at expiration boundary.
- beta = 1.0: Linear probability increase. Starts refreshing at 50% TTL remaining. More aggressive than typical use cases.
The formula was derived from research on web caching and balances refresh overhead against miss latency. The key insight is that popular items benefit from early refresh because they expire at predictable times.
Cache Stampede: A single popular cache key expires, causing multiple requests to simultaneously rebuild that specific key.
Thundering Herd: An entire cache tier goes down (restart, crash, network partition) or a massive traffic spike hits, causing ALL keys to be unavailable simultaneously. Every request hits the database at once.
Why thundering herd is harder:
- Scale: A stampede affects one key; thundering herd affects thousands. Request coalescing helps but the pending request queue itself can become a bottleneck.
- Recovery paradox: As the cache recovers, the recovery itself triggers another thundering herd because all nodes warm up simultaneously.
- No single lock helps: Distributed locking on one key doesn't help when the problem is all keys becoming unavailable at once.
- Solutions are operational: Requires pre-warming, gradual recovery, traffic shedding, or a " cache proxy" layer that serves stale during recovery.
Cache-Aside (Lazy Loading): Application checks cache first; on miss, fetches from DB and populates cache. Most common pattern.
Read-Through: Cache automatically loads data from DB on miss. Application only talks to cache, never DB directly. Cache is responsible for loading.
Write-Behind (Write-Back): Application writes to cache, which asynchronously persists to DB. Highest write performance but risk of data loss.
- Use cache-aside when: You need explicit control over cache contents, some data is accessed rarely, or you want to avoid caching data you might never read.
- Use read-through when: You want simplified application code and the cache can determine the loading strategy (e.g., multi-tier where L1 loads from L2).
- Use write-behind when: Write throughput is critical (logging, analytics), cache has battery-backed RAM, and brief data loss on crash is acceptable.
Cache invalidation across microservices requires explicit ownership and propagation:
- Event-driven invalidation: When Service A updates shared data, publish an event (Kafka, SNS) that all services consuming that data subscribe to and invalidate their local cache entries.
- Cache ownership model: Assign one service as the "owner" of each cached entity. Only the owner writes to the cache; other services read but never write directly to shared cache keys.
- Short TTL with safety net: Use short TTLs (30-60 seconds) for shared data as a safety net. Even if invalidation fails, staleness is bounded.
- Version-tagged entries: Include a version number from the database. Services check version before serving cached data. Reduces race conditions but adds complexity.
- Distributed cache tier: Use a shared Redis/Memcached cluster instead of local caches. Invalidation propagates instantly. Trade-off: single point of failure, network dependency.
Best practice: Use event-driven invalidation with short TTL as fallback. The event ensures timely invalidation; the TTL bounds potential staleness.
Availability vs Consistency Trade-off:
- Locking + synchronous refresh: Strong consistency but reduced availability. If lock holder crashes, requests wait or get stale. Timeouts must be set carefully.
- Hysteresis: High availability (always serves data) but potential staleness. Acceptable when brief staleness is tolerable (e.g., product catalog, user preferences).
- Probabilistic early expiration: Balanced approach. Most requests get fresh data with low latency. Some requests trigger background refresh. Works when access patterns are predictable.
Choosing:
- Consistency critical: Financial data, inventory counts, leaderboards. Use locking with version checking. Accept higher latency.
- Availability critical: User-facing content, media, product pages. Use hysteresis with staleness limits. Accept brief staleness.
- Both important: Use tiered approach. Hysteresis for most data; locking for critical keys that must be consistent.
Never completely sacrifice availability — a slow response is better than an error. Circuit breakers ensure the database isn't overwhelmed, preserving overall system health.
Consistent Hashing: Maps cache keys to a hash ring (0 to 2^32). Each cache node is assigned multiple positions on the ring (via virtual nodes). A key is stored on the first node found clockwise from its hash position.
Why it matters: When a node is added or removed, only K/N keys are remapped (where K is total keys, N is nodes). With simple modulo, K/N of ALL keys remap, causing a massive cache invalidation wave.
- Virtual nodes (replicas): Assigning 150-200 virtual nodes per physical node improves distribution balance. More replicas = better load distribution but more memory for ring metadata.
- Monotonicity: When adding nodes, existing keys don't need to move (beyond the redistribution boundary). Critical for production deployments.
Trade-offs:
- Simple modulo: Perfect distribution but catastrophic reshuffle on node changes. Good for fixed-size, never-shrink cache clusters.
- Consistent hashing: Graceful scaling with imperfect distribution. Slight hotspots possible with skewed key access patterns. Add replicas to improve balance.
CDN Edge Caches: Distributed globally, serve content from PoPs (points of presence) close to users. Typically HTTP caching (Varnish, Nginx, Cloudflare, Akamai).
- Invalidation: Hard to invalidate. Requires purge API calls to each edge node. Purge propagation takes seconds to minutes. Not suitable for rapidly changing data.
- Consistency: Eventual consistency at scale. Different edge nodes may serve different versions during propagation. TTL-based staleness is acceptable.
- Best for: Static assets (images, CSS, JS), API responses with long TTLs, versioned content (URL-based versioning solves invalidation).
Application Caches (Redis/Memcached):
- Invalidation: Immediate via delete command. Direct control. Can implement sophisticated invalidation strategies.
- Consistency: Can achieve strong consistency with delete-on-write or version vectors. Critical for transactional data.
- Best for: Dynamic content, user-specific data, frequently changing data, session storage.
Hybrid approach: Use CDN for static assets with long TTLs and versioned URLs. Use application cache for dynamic content. The two layers have orthogonal characteristics and complement each other.
Cache Poisoning: An attacker injects malicious data into the cache by exploiting cache key collision or header manipulation.
- Key validation: Strictly validate cache key format. Reject keys with special characters, extremely long lengths, or injection patterns.
- Key namespacing: Prefix all keys with a namespace (e.g.,
user:,product:). Prevents cross-namespace pollution. - Input sanitization: Normalize inputs before using as cache keys. Same logical request should always produce the same key.
Cache Timing Attacks: Attacker measures response times to determine if data was served from cache vs database, potentially revealing sensitive data existence.
- Constant-time responses: Add jitter to responses to mask cache hit/miss timing differences.
- Separate cache for sensitive data: Don't mix sensitive and non-sensitive data in the same cache infrastructure.
- Rate limiting: Limit requests per IP/key to prevent systematic probing.
General cache security: Network segmentation (cache shouldn't be directly internet-accessible), TLS for cache connections, authentication for cache commands, no sensitive data in cache keys themselves.
Social media timelines have a specific access pattern: reads vastly outnumber writes, but writes create urgent freshness requirements (new posts must appear quickly).
Recommended approach:
- Write-around with immediate invalidation: Writes go to the database, then the user's timeline cache key is deleted. Subsequent reads rebuild the timeline from the database.
- Aggressive TTL: Timeline cache TTL of 30-60 seconds. Even without writes, timelines refresh frequently. Balances freshness with cache hit rate.
- User-specific warming: When a user posts, proactively warm their timeline for their followers in the background (within 5-10 seconds).
- Hysteresis for timeline reads: Serve stale timeline data while refreshing in background. Users see their own post immediately (via write-around); follower timeline updates within seconds.
Special consideration for fan-out:
- Push model: When a user posts, write to each follower's timeline cache immediately. High write amplification but best freshness. Use for small follower counts.
- Pull model: Timelines are always computed at read time from recent posts. Best for celebrities with millions of followers. Cache recent posts and assemble at read time.
- Hybrid: Push to first 1000 followers, pull for the rest. Balances freshness with scalability.
Further Reading
Official Documentation
- Redis Cache Patterns — Redis official guide to caching patterns
- AWS ElastiCache Best Practices — AWS guidance for managed Redis caching
Books
- Designing Data-Intensive Applications by Martin Kleppmann — Chapters 5 and 6 cover caching in depth
- Redis in Action by Josiah L. Carlson — Practical Redis patterns for production systems
Tools & Utilities
- cachegrand — Modern high-performance cache server with advanced features
- stale-lru — Stale-while-revalidate LRU implementation for Python
- python-cacheguard — Thread-safe caching with stampede protection
Related Posts on GeekWorkBench
- Caching Strategies — The five strategies these patterns work with
- Cache Eviction Policies — How to decide what to evict
- CDN Deep Dive — Tiered caching at the edge
- Distributed Caching — Multi-node caching patterns
Conclusion
Cache stampede protection matters when you have popular expiring entries. Cache warming matters after deployments. Tiered caching matters when you’re big enough that one cache tier isn’t enough.
Start with the basics. Add stampede protection early because it bites you unexpectedly. The other patterns you add when you have the operational need.
Don’t over-engineer before you have the problem.
Category
Related Posts
Distributed Caching: Scaling Cache Across Multiple Nodes
A comprehensive guide to distributed caching — consistent hashing, cache sharding, replica consistency, cache clustering, and handling the unique challenges of multi-node cache environments.
Caching Strategies: A Practical Guide
Learn the main caching patterns — cache-aside, write-through, write-behind, and refresh-ahead — plus how to pick TTLs, invalidate stale data, and distribute caches across nodes.
Cache Stampede Prevention: Protecting Your Cache
Learn how single-flight, request coalescing, and probabilistic early expiration prevent cache stampedes that can overwhelm your database.