Cache Patterns: Thundering Herd, Stampede Prevention, and Cache Warming

A comprehensive guide to advanced cache patterns — thundering herd, cache stampede prevention with distributed locking and probabilistic early expiration, and cache warming strategies.

published: March 10, 2026 reading time: 40 min read author: Geek Workbench

Introduction

This guide covers thundering herd, circuit breakers, request coalescing, distributed locking, cache warming strategies, tiered caching, race conditions, invalidation, and production concerns you’ll face when scaling cache layers.

Core Concepts

The Thundering Herd Problem

When a popular cache entry expires, every request that hits it tries to rebuild from the database simultaneously. The database wasn’t designed for that kind of concurrent load. It slows down, requests pile up, and now you’ve turned a caching problem into a cascading failure.

graph TD
    A[Cache Miss] --> B[Request DB]
    B --> C[100 more requests hit expired key]
    C --> D[All 100 try DB simultaneously]
    D --> E[DB overwhelmed]
    E --> F[Requests timeout]
    F --> G[Cache never repopulated]

This is the cache stampede problem. Thundering herd is the same problem at larger scale — typically when an entire cache tier goes down or restarts.

graph TD
    A[Cache goes down] --> B[All requests hit DB]
    B --> C[DB overwhelmed]
    C --> D[System crashes]
    D --> E[Cache never recovers]

Circuit Breaker Pattern

Design your cache to degrade gracefully under failure. When cache failures exceed a threshold, bypass it entirely and go directly to the database.

class CircuitBreakerCache:
    def __init__(self, redis_client, db, failure_threshold=5, timeout=30):
        self.redis = redis_client
        self.db = db
        self.failure_count = 0
        self.last_failure_time = 0
        self.timeout = timeout
        self.failure_threshold = failure_threshold
        self.state = "closed"  # closed, open, half-open

    def get(self, key):
        # If circuit open, skip cache and go to DB directly
        if self.state == "open":
            if time.time() - self.last_failure_time > self.timeout:
                self.state = "half-open"
            else:
                return self.db.query(key)

        try:
            value = self.redis.get(key)
            if value:
                return json.loads(value)

            # Cache miss - get from DB
            result = self.db.query(key)
            self.redis.setex(key, 3600, json.dumps(result))
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()

            if self.failure_count >= self.failure_threshold:
                self.state = "open"

            # Fallback to DB directly
            return self.db.query(key)

Request Coalescing

Multiple requests for the same key get coalesced into a single database query.

import asyncio
from concurrent.futures import Future

class RequestCoalescer:
    def __init__(self):
        self.pending = {}  # key -> Future

    async def get(self, key, fetch_func):
        # Check if there's already a pending request for this key
        if key in self.pending:
            # Wait for the existing request
            return await self.pending[key]

        # Create a future for this request
        future = Future()

        async def fetch_and_store():
            try:
                result = await fetch_func(key)
                future.set_result(result)
            except Exception as e:
                future.set_exception(e)
            finally:
                del self.pending[key]

        self.pending[key] = future
        asyncio.create_task(fetch_and_store())

        return await future

Topic-Specific Deep Dives (H1)

Stampede Prevention

Three main approaches to prevent cache stampedes, each with different trade-offs.

Distributed Locking

Only one process rebuilds the cache; others wait or serve stale data.

import redis
import time
import json

def get_with_lock(key, fetch_func, lock_timeout=5, stale_timeout=30):
    # Try to get from cache
    value = redis.get(key)
    if value:
        return json.loads(value)

    # Try to acquire lock
    lock_key = f"lock:{key}"
    lock = redis.lock(lock_key, timeout=lock_timeout)

    if lock.acquire(blocking=True, blocking_timeout=1):
        try:
            # Double-check cache (another process might have populated)
            value = redis.get(key)
            if value:
                return json.loads(value)

            # Fetch from source
            result = fetch_func(key)
            redis.setex(key, 3600, json.dumps(result))
            return result
        finally:
            lock.release()
    else:
        # Didn't get lock - wait briefly and retry cache
        time.sleep(0.1)
        value = redis.get(key)
        if value:
            return json.loads(value)

        # Still nothing - either serve stale or error
        return None

This prevents the stampede but adds latency for the first request. The others wait or get stale data.

Probabilistic Early Expiration

Refresh the cache before it expires, based on probability that decreases as TTL decreases.

import time
import random
import hashlib

def get_with_probabilistic_early_expiration(key, fetch_func, ttl=3600, beta=0.3):
    cached = redis.get(key)
    if cached:
        return json.loads(cached)

    # Get current TTL
    current_ttl = redis.ttl(key)
    if current_ttl <= 0:
        current_ttl = ttl

    # Calculate probability of early expiration
    # Lower beta = less aggressive early refresh
    remaining_ratio = current_ttl / ttl
    probability = 1 - (remaining_ratio ** beta)

    if random.random() < probability:
        # Refresh in background (don't block)
        def background_refresh():
            try:
                result = fetch_func(key)
                redis.setex(key, ttl, json.dumps(result))
            except Exception as e:
                pass  # Log but don't fail

        # In production, submit to thread pool or queue
        import threading
        threading.Thread(target=background_refresh).start()
    else:
        # Normal cache miss handling
        result = fetch_func(key)
        redis.setex(key, ttl, json.dumps(result))
        return result

    return fetch_func(key)  # Fallback

The math is: as TTL decreases, probability of refresh increases. Popular items get refreshed before they expire. The beta parameter controls aggressiveness.

Hysteresis (Stale-While-Refresh)

Keep serving stale data while refreshing in background, even after expiration.

def get_with_hysteresis(key, fetch_func, ttl=3600, stale_ttl=60):
    value = redis.get(key)
    if value:
        return json.loads(value)

    # Check for stale version
    stale_value = redis.get(f"{key}:stale")
    if stale_value:
        # Return stale, trigger background refresh
        def background_refresh():
            try:
                result = fetch_func(key)
                redis.setex(key, ttl, json.dumps(result))
                redis.delete(f"{key}:stale")
            except Exception as e:
                pass

        import threading
        threading.Thread(target=background_refresh).start()
        return json.loads(stale_value)

    # No stale version - must refresh synchronously
    result = fetch_func(key)
    redis.setex(key, ttl, json.dumps(result))
    return result

This ensures users never hit a cache miss but requires careful handling of staleness.

Stampede Detection Counter Pattern

A lightweight stampede detector using atomic increment:

def get_with_stampede_detection(key, fetch_func, threshold=10, window=5):
    """
    Detect stampede condition by counting concurrent misses.
    Returns (value, was_stampede).
    """
    stampede_key = f"stampede:{key}"
    count = redis.incr(stampede_key)

    if count == 1:
        # First request - set expiry window
        redis.expire(stampede_key, window)

    if count > threshold:
        # Stampede detected - log and potentially trigger mitigation
        logger.warning("cache_stampede_detected",
                      key=key,
                      concurrent_requests=count)
        return redis.get(key), True

    try:
        value = fetch_func(key)
        redis.setex(key, 3600, json.dumps(value))
        return value, False
    finally:
        # Only decrement if we're the last request in window
        remaining = redis.decr(stampede_key)
        if remaining <= 0:
            redis.delete(stampede_key)

Cache Warming

On startup or after cache invalidation, the cache is empty. Users experience slow requests until popular items are repopulated. Cache warming pre-loads the cache before users need it.

When to Warm Proactively

Cache warming makes sense when:

Known hot dataset: You can enumerate the top N keys that will be accessed
After planned outage: Cache restarts, deployments, maintenance windows
Predictable traffic spikes: Black Friday, product launches, marketing campaigns
Regulatory/compliance: Some data must be in cache before users access it

Proactive Warming on Startup

def warm_cache_on_startup(redis_client, db, popular_keys):
    """
    Load popular items into cache before users arrive.
    Run this after cache restart or deployment.
    """
    print(f"Warming cache with {len(popular_keys)} keys...")

    for key in popular_keys:
        try:
            # Fetch from database
            data = db.query(key)

            # Store in cache with longer TTL
            redis_client.setex(key, 86400, json.dumps(data))  # 24hr TTL
        except Exception as e:
            print(f"Failed to warm key {key}: {e}")

    print("Cache warming complete")

# Common patterns for popular keys:
# - Top 1000 products by sales
# - Top 100 users by activity
# - Category listing pages
# - Feature flags and config

Background Warming

Warm cache entries in the background as they expire or as traffic patterns emerge.

import threading
import time
from collections import Counter

class BackgroundWarmingCache:
    def __init__(self, redis_client, db, warm_threshold=100):
        self.redis = redis_client
        self.db = db
        self.access_count = Counter()
        self.warm_threshold = warm_threshold
        self.running = True

        # Start background thread
        self.warmer_thread = threading.Thread(target=self._warm_loop)
        self.warmer_thread.daemon = True
        self.warmer_thread.start()

    def track_access(self, key):
        """Track cache access for warming decisions"""
        self.access_count[key] += 1

    def get(self, key):
        value = self.redis.get(key)
        if value:
            self.track_access(key)
            return json.loads(value)
        return None

    def _warm_loop(self):
        while self.running:
            # Find hot keys that aren't in cache
            for key, count in self.access_count.most_common(100):
                if count >= self.warm_threshold and not self.redis.exists(key):
                    try:
                        data = self.db.query(key)
                        self.redis.setex(key, 86400, json.dumps(data))
                        self.access_count[key] = 0  # Reset after warming
                    except Exception:
                        pass

            time.sleep(60)  # Check every minute

When NOT to Warm

Don’t warm if:

Access pattern is unpredictable: You’ll waste resources preloading unused data
Data is highly dynamic: Values change faster than TTL expires
Wide working set: Too much data to warm meaningfully before first user requests hit
Just-in-time is fast enough: Database can handle initial cache misses

Demand-Fetch with Background Warming Hybrid

Instead of warming everything upfront, use a “speculative warming” approach:

class SpeculativeWarmer:
    def __init__(self, redis_client, db, hot_threshold=100, warm_window=60):
        self.redis = redis_client
        self.db = db
        self.hot_threshold = hot_threshold
        self.warm_window = warm_window
        self.access_log = Counter()
        self.running = True

    def record_access(self, key):
        """Called on every cache access"""
        self.access_log[key] += 1

    def speculatively_warm(self):
        """Run periodically to warm hot-but-missing keys"""
        now = time.time()

        for key, count in self.access_log.most_common(50):
            if count >= self.hot_threshold:
                if not self.redis.exists(key):
                    try:
                        data = self.db.fetch(key)
                        self.redis.setex(key, 3600, json.dumps(data))
                        self.access_log[key] = 0  # Reset after warming
                    except Exception:
                        pass

        # Decay counts slightly to prioritize recent hot keys
        for k in self.access_log:
            self.access_log[k] = max(0, self.access_log[k] // 2)

Tiered Caching

Not all caches are equal. Using multiple cache tiers with different characteristics gives you the best of both worlds: fast access for hot data, larger capacity for warm data.

graph TD
    A[Request] --> B[L1 Cache<br/>CPU Cache / In-Memory<br/>~MB, <1ms]
    B -->|miss| C[L2 Cache<br/>Redis / Memcached<br/>~GB, <10ms]
    C -->|miss| D[L3 Cache<br/>CDN / Edge<br/>~TB, <100ms]
    D -->|miss| E[Database<br/>~TB, 10-100ms]

L1 + L2 Pattern

import redis
import threading
import time

class TieredCache:
    def __init__(self, l1_size=1000, l2_client=None, ttl=3600):
        # L1: local memory cache (thread-safe dict with LRU)
        from collections import OrderedDict

        class LRUCache:
            def __init__(self, capacity):
                self.capacity = capacity
                self.cache = OrderedDict()

            def get(self, key):
                if key in self.cache:
                    self.cache.move_to_end(key)
                    return self.cache[key]
                return None

            def put(self, key, value):
                if key in self.cache:
                    self.cache.move_to_end(key)
                self.cache[key] = value
                if len(self.cache) > self.capacity:
                    self.cache.popitem(last=False)

        self.l1 = LRUCache(l1_size)
        self.l2 = l2_client  # Redis
        self.ttl = ttl

    def get(self, key):
        # Check L1 first
        value = self.l1.get(key)
        if value:
            return value

        # Check L2
        if self.l2:
            value = self.l2.get(key)
            if value:
                # Promote to L1
                self.l1.put(key, value)
                return value

        return None

    def put(self, key, value):
        # Write to both tiers
        self.l1.put(key, value)
        if self.l2:
            self.l2.setex(key, self.ttl, value)

CDN + Origin Pattern

graph TD
    A[User Request] --> B[CDN Edge<br/>~100ms TTL]
    B -->|miss| C[CDN Origin<br/>~1hr TTL]
    C -->|miss| D[Application<br/>~24hr TTL]
    D -->|miss| E[Database]

For web content:

# CDN cache control headers
Cache-Control: public, max-age=100, s-maxage=100  # Edge cache 100s
Cache-Control: public, max-age=3600, s-maxage=3600  # Origin cache 1hr

# Private content (user-specific)
Cache-Control: private, max-age=3600  # Only browser caches this

Local Cache (L1) Pitfalls

L1 (local in-memory) caches give sub-millisecond access but introduce problems that don’t exist with a centralized L2 cache.

Memory Leaks in Process-Level Caches

Python’s dict-based LRU cache won’t automatically shrink:

# BAD: Unbounded local cache
l1_cache = {}  # Grows forever

# GOOD: Bounded LRU cache with max size
from functools import lru_cache

@lru_cache(maxsize=1000)
def local_cache_lookup(key):
    return l2.get(key)

Stale Data Across Nodes

This is the hardest L1 problem. When data updates in L2, different nodes may have different L1 values:

graph TD
    A[Update to L2] --> B[Node 1 L1 invalidated]
    A --> C[Node 2 L1 NOT invalidated yet]
    A --> D[Node 3 L1 NOT invalidated yet]
    B --> E[Node 1 sees new value]
    C --> F[Node 3 sees stale value - 5 minutes old]
    D --> G[Node 4 sees stale value - 5 minutes old]

Solutions:

Short L1 TTL: Only a few seconds, essentially using L1 as a read buffer
L1 Invalidation Broadcast: When L2 updates, publish invalidation to all nodes
Version-Tagged Entries: L1 checks version against L2 before serving

class L1WithVersionCheck:
    def __init__(self, l2_client, l1_size=1000):
        self.l2 = l2_client
        self.l1 = LRUCache(l1_size)

    def get(self, key):
        l1_entry = self.l1.get(key)
        if l1_entry:
            data, l1_version = l1_entry
            l2_entry = self.l2.get(key)
            if l2_entry:
                _, l2_version = json.loads(l2_entry)
                if l1_version >= l2_version:
                    return data  # L1 is fresh enough
            # L1 is stale, invalidate and fetch from L2
            self.l1.delete(key)

        # Fetch from L2 and promote to L1
        l2_value = self.l2.get(key)
        if l2_value:
            data, version = json.loads(l2_value)
            self.l1.put(key, (data, version))
            return data
        return None

NUMA Awareness

On multi-socket servers, a local L1 on one socket can’t be accessed efficiently from the other socket. If you’re running on such hardware, consider CPU-affinitized local caches or just use L2 for everything.

Race Conditions & Invalidation

When multiple processes or machines update the same cache key independently, you can end up with stale data overwriting fresh data, or fresh data being immediately clobbered by stale data. This is especially dangerous with the hysteresis and early expiration patterns.

The “Last Writer Wins” Problem

With distributed locking, the lock ensures only one process rebuilds at a time. But what happens when the database itself is being updated while the cache rebuilds?

# Timeline of a race condition
# T1: Process A acquires lock, starts DB query
# T2: Process B updates DB with new value
# T3: Process A finishes, writes stale value to cache
# T4: Cache now contains stale data until next expiration

# Result: fresh DB data is invisible because cache says otherwise

Handling Concurrent Writes with Version Vectors

One solution is to tag cache entries with a version or timestamp from the database:

def get_with_version(key, fetch_func, ttl=3600):
    value = redis.get(key)
    if value:
        data, version = json.loads(value)
        # Verify version against DB
        current_version = db.get_version(key)
        if version >= current_version:
            return data

    # Cache miss or stale version - must refresh
    data, version = fetch_func(key)
    redis.setex(key, ttl, json.dumps((data, version)))
    return data

CAS (Compare-And-Swap) Pattern for Cache Updates

Use Redis WATCH/MULTI/EXEC for atomic read-modify-write:

def update_cache_atomic(key, update_func):
    """
    Atomically read cache, apply update, write back.
    Retries if cache was modified by another process.
    """
    retries = 3
    while retries > 0:
        try:
            redis.watch(key)
            value = redis.get(key)
            if value:
                current = json.loads(value)
                updated = update_func(current)
                redis.multi()
                redis.setex(key, 3600, json.dumps(updated))
                redis.exec()
                return updated
            else:
                redis.unwatch()
                return None
        except redis.WatchError:
            retries -= 1
            continue

Cache Invalidation Deep Dive

Cache invalidation is notoriously hard. Getting it wrong means users see stale data, or your cache never helps because entries are constantly evicted.

Write-Through vs Write-Back vs Write-Around

Three strategies keep cache and database in sync:

graph LR
    subgraph WriteThrough
    A[Write] --> B[Write to Cache]
    B --> C[Write to DB]
    C --> D[Return]
    end

    subgraph WriteBack
    X[Write] --> Y[Write to Cache]
    Y --> Z[Async Write to DB]
    Z --> W[Return immediately]
    end

    subgraph WriteAround
    P[Write] --> Q[Write directly to DB]
    Q --> R[Invalidate Cache]
    R --> S[Return]
    end

Write-Through: Every write updates both cache and DB synchronously. Strong consistency but adds latency to every write. Best for read-heavy workloads where data must be correct.

Write-Back: Write to cache, return immediately, sync to DB asynchronously. Lowest write latency but risk of data loss if cache fails before DB is updated. Use when durability is handled at the DB level (e.g., SSDs, WAL).

Write-Around: Write directly to DB, invalidate (don’t update) the cache. Subsequent reads pull from DB and repopulate cache. Good for workloads where stale reads are acceptable for a brief window.

Invalidation Strategies

Immediate Invalidation: Update the cache immediately when the database changes:

def update_with_immediate_invalidation(key, value, db):
    # Write to database
    db.update(key, value)
    # Immediately update cache
    redis.setex(key, 3600, json.dumps(value))

Problem: If the cache update fails but DB update succeeds, you’re out of sync.

Delete-On-Write (Cache-Aside with Invalidation): The cleanest approach: write to DB, delete from cache. Don’t try to update the cache:

def update_with_delete_on_write(key, value, db):
    db.update(key, value)
    redis.delete(key)  # Simple, reliable
    # Next read will repopulate from DB

This avoids the “update cache failed” problem entirely.

TTL Selection Trade-offs

TTL Setting	Pros	Cons	Best For
Short (seconds)	High freshness	Low hit rate	Real-time data, rapidly changing values
Medium (minutes)	Balanced	Stale reads possible	Most web content
Long (hours)	High hit rate	Stale data risk	Configuration, reference data
No TTL + manual invalidation	Perfect hit rate	Complexity, staleness risk	Stable reference data

Rule of thumb: Set TTL to 1-10x the expected refresh interval. If data changes every ~10 minutes, a 30-60 minute TTL is reasonable.

Monitoring Invalidation Health

Track these metrics to catch invalidation problems early:

# Track staleness by comparing cache timestamp with DB timestamp
def check_invalidation_health(key):
    cache_value = redis.get(key)
    db_value = db.get(key)

    if not cache_value or not db_value:
        return {"status": "unknown"}

    cache_ts = json.loads(cache_value).get("updated_at", 0)
    db_ts = db.get_updated_at(key)

    staleness_seconds = cache_ts - db_ts
    return {
        "status": "stale" if staleness_seconds > 60 else "fresh",
        "staleness_seconds": staleness_seconds
    }

Sharding Strategies

When a single cache instance isn’t enough, you need to distribute across multiple nodes.

Consistent Hashing

import hashlib
from bisect import bisect

class ConsistentHashRing:
    def __init__(self, nodes, replicas=150):
        self.replicas = replicas
        self.ring = {}
        self.sorted_keys = []

        for node in nodes:
            for i in range(replicas):
                key = f"{node}:{i}"
                hash_key = int(hashlib.md5(key.encode()).hexdigest(), 16)
                self.ring[hash_key] = node
                self.sorted_keys.append(hash_key)

        self.sorted_keys.sort()

    def get_node(self, key):
        if not self.ring:
            return None

        hash_key = int(hashlib.md5(key.encode()).hexdigest(), 16)
        idx = bisect(self.sorted_keys, hash_key)
        if idx >= len(self.sorted_keys):
            idx = 0
        return self.ring[self.sorted_keys[idx]]

    def add_node(self, node):
        for i in range(self.replicas):
            key = f"{node}:{i}"
            hash_key = int(hashlib.md5(key.encode()).hexdigest(), 16)
            self.ring[hash_key] = node
            self.sorted_keys.append(hash_key)
        self.sorted_keys.sort()

    def remove_node(self, node):
        for i in range(self.replicas):
            key = f"{node}:{i}"
            hash_key = int(hashlib.md5(key.encode()).hexdigest(), 16)
            del self.ring[hash_key]
            self.sorted_keys.remove(hash_key)

Trade-off Analysis

Stampede Prevention Comparison

Pattern	Stampede Protection	Latency Impact	Complexity	Consistency Risk
Distributed Locking	Strong	First request blocked	Medium	Low
Probabilistic Early	Probabilistic	Minimal	Low	Low
Hysteresis	Strong	Zero (always served)	Medium	Medium
Request Coalescing	Strong	First request blocked	Medium	Low
No Protection	None	None	None	High (cascade)

Write Strategy Comparison

Strategy	Write Latency	Read Performance	Data Loss Risk	Complexity
Write-Through	High (sync)	Best	None	Low
Write-Back	Lowest (async)	Best	High	High
Write-Around	Medium	Initially poor	None	Lowest
Delete-on-Write	Medium	Initially poor	None	Lowest

Cache Tier Comparison

Tier	Capacity	Latency	Shared Across Nodes	Best For
L1	MB	<1ms	No	Hottest 0.1% of keys
L2	GB	1-10ms	Yes	Hot dataset
L3	TB	10-100ms	Yes (global)	CDN edge, static assets
DB	TB+	10-100ms	Yes	Source of truth

Invalidation Strategy Comparison

Strategy	Staleness Window	Implementation	Failure Mode
TTL-based	Full TTL	Simplest	Stale data up to TTL
Immediate	None	Moderate	Cache/DB out of sync
Delete-on-Write	Until next read	Simple	Brief read miss
Version-tagged	None (checked)	Complex	Requires version sync

When to Use / When Not to Use

Pattern	When to Use	When Not to Use
Distributed Locking	Preventing cache stampede on popular keys; ensuring single-flight rebuilds	High-frequency writes; when lock contention is a problem
Probabilistic Early Expiration	Popular items with predictable access; reducing miss latency	Unpredictable access patterns; items with short TTLs
Hysteresis (Stale-While-Refresh)	Latency-sensitive applications; users must never wait	When stale data is unacceptable; very short TTLs
Circuit Breaker	External cache service; preventing cascade failures	Local/in-process caches; when cache is always available
Request Coalescing	High concurrency on same keys; thundering herd prevention	Low traffic; unique keys per request
Cache Warming	After deployments; known hot dataset; reducing cold-start latency	Unknown access patterns; dynamic content
Tiered Caching (L1+L2)	Multiple application instances; sub-millisecond access needed for hot data	Single instance; uniform access patterns
CDN + Origin	Static content; global user base; reducing origin load	Dynamic, personalized content; real-time data

Decision Guide

graph TD
    A[Problem] --> B{Cache unavailable?}
    B -->|Yes| C[Use Circuit Breaker]
    B -->|No| D{Stampede on miss?}
    D -->|Yes| E{Hot key access pattern?}
    D -->|No| F{Tiered Latency Needed?}
    E -->|Predictable| G[Probabilistic Early Expiration]
    E -->|High Concurrency| H[Request Coalescing]
    E -->|Low Latency Critical| I[Locking + Hysteresis]
    F -->|Yes| J[L1 + L2 or CDN + Origin]
    F -->|No| K[Standard Cache-Aside]

Production Failure Scenarios

Failure	Impact	Mitigation
Lock holder crashes	Lock never released; key permanently blocked	Use TTL on locks; implement lock timeout/auto-release
Hysteresis serves too-stale data	Users see outdated content beyond acceptable window	Set reasonable `stale_ttl` limit; monitor staleness metrics
Circuit breaker stuck open	Cache permanently bypassed; database always hit	Implement half-open state for testing; set maximum open time
Tiered cache inconsistency	L1 has stale data while L2 has updated	Use invalidation broadcast; prefer L2 as source of truth
Request coalescing memory leak	Pending futures accumulate if requests fail	Implement timeout on futures; cleanup on error
CDN serves stale after purge	Users see old content	Use versioned URLs; implement proper cache invalidation
Probabilistic early expiration overload	Too many background refreshes	Tune beta parameter; limit concurrent refreshes

Common Pitfalls / Anti-Patterns

1. Lock Without Timeout

Locks without TTL become permanent if the holder crashes.

# BAD: Lock without timeout
redis.set(f"lock:{key}", "1")  # No TTL - forever if process dies

# GOOD: Lock with TTL
redis.set(f"lock:{key}", "1", nx=True, ex=5)  # Auto-expires in 5 seconds

2. Not Handling Lock Acquisition Failure

Assuming you always get the lock leads to deadlocks.

# BAD: Assume lock always acquired
lock.acquire()
try:
    result = fetch_and_cache()
finally:
    lock.release()

# GOOD: Handle failure gracefully
if not lock.acquire(blocking=True, blocking_timeout=1):
    # Wait and retry from cache, or serve stale
    time.sleep(0.1)
    return redis.get(key)

try:
    result = fetch_and_cache()
finally:
    lock.release()

3. Probabilistic Early Expiration Beta Too Aggressive

Beta controls refresh aggressiveness. Too high causes unnecessary refreshes.

# BAD: Beta too high (aggressive) - refreshes constantly
get_with_probabilistic_early_expiration(key, fetch_func, beta=1.0)
# Almost every request triggers refresh for low TTL items

# GOOD: Beta around 0.3 - measured refresh
get_with_probabilistic_early_expiration(key, fetch_func, beta=0.3)
# Only refreshes when TTL gets low enough

4. Hysteresis Without Staleness Limit

Serving stale forever defeats the purpose of cache invalidation.

# BAD: No limit on how long to serve stale
if redis.get(key):
    return json.loads(redis.get(key))  # Always return whatever

# GOOD: Limit staleness window
def get_with_hysteresis(key, fetch_func, ttl=3600, max_stale_ttl=60):
    value = redis.get(key)
    if value:
        return json.loads(value)

    stale_value = redis.get(f"{key}:stale")
    if stale_value:
        if redis.ttl(f"{key}:stale") > -max_stale_ttl:
            # Serving within acceptable stale window
            trigger_background_refresh(key, fetch_func)
            return json.loads(stale_value)

    # No stale available - must refresh synchronously
    result = fetch_func(key)
    redis.setex(key, ttl, json.dumps(result))
    return result

5. Tiered Cache Without Invalidation Strategy

Updates to L2 don’t propagate to L1, causing stale reads.

# BAD: Write to L2 but L1 still has old data
def update(key, value):
    l2.setex(key, 3600, value)
    # L1 still has old value - will be served until evicted

# GOOD: Delete from both tiers
def update(key, value):
    l2.setex(key, 3600, value)
    l1.delete(key)  # Invalidate L1

Quick Recap

Key Bullets

Stampede protection is essential for production caches with popular expiring keys
Locking, hysteresis, and probabilistic early expiration all solve stampede differently
Circuit breakers prevent cascade failures when cache is unavailable
Tiered caching (L1+L2) gives sub-millisecond access for hottest data
Request coalescing reduces thundering herd by collapsing concurrent requests
Cache warming prevents cold-start latency after deployments or restarts

Copy/Paste Checklist

# Stampede protection - Locking pattern
lock = redis.lock(f"lock:{key}", timeout=5)
if lock.acquire(blocking=True, blocking_timeout=1):
    try:
        value = redis.get(key)
        if not value:
            value = fetch_from_db()
            redis.setex(key, 3600, value)
    finally:
        lock.release()

# Hysteresis pattern
value = redis.get(key)
if not value:
    stale = redis.get(f"{key}:stale")
    if stale:
        trigger_background_refresh(key)
        return json.loads(stale)
    value = fetch_from_db()
    redis.setex(key, 3600, value)

# Circuit breaker
if circuit_breaker.state == "open":
    return db.query(key)  # Skip cache
try:
    value = redis.get(key)
    if value:
        return json.loads(value)
except Exception as e:
    circuit_breaker.record_failure()
    return db.query(key)

# Tiered cache get
value = l1.get(key)
if value:
    return value
value = l2.get(key)
if value:
    l1.put(key, value)  # Promote to L1
    return value
value = db.query(key)
l2.setex(key, 3600, value)
l1.put(key, value)
return value

# Deployment checklist:
# [ ] Implement stampede protection before going to production
# [ ] Monitor lock contention and circuit breaker state
# [ ] Plan cache warming strategy for deployments
# [ ] Set appropriate TTLs for each cache tier
# [ ] Log tiered cache hit distribution (L1 vs L2)
# [ ] Have circuit breaker fallback to direct DB access

Observability Checklist

Metrics to Track

Stampede Rate: Number of simultaneous cache misses for same key
Lock Wait Time: How long requests wait for cache lock acquisition
Circuit Breaker State: Time spent in open/closed/half-open states
Stale Hit Ratio: Percentage of cache hits served while refreshing
L1 vs L2 Hit Distribution: Split between local and distributed cache tiers
Background Refresh Rate: How often early expiration triggers

# Stampede detection
def detect_stampede(redis_client, key, threshold=10):
    """
    Detect if multiple requests trying to rebuild same key.
    Uses INCR with short TTL as a lightweight counter.
    """
    stampede_key = f"stampede:{key}"
    count = redis_client.incr(stampede_key)
    if count == 1:
        redis_client.expire(stampede_key, 5)  # 5 second window

    if count > threshold:
        logger.warning("cache_stampede_detected", key=key, concurrent_requests=count)
        return True
    return False

# Hysteresis monitoring
def monitor_staleness(redis_client, key):
    stale_key = f"{key}:stale"
    if redis_client.exists(stale_key):
        ttl = redis_client.ttl(key)
        logger.info("serving_stale_data", key=key, fresh_ttl=ttl)

Logs to Capture

logger = structlog.get_logger()

# Circuit breaker state changes
def log_circuit_breaker(previous_state, new_state, key=None):
    logger.warning("circuit_breaker_state_change",
        previous_state=previous_state,
        new_state=new_state,
        cache_key=key,
        timestamp=time.time())

# Cache tier miss tracking
def log_tiered_miss(l1_hit=False, l2_hit=False, origin_required=True):
    logger.info("cache_tier_miss",
        l1_hit=l1_hit,
        l2_hit=l2_hit,
        origin_required=origin_required,
        latency_tier="l1" if l1_hit else "l2" if l2_hit else "origin")

Alert Rules

- alert: HighLockContention
  expr: rate(cache_lock_wait_seconds_total[5m]) > 0.1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High lock contention on cache keys"

- alert: CircuitBreakerOpen
  expr: cache_circuit_breaker_open_duration_seconds > 60
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Circuit breaker open for more than 60 seconds"

- alert: HighStaleHitRatio
  expr: cache_stale_hits_total / cache_total_hits > 0.1
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "More than 10% of cache hits are serving stale data"

Security Checklist

Lock key injection - Validate key names in lock acquisition; prevent large/special key injection
Rate limit lock attempts - Prevent denial of service via lock exhaustion
Stale data exposure - Ensure stale cache doesn’t expose sensitive data between users
Monitor circuit breaker state - An open circuit breaker may cause database overload
Tiered cache data isolation - Ensure L1 (local) cache doesn’t leak data between users/tenants
Background refresh access control - Background refresh threads should have same permission checks
CDN edge security - Don’t cache sensitive content at CDN edge

# Secure lock implementation
class SecureLockingCache:
    def __init__(self, redis_client, max_lock_timeout=5):
        self.redis = redis_client
        self.max_lock_timeout = max_lock_timeout

    def acquire_lock(self, key):
        # Validate key format - no special characters that could cause issues
        if not self._is_valid_key(key):
            raise ValueError(f"Invalid cache key: {key}")

        lock_key = f"lock:{key}"
        lock_timeout = min(self.max_lock_timeout, 5)  # Cap at 5 seconds

        # Use SET NX EX for atomic lock acquisition
        acquired = self.redis.set(lock_key, "1", nx=True, ex=lock_timeout)
        return acquired

    def _is_valid_key(self, key):
        # Basic validation - alphanumeric, colons, dashes, underscores
        import re
        return bool(re.match(r'^[\w:.-]+$', key))

Real-World Case Studies

No real-world case studies are currently documented for this topic.

Interview Questions

1. What is a cache stampede and how does it occur in production systems?

A cache stampede happens when a popular cache entry expires and multiple concurrent requests all try to rebuild the cache simultaneously. Since the cache is empty, all requests hit the database at once, overwhelming it and potentially causing cascading failures. The thundering herd is the same problem at a larger scale, typically when an entire cache tier goes down.

Multiple requests detect the same cache miss
All requests independently query the database
Database becomes a bottleneck under concurrent load
Cache never gets repopulated because requests keep timing out

2. Compare distributed locking vs probabilistic early expiration for preventing cache stampedes. When would you choose one over the other?

Distributed Locking: Only one process acquires the lock and rebuilds the cache; others wait or serve stale data. This guarantees no stampede but adds latency for the first request and complexity around lock management.

Probabilistic Early Expiration: Refreshes the cache before it expires based on a probability that increases as TTL decreases. This spreads the refresh load across time but doesn't guarantee stampede protection.

Use locking when: stampede protection is critical, high-concurrency on hot keys, lock infrastructure is already available
Use probabilistic when: latency variation is acceptable, access patterns are predictable, you want to avoid lock overhead

3. Explain the hysteresis (stale-while-revalidate) pattern. What are its advantages and the conditions under which it should be used?

Hysteresis keeps serving stale data while asynchronously refreshing the cache after the nominal TTL expires. The stale version is kept for a short window while a background thread repopulates the fresh value.

Advantages:

Users never experience a cache miss — response is always immediate
Eliminates request spikes at TTL expiration boundaries
Simple mental model: fresh or stale, no "loading" state

Use when:

Staleness tolerance exists (not financial data, medical records)
Latency sensitivity is high (user-facing requests)
A reasonable max_stale_ttl can be set and enforced

4. How does a circuit breaker pattern work in the context of caching? What is its role in preventing cascade failures?

A circuit breaker wraps cache access and tracks failures. When failures exceed a threshold, the breaker "opens" — subsequent requests skip the cache entirely and go directly to the database. After a timeout, it enters a "half-open" state to test if the cache has recovered.

Role in cascade prevention:

Isolates cache failures from database access
Allows cache to recover under reduced load
Provides graceful degradation: system keeps working (slower) even when cache is down

5. Describe tiered caching (L1 + L2). How do you decide what data should live in L1 vs L2?

Tiered caching uses a fast, small local cache (L1 — CPU cache or in-memory dict) in front of a slower, larger distributed cache (L2 — Redis/Memcached). Requests check L1 first, then L2, then the database.

L1 (Local/In-Memory):

Sub-millisecond access, limited size (MB range)
Hot data only: top 0.1% of accessed keys by a single node
Must be thread-safe and bounded to prevent memory leaks

L2 (Distributed):

Higher latency (~1-10ms) but larger capacity (GB range)
Shared across all application nodes
Source of truth for cache data across nodes

6. What is request coalescing and how does it prevent the thundering herd problem?

Request coalescing collapses multiple in-flight requests for the same key into a single database query. When a request finds another request already fetching the same key, it waits for that request's result rather than spawning a new database call.

Implementation uses a pending-request registry (e.g., a dict of Futures). When the first request starts, it creates a Future and stores it. Subsequent requests receive the same Future and await it. When the fetch completes, all waiters receive the result.

Dramatically reduces concurrent DB queries during cache misses
Works at the application level, not the cache level
Effective for thundering herd after cache restart

7. What are the trade-offs between write-through, write-back, and write-around cache strategies?

Write-Through: Synchronously update both cache and database on every write. Strong consistency but adds latency to writes. Best for read-heavy workloads where data must be correct.

Write-Back: Write to cache, return immediately, sync to database asynchronously. Lowest write latency but risk of data loss if cache fails before DB is updated. Use with durable storage.

Write-Around: Write directly to database, invalidate (not update) the cache. Simple and reliable — avoids the "cache update failed" problem. Brief staleness window after writes. Good for workloads where stale reads are briefly acceptable.

8. How do race conditions manifest in distributed caching, and what patterns address them?

Race conditions occur when multiple processes independently read, modify, and write cache entries without coordination. The classic problem: Process A reads cache (stale), Process B updates DB and invalidates cache, Process A writes its (now stale) value back to cache, overwriting the fresh data.

Solutions:

Version vectors: Tag each cache entry with a DB timestamp. Only serve/update cache if the version is newer than what's in the DB.
CAS (Compare-And-Swap): Use atomic operations (Redis WATCH/MULTI/EXEC) to detect concurrent modifications and retry.
Delete-on-write: Instead of updating cache on DB writes, delete the cache entry. Let the next read rebuild it. Eliminates the race entirely by not allowing concurrent updates.

9. What are the common pitfalls when using local (L1) caches in a distributed application?

Stale data across nodes: When L2 is updated, different nodes' L1 caches may still hold old values. Without an invalidation broadcast, some nodes serve stale data for the L1 TTL duration.

Memory leaks: Unbounded in-process caches (dicts, lists) grow without limit as more keys are accessed. Must use bounded LRU structures.

Invalidation complexity: L1 invalidation requires either broadcast (complex) or version checking (overhead). Simpler solutions just use short L1 TTLs or skip L1 entirely.

NUMA effects: On multi-socket servers, accessing an L1 cache on a different NUMA node is slow. CPU-affinitized caches or skipping L1 may be better.

10. How would you design a cache warming strategy for a system that experiences predictable traffic spikes (e.g., e-commerce site before Black Friday)?

A multi-tier warming strategy works best:

Pre-deployment: Identify the top N keys by expected traffic (top products, categories, user segments). Warm these into L2 before the traffic spike.
Tiered by priority: Warm the most popular 1000 keys first with longest TTL, then secondary keys. Use different TTLs for different tiers.
Background warming: Use a speculative warmer that tracks access counts and pre-loads keys that are trending hot but not yet in cache.
During the spike: Rely on hysteresis and probabilistic early expiration to keep hot keys fresh without lock contention.
Monitor: Watch hit rates and staleness. If L1 hit rate drops below threshold, temporarily increase L1 TTL to reduce L2 load.

11. Design a monitoring and observability strategy for a production cache system. What metrics are most critical and how would you alert on them?

Critical metrics for cache observability fall into four categories:

Hit/Miss Ratio: Track L1 hit rate, L2 hit rate, and overall cache hit rate. Alert when hit rate drops below baseline (indicates warming failure or cache invalidation issue).
Latency Percentiles: P50, P95, P99 latency at each cache tier. Latency spikes often precede cache failures.
Stampede Detection: Monitor concurrent miss counts per key. High concurrent misses on a single key signals a stampede in progress.
Circuit Breaker State: Track time spent in open vs closed state. Extended open state means cache is bypassed and database is under stress.
Staleness Metrics: Track how often stale data is served (hysteresis) vs fresh. Rising stale ratios indicate TTL settings are too aggressive.

Alert thresholds should be based on baseline measurements, not arbitrary values. Set warning alerts at 2 standard deviations and critical at 3 standard deviations from normal.

12. How does the probabilistic early expiration algorithm work mathematically? What happens when beta is set too high or too low?

The algorithm calculates refresh probability using: probability = 1 - (remaining_ttl / total_ttl) ^ beta

As TTL decreases, the probability increases. The beta parameter controls the curve shape:

beta = 0.3 (recommended): Gradual probability increase. Refreshes start when TTL is ~30% remaining. Most requests still serve cached data.
beta too high (>1.0): Aggressive refresh. Probability climbs steeply even when TTL is high. Results in excessive background refreshes, wasting resources and potentially causing lock contention.
beta too low (<0.1): Rarely refreshes. Effectively becomes standard TTL expiration — stampede risk returns at expiration boundary.
beta = 1.0: Linear probability increase. Starts refreshing at 50% TTL remaining. More aggressive than typical use cases.

The formula was derived from research on web caching and balances refresh overhead against miss latency. The key insight is that popular items benefit from early refresh because they expire at predictable times.

13. Explain the concept of "thundering herd" and how it differs from a cache stampede. What makes thundering herd harder to solve?

Cache Stampede: A single popular cache key expires, causing multiple requests to simultaneously rebuild that specific key.

Thundering Herd: An entire cache tier goes down (restart, crash, network partition) or a massive traffic spike hits, causing ALL keys to be unavailable simultaneously. Every request hits the database at once.

Why thundering herd is harder:

Scale: A stampede affects one key; thundering herd affects thousands. Request coalescing helps but the pending request queue itself can become a bottleneck.
Recovery paradox: As the cache recovers, the recovery itself triggers another thundering herd because all nodes warm up simultaneously.
No single lock helps: Distributed locking on one key doesn't help when the problem is all keys becoming unavailable at once.
Solutions are operational: Requires pre-warming, gradual recovery, traffic shedding, or a " cache proxy" layer that serves stale during recovery.

14. What is the difference between cache-aside, read-through, and write-behind caching patterns? When would you use each?

Cache-Aside (Lazy Loading): Application checks cache first; on miss, fetches from DB and populates cache. Most common pattern.

Read-Through: Cache automatically loads data from DB on miss. Application only talks to cache, never DB directly. Cache is responsible for loading.

Write-Behind (Write-Back): Application writes to cache, which asynchronously persists to DB. Highest write performance but risk of data loss.

Use cache-aside when: You need explicit control over cache contents, some data is accessed rarely, or you want to avoid caching data you might never read.
Use read-through when: You want simplified application code and the cache can determine the loading strategy (e.g., multi-tier where L1 loads from L2).
Use write-behind when: Write throughput is critical (logging, analytics), cache has battery-backed RAM, and brief data loss on crash is acceptable.

15. How would you handle cache invalidation in a microservices architecture where multiple services share cached data?

Cache invalidation across microservices requires explicit ownership and propagation:

Event-driven invalidation: When Service A updates shared data, publish an event (Kafka, SNS) that all services consuming that data subscribe to and invalidate their local cache entries.
Cache ownership model: Assign one service as the "owner" of each cached entity. Only the owner writes to the cache; other services read but never write directly to shared cache keys.
Short TTL with safety net: Use short TTLs (30-60 seconds) for shared data as a safety net. Even if invalidation fails, staleness is bounded.
Version-tagged entries: Include a version number from the database. Services check version before serving cached data. Reduces race conditions but adds complexity.
Distributed cache tier: Use a shared Redis/Memcached cluster instead of local caches. Invalidation propagates instantly. Trade-off: single point of failure, network dependency.

Best practice: Use event-driven invalidation with short TTL as fallback. The event ensures timely invalidation; the TTL bounds potential staleness.

16. What are the implications of cache stampede protection on system availability and consistency? How do you choose when to sacrifice one for the other?

Availability vs Consistency Trade-off:

Locking + synchronous refresh: Strong consistency but reduced availability. If lock holder crashes, requests wait or get stale. Timeouts must be set carefully.
Hysteresis: High availability (always serves data) but potential staleness. Acceptable when brief staleness is tolerable (e.g., product catalog, user preferences).
Probabilistic early expiration: Balanced approach. Most requests get fresh data with low latency. Some requests trigger background refresh. Works when access patterns are predictable.

Choosing:

Consistency critical: Financial data, inventory counts, leaderboards. Use locking with version checking. Accept higher latency.
Availability critical: User-facing content, media, product pages. Use hysteresis with staleness limits. Accept brief staleness.
Both important: Use tiered approach. Hysteresis for most data; locking for critical keys that must be consistent.

Never completely sacrifice availability — a slow response is better than an error. Circuit breakers ensure the database isn't overwhelmed, preserving overall system health.

17. Describe how consistent hashing works for cache sharding. What are the trade-offs compared to simple modulo-based sharding?

Consistent Hashing: Maps cache keys to a hash ring (0 to 2^32). Each cache node is assigned multiple positions on the ring (via virtual nodes). A key is stored on the first node found clockwise from its hash position.

Why it matters: When a node is added or removed, only K/N keys are remapped (where K is total keys, N is nodes). With simple modulo, K/N of ALL keys remap, causing a massive cache invalidation wave.

Virtual nodes (replicas): Assigning 150-200 virtual nodes per physical node improves distribution balance. More replicas = better load distribution but more memory for ring metadata.
Monotonicity: When adding nodes, existing keys don't need to move (beyond the redistribution boundary). Critical for production deployments.

Trade-offs:

Simple modulo: Perfect distribution but catastrophic reshuffle on node changes. Good for fixed-size, never-shrink cache clusters.
Consistent hashing: Graceful scaling with imperfect distribution. Slight hotspots possible with skewed key access patterns. Add replicas to improve balance.

18. How do CDN edge caches differ from application-level caches in terms of invalidation strategy and consistency guarantees?

CDN Edge Caches: Distributed globally, serve content from PoPs (points of presence) close to users. Typically HTTP caching (Varnish, Nginx, Cloudflare, Akamai).

Invalidation: Hard to invalidate. Requires purge API calls to each edge node. Purge propagation takes seconds to minutes. Not suitable for rapidly changing data.
Consistency: Eventual consistency at scale. Different edge nodes may serve different versions during propagation. TTL-based staleness is acceptable.
Best for: Static assets (images, CSS, JS), API responses with long TTLs, versioned content (URL-based versioning solves invalidation).

Application Caches (Redis/Memcached):

Invalidation: Immediate via delete command. Direct control. Can implement sophisticated invalidation strategies.
Consistency: Can achieve strong consistency with delete-on-write or version vectors. Critical for transactional data.
Best for: Dynamic content, user-specific data, frequently changing data, session storage.

Hybrid approach: Use CDN for static assets with long TTLs and versioned URLs. Use application cache for dynamic content. The two layers have orthogonal characteristics and complement each other.

19. What security considerations are specific to cache implementations? How can cache poisoning and cache timing attacks be mitigated?

Cache Poisoning: An attacker injects malicious data into the cache by exploiting cache key collision or header manipulation.

Key validation: Strictly validate cache key format. Reject keys with special characters, extremely long lengths, or injection patterns.
Key namespacing: Prefix all keys with a namespace (e.g., user:, product:). Prevents cross-namespace pollution.
Input sanitization: Normalize inputs before using as cache keys. Same logical request should always produce the same key.

Cache Timing Attacks: Attacker measures response times to determine if data was served from cache vs database, potentially revealing sensitive data existence.

Constant-time responses: Add jitter to responses to mask cache hit/miss timing differences.
Separate cache for sensitive data: Don't mix sensitive and non-sensitive data in the same cache infrastructure.
Rate limiting: Limit requests per IP/key to prevent systematic probing.

General cache security: Network segmentation (cache shouldn't be directly internet-accessible), TLS for cache connections, authentication for cache commands, no sensitive data in cache keys themselves.

20. Design a cache strategy for a read-heavy workload with occasional burst writes (e.g., a social media timeline). How do you balance freshness with performance?

Social media timelines have a specific access pattern: reads vastly outnumber writes, but writes create urgent freshness requirements (new posts must appear quickly).

Recommended approach:

Write-around with immediate invalidation: Writes go to the database, then the user's timeline cache key is deleted. Subsequent reads rebuild the timeline from the database.
Aggressive TTL: Timeline cache TTL of 30-60 seconds. Even without writes, timelines refresh frequently. Balances freshness with cache hit rate.
User-specific warming: When a user posts, proactively warm their timeline for their followers in the background (within 5-10 seconds).
Hysteresis for timeline reads: Serve stale timeline data while refreshing in background. Users see their own post immediately (via write-around); follower timeline updates within seconds.

Special consideration for fan-out:

Push model: When a user posts, write to each follower's timeline cache immediately. High write amplification but best freshness. Use for small follower counts.
Pull model: Timelines are always computed at read time from recent posts. Best for celebrities with millions of followers. Cache recent posts and assemble at read time.
Hybrid: Push to first 1000 followers, pull for the rest. Balances freshness with scalability.

Conclusion

Cache stampede protection matters when you have popular expiring entries. Cache warming matters after deployments. Tiered caching matters when you’re big enough that one cache tier isn’t enough.

Start with the basics. Add stampede protection early because it bites you unexpectedly. The other patterns you add when you have the operational need.

Don’t over-engineer before you have the problem.