Rate Limiting: Token Bucket, Sliding Window, and Distributed Systems

Rate limiting protects APIs from abuse. Learn token bucket, sliding window, fixed window algorithms and distributed rate limiting at scale.

published: March 22, 2026 reading time: 27 min read author: GeekWorkBench

Rate Limiting: Token Bucket, Sliding Window, and Distributed Systems

Every API has a capacity limit. Physical servers have finite CPU and memory. Databases can only handle so many connections. Rate limiting protects your services from being overwhelmed—whether by malicious attackers, runaway scripts, or too many legitimate users at once.

Without it, one misconfigured client can take down your entire API. A retry loop hitting a temporary backend hiccup creates a retry storm. Your service goes from degraded to dead.

Introduction

Rate limiting does more than block abuse. Here are the less obvious reasons you might want it.

Fairness. Without rate limits, heavy users crowd out light users. A single tenant uploading large files could saturate bandwidth for everyone else.

Cost control. Your infrastructure costs money. Rate limits let you charge different customers based on their usage.

Predictability. Hard limits mean clients can plan retries with proper backoff. No guesswork about when the system will cut them off.

Protection from cascading failures. Slow upstream services cause clients to retry more aggressively. Without rate limiting, retry storms make the original problem worse.

Core Concepts

Token Bucket

The token bucket algorithm is the most common approach. Picture a bucket holding a fixed number of tokens. Tokens get added to the bucket at a constant rate. Each request consumes one token. If the bucket is empty, requests get rejected.

import time
import threading

class TokenBucket:
    def __init__(self, capacity: int, refill_rate: float):
        self.capacity = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.tokens = capacity
        self.last_refill = time.time()
        self.lock = threading.Lock()

    def allow_request(self) -> bool:
        # Note: This is a simplified illustration. The `self.lock` provides
        # thread-safety for this single-instance implementation. For distributed
        # rate limiting across multiple processes or servers, atomic operations
        # (e.g., Redis Lua scripts or atomic increments) are required.
        with self.lock:
            self._refill()
            if self.tokens >= 1:
                self.tokens -= 1
                return True
            return False

    def _refill(self):
        now = time.time()
        elapsed = now - self.last_refill
        new_tokens = elapsed * self.refill_rate
        self.tokens = min(self.capacity, self.tokens + new_tokens)
        self.last_refill = now

Two things to keep in mind: bursts up to the bucket size are allowed, and the long-term average equals the refill rate. A bucket with capacity 100 and refill rate of 10 per second handles a burst of 100 requests, then settles into 10 per second.

Users get burst capacity when they need it, then settle into the sustainable rate.

Sliding Window

The sliding window algorithm fixes the boundary issues that fixed windows have. Rather than resetting counters at window boundaries, it tracks requests within a rolling time period.

import time
from collections import deque

class SlidingWindow:
    def __init__(self, max_requests: int, window_seconds: int):
        self.max_requests = max_requests
        self.window_seconds = window_seconds
        self.requests = deque()
        self.lock = threading.Lock()

    def allow_request(self) -> bool:
        with self.lock:
            now = time.time()
            cutoff = now - self.window_seconds

            # Remove expired entries
            while self.requests and self.requests[0] < cutoff:
                self.requests.popleft()

            if len(self.requests) < self.max_requests:
                self.requests.append(now)
                return True
            return False

The advantage over fixed window: no spike at window boundaries. With fixed windows, a client could make 100 requests at 11:59:59 and another 100 at 12:00:00, effectively doubling the rate. Sliding window prevents this.

The memory cost is proportional to the number of requests in the window. For high-volume APIs, this can add up.

Fixed Window

Fixed window is the simplest algorithm. Divide time into fixed intervals. Each interval has a request budget. Count resets at interval boundaries.

import time

class FixedWindow:
    def __init__(self, max_requests: int, window_seconds: int):
        self.max_requests = max_requests
        self.window_seconds = window_seconds
        self.window_start = time.time()
        self.count = 0
        self.lock = threading.Lock()

    def allow_request(self) -> bool:
        with self.lock:
            now = time.time()
            if now >= self.window_start + self.window_seconds:
                self.window_start = now
                self.count = 0

            if self.count < self.max_requests:
                self.count += 1
                return True
            return False

Fixed window is simple to implement and debug. The boundary spike is a real problem, but some applications can live with it. Memory stays constant.

Leaky Bucket

Leaky bucket works like a bucket with a hole in the bottom. Think of incoming requests as water being poured into the bucket — they arrive at irregular, bursty intervals just like water splashes in. The bucket holds them temporarily, and they drain out through the hole at a steady, controlled rate, like water trickling out. If too much water accumulates before it can drain (the bucket overflows), new requests are rejected. This smooths out bursty inbound traffic into a steady outflow, protecting a downstream service that cannot handle sudden spikes.

import time
import threading

class LeakyBucket:
    def __init__(self, capacity: int, leak_rate: float):
        self.capacity = capacity
        self.leak_rate = leak_rate
        self.water = 0
        self.last_leak = time.time()
        self.lock = threading.Lock()

    def allow_request(self) -> bool:
        with self.lock:
            self._leak()
            if self.water < self.capacity:
                self.water += 1
                return True
            return False

    def _leak(self):
        now = time.time()
        elapsed = now - self.last_leak
        leaked = elapsed * self.leak_rate
        self.water = max(0, self.water - leaked)
        self.last_leak = now

Leaky bucket drains at a steady rate, so it smooths your outbound traffic. If your API calls a third-party service with rate limits, leaky bucket prevents you from overwhelming it.

Implementing Rate Limits in APIs

HTTP Headers

APIs use headers to communicate rate limit status. No universal standard exists, but common conventions have emerged.

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1640995200
Retry-After: 3600

When a client exceeds the limit, return 429 Too Many Requests:

@app.route('/api/resource')
def get_resource():
    if not rate_limiter.allow_request():
        reset_time = rate_limiter.get_reset_time()
        return {
            'error': 'Rate limit exceeded',
            'retry_after': reset_time - time.time()
        }, 429
    return {'data': 'result'}

Clients should read these headers and back off. Use exponential backoff with jitter to avoid thundering herd problems.

Distributed Rate Limiting

In a single-server setup, these algorithms work fine. Modern applications run across multiple servers though. A client could hit different servers with separate counters and bypass the rate limit entirely.

Distributed rate limiting uses a shared store to coordinate limits across instances.

Redis-Based Token Bucket

-- Redis token bucket implementation
-- KEYS[1] = rate limit key
-- ARGV[1] = capacity
-- ARGV[2] = refill rate (tokens per second)
-- ARGV[3] = current timestamp
-- ARGV[4] = requested tokens (usually 1)

local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])

local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now

-- Refill tokens
local elapsed = now - last_refill
local new_tokens = elapsed * refill_rate
tokens = math.min(capacity, tokens + new_tokens)

local allowed = 0
if tokens >= requested then
    tokens = tokens - requested
    allowed = 1
end

redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 1)

return {allowed, tokens}

This Lua script runs atomically in Redis, so race conditions disappear. The expiry stops the key from hanging around after the client stops making requests.

Redis-Based Sliding Window

Sliding window with Redis uses a sorted set. Each request gets a timestamp as the score and a unique member. To check the limit: remove entries older than the window, count remaining entries, add the new entry if within limit.

import redis
import time
import uuid

class RedisSlidingWindow:
    def __init__(self, max_requests: int, window_seconds: int):
        self.max_requests = max_requests
        self.window_seconds = window_seconds
        self.redis = redis.Redis()

    def allow_request(self, key: str) -> bool:
        now = time.time()
        window_start = now - self.window_seconds
        request_id = str(uuid.uuid4())

        pipe = self.redis.pipeline()
        # Remove old entries
        pipe.zremrangebyscore(key, 0, window_start)
        # Count current entries
        pipe.zcard(key)
        # Add new entry if allowed
        results = pipe.execute()

        current_count = results[1]
        if current_count < self.max_requests:
            self.redis.zadd(key, {request_id: now})
            self.redis.expire(key, self.window_seconds + 1)
            return True
        return False

Rate Limiting Strategies

Per-User vs Per-IP

Rate limits can apply to authenticated users or to IP addresses. Per-user limits work better for shared IPs (multiple users behind corporate NAT). Per-IP limits protect against attacks from single sources.

Use both. Per-IP limits catch unauthenticated abuse. Per-user limits ensure authenticated users share resources fairly.

Tiered Limits

Different subscription tiers get different rate limits. Free tier could get 100 requests per minute. Paid tier gets 1000. Enterprise gets unlimited.

def get_rate_limit(user: User) -> tuple[int, int]:
    tier = user.subscription_tier
    if tier == 'free':
        return (100, 60)  # 100 requests per 60 seconds
    elif tier == 'paid':
        return (1000, 60)
    elif tier == 'enterprise':
        return (10000, 60)

Global vs Endpoint-Specific

You might want different limits for different endpoints. Public data endpoints could have generous limits. Write operations should have stricter limits. Admin endpoints might be limited to internal networks only.

ENDPOINT_LIMITS = {
    '/api/public/search': (1000, 60),
    '/api/private/write': (100, 60),
    '/api/admin/': (10000, 60),
}

def check_rate_limit(endpoint: str, user: User) -> bool:
    limit = ENDPOINT_LIMITS.get(endpoint, (100, 60))
    return sliding_window.allow_request(f"user:{user.id}:{endpoint}", *limit)

Handling Rate Limit Violations

When a client exceeds the limit, return a proper response.

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 3600

{
    "error": "rate_limit_exceeded",
    "message": "API rate limit exceeded. Please retry after 3600 seconds.",
    "limit": 1000,
    "reset_at": "2026-03-22T13:00:00Z"
}

Log rate limit violations. A client repeatedly hitting limits might be malicious, or might need a higher tier. Either way, you want visibility into this. Consider graduated responses. First violation gets a warning header. Repeated violations get actual rejections. This helps legitimate clients that are misconfigured.

When to Use Which Algorithm

Token bucket suits APIs where burst usage is normal. Users make many requests quickly while actively using a feature, then go idle.

Sliding window suits APIs needing smooth, predictable limiting. Better for services calling downstream APIs with strict rate limits.

Fixed window suits when you need simple, debuggable limits. The boundary spike might be acceptable for your use case.

Leaky bucket suits when you need to send requests at a steady rate to a downstream service.

For most web APIs, token bucket with Redis handles the job. It is well-understood, memory-efficient, and works naturally in distributed systems.

Trade-off Analysis

Key Observations

Token bucket vs Sliding window: Token bucket allows bursts which users expect. Sliding window enforces stricter consistency but costs more memory. For most APIs, token bucket wins. For financial or regulatory compliance where exact limits matter, sliding window is better.
Redis sorted sets vs Lua counters: Sorted sets (for sliding window) use more memory but support precise windowing. Lua counters use less memory but approximate behavior. At high scale, Lua counters win on memory efficiency.
Single Redis vs Redis Cluster: Single Redis handles ~100K ops/sec on modern hardware. Redis Cluster adds latency from cross-node coordination. Design for single Redis first, cluster only when proven necessary.
Fail-open vs fail-closed: Fail-open maximizes availability but removes protection during outages. Fail-closed protects resources but causes denial-of-service when rate limiting fails. Internal services protecting databases should fail closed; user-facing APIs can often fail open with degraded logging.

Algorithm Selection Matrix

Use this matrix to choose an algorithm based on your requirements:

Requirement	Recommended Algorithm	Why
Burst-friendly traffic	Token Bucket	Absorbs bursts naturally
Strict rate enforcement	Sliding Window	No boundary spikes
Minimal implementation	Fixed Window	Simple counters
Outbound throttling	Leaky Bucket	Steady outflow rate
Memory constrained	Token Bucket / Fixed Window	Constant memory
High accuracy needed	Sliding Window	Precise windowing

Distributed Systems Trade-offs

When implementing rate limiting across multiple nodes:

Consistency vs Availability: Strong consistency (Redis Cluster) adds latency but ensures accurate limits. Eventual consistency (local fallback) is faster but may allow slight limit bypass during coordination delays.

Latency vs Accuracy: Lua scripts atomic but add ~1ms per request. Application-side checks are faster but require careful locking. For most APIs, the Lua overhead is acceptable.

Storage efficiency vs Precision: Sorted sets track every request precisely but memory grows with window size. Approximate algorithms (token bucket with Lua) use constant memory but average over time rather than counting exact requests.

Cost-Benefit Summary

Factor	Token Bucket	Sliding Window	Fixed Window	Leaky Bucket
Implementation cost	Medium	High	Low	Medium
Operational cost	Low	Medium	Low	Low
User experience	Good (bursts allowed)	Best (smooth)	Poor (boundary spike)	Poor (no bursts)
Infrastructure impact	Low	Medium (more Redis ops)	Low	Low
Compliance suitability	Moderate	Best	Poor	Moderate

Real-world Failure Scenarios

Understanding how rate limiting fails in production helps you design more robust systems. These scenarios are based on real-world incidents.

The Akamai Rewrite Incident

A major SaaS provider deployed a rate limiting rule that dropped requests exceeding 1,000 per minute. The rule worked perfectly in staging. In production, their CDN (Akamai) was rewriting request headers, inflating header counts and triggering the limit for legitimate users making just 50 requests. The result: 12 hours of intermittent outages affecting 30% of users.

Lesson: Test rate limiting with production traffic patterns, not synthetic requests. CDNs, proxies, and load balancers modify headers in ways staging environments may not replicate.

The Leap Second Chaos

During a leap second, a distributed rate limiter’s sliding window implementation began rejecting all requests. The system used Redis sorted sets with timestamp-based expiry. The leap second caused timestamp calculations to overflow, making all requests appear to be outside the window.

Lesson: Use monotonic time sources where possible. Test your rate limiting implementation against edge cases: leap seconds, year rollovers, timezone changes, and clock adjustments.

The Noisy Neighbor Problem

A single tenant on a multi-tenant SaaS platform began sending webhook notifications to their entire user base. The webhook calls hit the rate limiter, but because all tenants shared the same Redis cluster, the noisy neighbor’s traffic caused latency spikes for all other tenants. The rate limiter itself became a bottleneck.

Lesson: Isolate rate limiting state by tenant. Use separate Redis instances or at minimum separate key namespaces. Monitor for noisy neighbors before they affect others.

The Kubernetes HPA Over-correction

A Kubernetes deployment scaled from 2 to 20 pods during a traffic spike. Each pod initialized its own local rate limiter with full burst capacity. With 20 pods each allowing bursts, the effective limit was 20x the intended limit. The downstream database was overwhelmed before the rate limiting caught up.

Lesson: Local rate limiting does not work with auto-scaling. Use centralized rate limiting (Redis) when your service scales horizontally. Alternatively, use a rate limiting sidecar that coordinates across pods.

The Gradual Rollout Failure

A company rolled out new per-user rate limits to 5% of traffic. The new limits worked. They rolled out to 25%. Still worked. At 50%, Redis latency spiked to 500ms. At 100%, the rate limiting Redis was overwhelmed and began timing out. The entire API became unreliable.

Lesson: Test rate limiting scalability before rollout. Distributed counters add Redis round-trips to every request. Profile your Redis under expected load, and add headroom for spikes.

Production Failure Scenarios

Failure	Impact	Mitigation
Rate limit counter overflow	Limits enforced incorrectly; bypassed or too restrictive	Use appropriate data types; set maximum values; monitor for unusual patterns
Redis unavailable for distributed limits	Rate limiting fails open or closed	Decide on fail-open vs fail-closed; implement local fallback with reduced limits
Clock skew in sliding window	Requests incorrectly counted or allowed	Use Redis server time; NTP sync all rate-limiting servers
Thundering herd on limit reset	All clients retry simultaneously at window boundary	Use jitter in retry logic; consider sliding window instead of fixed
Client bypass via IP spoofing	Rate limits circumvented by malicious actors	Combine per-IP with per-user limits; implement multiple enforcement layers

Common Pitfalls / Anti-Patterns

Not Being Atomic

Race conditions in distributed rate limiting cause limit bypass. Two requests arriving simultaneously on different servers both check the counter, both see room, both increment. Both get through when only one should.

Use atomic operations. Redis Lua scripts, conditional writes, or distributed locks. Pick one that matches your consistency requirements.

Ignoring Header Consistency

When using multiple rate-limiting instances, headers might show different values to the same client. Request hits Server A, sees 99 remaining. Request hits Server B, sees 100 remaining. This confuses clients tracking their usage.

Use a shared counter store and return values from it. Accept the latency cost.

Allowing Unrestricted Bursts

Token bucket capacity sets the maximum burst. Large bursts can overwhelm downstream services even if the long-term average is within limits.

Set burst limits based on what your infrastructure can actually handle, not on what seems generous.

Quick Recap

Key Bullets:

Token bucket suits burst-friendly APIs; sliding window suits consistent-rate APIs
Distributed rate limiting requires atomic operations via Redis Lua or similar
Combine per-IP and per-user limits for defense in depth
Always include Retry-After header; clients should respect it
Monitor rate limit metrics to detect abuse and misconfiguration

Copy/Paste Checklist:

Rate Limiting Implementation:
[ ] Choose algorithm based on use case
[ ] Implement atomic distributed counter (Redis Lua)
[ ] Set burst capacity based on downstream tolerance
[ ] Implement per-IP and per-user limits
[ ] Add X-RateLimit-* headers to all responses
[ ] Include Retry-After on 429 responses
[ ] Implement graceful degradation (fail-open vs fail-closed)
[ ] Add jitter to client retry logic
[ ] Monitor violation rates per client
[ ] Test with concurrent requests from multiple nodes

Observability Checklist

Metrics:
- Requests allowed vs rejected per endpoint
- Rate limit headroom (remaining vs total)
- Distributed counter sync latency
- 429 response rate by endpoint and client
- Token bucket utilization over time
Logs:
- Rate limit violations with client identifier
- Counter sync operations between nodes
- Algorithm selection per endpoint
- Burst detection (sudden spikes in requests)
Alerts:
- 429 response rate exceeds threshold (indicates abuse or misconfiguration)
- Rate limit headroom approaching zero
- Counter desynchronization between Redis instances
- Unusual spike in requests from single source

Security Checklist

Rate limiting enforced server-side (not just client-side)
Per-IP limits combined with per-user limits
Distributed rate limiting synchronized across all instances
Atomic operations prevent counter bypass
Rate limit headers do not expose internal implementation details
Graduated response (warn before blocking) for misconfigured clients
Anonymize PII in rate limiting logs

Interview Questions

1. What is the difference between rate limiting and quotas?

Rate limiting controls the rate of requests—how many requests per second, minute, or hour. Quotas control total usage over a time period—how many requests per month. A rate limit of 100 requests per minute means you can burst to 100 quickly or spread them evenly; a quota of 100,000 per month means you get 100,000 total, regardless of distribution.

Use rate limits to prevent abuse and system overload. Use quotas for billing and capacity planning. Most APIs implement both: a rate limit prevents momentary abuse, a quota prevents sustained overages.

2. Compare token bucket, sliding window, and fixed window algorithms for rate limiting.

Token bucket: Tokens accumulate up to a bucket size, then refill at a steady rate. Allows bursts up to the bucket size, then limits sustained rate. Good for user-facing APIs where users make bursts while actively using features.

Sliding window: Tracks requests in a rolling time window. Smoother than fixed window, more accurate, but requires more memory. Good when you need consistent limiting without boundary spikes.

Fixed window: Simplest—reset counters at the window boundary. Has a boundary spike problem where users get double the limit at window edges. Use only when spike tolerance is acceptable.

3. How do you implement rate limiting in a distributed system with multiple servers?

Local (in-memory) rate limiting does not work across multiple servers—a user could get N requests per server, effectively getting N times the limit. You need shared state.

Redis is the standard solution. Use atomic operations like INCR and EXPIRE to increment counters atomically. Use Lua scripts for complex algorithms like sliding window that need multiple operations to be atomic. This ensures all servers see the same count.

Consider Redis Cluster for sharding if your rate limiting needs exceed a single Redis instance's capacity. Network latency to Redis adds 1-2ms per request—acceptable for most use cases, problematic for ultra-low latency paths.

4. What is the thundering herd problem in rate limiting?

The thundering herd happens when many requests are queued or retried simultaneously after a rate limit resets or expires. All clients hit the system at once, overwhelming it even though they were rate limited before.

Mitigations: add jitter to client retry logic so retries are spread over time rather than synchronized. For cache stampede specifically, use probabilistic early expiration—some requests proactively refresh the cache before it expires, preventing all requests from hitting the backend when the cache clears.

5. How should a client handle a 429 Too Many Requests response?

Read the Retry-After header if present—it tells you how long to wait. If no Retry-After header, use exponential backoff with jitter. Start with a small delay, double it for each subsequent failure, add random jitter to prevent synchronization with other clients.

Do not spin retrying immediately—that just increases load and guarantees another 429. Respect the rate limit. If you consistently hit rate limits, you need to either reduce usage, request higher limits, or use caching to avoid redundant requests.

6. What is per-IP versus per-user rate limiting and when would you use each?

Per-IP rate limiting tracks requests by client IP address. Good for preventing abuse from anonymous clients or distributed attacks. But IPv6 and NAT make IP tracking unreliable, and shared IPs (corporate networks, mobile carriers) penalize legitimate users.

Per-user rate limiting tracks by authenticated user ID or API key. More accurate for metered APIs where you bill per user. Requires authentication—cannot apply to anonymous endpoints.

Use both in layers: per-IP as a first-pass defense against obvious abuse, per-user for accurate limiting of authenticated requests. This gives you protection against both anonymous floods and individual user overages.

7. How do leaky bucket and token bucket algorithms differ?

Leaky bucket processes requests at a constant rate regardless of arrival pattern. Requests queue in the bucket and drain at the configured rate. If the bucket is full, new requests are rejected. Think of a physical bucket with a hole—water drains at a steady rate regardless of how fast you pour it in.

Token bucket adds tokens to a bucket at a steady rate. Requests consume tokens—if a token is available, the request proceeds. This allows bursts up to the bucket capacity while enforcing an average rate. A user with accumulated tokens can make many requests quickly, then slows down as tokens replenish.

Token bucket suits most APIs. Leaky bucket suits APIs calling downstream services that cannot handle bursts.

8. What metrics should you monitor for rate limiting?

Monitor violation rate per client or IP—this tells you if abuse is increasing. Monitor rate limit headroom (how close users are to their limits) to identify users approaching limits before they hit them. Monitor 429 response rate as a percentage of total requests to detect systemic misconfiguration.

At the system level, monitor Redis latency for distributed rate limiting—if Redis is slow, your rate limiting adds latency to every request. Monitor false positives: legitimate users hitting rate limits because they share an IP with bad actors.

9. What is adaptive rate limiting?

Adaptive rate limiting adjusts limits based on system load or detected anomalies. When the system is under stress, limits tighten—fewer requests per client. When the system is healthy, limits expand. This provides graceful degradation under load rather than hard cutoffs.

Implementation: monitor system metrics (CPU, latency P99, queue depth). When metrics exceed thresholds, reduce limits proportionally. When metrics improve, restore limits gradually. Some systems integrate with circuit breakers—circuit open on a downstream service automatically reduces rate limits to protect it.

10. What are the trade-offs between L4 and L7 rate limiting?

L4 (transport layer) rate limiting works on IP and TCP/UDP—very fast, simple to implement at the network level. Cannot inspect application payload, so limited to IP-based limiting. Hardware network devices handle this efficiently.

L7 (application layer) rate limiting can inspect request content—user ID, API key, request path, headers. More sophisticated limiting decisions but requires more processing. Adds latency. Suitable for scenarios requiring fine-grained control over application layer data.

Use L4 for gross limiting at the network edge, L7 for API-level limiting where you need to distinguish between users or endpoints. Most architectures do both.

11. What is rate limit bypass and how do you prevent it?

Rate limit bypass occurs when clients find ways to circumvent rate limiting—using multiple IP addresses, spoofing headers, rotating user agents, or exploiting inconsistencies between enforcement points.

Prevention: implement defense in depth with multiple layers (gateway + application), combine per-IP with per-user limits, use fingerprinting (IP + header combination), monitor for anomalous patterns like many requests from sequential IPs, and ensure atomic operations in distributed implementations.

12. How does GC (garbage collection) pressure affect token bucket implementations?

Token bucket in high-concurrency languages must handle atomic operations carefully. In Go, use sync/atomic. In Java, use AtomicLong or LongAdder. In Python, the GIL serializes access but the refill computation still needs locking to prevent race conditions.

Memory pressure from frequent object allocation (for timestamps, tokens) can cause GC pauses that delay refill calculations, making the algorithm less accurate under load.

13. What is the difference between fail-open and fail-closed rate limiting?

Fail-open: when the rate limiting system is unavailable (Redis down), requests are allowed through. Maximizes availability but removes protection during failures.

Fail-closed: when rate limiting is unavailable, requests are rejected. Protects the system but causes outages when dependencies fail.

Choose based on your risk tolerance. User-facing APIs often fail open to avoid blocking legitimate users. Internal services protecting downstream resources often fail closed.

14. How would you implement hierarchical rate limiting across multiple tiers (user → plan → global)?

Hierarchical rate limiting enforces limits at multiple granularities simultaneously. A user has per-endpoint limits, per-plan limits (all endpoints combined), and a global system limit. All three must pass for a request to proceed.

Implementation: check in order from narrowest to broadest scope. If user limit passes, check plan limit. If plan limit passes, check global limit. Reject on first failure. Use Redis Lua scripts to perform multi-level checks atomically.

15. What are token credit systems and how do they differ from standard token bucket?

Token credit systems pre-allocate a pool of tokens (credits) to users, which they spend on API calls. Credits may replenish on a schedule (monthly allowance) or via purchase. This differs from token bucket where tokens accumulate continuously.

Credits are better for billing scenarios where you need predictable monthly allowances and usage tracking. Standard token bucket is better for traffic shaping without usage accounting.

16. How do you handle rate limiting for authenticated vs anonymous users?

Anonymous users get limited by IP address or fingerprint. Authenticated users get limited by user ID or API key. This ensures anonymous abuse does not consume the authenticated user's quota.

Implementation: check anonymous limits first (faster rejection of obvious abuse). If authenticated, also check per-user limits. Return different error messages and retry windows based on which limit was hit.

17. What is a retry storm and how does rate limiting help prevent it?

A retry storm occurs when many clients retry simultaneously after a service disruption. Without rate limiting, retries multiply the load on recovering services, potentially causing the outage to continue or worsen.

Rate limiting on both client and server side prevents storms. Clients should implement exponential backoff with jitter. Servers should return Retry-After headers and reject immediately rather than timing out slowly.

18. How would you design a multi-tenant SaaS rate limiting system?

Multi-tenant rate limiting requires isolation between tenants while maintaining shared infrastructure efficiency. Use tenant ID as the rate limit key prefix. Each tenant gets their own bucket or counter.

Consider: per-tenant limits can be static (config) or dynamic (based on subscription tier). Use Redis hashes with tenant ID as key for efficient per-tenant storage. Monitor for noisy neighbor problems where one tenant's traffic affects others.

19. What is the role of jitter in client retry strategies?

Jitter adds randomness to retry delays to prevent synchronized retries from multiple clients. Without jitter, all clients waiting for the same Retry-After duration retry at the same instant, creating a thundering herd.

Types: full jitter (random delay between 0 and cap), equal jitter (half the delay plus random), decorrelated jitter (each delay independent). Full jitter is simplest and generally effective.

20. How do you test rate limiting implementations?

Unit test each algorithm in isolation: verify bucket empties at refill rate, sliding window removes old entries, fixed window resets correctly.

Integration test: concurrent requests from multiple nodes hitting a shared Redis store. Verify atomicity—max N concurrent requests should succeed, N+1 should fail.

Chaos test: kill Redis mid-request, verify fail-open or fail-closed behavior as designed. Test clock skew scenarios for sliding window.

Conclusion

Rate limiting protects your infrastructure from abuse, controls costs, and ensures fair resource allocation across clients. Token bucket handles bursts naturally, sliding window provides smooth limiting, and distributed implementations using Redis ensure consistency across multiple servers.

The right algorithm depends on your use case. Token bucket suits most APIs. Use sliding window when you need consistent rate limiting without boundary spikes. Use fixed window only when simplicity matters more than accuracy.

Implementation checklist: choose your algorithm, set burst capacity based on downstream tolerance, implement per-IP and per-user limits, add rate limit headers to responses, include Retry-After on 429s, monitor violation rates, and test with concurrent requests from multiple nodes.

Rate Limiting: Token Bucket, Sliding Window, and Distributed Systems

Introduction

Core Concepts

Token Bucket

Sliding Window

Fixed Window

Leaky Bucket

Implementing Rate Limits in APIs

HTTP Headers

Distributed Rate Limiting

Redis-Based Token Bucket

Redis-Based Sliding Window

Rate Limiting Strategies

Per-User vs Per-IP

Tiered Limits

Global vs Endpoint-Specific

Handling Rate Limit Violations

When to Use Which Algorithm

Trade-off Analysis

Key Observations

Algorithm Selection Matrix

Distributed Systems Trade-offs

Cost-Benefit Summary

Real-world Failure Scenarios

The Akamai Rewrite Incident

The Leap Second Chaos

The Noisy Neighbor Problem

The Kubernetes HPA Over-correction

The Gradual Rollout Failure

Production Failure Scenarios

Common Pitfalls / Anti-Patterns

Not Being Atomic

Ignoring Header Consistency

Allowing Unrestricted Bursts

Quick Recap

Observability Checklist

Security Checklist

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

API Versioning: Managing Change Without Breaking Clients

GraphQL vs REST: Choosing the Right API Paradigm

RESTful API Design: Best Practices for Building Web APIs