Rate Limiting: Token Bucket, Sliding Window, and Distributed Systems
Rate limiting protects APIs from abuse. Learn token bucket, sliding window, fixed window algorithms and distributed rate limiting at scale.
Rate Limiting: Token Bucket, Sliding Window, and Distributed Systems
Every API has a capacity limit. Physical servers have finite CPU and memory. Databases can only handle so many connections. Rate limiting protects your services from being overwhelmed—whether by malicious attackers, runaway scripts, or too many legitimate users at once.
Without it, one misconfigured client can take down your entire API. A retry loop hitting a temporary backend hiccup creates a retry storm. Your service goes from degraded to dead.
Introduction
Rate limiting does more than block abuse. Here are the less obvious reasons you might want it.
Fairness. Without rate limits, heavy users crowd out light users. A single tenant uploading large files could saturate bandwidth for everyone else.
Cost control. Your infrastructure costs money. Rate limits let you charge different customers based on their usage.
Predictability. Hard limits mean clients can plan retries with proper backoff. No guesswork about when the system will cut them off.
Protection from cascading failures. Slow upstream services cause clients to retry more aggressively. Without rate limiting, retry storms make the original problem worse.
Core Concepts
Token Bucket
The token bucket algorithm is the most common approach. Picture a bucket holding a fixed number of tokens. Tokens get added to the bucket at a constant rate. Each request consumes one token. If the bucket is empty, requests get rejected.
import time
import threading
class TokenBucket:
def __init__(self, capacity: int, refill_rate: float):
self.capacity = capacity
self.refill_rate = refill_rate # tokens per second
self.tokens = capacity
self.last_refill = time.time()
self.lock = threading.Lock()
def allow_request(self) -> bool:
# Note: This is a simplified illustration. The `self.lock` provides
# thread-safety for this single-instance implementation. For distributed
# rate limiting across multiple processes or servers, atomic operations
# (e.g., Redis Lua scripts or atomic increments) are required.
with self.lock:
self._refill()
if self.tokens >= 1:
self.tokens -= 1
return True
return False
def _refill(self):
now = time.time()
elapsed = now - self.last_refill
new_tokens = elapsed * self.refill_rate
self.tokens = min(self.capacity, self.tokens + new_tokens)
self.last_refill = now
Two things to keep in mind: bursts up to the bucket size are allowed, and the long-term average equals the refill rate. A bucket with capacity 100 and refill rate of 10 per second handles a burst of 100 requests, then settles into 10 per second.
Users get burst capacity when they need it, then settle into the sustainable rate.
Sliding Window
The sliding window algorithm fixes the boundary issues that fixed windows have. Rather than resetting counters at window boundaries, it tracks requests within a rolling time period.
import time
from collections import deque
class SlidingWindow:
def __init__(self, max_requests: int, window_seconds: int):
self.max_requests = max_requests
self.window_seconds = window_seconds
self.requests = deque()
self.lock = threading.Lock()
def allow_request(self) -> bool:
with self.lock:
now = time.time()
cutoff = now - self.window_seconds
# Remove expired entries
while self.requests and self.requests[0] < cutoff:
self.requests.popleft()
if len(self.requests) < self.max_requests:
self.requests.append(now)
return True
return False
The advantage over fixed window: no spike at window boundaries. With fixed windows, a client could make 100 requests at 11:59:59 and another 100 at 12:00:00, effectively doubling the rate. Sliding window prevents this.
The memory cost is proportional to the number of requests in the window. For high-volume APIs, this can add up.
Fixed Window
Fixed window is the simplest algorithm. Divide time into fixed intervals. Each interval has a request budget. Count resets at interval boundaries.
import time
class FixedWindow:
def __init__(self, max_requests: int, window_seconds: int):
self.max_requests = max_requests
self.window_seconds = window_seconds
self.window_start = time.time()
self.count = 0
self.lock = threading.Lock()
def allow_request(self) -> bool:
with self.lock:
now = time.time()
if now >= self.window_start + self.window_seconds:
self.window_start = now
self.count = 0
if self.count < self.max_requests:
self.count += 1
return True
return False
Fixed window is simple to implement and debug. The boundary spike is a real problem, but some applications can live with it. Memory stays constant.
Leaky Bucket
Leaky bucket works like a bucket with a hole in the bottom. Think of incoming requests as water being poured into the bucket — they arrive at irregular, bursty intervals just like water splashes in. The bucket holds them temporarily, and they drain out through the hole at a steady, controlled rate, like water trickling out. If too much water accumulates before it can drain (the bucket overflows), new requests are rejected. This smooths out bursty inbound traffic into a steady outflow, protecting a downstream service that cannot handle sudden spikes.
import time
import threading
class LeakyBucket:
def __init__(self, capacity: int, leak_rate: float):
self.capacity = capacity
self.leak_rate = leak_rate
self.water = 0
self.last_leak = time.time()
self.lock = threading.Lock()
def allow_request(self) -> bool:
with self.lock:
self._leak()
if self.water < self.capacity:
self.water += 1
return True
return False
def _leak(self):
now = time.time()
elapsed = now - self.last_leak
leaked = elapsed * self.leak_rate
self.water = max(0, self.water - leaked)
self.last_leak = now
Leaky bucket drains at a steady rate, so it smooths your outbound traffic. If your API calls a third-party service with rate limits, leaky bucket prevents you from overwhelming it.
Implementing Rate Limits in APIs
HTTP Headers
APIs use headers to communicate rate limit status. No universal standard exists, but common conventions have emerged.
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1640995200
Retry-After: 3600
When a client exceeds the limit, return 429 Too Many Requests:
@app.route('/api/resource')
def get_resource():
if not rate_limiter.allow_request():
reset_time = rate_limiter.get_reset_time()
return {
'error': 'Rate limit exceeded',
'retry_after': reset_time - time.time()
}, 429
return {'data': 'result'}
Clients should read these headers and back off. Use exponential backoff with jitter to avoid thundering herd problems.
Distributed Rate Limiting
In a single-server setup, these algorithms work fine. Modern applications run across multiple servers though. A client could hit different servers with separate counters and bypass the rate limit entirely.
Distributed rate limiting uses a shared store to coordinate limits across instances.
Redis-Based Token Bucket
-- Redis token bucket implementation
-- KEYS[1] = rate limit key
-- ARGV[1] = capacity
-- ARGV[2] = refill rate (tokens per second)
-- ARGV[3] = current timestamp
-- ARGV[4] = requested tokens (usually 1)
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now
-- Refill tokens
local elapsed = now - last_refill
local new_tokens = elapsed * refill_rate
tokens = math.min(capacity, tokens + new_tokens)
local allowed = 0
if tokens >= requested then
tokens = tokens - requested
allowed = 1
end
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 1)
return {allowed, tokens}
This Lua script runs atomically in Redis, so race conditions disappear. The expiry stops the key from hanging around after the client stops making requests.
Redis-Based Sliding Window
Sliding window with Redis uses a sorted set. Each request gets a timestamp as the score and a unique member. To check the limit: remove entries older than the window, count remaining entries, add the new entry if within limit.
import redis
import time
import uuid
class RedisSlidingWindow:
def __init__(self, max_requests: int, window_seconds: int):
self.max_requests = max_requests
self.window_seconds = window_seconds
self.redis = redis.Redis()
def allow_request(self, key: str) -> bool:
now = time.time()
window_start = now - self.window_seconds
request_id = str(uuid.uuid4())
pipe = self.redis.pipeline()
# Remove old entries
pipe.zremrangebyscore(key, 0, window_start)
# Count current entries
pipe.zcard(key)
# Add new entry if allowed
results = pipe.execute()
current_count = results[1]
if current_count < self.max_requests:
self.redis.zadd(key, {request_id: now})
self.redis.expire(key, self.window_seconds + 1)
return True
return False
Rate Limiting Strategies
Per-User vs Per-IP
Rate limits can apply to authenticated users or to IP addresses. Per-user limits work better for shared IPs (multiple users behind corporate NAT). Per-IP limits protect against attacks from single sources.
Use both. Per-IP limits catch unauthenticated abuse. Per-user limits ensure authenticated users share resources fairly.
Tiered Limits
Different subscription tiers get different rate limits. Free tier could get 100 requests per minute. Paid tier gets 1000. Enterprise gets unlimited.
def get_rate_limit(user: User) -> tuple[int, int]:
tier = user.subscription_tier
if tier == 'free':
return (100, 60) # 100 requests per 60 seconds
elif tier == 'paid':
return (1000, 60)
elif tier == 'enterprise':
return (10000, 60)
Global vs Endpoint-Specific
You might want different limits for different endpoints. Public data endpoints could have generous limits. Write operations should have stricter limits. Admin endpoints might be limited to internal networks only.
ENDPOINT_LIMITS = {
'/api/public/search': (1000, 60),
'/api/private/write': (100, 60),
'/api/admin/': (10000, 60),
}
def check_rate_limit(endpoint: str, user: User) -> bool:
limit = ENDPOINT_LIMITS.get(endpoint, (100, 60))
return sliding_window.allow_request(f"user:{user.id}:{endpoint}", *limit)
Handling Rate Limit Violations
When a client exceeds the limit, return a proper response.
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 3600
{
"error": "rate_limit_exceeded",
"message": "API rate limit exceeded. Please retry after 3600 seconds.",
"limit": 1000,
"reset_at": "2026-03-22T13:00:00Z"
}
Log rate limit violations. A client repeatedly hitting limits might be malicious, or might need a higher tier. Either way, you want visibility into this. Consider graduated responses. First violation gets a warning header. Repeated violations get actual rejections. This helps legitimate clients that are misconfigured.
When to Use Which Algorithm
Token bucket suits APIs where burst usage is normal. Users make many requests quickly while actively using a feature, then go idle.
Sliding window suits APIs needing smooth, predictable limiting. Better for services calling downstream APIs with strict rate limits.
Fixed window suits when you need simple, debuggable limits. The boundary spike might be acceptable for your use case.
Leaky bucket suits when you need to send requests at a steady rate to a downstream service.
For most web APIs, token bucket with Redis handles the job. It is well-understood, memory-efficient, and works naturally in distributed systems.
Trade-off Analysis
Key Observations
-
Token bucket vs Sliding window: Token bucket allows bursts which users expect. Sliding window enforces stricter consistency but costs more memory. For most APIs, token bucket wins. For financial or regulatory compliance where exact limits matter, sliding window is better.
-
Redis sorted sets vs Lua counters: Sorted sets (for sliding window) use more memory but support precise windowing. Lua counters use less memory but approximate behavior. At high scale, Lua counters win on memory efficiency.
-
Single Redis vs Redis Cluster: Single Redis handles ~100K ops/sec on modern hardware. Redis Cluster adds latency from cross-node coordination. Design for single Redis first, cluster only when proven necessary.
-
Fail-open vs fail-closed: Fail-open maximizes availability but removes protection during outages. Fail-closed protects resources but causes denial-of-service when rate limiting fails. Internal services protecting databases should fail closed; user-facing APIs can often fail open with degraded logging.
Algorithm Selection Matrix
Use this matrix to choose an algorithm based on your requirements:
| Requirement | Recommended Algorithm | Why |
|---|---|---|
| Burst-friendly traffic | Token Bucket | Absorbs bursts naturally |
| Strict rate enforcement | Sliding Window | No boundary spikes |
| Minimal implementation | Fixed Window | Simple counters |
| Outbound throttling | Leaky Bucket | Steady outflow rate |
| Memory constrained | Token Bucket / Fixed Window | Constant memory |
| High accuracy needed | Sliding Window | Precise windowing |
Distributed Systems Trade-offs
When implementing rate limiting across multiple nodes:
Consistency vs Availability: Strong consistency (Redis Cluster) adds latency but ensures accurate limits. Eventual consistency (local fallback) is faster but may allow slight limit bypass during coordination delays.
Latency vs Accuracy: Lua scripts atomic but add ~1ms per request. Application-side checks are faster but require careful locking. For most APIs, the Lua overhead is acceptable.
Storage efficiency vs Precision: Sorted sets track every request precisely but memory grows with window size. Approximate algorithms (token bucket with Lua) use constant memory but average over time rather than counting exact requests.
Cost-Benefit Summary
| Factor | Token Bucket | Sliding Window | Fixed Window | Leaky Bucket |
|---|---|---|---|---|
| Implementation cost | Medium | High | Low | Medium |
| Operational cost | Low | Medium | Low | Low |
| User experience | Good (bursts allowed) | Best (smooth) | Poor (boundary spike) | Poor (no bursts) |
| Infrastructure impact | Low | Medium (more Redis ops) | Low | Low |
| Compliance suitability | Moderate | Best | Poor | Moderate |
Real-world Failure Scenarios
Understanding how rate limiting fails in production helps you design more robust systems. These scenarios are based on real-world incidents.
The Akamai Rewrite Incident
A major SaaS provider deployed a rate limiting rule that dropped requests exceeding 1,000 per minute. The rule worked perfectly in staging. In production, their CDN (Akamai) was rewriting request headers, inflating header counts and triggering the limit for legitimate users making just 50 requests. The result: 12 hours of intermittent outages affecting 30% of users.
Lesson: Test rate limiting with production traffic patterns, not synthetic requests. CDNs, proxies, and load balancers modify headers in ways staging environments may not replicate.
The Leap Second Chaos
During a leap second, a distributed rate limiter’s sliding window implementation began rejecting all requests. The system used Redis sorted sets with timestamp-based expiry. The leap second caused timestamp calculations to overflow, making all requests appear to be outside the window.
Lesson: Use monotonic time sources where possible. Test your rate limiting implementation against edge cases: leap seconds, year rollovers, timezone changes, and clock adjustments.
The Noisy Neighbor Problem
A single tenant on a multi-tenant SaaS platform began sending webhook notifications to their entire user base. The webhook calls hit the rate limiter, but because all tenants shared the same Redis cluster, the noisy neighbor’s traffic caused latency spikes for all other tenants. The rate limiter itself became a bottleneck.
Lesson: Isolate rate limiting state by tenant. Use separate Redis instances or at minimum separate key namespaces. Monitor for noisy neighbors before they affect others.
The Kubernetes HPA Over-correction
A Kubernetes deployment scaled from 2 to 20 pods during a traffic spike. Each pod initialized its own local rate limiter with full burst capacity. With 20 pods each allowing bursts, the effective limit was 20x the intended limit. The downstream database was overwhelmed before the rate limiting caught up.
Lesson: Local rate limiting does not work with auto-scaling. Use centralized rate limiting (Redis) when your service scales horizontally. Alternatively, use a rate limiting sidecar that coordinates across pods.
The Gradual Rollout Failure
A company rolled out new per-user rate limits to 5% of traffic. The new limits worked. They rolled out to 25%. Still worked. At 50%, Redis latency spiked to 500ms. At 100%, the rate limiting Redis was overwhelmed and began timing out. The entire API became unreliable.
Lesson: Test rate limiting scalability before rollout. Distributed counters add Redis round-trips to every request. Profile your Redis under expected load, and add headroom for spikes.
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Rate limit counter overflow | Limits enforced incorrectly; bypassed or too restrictive | Use appropriate data types; set maximum values; monitor for unusual patterns |
| Redis unavailable for distributed limits | Rate limiting fails open or closed | Decide on fail-open vs fail-closed; implement local fallback with reduced limits |
| Clock skew in sliding window | Requests incorrectly counted or allowed | Use Redis server time; NTP sync all rate-limiting servers |
| Thundering herd on limit reset | All clients retry simultaneously at window boundary | Use jitter in retry logic; consider sliding window instead of fixed |
| Client bypass via IP spoofing | Rate limits circumvented by malicious actors | Combine per-IP with per-user limits; implement multiple enforcement layers |
Common Pitfalls / Anti-Patterns
Not Being Atomic
Race conditions in distributed rate limiting cause limit bypass. Two requests arriving simultaneously on different servers both check the counter, both see room, both increment. Both get through when only one should.
Use atomic operations. Redis Lua scripts, conditional writes, or distributed locks. Pick one that matches your consistency requirements.
Ignoring Header Consistency
When using multiple rate-limiting instances, headers might show different values to the same client. Request hits Server A, sees 99 remaining. Request hits Server B, sees 100 remaining. This confuses clients tracking their usage.
Use a shared counter store and return values from it. Accept the latency cost.
Allowing Unrestricted Bursts
Token bucket capacity sets the maximum burst. Large bursts can overwhelm downstream services even if the long-term average is within limits.
Set burst limits based on what your infrastructure can actually handle, not on what seems generous.
Quick Recap
Key Bullets:
- Token bucket suits burst-friendly APIs; sliding window suits consistent-rate APIs
- Distributed rate limiting requires atomic operations via Redis Lua or similar
- Combine per-IP and per-user limits for defense in depth
- Always include Retry-After header; clients should respect it
- Monitor rate limit metrics to detect abuse and misconfiguration
Copy/Paste Checklist:
Rate Limiting Implementation:
[ ] Choose algorithm based on use case
[ ] Implement atomic distributed counter (Redis Lua)
[ ] Set burst capacity based on downstream tolerance
[ ] Implement per-IP and per-user limits
[ ] Add X-RateLimit-* headers to all responses
[ ] Include Retry-After on 429 responses
[ ] Implement graceful degradation (fail-open vs fail-closed)
[ ] Add jitter to client retry logic
[ ] Monitor violation rates per client
[ ] Test with concurrent requests from multiple nodes
Observability Checklist
-
Metrics:
- Requests allowed vs rejected per endpoint
- Rate limit headroom (remaining vs total)
- Distributed counter sync latency
- 429 response rate by endpoint and client
- Token bucket utilization over time
-
Logs:
- Rate limit violations with client identifier
- Counter sync operations between nodes
- Algorithm selection per endpoint
- Burst detection (sudden spikes in requests)
-
Alerts:
- 429 response rate exceeds threshold (indicates abuse or misconfiguration)
- Rate limit headroom approaching zero
- Counter desynchronization between Redis instances
- Unusual spike in requests from single source
Security Checklist
- Rate limiting enforced server-side (not just client-side)
- Per-IP limits combined with per-user limits
- Distributed rate limiting synchronized across all instances
- Atomic operations prevent counter bypass
- Rate limit headers do not expose internal implementation details
- Graduated response (warn before blocking) for misconfigured clients
- Anonymize PII in rate limiting logs
Interview Questions
Rate limiting controls the rate of requests—how many requests per second, minute, or hour. Quotas control total usage over a time period—how many requests per month. A rate limit of 100 requests per minute means you can burst to 100 quickly or spread them evenly; a quota of 100,000 per month means you get 100,000 total, regardless of distribution.
Use rate limits to prevent abuse and system overload. Use quotas for billing and capacity planning. Most APIs implement both: a rate limit prevents momentary abuse, a quota prevents sustained overages.
Token bucket: Tokens accumulate up to a bucket size, then refill at a steady rate. Allows bursts up to the bucket size, then limits sustained rate. Good for user-facing APIs where users make bursts while actively using features.
Sliding window: Tracks requests in a rolling time window. Smoother than fixed window, more accurate, but requires more memory. Good when you need consistent limiting without boundary spikes.
Fixed window: Simplest—reset counters at the window boundary. Has a boundary spike problem where users get double the limit at window edges. Use only when spike tolerance is acceptable.
Local (in-memory) rate limiting does not work across multiple servers—a user could get N requests per server, effectively getting N times the limit. You need shared state.
Redis is the standard solution. Use atomic operations like INCR and EXPIRE to increment counters atomically. Use Lua scripts for complex algorithms like sliding window that need multiple operations to be atomic. This ensures all servers see the same count.
Consider Redis Cluster for sharding if your rate limiting needs exceed a single Redis instance's capacity. Network latency to Redis adds 1-2ms per request—acceptable for most use cases, problematic for ultra-low latency paths.
The thundering herd happens when many requests are queued or retried simultaneously after a rate limit resets or expires. All clients hit the system at once, overwhelming it even though they were rate limited before.
Mitigations: add jitter to client retry logic so retries are spread over time rather than synchronized. For cache stampede specifically, use probabilistic early expiration—some requests proactively refresh the cache before it expires, preventing all requests from hitting the backend when the cache clears.
Read the Retry-After header if present—it tells you how long to wait. If no Retry-After header, use exponential backoff with jitter. Start with a small delay, double it for each subsequent failure, add random jitter to prevent synchronization with other clients.
Do not spin retrying immediately—that just increases load and guarantees another 429. Respect the rate limit. If you consistently hit rate limits, you need to either reduce usage, request higher limits, or use caching to avoid redundant requests.
Per-IP rate limiting tracks requests by client IP address. Good for preventing abuse from anonymous clients or distributed attacks. But IPv6 and NAT make IP tracking unreliable, and shared IPs (corporate networks, mobile carriers) penalize legitimate users.
Per-user rate limiting tracks by authenticated user ID or API key. More accurate for metered APIs where you bill per user. Requires authentication—cannot apply to anonymous endpoints.
Use both in layers: per-IP as a first-pass defense against obvious abuse, per-user for accurate limiting of authenticated requests. This gives you protection against both anonymous floods and individual user overages.
Leaky bucket processes requests at a constant rate regardless of arrival pattern. Requests queue in the bucket and drain at the configured rate. If the bucket is full, new requests are rejected. Think of a physical bucket with a hole—water drains at a steady rate regardless of how fast you pour it in.
Token bucket adds tokens to a bucket at a steady rate. Requests consume tokens—if a token is available, the request proceeds. This allows bursts up to the bucket capacity while enforcing an average rate. A user with accumulated tokens can make many requests quickly, then slows down as tokens replenish.
Token bucket suits most APIs. Leaky bucket suits APIs calling downstream services that cannot handle bursts.
Monitor violation rate per client or IP—this tells you if abuse is increasing. Monitor rate limit headroom (how close users are to their limits) to identify users approaching limits before they hit them. Monitor 429 response rate as a percentage of total requests to detect systemic misconfiguration.
At the system level, monitor Redis latency for distributed rate limiting—if Redis is slow, your rate limiting adds latency to every request. Monitor false positives: legitimate users hitting rate limits because they share an IP with bad actors.
Adaptive rate limiting adjusts limits based on system load or detected anomalies. When the system is under stress, limits tighten—fewer requests per client. When the system is healthy, limits expand. This provides graceful degradation under load rather than hard cutoffs.
Implementation: monitor system metrics (CPU, latency P99, queue depth). When metrics exceed thresholds, reduce limits proportionally. When metrics improve, restore limits gradually. Some systems integrate with circuit breakers—circuit open on a downstream service automatically reduces rate limits to protect it.
L4 (transport layer) rate limiting works on IP and TCP/UDP—very fast, simple to implement at the network level. Cannot inspect application payload, so limited to IP-based limiting. Hardware network devices handle this efficiently.
L7 (application layer) rate limiting can inspect request content—user ID, API key, request path, headers. More sophisticated limiting decisions but requires more processing. Adds latency. Suitable for scenarios requiring fine-grained control over application layer data.
Use L4 for gross limiting at the network edge, L7 for API-level limiting where you need to distinguish between users or endpoints. Most architectures do both.
Rate limit bypass occurs when clients find ways to circumvent rate limiting—using multiple IP addresses, spoofing headers, rotating user agents, or exploiting inconsistencies between enforcement points.
Prevention: implement defense in depth with multiple layers (gateway + application), combine per-IP with per-user limits, use fingerprinting (IP + header combination), monitor for anomalous patterns like many requests from sequential IPs, and ensure atomic operations in distributed implementations.
Token bucket in high-concurrency languages must handle atomic operations carefully. In Go, use sync/atomic. In Java, use AtomicLong or LongAdder. In Python, the GIL serializes access but the refill computation still needs locking to prevent race conditions.
Memory pressure from frequent object allocation (for timestamps, tokens) can cause GC pauses that delay refill calculations, making the algorithm less accurate under load.
Fail-open: when the rate limiting system is unavailable (Redis down), requests are allowed through. Maximizes availability but removes protection during failures.
Fail-closed: when rate limiting is unavailable, requests are rejected. Protects the system but causes outages when dependencies fail.
Choose based on your risk tolerance. User-facing APIs often fail open to avoid blocking legitimate users. Internal services protecting downstream resources often fail closed.
Hierarchical rate limiting enforces limits at multiple granularities simultaneously. A user has per-endpoint limits, per-plan limits (all endpoints combined), and a global system limit. All three must pass for a request to proceed.
Implementation: check in order from narrowest to broadest scope. If user limit passes, check plan limit. If plan limit passes, check global limit. Reject on first failure. Use Redis Lua scripts to perform multi-level checks atomically.
Token credit systems pre-allocate a pool of tokens (credits) to users, which they spend on API calls. Credits may replenish on a schedule (monthly allowance) or via purchase. This differs from token bucket where tokens accumulate continuously.
Credits are better for billing scenarios where you need predictable monthly allowances and usage tracking. Standard token bucket is better for traffic shaping without usage accounting.
Anonymous users get limited by IP address or fingerprint. Authenticated users get limited by user ID or API key. This ensures anonymous abuse does not consume the authenticated user's quota.
Implementation: check anonymous limits first (faster rejection of obvious abuse). If authenticated, also check per-user limits. Return different error messages and retry windows based on which limit was hit.
A retry storm occurs when many clients retry simultaneously after a service disruption. Without rate limiting, retries multiply the load on recovering services, potentially causing the outage to continue or worsen.
Rate limiting on both client and server side prevents storms. Clients should implement exponential backoff with jitter. Servers should return Retry-After headers and reject immediately rather than timing out slowly.
Multi-tenant rate limiting requires isolation between tenants while maintaining shared infrastructure efficiency. Use tenant ID as the rate limit key prefix. Each tenant gets their own bucket or counter.
Consider: per-tenant limits can be static (config) or dynamic (based on subscription tier). Use Redis hashes with tenant ID as key for efficient per-tenant storage. Monitor for noisy neighbor problems where one tenant's traffic affects others.
Jitter adds randomness to retry delays to prevent synchronized retries from multiple clients. Without jitter, all clients waiting for the same Retry-After duration retry at the same instant, creating a thundering herd.
Types: full jitter (random delay between 0 and cap), equal jitter (half the delay plus random), decorrelated jitter (each delay independent). Full jitter is simplest and generally effective.
Unit test each algorithm in isolation: verify bucket empties at refill rate, sliding window removes old entries, fixed window resets correctly.
Integration test: concurrent requests from multiple nodes hitting a shared Redis store. Verify atomicity—max N concurrent requests should succeed, N+1 should fail.
Chaos test: kill Redis mid-request, verify fail-open or fail-closed behavior as designed. Test clock skew scenarios for sliding window.
Further Reading
- Martin Fowler’s Rate Limiter — Core patterns explained
- Stripe’s API Rate Limits — Real-world engineering perspective
- Cloudflare Rate Limiting — CDN-level limiting implementation
- Redis Rate Limiting — Atomic operations for distributed counters
- AWS API Gateway Throttling — Cloud provider approach
Defense in depth matters. Combine rate limiting at your API gateway with rate limiting in your application code. Use atomic operations for distributed counters. Add observability to detect abuse and misconfiguration before they become outages.
Conclusion
Rate limiting protects your infrastructure from abuse, controls costs, and ensures fair resource allocation across clients. Token bucket handles bursts naturally, sliding window provides smooth limiting, and distributed implementations using Redis ensure consistency across multiple servers.
The right algorithm depends on your use case. Token bucket suits most APIs. Use sliding window when you need consistent rate limiting without boundary spikes. Use fixed window only when simplicity matters more than accuracy.
Implementation checklist: choose your algorithm, set burst capacity based on downstream tolerance, implement per-IP and per-user limits, add rate limit headers to responses, include Retry-After on 429s, monitor violation rates, and test with concurrent requests from multiple nodes.
Category
Related Posts
API Versioning: Managing Change Without Breaking Clients
Learn API versioning strategies: URL path, header, and query parameter approaches. Understand backward compatibility, deprecation practices, and migration patterns.
GraphQL vs REST: Choosing the Right API Paradigm
Compare GraphQL and REST APIs, understand when to use each approach, schema design, queries, mutations, and trade-offs between the two paradigms.
RESTful API Design: Best Practices for Building Web APIs
Learn REST principles, resource naming, HTTP methods, status codes, and best practices. Design clean, maintainable, and scalable RESTful APIs.