Load Balancing: The Traffic Controller of Modern Infrastructure
Learn how load balancers distribute traffic across servers, the differences between L4 and L7 load balancing, and when to use software vs hardware solutions.
Load Balancing: The Traffic Controller of Modern Infrastructure
Load balancing exists because every system has a breaking point. Push enough traffic toward a single server and it buckles. The load balancer sits between users and your server pool, spreading requests around so nothing collapses.
I think of load balancers as air traffic control for your network. They do not just route packets. They make decisions based on real-time conditions, health status, and configured policies. Without them, scaling beyond a handful of servers becomes a nightmare of manual failover and prayer.
Core Concepts
A load balancer accepts incoming traffic and picks which backend server handles each request. The client only sees one destination IP address. The load balancer keeps up appearances while doing the actual work behind the scenes.
The flow goes like this: client sends a request to the load balancer’s virtual IP, the load balancer evaluates its routing algorithm and selects a healthy backend based on current load and policy, the request gets forwarded, the server processes it, and the response returns through the load balancer to the client.
The bidirectional proxy model means the load balancer can inspect, modify, and optimize traffic in both directions. Some setups use direct server return where responses bypass the load balancer, but request forwarding always goes through it.
graph TD
Client1[Client] --> LB[Load Balancer]
Client2[Client] --> LB
Client3[Client] --> LB
LB --> Server1[Server 1]
LB --> Server2[Server 2]
LB --> Server3[Server 3]
Server1 --> LB
Server2 --> LB
Server3 --> LB
Layer 4 vs Layer 7 Load Balancing
Network engineers talk about Layer 4 (transport) and Layer 7 (application) load balancing. The layer tells you how deep into the network stack the load balancer inspects when making routing decisions.
Layer 4 Load Balancing
Layer 4 load balancers operate at the transport layer. They route based on source and destination IP addresses plus port numbers, without looking inside the actual request. This makes them faster and able to handle more throughput since parsing happens at a lower level.
Picture L4 as a postal sorter that only looks at the street address, not what is written in the letter. It routes based on network-level information and can process millions of requests per second with minimal latency.
Layer 4 works well for TCP-based protocols like databases or SSH connections, or any protocol where raw throughput matters more than content inspection.
Layer 7 Load Balancing
Layer 7 load balancers operate at the application layer. They can inspect HTTP headers, URLs, cookies, and request bodies. This opens up sophisticated routing based on what the user actually requested.
With L7, you can route based on URL path, send API requests to one cluster and static assets to another, or direct mobile users to a different backend. You can also terminate SSL at the load balancer, inspecting encrypted traffic before forwarding it.
The cost is higher resource usage. Parsing HTTP is more expensive than reading IP and port numbers. Modern L7 balancers are heavily optimized, but this distinction matters when designing systems that need maximum performance.
graph LR
subgraph "Layer 4"
L4Req[IP + Port] --> L4Dec[Routing Decision]
end
subgraph "Layer 7"
L7Req[HTTP Headers<br/>URL Path<br/>Cookies] --> L7Dec[Routing Decision]
end
Software vs Hardware Load Balancers
Back in the day, load balancers were expensive hardware appliances. Companies like F5 sold dedicated network devices for tens of thousands of dollars, with specialized ASICs for packet processing.
The industry shifted toward software. HAProxy, Nginx, and cloud offerings like AWS ALB or Google Cloud Load Balancing commoditized load balancing. You can deploy capable software balancers on commodity hardware or use managed cloud services.
Software wins on flexibility and cost. You can modify routing logic with code changes, integrate with container orchestration, and scale by deploying more instances. Hardware appliance licensing makes less sense in cloud environments.
But hardware still has a place. Regulated industries sometimes require dedicated appliances for compliance. Extremely high-throughput environments, like major video streaming platforms, still use custom ASIC-based solutions. For most web applications though, software approaches work fine.
Sticky Sessions
Session affinity creates problems. If User A logs into Server 1 and their next request goes to Server 2, that server has no memory of the login. The user appears logged out.
Sticky sessions route a particular user’s requests to the same backend server. The load balancer tracks which client maps to which server, using cookies, client IP, or some other identifier.
Cookie-based sticky sessions insert a tracking cookie that identifies the target server. IP-based affinity hashes the client IP to always return the same backend. Header-based approaches use a custom header set by an upstream service.
Sticky sessions cause their own headaches though. They complicate maintenance windows since you cannot take down a server without disconnecting active users. They make horizontal scaling harder because you cannot freely redistribute load. Many applications work better with session state stored in a distributed cache like Redis.
Health Checks and Failover
Load balancers continuously check that backend servers can handle traffic. Health checks run at configurable intervals, testing whether each server responds correctly. A server that fails too many health checks gets marked unhealthy and removed from rotation.
Health checks range from simple TCP connection tests to full HTTP requests with expected response validation. Deeper checks catch more real failures but add latency and load. Most setups use multiple check types: lighter checks more frequently with occasional deeper validation.
graph TD
LB[Load Balancer] --> HC[Health Checker]
HC -->|TCP ping| S1[Server 1]
HC -->|TCP ping| S2[Server 2]
HC -->|TCP ping| S3[Server 3]
S1 -->|Healthy| S1Status[✓ Healthy]
S2 -->|Timeout| S2Status[✗ Unhealthy]
S3 -->|Healthy| S3Status[✓ Healthy]
S2Status -->|Remove| Pool[Removed from pool]
When a server fails health checks, the load balancer stops routing traffic to it. Existing connections may be terminated or allowed to drain depending on configuration. Once the server recovers and passes health checks again, it rejoins the pool automatically in most systems.
SSL Termination
Handling HTTPS at the load balancer layer has practical benefits. SSL termination means the load balancer decrypts incoming HTTPS traffic, inspects it if L7, and forwards unencrypted traffic to backend servers. This reduces cryptographic load on application servers.
You manage certificates in one place rather than on every backend. Backend communication can use plain HTTP, reducing CPU overhead on application servers. Your load balancer can inject security headers, rewrite URLs, and perform other transformations on decrypted traffic.
The tradeoff involves trust. Traffic travels unencrypted between load balancer and backend servers, typically within your internal network. In cloud environments, this is usually fine. For sensitive data, you might use SSL passthrough where encrypted traffic flows all the way to backend servers, or re-encrypt before forwarding.
Choosing the Right Load Balancer
Choosing a load balancer depends on your requirements. For simple HTTP traffic with moderate scale, Nginx or HAProxy on a couple of virtual machines works well. They are battle-tested, documented, and free.
Cloud providers offer managed load balancers that integrate with their ecosystems. AWS Application Load Balancer handles L7 routing with rule-based decisions. Network Load Balancer provides ultra-low-latency L4 forwarding for TCP workloads. These services scale automatically and reduce operational overhead.
If you run Kubernetes, the ingress controller often handles load balancing. Options like ingress-nginx, Traefik, or cloud-specific controllers provide L7 routing with tight integration into the container scheduler.
For microservices, service meshes like Istio or Linkerd include load balancing as part of their service-to-service communication layer. These handle traffic shaping, circuit breaking, and retries alongside basic load distribution.
When to Use Each Approach
When to Use Load Balancing
Load balancing is essential when:
- You run multiple backend servers serving the same application
- You need high availability (single server failure should not cause outage)
- You want to scale horizontally by adding more servers
- You need to perform maintenance without downtime
- Traffic volume exceeds what a single server can handle
- You want to protect against server overload and failures
When to Use Layer 4 (L4) Load Balancing
L4 is the right choice when:
- You need maximum throughput with minimal latency
- You are load balancing TCP/UDP protocols beyond HTTP
- You do not need to inspect application-layer data
- You are routing database connections or streaming data
- Raw performance matters more than routing intelligence
When to Use Layer 7 (L7) Load Balancing
L7 is the right choice when:
- You need content-based routing (URL path, headers, cookies)
- You want to terminate SSL at the load balancer
- You need to implement sticky sessions
- You are serving multiple applications on the same IP
- You want to rewrite URLs or redirect requests
Global Server Load Balancing
When your application spans multiple geographic regions, simple load balancing breaks down. A user in Tokyo hitting servers in Virginia is a recipe for slow page loads and frustrated users. GSLB solves this by routing users to the closest healthy backend based on geography, server health, and real-time load.
GSLB works at the DNS level. Your DNS responder returns different IP addresses depending on where the query originates. The user gets routed to the nearest healthy region without ever knowing other regions exist.
graph TD
UserAP[User - APAC] --> DNS[GSLB DNS]
UserEU[User - EU] --> DNS
UserUS[User - US] --> DNS
DNS -->|Return APAC IP| LB_APAC[Load Balancer - Tokyo]
DNS -->|Return EU IP| LB_EU[Load Balancer - Frankfurt]
DNS -->|Return US IP| LB_US[Load Balancer - Virginia]
Anycast vs GSLB
One approach worth understanding is Anycast routing. With Anycast, multiple servers share the same IP address, and the network routes packets to the nearest one based on BGP path metrics. This works at the network layer without application involvement.
GSLB gives you more control. You can weigh traffic differently, factor in server health beyond just network reachability, and route based on application-layer signals. Cloud providers offer GSLB as a service: AWS Route 53 geolocation routing, Google Cloud Load Balancing with cloud CDN, and Azure Traffic Manager all fit here.
For globally distributed systems, combining Anycast and GSLB works well. Anycast routes to the nearest region at the network level. GSLB handles fine-grained routing within and between regions based on application health.
Health Checking Across Regions
GSLB health checks must account for entire regions going dark. A health check that only verifies server TCP connectivity misses regional outages. Proper GSLB checks server-level health, regional health, latency thresholds, and capacity limits.
Most GSLB implementations probe from multiple vantage points to distinguish between a slow region and a slow internet path. If three out of five probes fail from different locations, the region gets marked unhealthy.
Rate Limiting at the Load Balancer
Load balancers sit in the perfect spot to enforce rate limits. They see every request before it reaches your application, which makes them the first line of defense against abuse, runaway scripts, and intentional DDoS.
Token Bucket Algorithm
The token bucket algorithm is the most common approach. Each client receives a bucket that fills with tokens at a steady rate. Every request consumes a token. When the bucket is empty, requests get rejected or delayed.
# NGINX rate limiting using token bucket
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=100r/s;
server {
location /api/ {
limit_req zone=api_limit burst=200 nodelay;
}
}
Sliding Window Counter
Token bucket allows bursts. If you need more predictable rate limiting, the sliding window counter maintains a rolling count of requests per time window. It uses more memory than token bucket but provides smoother rate limiting.
# HAProxy sliding window configuration
stick-table type ip size 100k expire 60s store http_req_rate(60s)
http-request track-sc0 src table global
Limiting by Different Keys
Rate limits should apply at multiple granularities:
| Key | Use Case |
|---|---|
| Source IP | Single user/clients behind NAT |
| API Key | Authenticated API consumers |
| Header (User-Agent) | Bot detection |
| JWT claim | Per-user service tier limits |
Cloud load balancers usually provide built-in rate limiting with automatic IP reputation scoring. AWS ALB has throttling via WAF rules. Google Cloud Load Balancing integrates with Cloud Armor for advanced rate-based policies.
Circuit Breaker Pattern Integration
Health checks remove failed backends from rotation. Circuit breakers work one level deeper. They prevent your load balancer from hammering a struggling backend that has not fully failed but is showing signs of strain.
How Circuit Breakers Interact with Load Balancers
A well-designed circuit breaker sits between the load balancer and the backend, monitoring error rates and latency. When a backend starts returning errors or exceeding latency thresholds, the circuit breaker “opens” and trips, immediately returning failures without forwarding the request.
stateDiagram-v2
Closed --> Open : Error threshold exceeded
Open --> HalfOpen : Cool-down period elapsed
HalfOpen --> Closed : Probe request succeeds
HalfOpen --> Open : Probe request fails
The load balancer still performs health checks on an open circuit breaker. Once the backend recovers and passes enough health checks, the circuit breaker allows traffic through again.
Implementation Approaches
For Kubernetes deployments, service meshes like Istio implement circuit breaking at the sidecar proxy level. You configure outlier detection on your DestinationRule:
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: backend-service
spec:
host: backend-service
trafficPolicy:
outlierDetection:
consecutiveGatewayErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
For traditional deployments, libraries like Hystrix (Java), Resilience4j, or Polly (.NET) implement circuit breakers in your application code. The load balancer health checks still matter — they handle complete failures while circuit breakers handle degraded states.
Canary and Blue-Green Deployments
Load balancers make deployment strategies like canary releases and blue-green deployments possible without downtime. Instead of replacing all servers at once, you shift traffic gradually and monitor for problems.
Blue-Green Deployment
Blue-green keeps two identical environments. The current production environment (blue) handles live traffic while the new version (green) sits idle. When you are ready to deploy, you shift all traffic from blue to green at the load balancer level with a single routing rule change.
graph LR
subgraph "Blue Environment (Current)"
B1[Server 1]
B2[Server 2]
end
subgraph "Green Environment (New)"
G1[Server 1']
G2[Server 2']
end
LB[Load Balancer] -->|Route to Blue| B1
LB -->|Route to Blue| B2
LB -.->|Ready to switch| G1
LB -.->|Ready to switch| G2
If something goes wrong, you flip traffic back to blue instantly. The old environment stays warm during the deployment window so rollback never requires rebuilding anything.
Canary Deployment
Canary releases shift a small percentage of traffic to the new version while the rest stays on the current version. You route 5% of users to the new build, monitor error rates and latency, and gradually increase the percentage.
# HAProxy canary configuration
backend canary_backend
server new_app_1 10.0.1.101:8080 weight 5 # 5% of traffic
server current_app_1 10.0.1.201:8080 weight 95
# Increase weight gradually as confidence builds
# 5% -> 10% -> 25% -> 50% -> 100%
Canary deployments require more sophisticated monitoring. You need to compare error rates and latency between the canary and production groups. If the canary shows 1% higher error rate while serving 5% of traffic, that is a real signal worth investigating. Load balancers that export metrics to Prometheus or Datadog make this kind of comparison straightforward.
Load Balancer Role in Deployment Safety
The load balancer is the control point for both strategies:
- Instant rollback: Change backend weights to route 100% back to the old version
- Gradual rollout: Shift traffic incrementally to the new version
- Health monitoring integration: Automatically pause rollout if backend error rates spike
- Connection draining: Move users off servers being decommissioned gracefully
Both strategies depend on having enough capacity to run two environments simultaneously. In cloud environments with auto-scaling, this is easier. You spin up green instances alongside blue, shift traffic, then scale down blue.
Kubernetes Ingress and Service Mesh Load Balancing
In Kubernetes, load balancing works differently than in traditional setups. The kubelet on each node runs kube-proxy, which sets up IPVS or iptables rules to load balance traffic across pods. But this only handles traffic within the cluster.
Ingress Controllers
For external traffic entering the cluster, Ingress resources define how HTTP/HTTPS routing works. The Ingress controller (like ingress-nginx, Traefik, or cloud-specific controllers) implements those rules and terminates load balancing at the edge.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: app-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
rules:
- host: api.example.com
http:
paths:
- path: /users
pathType: Prefix
backend:
service:
name: users-service
port:
number: 80
- path: /products
pathType: Prefix
backend:
service:
name: products-service
port:
number: 80
Ingress controllers handle L7 routing based on host and path, SSL termination, and canary routing through weighted backend services.
Service Mesh Load Balancing
Service meshes like Istio and Linkerd move load balancing from the application layer to the sidecar proxy running alongside each pod. Every outbound request goes through the sidecar, which makes routing decisions based on service mesh policies.
With a service mesh, you get per-request routing, chaos injection for testing, automatic retries with backoff, and fine-grained traffic shaping. Your application code has no idea any of this is happening.
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: reviews
spec:
hosts:
- reviews
http:
- route:
- destination:
host: reviews
subset: v1
weight: 90
- destination:
host: reviews
subset: v2
weight: 10
The sidecar proxy handles health checking, load balancing algorithms, and circuit breaking transparently. Your application code has no awareness of which version is handling any given request.
Comparison: Ingress vs Service Mesh
| Aspect | Ingress Controller | Service Mesh |
|---|---|---|
| Scope | North-south traffic (into cluster) | East-west traffic (within cluster) |
| Deployment | Cluster-level controller | Per-pod sidecars |
| Protocol | HTTP/HTTPS primarily | Any L7 protocol |
| Complexity | Lower | Higher |
| Use Case | Exposing services externally | Service-to-service communication |
Most real deployments end up needing both. The Ingress controller handles traffic entering your cluster from outside. The service mesh handles how those requests reach your application services and how services talk to each other once they are inside.
Topic-Specific Deep Dives
When Not to Rely on Load Balancing Alone
Load balancing is not enough when:
- You need instant failover (add health check latency and recovery time)
- You have state that cannot be distributed (use shared storage)
- You expect traffic spikes you cannot pre-scale for (combine with auto-scaling)
Trade-off Analysis
Layer 4 vs Layer 7 Trade-offs
| Aspect | Layer 4 | Layer 7 |
|---|---|---|
| Performance | Higher throughput, lower latency | Higher latency, more CPU usage |
| Routing Intelligence | IP + port only | Content-based (URL, headers, cookies) |
| SSL Termination | Not possible (passthrough only) | Full inspection and termination |
| Protocol Support | Any TCP/UDP protocol | HTTP/HTTPS primarily |
| Memory Usage | Lower | Higher (stateful inspection) |
| Complexity | Simpler configuration | More configuration options |
Software vs Hardware Trade-offs
| Aspect | Software Load Balancers | Hardware Load Balancers |
|---|---|---|
| Cost | Low / open source | High ($50K+ appliances) |
| Flexibility | Easily modified, code changes | Fixed functionality, firmware updates |
| Scalability | Horizontal (add more VMs) | Vertical (bigger appliance) or clustering |
| Performance | Good for most web apps | Extreme throughput (ASICs) |
| Compliance | May not meet regulatory needs | Often compliance-certified |
| Operations | Requires maintenance | Managed appliance support |
Sticky Sessions Trade-offs
| Aspect | Without Sticky Sessions | With Sticky Sessions |
|---|---|---|
| Load Distribution | Perfect distribution possible | Uneven when clients have different behavior |
| Session State | Must use shared storage (Redis) | Simpler if server-local state is acceptable |
| Scaling | Easy horizontal scaling | Harder — cannot freely move clients |
| Failure Handling | User sessions survive server failure | User loses session if sticky server dies |
| Complexity | Requires external session store | Simpler initially |
Health Check Depth Trade-offs
| Check Type | What It Catches | Cost / Trade-off |
|---|---|---|
| TCP Connect | Port open, basic reachability | Fast, low overhead, but misses app failures |
| HTTP Request | Application responding | More accurate, slightly higher latency |
| Deep HTTP | Response body validation | Catches more failures, highest latency |
| Custom Script | Business logic verification | Most accurate, operational complexity |
Rate Limiting Algorithm Trade-offs
| Algorithm | Burst Handling | Memory Usage | Predictability |
|---|---|---|---|
| Token Bucket | Allows bursts | Low | Variable |
| Sliding Window | No bursts | Higher | Consistent |
| Fixed Window | Allows bursts | Lowest | Inconsistent at boundaries |
| Leaky Bucket | Smooths output | Low | Consistent |
Observability and Security
Metrics
- Request rate (requests per second by backend)
- Response latency (p50, p95, p99 per backend)
- Backend health status (healthy/unhealthy/draining)
- Active connections per backend
- Connection rate (new connections per second)
- SSL handshake rate and latency (if terminating SSL)
- Backend error rate (5xx responses from backends)
- Health check success/failure rate
- Backend response time trends
Logs
- Backend health check failures with details
- SSL handshake failures (certificate errors, protocol mismatches)
- Connection timeouts from backends
- Routing decisions for L7 (which rule matched)
- Backend server added/removed events
- Rate limiting events
- Connection errors and disconnections
Alerts
- Any backend unhealthy for more than 30 seconds
- All backends unhealthy (complete outage)
- Request error rate exceeds threshold
- p99 latency exceeds service level objective
- Active connections approach limits
- Health check failure rate increases
- Unusual traffic patterns (potential attack)
Security Checklist
- Restrict access to load balancer management interface
- Use TLS for load balancer to backend communication (internal encryption)
- Implement access controls on health check endpoints
- Monitor for traffic anomalies indicating attack
- Use private VIPs for internal load balancers (not internet-facing)
- Rotate SSL certificates if terminated at load balancer
- Implement rate limiting at load balancer layer
- Log all administrative changes to load balancer config
- Use network ACLs to restrict which clients can reach load balancer
- Enable audit logging for compliance
- Protect against DDoS at load balancer level
- Verify backend servers are not directly accessible (all traffic through LB)
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Load balancer itself fails | Complete service outage | Deploy redundant load balancers; use VRRP/keepalived |
| Backend server fails silently | Requests routed to dead server; errors for users | Implement health checks; remove failed servers quickly |
| Health check misconfiguration | False positives remove healthy servers | Use multiple check types; set appropriate thresholds |
| Sticky session overload | One server gets all traffic; cascade failure | Minimize sticky sessions; use session storage (Redis) |
| SSL termination bottleneck | Load balancer CPU maxes out on encryption | Use SSL offloading hardware; scale horizontally |
| Connection exhaustion | No new connections accepted; service hangs | Monitor connection counts; implement connection limits |
| ARP/cache issues with VIP | Traffic routing breaks; intermittent failures | Use keepalived with proper priority; monitor ARP tables |
| Misconfigured routing rules | Traffic goes to wrong backend; data issues | Test rules in staging; implement gradual rollout |
Common Pitfalls / Anti-Patterns
Single Point of Failure
A single load balancer is a single point of failure.
graph TD
A[Client] --> B[Single LB]
B --> C[Server 1]
B --> D[Server 2]
Use redundant load balancers with keepalived or use managed cloud load balancing with built-in redundancy.
Ignoring Health Check Tuning
Too aggressive health checks cause flapping; too lenient causes slow detection.
# Too aggressive - causes flapping
health_check {
interval: 1s # Check every second
timeout: 1s # 1 second timeout
failures: 1 # One failure removes server
}
# Better - balanced
health_check {
interval: 10s # Check every 10 seconds
timeout: 3s # 3 second timeout
failures: 3 # Three failures removes server
success: 2 # Two successes brings server back
}
Not Planning for Connection Draining
Abruptly removing a server drops active connections.
# Allow existing connections to complete
server {
# Graceful shutdown after 60 seconds
shutdown_timeout 60s;
}
Overusing Sticky Sessions
Sticky sessions defeat load balancing benefits and cause issues.
# Problem: All of User A's requests go to Server 1
# If Server 1 fails, User A loses session
# Better: Store sessions in Redis
# All servers can serve User A
session_store: redis
Not Monitoring Backend Load
Load balancer may distribute evenly while backends struggle.
# Simple connection count is not enough
balance roundrobin # Equal connections, not equal load
# Better: least_conn or weighted by actual load
balance least_conn
Design for Resilience
Design for Redundancy From the Start
Never deploy a single load balancer in production. Use at least two in an active-passive or active-active configuration. For cloud deployments, use the managed load balancer which handles redundancy automatically. The moment you tell yourself “we will add redundancy later” is the moment you guarantee an outage.
Health Checks Are Your Safety Net
Configure health checks to match your actual application requirements. A TCP connect check verifies the port is open but not that the application is working. An HTTP check that validates the response code and optionally the response body catches actual application failures. Set thresholds that avoid flapping. Three failures to remove, two successes to restore is a reasonable starting point.
Prefer L7 When You Need Intelligence
L4 gives you raw performance. L7 gives you control. If you need content-based routing, SSL termination, sticky sessions, or any kind of request inspection, use L7. The performance difference only matters at extreme scale. For most web applications, L7 is the right default.
Distribute Load by Actual Capacity
Round-robin distributes connections evenly, not load. If your backends have different capacities or are running different versions, use weighted distribution. least_conn routes to the backend with the fewest active connections, which is a better proxy for actual load on request-response workloads.
Session and Monitoring
Decouple Session State
Store session state in a distributed cache like Redis rather than relying on sticky sessions. This lets the load balancer do its job properly, distributing requests based on actual load rather than routing constraints. It also means a failed backend does not take user sessions down with it.
Monitor What Matters
Track backend error rates, not just request counts. A backend serving 100 requests per second with 10% errors is worse than one serving 50 requests with 0% errors. Set up alerts on error rate increases before they become outages.
Test Failure Modes Regularly
Run chaos engineering experiments. Terminate a backend server and watch how the load balancer handles it. Verify your monitoring alerts fire. Confirm users recover. Do this in staging regularly enough that you are not surprised when it happens in production.
Operational Excellence
Plan Connection Draining
When taking a backend offline for maintenance, configure connection draining. This allows existing requests to complete while preventing new connections. Set the drain timeout long enough for your worst-case request duration, typically 60 seconds.
Keep Load Balancers Simple
The load balancer should route traffic and enforce policies. Do not try to make it do too much. Offload compression, authentication, and complex business logic to your application servers. Load balancers do one thing well, distributing traffic intelligently.
Interview Questions
Layer 4 operates at the transport layer, making routing decisions based on IP address and port number without inspecting the actual request content. Layer 7 operates at the application layer, enabling content-based routing using HTTP headers, URLs, cookies, and request bodies.
Choose L4 when you need maximum throughput with minimal latency, are load balancing non-HTTP protocols like databases or SSH, or do not need application-layer inspection. Choose L7 when you need content-based routing, SSL termination, sticky sessions, or URL rewriting. L7 adds latency due to parsing but provides much greater routing intelligence.
Sticky sessions route a particular user's requests to the same backend server using cookies, client IP hashing, or custom headers. The load balancer tracks which client maps to which server.
Problems they create: complicates horizontal scaling since you cannot freely redistribute load, makes maintenance windows difficult since taking down a server disconnects active users, creates session affinity that defeats load balancing benefits, and causes user-facing errors when a sticky server fails. Best practice is to store session state in a distributed cache like Redis instead of relying on sticky sessions.
Health checks are periodic tests the load balancer runs against backends to verify they can handle traffic. Types include:
- TCP connect: Verifies the port is open and accepting connections
- HTTP/HTTPS: Makes a full request and validates the response code and optionally the response body
- Custom: Application-specific checks that verify actual service functionality
Health checks run at configurable intervals. A server failing too many checks gets marked unhealthy and removed from rotation. Lighter checks run more frequently; occasional deep validation catches real failures. Configure thresholds to avoid flapping — too aggressive removes healthy servers, too lenient means slow failure detection.
In active-passive, one load balancer actively handles traffic while the other sits idle, monitoring via a protocol like VRRP/keepalived. When the active fails, the passive takes over the virtual IP. This provides redundancy but at 50% utilization of your load balancer capacity.
In active-active, multiple load balancers handle traffic simultaneously. This maximizes utilization and provides redundancy, but requires more coordination. Cloud managed load balancers typically operate in active-active mode by default with built-in redundancy across availability zones.
SSL termination means the load balancer decrypts incoming HTTPS traffic, inspects it (if L7), and forwards unencrypted traffic to backend servers. Benefits: reduces cryptographic load on application servers, centralizes certificate management, enables the load balancer to inject security headers and rewrite URLs.
Trade-offs: traffic between load balancer and backend travels unencrypted (typically fine within a private network), adds CPU load on the load balancer for decryption, and creates a trust boundary where traffic is briefly unencrypted. For sensitive environments, use SSL passthrough (encrypted end-to-end) or re-encrypt before forwarding to backends.
GSLB routes users to the closest or most appropriate geographic region based on DNS responses. Unlike simple DNS round-robin, GSLB considers server health, load, and latency to make routing decisions. You need GSLB when your application runs across multiple regions and you want optimal user experience by minimizing latency.
GSLB health checks must account for regional failures, not just individual server failures. If an entire region becomes unavailable, GSLB should route users to the next closest healthy region. Implementation options include cloud provider services (AWS Route 53, Azure Traffic Manager, Google Cloud Load Balancing) or dedicated GSLB appliances.
Load balancer health checks handle complete server failures — when a server stops responding entirely, health checks remove it from rotation. Circuit breakers work at a finer granularity, monitoring error rates and latency to detect degraded backends that are still responding but poorly.
When a circuit breaker opens, it immediately returns failures without forwarding requests to a struggling backend, giving that backend time to recover. Health checks continue monitoring an open circuit — once the backend recovers, the circuit breaker closes and allows traffic again. Service meshes like Istio implement circuit breaking at the sidecar proxy level with configurable outlier detection policies.
A canary deployment routes a small percentage of production traffic to a new version while the majority stays on the current version. At the load balancer, you set backend weights — start with 5% to the new version, monitor error rates and latency for both groups, and gradually increase the weight as confidence builds.
For example, in HAProxy you would define both versions as backends and adjust weights: 95% to current, 5% to new. Incrementally shift to 90/10, 75/25, 50/50, and finally 0/100 when the canary is proven. If the canary shows elevated error rates or latency, immediately shift traffic back to the stable version. This requires proper observability — compare error rates and latency between groups, not just absolute numbers.
Token bucket allows bursts. Each client gets a bucket that fills with tokens at a steady rate — for example, 100 tokens per second. A request consumes one token. When the bucket is empty, requests are rejected or delayed. If traffic is below the rate limit, tokens accumulate, allowing brief bursts up to the bucket size.
Sliding window counter maintains a rolling time window and counts requests within it. It uses more memory than token bucket but provides smoother rate limiting without burst capability. Both algorithms have the same fundamental limit but different behavioral characteristics — token bucket is better for bursty traffic patterns, sliding window for more predictable traffic.
Critical metrics include: request rate per backend (requests/second), response latency percentiles (p50, p95, p99) per backend, backend health status (healthy/unhealthy/draining), active connections per backend, backend error rate (5xx responses from backends), SSL handshake rate and latency if terminating SSL, and health check success/failure rate.
Also monitor load balancer CPU and memory utilization, connection exhaustion indicators (no new connections accepted), unusual traffic patterns (potential DDoS or abuse), and backend response time trends to catch degradation before it causes errors. Set up alerts on backend error rate increases, p99 latency exceeding SLO, and all backends becoming unhealthy simultaneously.
Load balancers perform health checks at configurable intervals by sending probe requests to backend servers. Common types include TCP connect checks (verifies port is open), HTTP checks (makes request and validates response), and custom application-level checks.
When a backend fails consecutive health checks, the load balancer marks it unhealthy and removes it from the server pool. Existing connections may be terminated or allowed to drain depending on configuration. The load balancer stops routing new traffic to the failed backend. Once the backend recovers and passes enough consecutive health checks (typically 2-3 successes), it automatically rejoins the pool.
Round-robin distributes requests evenly across backends in rotation. Each server gets the next request in sequence. Simple but does not account for server capacity differences or current load.
Least connections routes to the backend with the fewest active connections at request time. Better for request-response workloads where connection time varies, as it adapts to actual server load rather than just connection count.
IP hash computes a hash of the client IP address to determine which backend handles the request. This provides session persistence since the same client always goes to the same server. The downside is poor load distribution if clients have heterogeneous behavior, and adding/removing backends disrupts existing mappings.
Blue-green deployment maintains two identical environments. The current production environment (blue) handles live traffic while the new version (green) sits idle. Deployment involves shifting all traffic from blue to green at the load balancer level with a single routing rule change.
Advantages: instant rollback (flip traffic back to blue if issues occur), zero downtime since both environments stay warm, no partial deployment states where some users get new version and others do not. The old environment stays running during the deployment window so rollback never requires rebuilding anything. This is safer for database migrations since both environments run the old schema until migration is verified.
Anycast routing has multiple servers share the same IP address. The network routes packets to the nearest server based on BGP path metrics, without application involvement. This works at the network layer and is commonly used by CDNs.
GSLB provides more control. You can weigh traffic differently, factor in server health beyond just network reachability, and route based on application-layer signals. GSLB works at the DNS level, returning different IP addresses based on where the query originates. Unlike Anycast, GSLB can distinguish between a slow region and a slow internet path, and can perform health checks on application functionality.
SSL termination at load balancer: Benefits include reduced cryptographic load on application servers, centralized certificate management, ability to inspect and modify encrypted traffic (L7), and URL rewriting. Drawbacks are CPU load on the load balancer for decryption and traffic traveling unencrypted between LB and backends.
SSL passthrough: Traffic remains encrypted end-to-end, no trust boundary issues. The backend servers handle decryption CPU load. Drawbacks include inability to perform L7 routing on encrypted traffic, no central certificate management, and no traffic inspection capability.
Connection draining allows existing requests to complete on a backend server being taken offline while preventing new connections from being routed to it. The load balancer stops sending new traffic but allows in-flight requests to finish.
Without connection draining, abruptly removing a server drops active user connections, causing errors and poor experience. With draining configured, users complete their current tasks while new requests go to healthy backends. Set the drain timeout long enough for your worst-case request duration, typically 30-60 seconds. This enables maintenance windows without user-facing impact.
Load balancers sit in the perfect spot for rate limiting since they see every request before it reaches applications. Common approaches include token bucket (allows bursts up to bucket size) and sliding window counter (smooth rate limiting without bursts). Rate limits can be applied by source IP, API key, User-Agent header, or JWT claims.
For DDoS protection, load balancers can absorb volumetric attacks by distributing traffic across backends, though extreme attacks require dedicated DDoS mitigation services. Cloud load balancers typically include automatic IP reputation scoring and integration with WAF rules for rate-based policies.
Ingress controllers handle north-south traffic (into the cluster), implementing L7 routing based on host and path, SSL termination, and weighted backend services for canary deployments. They are cluster-level controllers that terminate external load balancing.
Service meshes handle east-west traffic (within the cluster) through per-pod sidecar proxies. Every outbound request goes through the sidecar which makes routing decisions based on mesh policies. Service meshes provide per-request routing, chaos injection, automatic retries, and fine-grained traffic shaping.
Most deployments need both. The ingress controller handles traffic entering the cluster from outside, while the service mesh handles how those requests reach application services and how services communicate with each other inside the cluster.
Design for redundancy from the start with at least two load balancers in active-passive or active-active configuration. Active-passive uses VRRP/keepalived where the standby monitors via heartbeat and takes over the virtual IP when active fails. Active-active distributes load across multiple LBs for better utilization.
Key considerations: Deploy load balancers in different availability zones, use managed cloud load balancers with built-in redundancy, implement connection draining for planned maintenance, configure proper health checks with appropriate thresholds to avoid flapping, and monitor load balancer health metrics alongside backend metrics.
Choose managed cloud load balancer when: You want automatic scaling and high availability built-in, you prefer reducing operational overhead, you need tight integration with other cloud services (auto-scaling groups, security groups, WAF), or you are building cloud-native applications.
Choose software load balancers on VMs when: You need specific configuration options not available in the managed offering, cost is a major factor (managed LBs have per-hour pricing), you need consistent load balancing across multi-cloud or hybrid environments, or you want more control over routing logic and integration with external tools.
Further Reading
- Load Balancing Algorithms — Deep dive into specific algorithms: round-robin, weighted round-robin, least connections, IP hash, and adaptive load balancing
- CAP Theorem — Fundamental trade-offs in distributed systems that affect load balancing design decisions
- Microservices Architecture — How load balancing fits into larger distributed system patterns
- Service Mesh Deep Dive — Sidecar proxies and mesh-level load balancing for Kubernetes environments
- HAProxy Architecture Guide — Detailed documentation on HAProxy’s internal workings and configuration best practices
- Envoy Proxy Documentation — For understanding modern L7 proxy and service mesh implementation patterns
- AWS Application Load Balancer Documentation — Cloud-native load balancing patterns and integrations
Conclusion
Key Bullets
- Load balancers distribute traffic across multiple servers to improve availability and scalability
- L4 operates at transport layer (IP + port) for maximum performance
- L7 operates at application layer (HTTP) for intelligent routing
- Health checks continuously monitor backend availability
- Sticky sessions route users to the same backend but reduce flexibility
- SSL termination at load balancer reduces backend cryptographic load
- Software load balancers (HAProxy, Nginx) work well for most use cases
- Cloud managed load balancers reduce operational overhead
- Redundancy is critical: never have a single load balancer in production
Copy/Paste Checklist
# Check HAProxy backend status (via socket)
echo "show stat" | socat stdio /var/run/haproxy.sock
# Check Nginx upstream status (requires status module)
curl http://localhost:8500/upstream_conf
# Test health check endpoint
curl -I http://backend1:8080/health
# Check active connections
ss -s
# View HAProxy metrics
echo "show info" | socat stdio /var/run/haproxy.sock
# Test backend directly (bypass load balancer)
curl -H "Host: example.com" http://backend1:8080/
Category
Related Posts
CDN Deep Dive: Content Delivery Networks Explained
A comprehensive guide to CDNs — how they work, PoP architecture, anycast routing, cache invalidation strategies, SSL/TLS termination, and real-world performance trade-offs.
API Gateway: Single Entry Point for Microservices
Learn how API gateways work, when to use them, architecture patterns, failure scenarios, and implementation strategies for production microservices.
DNS and Domain Management: The Complete Guide
Learn how DNS resolution works, understand record types (A, AAAA, CNAME, MX), TTL, DNS hierarchy, and best practices for managing domains.