The Eight Fallacies of Distributed Computing
Explore the classic assumptions developers make about networked systems that lead to failures. Learn how to avoid these pitfalls in distributed architecture.
Introduction
Fallacy 1: The Network is Reliable
The first and most dangerous fallacy. Production networks fail constantly: cables get cut, switches reboot, routers misconfigure. In any large enough system, you should plan for network failures daily.
graph TD
A[Client Request] --> B[Network]
B -->|Packet Loss| C[Request Timeout]
B -->|Latency Spike| D[Request Slow]
B -->|Partition| E[Request Failed]
C --> F[Retry Logic]
D --> F
E --> G[Fallback Behavior]
F --> G
The fix: implement retries with exponential backoff, circuit breakers, and fallback behavior. When designing a distributed system, assume the network will fail and design accordingly. See the CAP theorem for what happens during network partitions.
Fallacy 2: Latency is Zero
Light takes about 67ms to cross the US at the speed of fiber. A round trip between New York and Tokyo runs 100-150ms. Developers who treat network calls like function calls soon discover their “simple” distributed system responds in seconds, not milliseconds.
// Naive approach: multiple sequential network calls
async function getUserDashboard(userId) {
const user = await fetchUser(userId); // 50ms
const posts = await fetchPosts(userId); // 80ms
const friends = await fetchFriends(userId); // 60ms
const notifications = await fetchNotifs(userId); // 40ms
// Total: 230ms sequential
// Better: parallel calls
const [user, posts, friends, notifs] = await Promise.all([
fetchUser(userId),
fetchPosts(userId),
fetchFriends(userId),
fetchNotifications(userId),
]);
// Total: ~80ms (time of slowest call)
}
Even without partitions, latency and consistency trade off against each other. See the PACELC theorem. Synchronous replication for strong consistency adds latency. Async replication is faster but introduces potential inconsistency.
Fallacy 3: Bandwidth is Infinite
Early network proponents assumed bandwidth was effectively unlimited. You hit the wall fast when replicating large datasets or streaming video across data centers.
Real bandwidth limits hit in ways that surprise teams:
- Replicating a 500GB database across regions at 100Mbps takes over 11 hours
- Streaming 4K video to 10,000 concurrent viewers needs ~300Gbps
- A single gigabit network link maxes out at about 125MB/s of actual throughput
The geo-distribution post covers how to architect when data must span regions—and the latency and bandwidth costs that come with it.
Fallacy 4: The Network is Secure
Internal networks get compromised. Services get repurposed. Attackers find ways in. Zero trust architecture exists because assuming internal networks are safe has killed systems. Every service should authenticate every request, even from within the same data center.
The mTLS post covers mutual TLS for service-to-service authentication. The secrets management post discusses how to handle credentials in a distributed environment.
Fallacy 5: Topology Doesn’t Change
Static network diagrams go stale the moment you draw them. Services get added, removed, migrated, and scaled. Containers move between hosts. VMs get provisioned and deprovisioned. Your “stable” service discovery configuration becomes stale within weeks.
graph LR
A[Service A] -->|Initially| B[Service B]
A -->|After migration| C[Service B in DC2]
A -->|With service mesh| D[Service B via mesh]
Dynamic service discovery (covered in DNS Service Discovery and Service Registry) handles topology changes. But you still need to design for instances appearing and disappearing without warning.
Fallacy 6: There is One Administrator
In a distributed system, multiple teams manage different parts of the infrastructure. The network team controls routers. The database team manages replicas. The platform team handles Kubernetes. The security team enforces firewall rules. Changes made by any one can break things for all.
Coordination overhead grows fast with system size. A simple change like adding a new firewall port requires tickets, approvals, and maintenance windows. Distributed systems require strong processes alongside strong architecture.
The service boundaries post discusses how to structure teams and services to minimize coordination overhead while maintaining proper separation.
DevOps and SRE Organizational Patterns
Fallacy 6 manifests most painfully in multi-team environments. When hundreds of services are managed by dozens of teams, coordination becomes the bottleneck. These patterns help manage the complexity.
Team topologies for distributed systems
The Team Topologies book from SKF and Red Hat identifies four fundamental team patterns that map well to distributed systems:
| Team Type | Responsibility | Interaction Pattern |
|---|---|---|
| Stream-aligned | Deliver user-facing features | Self-sufficient, flow work |
| Platform | Provide internal platforms/tools | Enable stream teams |
| Enabling | Help teams overcome obstacles | Temporary, embed with teams |
| Complicated Subsystem | Own complex domain expertise | Consulted by stream teams |
In distributed systems, platform teams become critical. They own the shared infrastructure that stream-aligned teams consume: service mesh, observability stack, database platforms, networking policies.
SRE practices that counteract Fallacy 6
SRE (Site Reliability Engineering) was invented at Google to bridge the gap between engineering and operations in large-scale distributed systems:
Error Budgets: Rather than mandating that every change goes through a lengthy approval process, error budgets allow teams to move fast as long as the system remains reliable. If your SLA is 99.9% uptime, you have 0.1% downtime budget to spend on risk. When the budget is full, slow down. When it is healthy, move fast.
Toil Reduction: SRE defines toil as manual, repetitive, automatable work. Teams drowning in toil cannot move fast. SRE practices push organizations to automate operational work so engineers can focus on improving the system rather than running it.
SLOs and SLIs: Service Level Objectives give teams shared definitions of reliability. When everyone agrees that “reliable” means 99.95% availability with p99 latency under 200ms, cross-team discussions become easier. SLOs become the common language for reliability decisions.
Blameless Postmortems: Failures in complex distributed systems almost never trace back to “one person made a mistake.” Multiple contributing factors across team boundaries are almost always involved. Blameless postmortems focus on systemic causes rather than individual fault, which means people actually report near-misses instead of hiding them.
# Example: SRE error budget policy
# .sre/config.yaml
error_budget_policy:
availability_target: 99.95 # 4 hours downtime per year
slo_window: 30d
warning_threshold: 50% # Remaining budget
critical_threshold: 25% # Remaining budget
# Actions tied to budget consumption
when_below_50%:
- Increase testing rigor
- Require additional review for risky changes
when_below_25%:
- Freeze non-critical deployments
- Root cause analysis required
- Executive notification
when_depleted:
- Emergency retrospective
- Post-mortem mandatory
- Deployment freeze until resolved
Cross-team coordination mechanisms
As distributed systems grow, you need formal mechanisms to coordinate changes across team boundaries:
Architecture Decision Records (ADRs): When a decision affects multiple teams, an ADR documents the context, decision, and consequences. This creates a searchable record of why the system is built the way it is. New team members can read ADRs to understand architectural decisions made before they joined.
Service Level Agreements (SLAs) Between Teams: Downstream teams depend on upstream services. Formalizing those dependencies as SLAs—with explicit latency, availability, and error rate commitments—makes it clear what each team owes to others.
Design Reviews for Cross-Cutting Changes: Changes to shared infrastructure, API contracts, or data schemas should go through a design review process that includes affected teams. This catches breaking changes before they happen.
Dark Launches and Feature Flags: Rather than coordinating deployments across teams, feature flags let teams deploy code independently. A dark launch exposes new functionality to a small subset of traffic while the team monitors for issues. This decouples deployment from release.
# Feature flag example for cross-team coordination
# When Team A wants to change an API that Team B depends on,
# Team A can:
# 1. Deploy the change behind a feature flag
# 2. Test with their own traffic
# 3. Gradually increase traffic while monitoring
# 4. Roll back instantly if issues arise
# No coordination with Team B required until the flag is fully enabled
def gradual_rollout(flag_name: str, initial_percentage: float = 1.0,
increment: float = 10.0, interval_seconds: float = 300):
"""
Gradually increase traffic to a new feature.
Returns when flag reaches 100% or is disabled.
"""
current_percentage = initial_percentage
while current_percentage < 100.0:
if is_flag_disabled(flag_name):
print(f"Flag {flag_name} disabled - rolling back")
return False
metrics = get_flag_metrics(flag_name)
if metrics.error_rate > 0.01: # 1% error threshold
print(f"Error rate {metrics.error_rate} exceeds threshold - rolling back")
disable_flag(flag_name)
return False
current_percentage = min(current_percentage + increment, 100.0)
set_flag_percentage(flag_name, current_percentage)
print(f"Flag {flag_name} at {current_percentage}%")
sleep(interval_seconds)
return True
Practical guidelines for reducing coordination overhead
-
Embrace self-service: Build platforms and tooling that let teams operate independently. If deploying a new service requires filing a ticket with the platform team, you have a bottleneck.
-
Make defaults safe: Design systems so the safe choice is the default. If encryption should be on by default, make it impossible to turn off. This reduces the number of decisions teams must make and mistakes they can make.
-
Invest in observability: When something breaks in a distributed system, the first question is always “which service?” Good observability—logs, metrics, traces—lets teams diagnose issues without coordinating with other teams.
-
Use service meshes for network policies: Rather than coordinating firewall rules across teams, a service mesh like Istio or Linkerd lets you express network policy as code. Teams define what their service needs to access; the mesh enforces it.
-
Create shared libraries for common patterns: When multiple teams implement the same pattern (retries, timeouts, circuit breakers), a shared library means they all benefit from improvements and bug fixes without coordination.
Fallacy 7: Transport Cost is Zero
This extends “latency is zero” but focuses on serialization and deserialization. Converting an object to JSON, sending it across the network, and converting it back takes CPU time and memory. For high-throughput systems, serialization alone can dominate processing time.
// Efficient binary protocol vs JSON
const userJson = JSON.stringify(user); // CPU: ~500μs for complex object
const userProtobuf = protobuf.encode(user); // CPU: ~50μs, 60% smaller
// Impact at scale: 10,000 requests/second
// JSON: 5 seconds CPU just for serialization
// Protobuf: 0.5 seconds CPU
gRPC with Protocol Buffers outperforms REST with JSON for internal service communication. The API contracts post covers designing service interfaces that minimize overhead.
Fallacy 8: The Network is Homogeneous
Assuming all network paths are equal leads to subtle performance problems. A call from service A to service B in the same data center might take 1ms. The same call routed through a load balancer might take 5ms. A call that crosses a WAN might take 50ms. A call that goes through a poorly configured firewall might time out.
Network paths are never uniform. Quality of service policies, routing anomalies, and congestion all affect performance unpredictably. Load balancing and load balancing algorithms posts cover how to route traffic intelligently despite network heterogeneity.
Core Concepts
These fallacies compound in production. A team assumes latency is negligible, so they make synchronous calls between services. Then a network partition hits. Because latency is actually non-zero, timeouts trigger slowly. Retries overwhelm the system. The cascade spreads.
The availability patterns and resilience patterns posts cover how to build systems that survive when assumptions fail—because they always do.
Real incident case studies
Understanding fallacies through real-world failures helps cement why these assumptions are dangerous.
AWS US-EAST-1 Outage (December 2021) - Fallacy 1: Network is Reliable
What happened: A US-EAST-1 availability zone experienced a power failure that cascaded. The outage lasted over 7 hours and affected thousands of services including major platforms like Amazon, Netflix, and Disney+.
Root fallacy violations:
- Network is Reliable: The single-AZ design assumed failure wouldn’t spread across availability zones properly
- Latency is Zero: Synchronous cross-service dependencies meant one failing component brought down entire systems
- One Administrator: The change management process for restoring power took hours of coordination
Key lessons:
- Assume any single component can fail catastrophically
- Design for partial availability - degrade gracefully
- Have runbooks for multi-step recovery procedures
Cloudflare DNS Outage (July 2019) - Fallacy 1 & 4: Network is Reliable, Network is Secure
What happened: A Layer 3 network routing misconfiguration caused Cloudflare’s DNS service to drop approximately 15% of global traffic. Over 400 million websites and services were unreachable.
Root fallacy violations:
- Network is Reliable: A single routing configuration error propagated globally
- Network is Homogeneous: The assumption that all network paths behave consistently proved false
Key lessons:
- BGP routing errors can propagate globally in minutes
- Have multiple DNS providers for critical services
- Monitor for unexpected routing changes
GitHub Outage (February 2018) - Fallacy 2: Latency is Zero
What happened: GitHub experienced 24 hours of degraded performance after a routine maintenance operation caused excessive load on their primary MySQL databases. The cascading failure affected over 35 million developers.
Root fallacy violations:
- Latency is Zero: Maintenance operations weren’t scheduled with latency impact analysis
- Transport Cost is Zero: The serialization of large MySQL result sets caused memory pressure
Key lessons:
- Always analyze latency impact of maintenance operations
- Have rollback procedures that account for replication lag
- Monitor database connection pool exhaustion
The compounding effect: How fallacies cascade
flowchart TD
A[Assume latency is zero] --> B[Make synchronous calls between services]
B --> C[Network partition occurs]
C --> D[Timeouts trigger slowly due to no timeout budgets]
D --> E[Retries overwhelm healthy services]
E --> F[Cascade failure spreads across system]
G[Assume network is reliable] --> H[No circuit breakers configured]
H --> C
I[Assume topology doesn't change] --> J[Hardcoded IP addresses used]
J --> K[Service migration causes connection failures]
K --> C
The compounding pattern: Violating one fallacy often amplifies violations of others. A network partition (fallacy 1) becomes catastrophic when combined with no timeout budgets (fallacy 2) and no circuit breakers (fallacy 1 again).
Designing for Reality
Use these fallacies as a checklist when designing distributed systems:
graph TD
A[Designing Distributed System] --> B{Ask These Questions}
B --> C[What happens when network calls fail?]
B --> D[What is actual latency between services?]
B --> E[Are we exceeding bandwidth budgets?]
B --> F[Do we authenticate internal calls?]
B --> G[How does topology changes affect us?]
B --> H[Who coordinates cross-team changes?]
B --> I[Is serialization cost included in latency budgets?]
B --> J[Are all network paths equally reliable?]
If you cannot answer these questions confidently, your system will fail in production. Not might—will.
Production Failure Scenarios
| Failure Scenario | Root Fallacy | Mitigation |
|---|---|---|
| Cascading timeout during network blip | Latency is zero, Network is reliable | Circuit breakers, bulkheads, timeouts with budgets |
| Database replication falls behind | Bandwidth is infinite | Monitor replication lag, compress streams, batch |
| Service discovery returns stale endpoints | Topology doesn’t change | Use dynamic discovery with short TTLs |
| Lateral movement after internal breach | Network is secure | Zero trust, mTLS, every call authenticated |
| Single team bottleneck on deployment | One administrator | Decentralize ownership, self-service platforms |
Common Pitfalls / Anti-Patterns
Pitfall 1: Treating network calls like function calls
Problem: Calling a remote service as if it were a local function, without timeouts, retries, or error handling.
Solution: Every network call must have: timeout, retry logic (with backoff), fallback behavior, and proper error handling.
Pitfall 2: Assuming homogeneous network conditions
Problem: Treating all environments (dev, staging, production) as having similar network characteristics.
Solution: Test against production-like network conditions. Use chaos engineering to inject latency and partition services.
Pitfall 3: Ignoring serialization overhead
Problem: Choosing JSON because it is readable, ignoring CPU and bandwidth costs at scale.
Solution: Profile serialization cost. For high-throughput internal services, consider Protocol Buffers, Avro, or MessagePack.
Self-Assessment Quiz: Is Your Architecture Violating These Fallacies?
Test your system design against each fallacy:
Fallacy 1: Network is Reliable
- Do you have retry logic with exponential backoff for all network calls?
- Are circuit breakers configured for all downstream services?
- Do you have fallback behavior when services are unavailable?
- Have you tested what happens when a network partition occurs?
Checklist: Fallacy 2 - Latency is Zero
- Have you measured actual latency between all service pairs?
- Are timeouts set based on measured P99 latency, not arbitrary values?
- Do you use parallel calls where possible instead of sequential chains?
- Is latency included in your SLA calculations?
Checklist: Fallacy 3 - Bandwidth is Infinite
- Have you calculated bandwidth requirements for data replication?
- Do you compress data streams between datacenters?
- Are large data transfers batched or streamed rather than bulk-loaded?
- Do you monitor bandwidth utilization and alert on limits?
Fallacy 4: Network is Secure
- Do you use mTLS for all internal service-to-service communication?
- Is every API endpoint authenticated, even internal ones?
- Do you rotate secrets and certificates automatically?
- Do you log and audit all cross-service communication?
Checklist: Fallacy 5 - Topology Doesn’t Change
- Do you use dynamic service discovery instead of static IPs?
- Are services designed to handle instances appearing/disappearing?
- Do you use Kubernetes or similar for dynamic load balancing?
- Are DNS TTLs short enough for rapid changes?
Checklist: Fallacy 6 - There is One Administrator
- Do you have clear ownership boundaries for each service?
- Are cross-team changes coordinated through ADRs (Architecture Decision Records)?
- Do you have self-service deployment pipelines?
- Is there a clear escalation path for incidents?
Checklist: Fallacy 7 - Transport Cost is Zero
- Have you profiled serialization costs at your expected throughput?
- Are you using efficient serialization (Protobuf, Avro) for internal communication?
- Do you stream large datasets rather than loading them all at once?
- Is serialization cost included in latency budgets?
Fallacy 8: Network is Homogeneous
- Do you use latency-aware routing (not just round-robin)?
- Have you tested behavior across different network paths?
- Do you monitor per-path latency, not just aggregate?
- Do you have quality-of-service policies for different traffic types?
Scoring
Count your unchecked boxes:
- 0-4 unchecked: Your architecture is resilient - well done!
- 5-10 unchecked: Some vulnerabilities exist - review weak areas
- 11+ unchecked: High risk - prioritize fixing these issues before production
Quick Recap Checklist
Use this checklist to verify your distributed system design addresses each fallacy before production deployment:
Network Reliability
- All network calls have retry logic with exponential backoff
- Circuit breakers are configured for all downstream services
- Fallback behavior is defined for when services are unavailable
- Timeouts are set based on measured P99 latency, not arbitrary values
Latency Awareness
- Actual latency measured between all service pairs across regions
- Parallel fan-out used for independent service calls
- Latency budgets include serialization and deserialization costs
- Async I/O preferred over synchronous blocking calls
Bandwidth Planning
- Bandwidth requirements calculated for all data replication streams
- Compression enabled for cross-datacenter traffic
- Large transfers batched or streamed, not bulk-loaded
- Monitoring in place for bandwidth utilization alerts
Security Posture
- mTLS enabled for all internal service-to-service communication
- Every API endpoint authenticated, including internal ones
- Secrets and certificates rotated automatically
- Zero trust assumed for all network hops
Topology Dynamics
- Dynamic service discovery used instead of static IPs
- Services handle instances appearing/disappearing gracefully
- DNS TTLs configured appropriately for rapid changes
- Short-lived connections rebuilt cleanly after service migration
Team Coordination
- Clear ownership boundaries defined for each service
- Architecture Decision Records (ADRs) maintained for cross-team decisions
- Self-service deployment pipelines available to all teams
- SLOs documented and shared across team boundaries
Transport Efficiency
- Serialization costs profiled at expected throughput
- Binary protocols (Protobuf, Avro) used for high-throughput internal communication
- Large datasets streamed rather than loaded entirely into memory
- Transport cost included in latency budgets
Network Heterogeneity
- Latency-aware routing configured (not just round-robin)
- Per-path latency monitored separately from aggregate metrics
- Quality-of-service policies defined for different traffic types
- Firewall and load balancer latencies factored into timeout calculations
When to Use / When Not to Use
| Scenario | Recommendation |
|---|---|
| Greenfield distributed system | Design explicitly for all eight fallacies |
| Adding distribution to monolith | Start with async communication |
| Multi-region deployment | Assume all fallacies apply doubled |
| Internal microservices | Treat internal traffic as untrusted |
| High-throughput systems | Profile serialization cost early |
Fallacy trade-off decision matrix
When designing a distributed system, each fallacy creates a set of competing concerns. This matrix maps each fallacy against the engineering trade-offs it forces you to make.
| Fallacy | You assume… | Reality forces… | Typical cost if violated | Mitigating pattern |
|---|---|---|---|---|
| 1. Network is reliable | Calls always succeed | Partitions happen, routes flap, links drop | Data loss, cascade failures | Circuit breakers, retries with backoff |
| 2. Latency is zero | Calls are free | Cross-DC latency dominates response time | Broken SLAs, timeouts, user-facing lag | Async I/O, parallel fan-out, latency budgets |
| 3. Bandwidth is infinite | Push all data freely | Replication streams saturate links | Replication lag, missed SLAs | Compression, batching, CDC, read replicas |
| 4. Network is secure | Internal == trusted | Lateral movement, compromised services | Data breach, internal exfiltration | mTLS, zero trust, every-call auth |
| 5. Topology doesn’t change | Endpoints are stable | Instances migrate, pods reschedule, IPs change | Connection failures, stale lookups | Dynamic discovery, short DNS TTLs |
| 6. One administrator | Single team owns all | Cross-team change coordination takes weeks | Deployment bottlenecks, incidents | Self-service platforms, ADRs, feature flags |
| 7. Transport cost is zero | Serialize freely | JSON encoding burns CPU at scale | CPU bottlenecks, throughput cap | Protobuf, Avro, MessagePack |
| 8. Network is homogeneous | All paths equal | Hotspots, QoS throttling, firewall timeouts | Unpredictable latency, timeouts | Latency-aware routing, per-path monitoring |
How to use this matrix: For each row, check whether your current architecture accounts for the “Reality forces” column. If your design ignores a fallacy, the “Typical cost if violated” column tells you what failure mode to expect. The “Mitigating pattern” column points to the specific pattern that addresses each trade-off on this site.
Interview Questions
These questions test understanding of the eight fallacies and their practical implications in distributed system design.
Expected answer points:
- Networks fail constantly through cable cuts, switch reboots, and router misconfigurations
- In large enough systems, network failures should be expected daily, not treated as exceptional events
- Unreliable networks cause cascading failures when services make synchronous calls without timeouts
- The assumption leads to designs without retry logic, circuit breakers, or fallback behavior
- Production networks experience packet loss, latency spikes, and partitions that break naive assumptions
Expected answer points:
- Local function calls complete in nanoseconds; network calls take milliseconds to hundreds of milliseconds
- Sequential network calls compound latency (multiple 50ms calls = 200ms+ total)
- Developers treating network calls like function calls create systems that respond in seconds, not milliseconds
- Non-zero latency means timeouts must be explicitly configured with appropriate budgets
- Parallel execution of independent calls becomes essential for performance
- The PACELC theorem shows latency and consistency trade off against each other
Expected answer points:
- Cross-region database replication of 500GB at 100Mbps takes over 11 hours
- 4K video streaming to 10,000 concurrent viewers requires approximately 300Gbps
- A gigabit network link maxes out at about 125MB/s of actual throughput
- Teams encounter unexpected throttling and performance degradation during data-intensive operations
- Compression, batching, and streaming become necessities rather than optimizations
Expected answer points:
- Internal networks get compromised through lateral movement, insider threats, and misconfiguration
- Services get repurposed and expose unexpected attack surfaces
- Zero trust requires every service to authenticate every request, even from within the same data center
- mTLS provides mutual authentication between services
- Every network hop is treated as potentially hostile, regardless of physical location
Expected answer points:
- Services get added, removed, migrated, and scaled constantly in production environments
- Containers move between hosts and VMs get provisioned/deprovisioned dynamically
- Static IPs become stale within weeks as infrastructure changes
- Dynamic service discovery with short TTLs handles topology changes gracefully
- DNS Service Discovery and Service Registries provide real-time endpoint tracking
- Service meshes enforce network policies as code rather than relying on static firewall rules
Expected answer points:
- Multiple teams manage different infrastructure parts: network team, database team, platform team, security team
- Adding a firewall port requires tickets, approvals, and maintenance windows across teams
- Coordination overhead becomes the bottleneck as system size grows
- Team Topologies patterns: stream-aligned, platform, enabling, and complicated subsystem teams
- SRE practices with error budgets, SLOs, and blameless postmortems help manage complexity
- ADRs (Architecture Decision Records) create searchable records of cross-team decisions
Expected answer points:
- Transport cost includes serialization (object to JSON), transmission, and deserialization on the other side
- JSON.stringify on a complex object takes approximately 500 microseconds; Protobuf encoding takes approximately 50 microseconds
- At 10,000 requests/second: JSON uses 5 seconds of CPU for serialization versus 0.5 seconds for Protobuf
- High-throughput systems find serialization overhead dominates processing time
- Efficient binary protocols like gRPC with Protocol Buffers significantly reduce transport overhead
Expected answer points:
- Intra-datacenter calls might take 1ms, calls through load balancers 5ms, WAN calls 50ms
- Quality of service policies, routing anomalies, and congestion create unpredictable performance
- Round-robin ignores latency differences and sends equal traffic to fast and slow paths
- Latency-aware routing must consider actual path performance, not just node selection
- Poorly configured firewalls may cause calls to time out entirely on specific paths
- Monitoring per-path latency rather than aggregate reveals hidden performance issues
Expected answer points:
- Assuming latency is zero leads to synchronous calls between services
- Network partitions hit, and because latency is non-zero, timeouts trigger slowly
- Slow-triggering timeouts cause retries to accumulate and overwhelm healthy services
- Assuming network is reliable with no circuit breakers allows cascade to spread
- Hardcoded IPs after service migration cause connection failures during topology changes
- The compounding pattern: violating one fallacy amplifies violations of others
Expected answer points:
- Chaos engineering injects controlled failures into production to test system behavior
- Teams inject latency, partition services, and simulate failures to validate assumptions
- Netflix's Chaos Monkey randomly terminates instances to ensure resilience
- Discovers hidden vulnerabilities before they cause actual outages
- Tests whether circuit breakers, retries, and fallback behavior actually work under failure conditions
Expected answer points:
- Circuit breakers trip when a downstream service fails repeatedly, stopping further requests
- Bulkheads partition resources (connection pools, threads) so failures in one service don't exhaust resources for others
- Circuit breakers: use when services call other services that might be slow or unavailable during peak load
- Bulkheads: use when you need to isolate critical services from noisy neighbors consuming all resources
- Both patterns address network unreliability by limiting blast radius of failures
Expected answer points:
- The CAP theorem states that during a network partition (P), you must choose between consistency (C) and availability (A)
- Assuming the network is reliable ignores the possibility of partitions entirely
- In reality, partitions happen; designs must choose between serving stale data (AP) or refusing requests (CP)
- When network is reliable and no partition occurs, you can have both consistency and availability (CA)
- The fallacy leads to designs that assume CA without handling the P case when partitions occur
Expected answer points:
- Feature flags decouple deployment from release, allowing teams to deploy independently
- Team A can change an API that Team B depends on by deploying behind a flag
- Testing happens with own traffic before exposing to dependent teams
- Gradual rollout with monitoring allows rollback without cross-team coordination
- Dark launches expose new functionality to small traffic subsets while team monitors for issues
- No coordination with Team B required until the flag is fully enabled
Expected answer points:
- Error budget = allowed downtime based on SLA (e.g., 99.95% availability = 4 hours downtime per year)
- Teams can move fast as long as system reliability remains within budget
- When budget is full (reliable), teams move fast; when budget is depleted, they slow down
- Removes need for lengthy approval processes when system is healthy
- At 50% budget remaining: increase testing rigor, require additional review
- At 25% budget remaining: freeze non-critical deployments, require root cause analysis
Expected answer points:
- PACELC extends CAP: if there is a partition (P), you choose between latency (L) and consistency (C)
- When no partition occurs, you still face a latency vs consistency tradeoff
- Synchronous replication for strong consistency adds latency due to coordination overhead
- Asynchronous replication is faster but introduces potential inconsistency
- Designing for low latency often means accepting eventual consistency
- The fallacy "latency is zero" ignores this fundamental tradeoff that affects real system performance
Expected answer points:
- Self-service platforms: teams can deploy new services without filing tickets with platform team
- Safe defaults: encryption on by default, impossible to turn off, reduces decision mistakes
- Observability investment: logs, metrics, traces let teams diagnose issues without coordinating with other teams
- Service meshes: network policy as code instead of coordinating firewall rules
- Shared libraries for common patterns (retries, timeouts, circuit breakers): all teams benefit from improvements without individual coordination
Expected answer points:
- Protocol Buffers produce messages 60% smaller than equivalent JSON
- Encoding time is approximately 10x faster (50 microseconds vs 500 microseconds for complex objects)
- gRPC uses HTTP/2 for multiplexing multiple requests over a single connection
- Binary protocol parsing is more efficient than string-based JSON parsing
- For internal service communication with high throughput requirements, Protobuf significantly reduces CPU overhead
- REST with JSON remains acceptable for external APIs where human readability matters
Expected answer points:
- A US-EAST-1 availability zone power failure cascaded for over 7 hours
- Affected thousands of services including Amazon, Netflix, and Disney+
- Fallacy 1 (Network is Reliable): single-AZ design assumed failure wouldn't spread properly
- Fallacy 2 (Latency is Zero): synchronous cross-service dependencies meant one failing component brought down entire systems
- Fallacy 6 (One Administrator): change management process for restoring power took hours of coordination
- Key lessons: assume any single component can fail catastrophically, design for partial availability, have runbooks for multi-step recovery
Expected answer points:
- SLOs give teams shared definitions of reliability (e.g., 99.95% availability, p99 latency under 200ms)
- Cross-team discussions become easier when everyone agrees on what "reliable" means
- SLOs become the common language for reliability decisions across team boundaries
- Downstream teams depend on upstream services; SLAs formalize those dependencies with explicit commitments
- Teams can make trade-off decisions (speed vs reliability) based on shared SLO understanding
Expected answer points:
- Cross-region database replication exposes bandwidth limitations dramatically
- Replicating 500GB at 100Mbps takes over 11 hours, making naive replication strategies impractical
- Solutions include: compressing replication streams, batching updates, using change data capture (CDC)
- Async replication becomes necessary when bandwidth cannot keep up with write rates
- Geo-distribution architectures must account for bandwidth costs and latency between regions
- The fallacy leads teams to design replication strategies that assume no bandwidth constraints
Further Reading
Explore these related topics to deepen your understanding of distributed systems design and the fallacies that govern them.
Fundamental Theorems:
- CAP Theorem - Understanding consistency and availability trade-offs during network partitions
- PACELC Theorem - The latency versus consistency tradeoff that persists even when networks are reliable
Resilience Patterns:
- Resilience Patterns - Circuit breakers, bulkheads, retry strategies, and fallback behavior
- Circuit Breaker Pattern - Implementing fault isolation to prevent cascade failures
- Availability Patterns - Designing systems that remain operational during partial failures
Operational Practices:
- Chaos Engineering - Injecting controlled failures to validate system resilience
- Service Registry - Dynamic service discovery for handling topology changes
- DNS Service Discovery - Finding services dynamically as topology changes
Security:
- mTLS - Mutual TLS for authenticating service-to-service communication
Team Coordination:
- Service Boundaries - Structuring teams and services to minimize coordination overhead
Observability:
- Distributed Tracing - Tracking requests across service boundaries to diagnose latency issues
For deeper exploration of failure handling and real-world case studies, see Incident Response.
Conclusion
The eight fallacies describe assumptions that seem reasonable but are always wrong in production.
- Assume the network is unreliable, latency is non-zero, and bandwidth is finite.
- Treat internal traffic as untrusted—security boundaries exist at every network hop.
- Design for topology changes. Use dynamic service discovery.
- Account for transport cost (serialization) in performance budgets.
Copy/Paste Checklist
- [ ] Every network call has timeout and retry logic
- [ ] Circuit breakers protect against cascade failures
- [ ] Fallback behavior defined for when services are unavailable
- [ ] Authentication required for all service-to-service calls
- [ ] Dynamic service discovery instead of static configuration
- [ ] Serialization cost included in latency budgets
- [ ] Network topology changes handled gracefully
- [ ] Monitoring for latency, error rates, and bandwidth utilization
For deeper exploration of failure handling, see Chaos Engineering and Circuit Breaker Pattern.
Category
Related Posts
Distributed Systems Primer: Key Concepts for Modern Architecture
A practical introduction to distributed systems fundamentals. Learn about failure modes, replication strategies, consensus algorithms, and the core challenges of building distributed software.
Graceful Degradation: Systems That Bend Instead Break
Design systems that maintain core functionality when components fail through fallback strategies, degradation modes, and progressive service levels.
Microservices vs Monolith: Choosing the Right Architecture
Understand the fundamental differences between monolithic and microservices architectures, their trade-offs, and how to decide which approach fits your project.