The Eight Fallacies of Distributed Computing

Explore the classic assumptions developers make about networked systems that lead to failures. Learn how to avoid these pitfalls in distributed architecture.

published: March 24, 2026 reading time: 32 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

Peter Deutsch nailed it with eight assumptions about distributed systems that sound obvious until your system hits production and they're all wrong. This post works through each fallacy with real incidents to show why: networks fail constantly, latency compounds across service calls, internal traffic shouldn't be trusted, and topology changes are constant rather than exceptional. Each fallacy comes with failure stories from AWS, Cloudflare, and GitHub alongside the patterns that actually mitigate them—circuit breakers, parallel fan-out, and treating every service call like it might time out.

Introduction

Fallacy 1: The Network is Reliable

The first and most dangerous fallacy. Production networks fail constantly: cables get cut, switches reboot, routers misconfigure. In any large enough system, you should plan for network failures daily.

graph TD
    A[Client Request] --> B[Network]
    B -->|Packet Loss| C[Request Timeout]
    B -->|Latency Spike| D[Request Slow]
    B -->|Partition| E[Request Failed]
    C --> F[Retry Logic]
    D --> F
    E --> G[Fallback Behavior]
    F --> G

The fix: implement retries with exponential backoff, circuit breakers, and fallback behavior. When designing a distributed system, assume the network will fail and design accordingly. See the CAP theorem for what happens during network partitions.

Fallacy 2: Latency is Zero

Light takes about 67ms to cross the US at the speed of fiber. A round trip between New York and Tokyo runs 100-150ms. Developers who treat network calls like function calls soon discover their “simple” distributed system responds in seconds, not milliseconds.

// Naive approach: multiple sequential network calls
async function getUserDashboard(userId) {
  const user = await fetchUser(userId); // 50ms
  const posts = await fetchPosts(userId); // 80ms
  const friends = await fetchFriends(userId); // 60ms
  const notifications = await fetchNotifs(userId); // 40ms
  // Total: 230ms sequential

  // Better: parallel calls
  const [user, posts, friends, notifs] = await Promise.all([
    fetchUser(userId),
    fetchPosts(userId),
    fetchFriends(userId),
    fetchNotifications(userId),
  ]);
  // Total: ~80ms (time of slowest call)
}

Even without partitions, latency and consistency trade off against each other. See the PACELC theorem. Synchronous replication for strong consistency adds latency. Async replication is faster but introduces potential inconsistency.

Fallacy 3: Bandwidth is Infinite

Early network proponents assumed bandwidth was effectively unlimited. You hit the wall fast when replicating large datasets or streaming video across data centers.

Real bandwidth limits hit in ways that surprise teams:

Replicating a 500GB database across regions at 100Mbps takes over 11 hours
Streaming 4K video to 10,000 concurrent viewers needs ~300Gbps
A single gigabit network link maxes out at about 125MB/s of actual throughput

The geo-distribution post covers how to architect when data must span regions—and the latency and bandwidth costs that come with it.

Fallacy 4: The Network is Secure

Internal networks get compromised. Services get repurposed. Attackers find ways in. Zero trust architecture exists because assuming internal networks are safe has killed systems. Every service should authenticate every request, even from within the same data center.

The mTLS post covers mutual TLS for service-to-service authentication. The secrets management post discusses how to handle credentials in a distributed environment.

Fallacy 5: Topology Doesn’t Change

Static network diagrams go stale the moment you draw them. Services get added, removed, migrated, and scaled. Containers move between hosts. VMs get provisioned and deprovisioned. Your “stable” service discovery configuration becomes stale within weeks.

graph LR
    A[Service A] -->|Initially| B[Service B]
    A -->|After migration| C[Service B in DC2]
    A -->|With service mesh| D[Service B via mesh]

Dynamic service discovery (covered in DNS Service Discovery and Service Registry) handles topology changes. But you still need to design for instances appearing and disappearing without warning.

Fallacy 6: There is One Administrator

In a distributed system, multiple teams manage different parts of the infrastructure. The network team controls routers. The database team manages replicas. The platform team handles Kubernetes. The security team enforces firewall rules. Changes made by any one can break things for all.

Coordination overhead grows fast with system size. A simple change like adding a new firewall port requires tickets, approvals, and maintenance windows. Distributed systems require strong processes alongside strong architecture.

The service boundaries post discusses how to structure teams and services to minimize coordination overhead while maintaining proper separation.

DevOps and SRE Organizational Patterns

Fallacy 6 manifests most painfully in multi-team environments. When hundreds of services are managed by dozens of teams, coordination becomes the bottleneck. These patterns help manage the complexity.

Team topologies for distributed systems

The Team Topologies book from SKF and Red Hat identifies four fundamental team patterns that map well to distributed systems:

Team Type	Responsibility	Interaction Pattern
Stream-aligned	Deliver user-facing features	Self-sufficient, flow work
Platform	Provide internal platforms/tools	Enable stream teams
Enabling	Help teams overcome obstacles	Temporary, embed with teams
Complicated Subsystem	Own complex domain expertise	Consulted by stream teams

In distributed systems, platform teams become critical. They own the shared infrastructure that stream-aligned teams consume: service mesh, observability stack, database platforms, networking policies.

SRE practices that counteract Fallacy 6

SRE (Site Reliability Engineering) was invented at Google to bridge the gap between engineering and operations in large-scale distributed systems:

Error Budgets: Rather than mandating that every change goes through a lengthy approval process, error budgets allow teams to move fast as long as the system remains reliable. If your SLA is 99.9% uptime, you have 0.1% downtime budget to spend on risk. When the budget is full, slow down. When it is healthy, move fast.

Toil Reduction: SRE defines toil as manual, repetitive, automatable work. Teams drowning in toil cannot move fast. SRE practices push organizations to automate operational work so engineers can focus on improving the system rather than running it.

SLOs and SLIs: Service Level Objectives give teams shared definitions of reliability. When everyone agrees that “reliable” means 99.95% availability with p99 latency under 200ms, cross-team discussions become easier. SLOs become the common language for reliability decisions.

Blameless Postmortems: Failures in complex distributed systems almost never trace back to “one person made a mistake.” Multiple contributing factors across team boundaries are almost always involved. Blameless postmortems focus on systemic causes rather than individual fault, which means people actually report near-misses instead of hiding them.

# Example: SRE error budget policy
# .sre/config.yaml
error_budget_policy:
  availability_target: 99.95 # 4 hours downtime per year
  slo_window: 30d
  warning_threshold: 50% # Remaining budget
  critical_threshold: 25% # Remaining budget

  # Actions tied to budget consumption
  when_below_50%:
    - Increase testing rigor
    - Require additional review for risky changes
  when_below_25%:
    - Freeze non-critical deployments
    - Root cause analysis required
    - Executive notification
  when_depleted:
    - Emergency retrospective
    - Post-mortem mandatory
    - Deployment freeze until resolved

Cross-team coordination mechanisms

As distributed systems grow, you need formal mechanisms to coordinate changes across team boundaries:

Architecture Decision Records (ADRs): When a decision affects multiple teams, an ADR documents the context, decision, and consequences. This creates a searchable record of why the system is built the way it is. New team members can read ADRs to understand architectural decisions made before they joined.

Service Level Agreements (SLAs) Between Teams: Downstream teams depend on upstream services. Formalizing those dependencies as SLAs—with explicit latency, availability, and error rate commitments—makes it clear what each team owes to others.

Design Reviews for Cross-Cutting Changes: Changes to shared infrastructure, API contracts, or data schemas should go through a design review process that includes affected teams. This catches breaking changes before they happen.

Dark Launches and Feature Flags: Rather than coordinating deployments across teams, feature flags let teams deploy code independently. A dark launch exposes new functionality to a small subset of traffic while the team monitors for issues. This decouples deployment from release.

# Feature flag example for cross-team coordination
# When Team A wants to change an API that Team B depends on,
# Team A can:
# 1. Deploy the change behind a feature flag
# 2. Test with their own traffic
# 3. Gradually increase traffic while monitoring
# 4. Roll back instantly if issues arise
# No coordination with Team B required until the flag is fully enabled

def gradual_rollout(flag_name: str, initial_percentage: float = 1.0,
                    increment: float = 10.0, interval_seconds: float = 300):
    """
    Gradually increase traffic to a new feature.
    Returns when flag reaches 100% or is disabled.
    """
    current_percentage = initial_percentage
    while current_percentage < 100.0:
        if is_flag_disabled(flag_name):
            print(f"Flag {flag_name} disabled - rolling back")
            return False

        metrics = get_flag_metrics(flag_name)
        if metrics.error_rate > 0.01:  # 1% error threshold
            print(f"Error rate {metrics.error_rate} exceeds threshold - rolling back")
            disable_flag(flag_name)
            return False

        current_percentage = min(current_percentage + increment, 100.0)
        set_flag_percentage(flag_name, current_percentage)
        print(f"Flag {flag_name} at {current_percentage}%")
        sleep(interval_seconds)

    return True

Practical guidelines for reducing coordination overhead

Embrace self-service: Build platforms and tooling that let teams operate independently. If deploying a new service requires filing a ticket with the platform team, you have a bottleneck.
Make defaults safe: Design systems so the safe choice is the default. If encryption should be on by default, make it impossible to turn off. This reduces the number of decisions teams must make and mistakes they can make.
Invest in observability: When something breaks in a distributed system, the first question is always “which service?” Good observability—logs, metrics, traces—lets teams diagnose issues without coordinating with other teams.
Use service meshes for network policies: Rather than coordinating firewall rules across teams, a service mesh like Istio or Linkerd lets you express network policy as code. Teams define what their service needs to access; the mesh enforces it.
Create shared libraries for common patterns: When multiple teams implement the same pattern (retries, timeouts, circuit breakers), a shared library means they all benefit from improvements and bug fixes without coordination.

Fallacy 7: Transport Cost is Zero

This extends “latency is zero” but focuses on serialization and deserialization. Converting an object to JSON, sending it across the network, and converting it back takes CPU time and memory. For high-throughput systems, serialization alone can dominate processing time.

// Efficient binary protocol vs JSON
const userJson = JSON.stringify(user); // CPU: ~500μs for complex object
const userProtobuf = protobuf.encode(user); // CPU: ~50μs, 60% smaller

// Impact at scale: 10,000 requests/second
// JSON: 5 seconds CPU just for serialization
// Protobuf: 0.5 seconds CPU

gRPC with Protocol Buffers outperforms REST with JSON for internal service communication. The API contracts post covers designing service interfaces that minimize overhead.

Fallacy 8: The Network is Homogeneous

Assuming all network paths are equal leads to subtle performance problems. A call from service A to service B in the same data center might take 1ms. The same call routed through a load balancer might take 5ms. A call that crosses a WAN might take 50ms. A call that goes through a poorly configured firewall might time out.

Network paths are never uniform. Quality of service policies, routing anomalies, and congestion all affect performance unpredictably. Load balancing and load balancing algorithms posts cover how to route traffic intelligently despite network heterogeneity.

Core Concepts

These fallacies compound in production. A team assumes latency is negligible, so they make synchronous calls between services. Then a network partition hits. Because latency is actually non-zero, timeouts trigger slowly. Retries overwhelm the system. The cascade spreads.

The availability patterns and resilience patterns posts cover how to build systems that survive when assumptions fail—because they always do.

Real incident case studies

Understanding fallacies through real-world failures helps cement why these assumptions are dangerous.

AWS US-EAST-1 Outage (December 2021) - Fallacy 1: Network is Reliable

What happened: A US-EAST-1 availability zone experienced a power failure that cascaded. The outage lasted over 7 hours and affected thousands of services including major platforms like Amazon, Netflix, and Disney+.

Root fallacy violations:

Network is Reliable: The single-AZ design assumed failure wouldn’t spread across availability zones properly
Latency is Zero: Synchronous cross-service dependencies meant one failing component brought down entire systems
One Administrator: The change management process for restoring power took hours of coordination

Key lessons:

Assume any single component can fail catastrophically
Design for partial availability - degrade gracefully
Have runbooks for multi-step recovery procedures

Cloudflare DNS Outage (July 2019) - Fallacy 1 & 4: Network is Reliable, Network is Secure

What happened: A Layer 3 network routing misconfiguration caused Cloudflare’s DNS service to drop approximately 15% of global traffic. Over 400 million websites and services were unreachable.

Root fallacy violations:

Network is Reliable: A single routing configuration error propagated globally
Network is Homogeneous: The assumption that all network paths behave consistently proved false

Key lessons:

BGP routing errors can propagate globally in minutes
Have multiple DNS providers for critical services
Monitor for unexpected routing changes

GitHub Outage (February 2018) - Fallacy 2: Latency is Zero

What happened: GitHub experienced 24 hours of degraded performance after a routine maintenance operation caused excessive load on their primary MySQL databases. The cascading failure affected over 35 million developers.

Root fallacy violations:

Latency is Zero: Maintenance operations weren’t scheduled with latency impact analysis
Transport Cost is Zero: The serialization of large MySQL result sets caused memory pressure

Key lessons:

Always analyze latency impact of maintenance operations
Have rollback procedures that account for replication lag
Monitor database connection pool exhaustion

The compounding effect: How fallacies cascade

flowchart TD
    A[Assume latency is zero] --> B[Make synchronous calls between services]
    B --> C[Network partition occurs]
    C --> D[Timeouts trigger slowly due to no timeout budgets]
    D --> E[Retries overwhelm healthy services]
    E --> F[Cascade failure spreads across system]

    G[Assume network is reliable] --> H[No circuit breakers configured]
    H --> C

    I[Assume topology doesn't change] --> J[Hardcoded IP addresses used]
    J --> K[Service migration causes connection failures]
    K --> C

The compounding pattern: Violating one fallacy often amplifies violations of others. A network partition (fallacy 1) becomes catastrophic when combined with no timeout budgets (fallacy 2) and no circuit breakers (fallacy 1 again).

Designing for Reality

Use these fallacies as a checklist when designing distributed systems:

graph TD
    A[Designing Distributed System] --> B{Ask These Questions}
    B --> C[What happens when network calls fail?]
    B --> D[What is actual latency between services?]
    B --> E[Are we exceeding bandwidth budgets?]
    B --> F[Do we authenticate internal calls?]
    B --> G[How does topology changes affect us?]
    B --> H[Who coordinates cross-team changes?]
    B --> I[Is serialization cost included in latency budgets?]
    B --> J[Are all network paths equally reliable?]

If you cannot answer these questions confidently, your system will fail in production. Not might—will.

Production Failure Scenarios

Failure Scenario	Root Fallacy	Mitigation
Cascading timeout during network blip	Latency is zero, Network is reliable	Circuit breakers, bulkheads, timeouts with budgets
Database replication falls behind	Bandwidth is infinite	Monitor replication lag, compress streams, batch
Service discovery returns stale endpoints	Topology doesn’t change	Use dynamic discovery with short TTLs
Lateral movement after internal breach	Network is secure	Zero trust, mTLS, every call authenticated
Single team bottleneck on deployment	One administrator	Decentralize ownership, self-service platforms

Common Pitfalls / Anti-Patterns

Pitfall 1: Treating network calls like function calls

Problem: Calling a remote service as if it were a local function, without timeouts, retries, or error handling.

Solution: Every network call must have: timeout, retry logic (with backoff), fallback behavior, and proper error handling.

Pitfall 2: Assuming homogeneous network conditions

Problem: Treating all environments (dev, staging, production) as having similar network characteristics.

Solution: Test against production-like network conditions. Use chaos engineering to inject latency and partition services.

Pitfall 3: Ignoring serialization overhead

Problem: Choosing JSON because it is readable, ignoring CPU and bandwidth costs at scale.

Solution: Profile serialization cost. For high-throughput internal services, consider Protocol Buffers, Avro, or MessagePack.

Self-Assessment Quiz: Is Your Architecture Violating These Fallacies?

This quiz goes through all eight fallacies with four yes/no questions per fallacy. Answer based on what your system actually does, not what you planned to build eventually. If you answer “I don’t know,” count that as a gap.

Set aside 10-15 minutes. Keep your architecture diagrams and runbooks handy so you can check real behavior instead of intended behavior. The gap this quiz targets is the difference between “we built it to handle that” and “we tested that it handles that.”

The eight sections cover network reliability, latency measurement, bandwidth planning, internal authentication, service discovery, cross-team ownership, serialization overhead, and network path diversity.

When you finish, count your unchecked boxes and use the scoring guide to see where you stand. Five or more unchecked items means you have assumptions that have not been tested in production. That is where incidents come from. The goal is not a perfect score — it is knowing which gaps to fix first.

Fallacy 1: Network is Reliable

Do you have retry logic with exponential backoff for all network calls?
Are circuit breakers configured for all downstream services?
Do you have fallback behavior when services are unavailable?
Have you tested what happens when a network partition occurs?

Checklist: Fallacy 2 - Latency is Zero

Have you measured actual latency between all service pairs?
Are timeouts set based on measured P99 latency, not arbitrary values?
Do you use parallel calls where possible instead of sequential chains?
Is latency included in your SLA calculations?

Checklist: Fallacy 3 - Bandwidth is Infinite

Have you calculated bandwidth requirements for data replication?
Do you compress data streams between datacenters?
Are large data transfers batched or streamed rather than bulk-loaded?
Do you monitor bandwidth utilization and alert on limits?

Fallacy 4: Network is Secure

Do you use mTLS for all internal service-to-service communication?
Is every API endpoint authenticated, even internal ones?
Do you rotate secrets and certificates automatically?
Do you log and audit all cross-service communication?

Checklist: Fallacy 5 - Topology Doesn’t Change

Do you use dynamic service discovery instead of static IPs?
Are services designed to handle instances appearing/disappearing?
Do you use Kubernetes or similar for dynamic load balancing?
Are DNS TTLs short enough for rapid changes?

Checklist: Fallacy 6 - There is One Administrator

Do you have clear ownership boundaries for each service?
Are cross-team changes coordinated through ADRs (Architecture Decision Records)?
Do you have self-service deployment pipelines?
Is there a clear escalation path for incidents?

Checklist: Fallacy 7 - Transport Cost is Zero

Have you profiled serialization costs at your expected throughput?
Are you using efficient serialization (Protobuf, Avro) for internal communication?
Do you stream large datasets rather than loading them all at once?
Is serialization cost included in latency budgets?

Fallacy 8: Network is Homogeneous

Do you use latency-aware routing (not just round-robin)?
Have you tested behavior across different network paths?
Do you monitor per-path latency, not just aggregate?
Do you have quality-of-service policies for different traffic types?

Scoring

Count your unchecked boxes:

0-4 unchecked: Your architecture is resilient - well done!
5-10 unchecked: Some vulnerabilities exist - review weak areas
11+ unchecked: High risk - prioritize fixing these issues before production

Quick Recap Checklist

Use this checklist to verify your distributed system design addresses each fallacy before production deployment:

Network Reliability

All network calls have retry logic with exponential backoff
Circuit breakers are configured for all downstream services
Fallback behavior is defined for when services are unavailable
Timeouts are set based on measured P99 latency, not arbitrary values

Latency Awareness

Actual latency measured between all service pairs across regions
Parallel fan-out used for independent service calls
Latency budgets include serialization and deserialization costs
Async I/O preferred over synchronous blocking calls

Bandwidth Planning

Bandwidth requirements calculated for all data replication streams
Compression enabled for cross-datacenter traffic
Large transfers batched or streamed, not bulk-loaded
Monitoring in place for bandwidth utilization alerts

Security Posture

mTLS enabled for all internal service-to-service communication
Every API endpoint authenticated, including internal ones
Secrets and certificates rotated automatically
Zero trust assumed for all network hops

Topology Dynamics

Dynamic service discovery used instead of static IPs
Services handle instances appearing/disappearing gracefully
DNS TTLs configured appropriately for rapid changes
Short-lived connections rebuilt cleanly after service migration

Team Coordination

Clear ownership boundaries defined for each service
Architecture Decision Records (ADRs) maintained for cross-team decisions
Self-service deployment pipelines available to all teams
SLOs documented and shared across team boundaries

Transport Efficiency

Serialization costs profiled at expected throughput
Binary protocols (Protobuf, Avro) used for high-throughput internal communication
Large datasets streamed rather than loaded entirely into memory
Transport cost included in latency budgets

Network Heterogeneity

Latency-aware routing configured (not just round-robin)
Per-path latency monitored separately from aggregate metrics
Quality-of-service policies defined for different traffic types
Firewall and load balancer latencies factored into timeout calculations

When to Use / When Not to Use

Scenario	Recommendation
Greenfield distributed system	Design explicitly for all eight fallacies
Adding distribution to monolith	Start with async communication
Multi-region deployment	Assume all fallacies apply doubled
Internal microservices	Treat internal traffic as untrusted
High-throughput systems	Profile serialization cost early

Fallacy trade-off decision matrix

When designing a distributed system, each fallacy creates a set of competing concerns. This matrix maps each fallacy against the engineering trade-offs it forces you to make.

Fallacy	You assume…	Reality forces…	Typical cost if violated	Mitigating pattern
1. Network is reliable	Calls always succeed	Partitions happen, routes flap, links drop	Data loss, cascade failures	Circuit breakers, retries with backoff
2. Latency is zero	Calls are free	Cross-DC latency dominates response time	Broken SLAs, timeouts, user-facing lag	Async I/O, parallel fan-out, latency budgets
3. Bandwidth is infinite	Push all data freely	Replication streams saturate links	Replication lag, missed SLAs	Compression, batching, CDC, read replicas
4. Network is secure	Internal == trusted	Lateral movement, compromised services	Data breach, internal exfiltration	mTLS, zero trust, every-call auth
5. Topology doesn’t change	Endpoints are stable	Instances migrate, pods reschedule, IPs change	Connection failures, stale lookups	Dynamic discovery, short DNS TTLs
6. One administrator	Single team owns all	Cross-team change coordination takes weeks	Deployment bottlenecks, incidents	Self-service platforms, ADRs, feature flags
7. Transport cost is zero	Serialize freely	JSON encoding burns CPU at scale	CPU bottlenecks, throughput cap	Protobuf, Avro, MessagePack
8. Network is homogeneous	All paths equal	Hotspots, QoS throttling, firewall timeouts	Unpredictable latency, timeouts	Latency-aware routing, per-path monitoring

How to use this matrix: For each row, check whether your current architecture accounts for the “Reality forces” column. If your design ignores a fallacy, the “Typical cost if violated” column tells you what failure mode to expect. The “Mitigating pattern” column points to the specific pattern that addresses each trade-off on this site.

Interview Questions

These questions test understanding of the eight fallacies and their practical implications in distributed system design.

1. Why is assuming the network is reliable considered the most dangerous fallacy in distributed computing?

Expected answer points:

Networks fail constantly through cable cuts, switch reboots, and router misconfigurations
In large enough systems, network failures should be expected daily, not treated as exceptional events
Unreliable networks cause cascading failures when services make synchronous calls without timeouts
The assumption leads to designs without retry logic, circuit breakers, or fallback behavior
Production networks experience packet loss, latency spikes, and partitions that break naive assumptions

2. Explain why latency being non-zero fundamentally changes distributed system design compared to local procedure calls.

Expected answer points:

Local function calls complete in nanoseconds; network calls take milliseconds to hundreds of milliseconds
Sequential network calls compound latency (multiple 50ms calls = 200ms+ total)
Developers treating network calls like function calls create systems that respond in seconds, not milliseconds
Non-zero latency means timeouts must be explicitly configured with appropriate budgets
Parallel execution of independent calls becomes essential for performance
The PACELC theorem shows latency and consistency trade off against each other

3. What practical problems arise from assuming bandwidth is infinite in distributed systems?

Expected answer points:

Cross-region database replication of 500GB at 100Mbps takes over 11 hours
4K video streaming to 10,000 concurrent viewers requires approximately 300Gbps
A gigabit network link maxes out at about 125MB/s of actual throughput
Teams encounter unexpected throttling and performance degradation during data-intensive operations
Compression, batching, and streaming become necessities rather than optimizations

4. How does zero trust architecture address the fallacy that internal networks are secure?

Expected answer points:

Internal networks get compromised through lateral movement, insider threats, and misconfiguration
Services get repurposed and expose unexpected attack surfaces
Zero trust requires every service to authenticate every request, even from within the same data center
mTLS provides mutual authentication between services
Every network hop is treated as potentially hostile, regardless of physical location

5. Why do static network diagrams become obsolete quickly, and what replaces them in modern architectures?

Expected answer points:

Services get added, removed, migrated, and scaled constantly in production environments
Containers move between hosts and VMs get provisioned/deprovisioned dynamically
Static IPs become stale within weeks as infrastructure changes
Dynamic service discovery with short TTLs handles topology changes gracefully
DNS Service Discovery and Service Registries provide real-time endpoint tracking
Service meshes enforce network policies as code rather than relying on static firewall rules

6. Describe how Fallacy 6 (there is one administrator) manifests in large-scale distributed systems and organizational patterns that mitigate it.

Expected answer points:

Multiple teams manage different infrastructure parts: network team, database team, platform team, security team
Adding a firewall port requires tickets, approvals, and maintenance windows across teams
Coordination overhead becomes the bottleneck as system size grows
Team Topologies patterns: stream-aligned, platform, enabling, and complicated subsystem teams
SRE practices with error budgets, SLOs, and blameless postmortems help manage complexity
ADRs (Architecture Decision Records) create searchable records of cross-team decisions

7. What is transport cost in distributed systems, and why does it matter more at scale?

Expected answer points:

Transport cost includes serialization (object to JSON), transmission, and deserialization on the other side
JSON.stringify on a complex object takes approximately 500 microseconds; Protobuf encoding takes approximately 50 microseconds
At 10,000 requests/second: JSON uses 5 seconds of CPU for serialization versus 0.5 seconds for Protobuf
High-throughput systems find serialization overhead dominates processing time
Efficient binary protocols like gRPC with Protocol Buffers significantly reduce transport overhead

8. Explain how network path heterogeneity causes subtle performance problems that round-robin load balancing cannot solve.

Expected answer points:

Intra-datacenter calls might take 1ms, calls through load balancers 5ms, WAN calls 50ms
Quality of service policies, routing anomalies, and congestion create unpredictable performance
Round-robin ignores latency differences and sends equal traffic to fast and slow paths
Latency-aware routing must consider actual path performance, not just node selection
Poorly configured firewalls may cause calls to time out entirely on specific paths
Monitoring per-path latency rather than aggregate reveals hidden performance issues

9. How do the eight fallacies compound in production, creating cascading failures?

Expected answer points:

Assuming latency is zero leads to synchronous calls between services
Network partitions hit, and because latency is non-zero, timeouts trigger slowly
Slow-triggering timeouts cause retries to accumulate and overwhelm healthy services
Assuming network is reliable with no circuit breakers allows cascade to spread
Hardcoded IPs after service migration cause connection failures during topology changes
The compounding pattern: violating one fallacy amplifies violations of others

10. What is chaos engineering, and how does it help teams discover which fallacies their systems violate?

Expected answer points:

Chaos engineering injects controlled failures into production to test system behavior
Teams inject latency, partition services, and simulate failures to validate assumptions
Netflix's Chaos Monkey randomly terminates instances to ensure resilience
Discovers hidden vulnerabilities before they cause actual outages
Tests whether circuit breakers, retries, and fallback behavior actually work under failure conditions

11. Compare circuit breakers versus bulkheads as patterns for preventing cascade failures. When would you use each?

Expected answer points:

Circuit breakers trip when a downstream service fails repeatedly, stopping further requests
Bulkheads partition resources (connection pools, threads) so failures in one service don't exhaust resources for others
Circuit breakers: use when services call other services that might be slow or unavailable during peak load
Bulkheads: use when you need to isolate critical services from noisy neighbors consuming all resources
Both patterns address network unreliability by limiting blast radius of failures

12. What is the relationship between the CAP theorem and Fallacy 1 (the network is reliable)?

Expected answer points:

The CAP theorem states that during a network partition (P), you must choose between consistency (C) and availability (A)
Assuming the network is reliable ignores the possibility of partitions entirely
In reality, partitions happen; designs must choose between serving stale data (AP) or refusing requests (CP)
When network is reliable and no partition occurs, you can have both consistency and availability (CA)
The fallacy leads to designs that assume CA without handling the P case when partitions occur

13. How do feature flags help reduce coordination overhead across teams managing distributed systems?

Expected answer points:

Feature flags decouple deployment from release, allowing teams to deploy independently
Team A can change an API that Team B depends on by deploying behind a flag
Testing happens with own traffic before exposing to dependent teams
Gradual rollout with monitoring allows rollback without cross-team coordination
Dark launches expose new functionality to small traffic subsets while team monitors for issues
No coordination with Team B required until the flag is fully enabled

14. Describe the error budget concept from SRE and how it changes team behavior regarding deployments and risk.

Expected answer points:

Error budget = allowed downtime based on SLA (e.g., 99.95% availability = 4 hours downtime per year)
Teams can move fast as long as system reliability remains within budget
When budget is full (reliable), teams move fast; when budget is depleted, they slow down
Removes need for lengthy approval processes when system is healthy
At 50% budget remaining: increase testing rigor, require additional review
At 25% budget remaining: freeze non-critical deployments, require root cause analysis

15. Why does the PACELC theorem matter when designing for latency in distributed systems?

Expected answer points:

PACELC extends CAP: if there is a partition (P), you choose between latency (L) and consistency (C)
When no partition occurs, you still face a latency vs consistency tradeoff
Synchronous replication for strong consistency adds latency due to coordination overhead
Asynchronous replication is faster but introduces potential inconsistency
Designing for low latency often means accepting eventual consistency
The fallacy "latency is zero" ignores this fundamental tradeoff that affects real system performance

16. What organizational patterns help reduce the multi-team coordination overhead described in Fallacy 6?

Expected answer points:

Self-service platforms: teams can deploy new services without filing tickets with platform team
Safe defaults: encryption on by default, impossible to turn off, reduces decision mistakes
Observability investment: logs, metrics, traces let teams diagnose issues without coordinating with other teams
Service meshes: network policy as code instead of coordinating firewall rules
Shared libraries for common patterns (retries, timeouts, circuit breakers): all teams benefit from improvements without individual coordination

17. How does gRPC with Protocol Buffers address the transport cost fallacy more effectively than REST with JSON?

Expected answer points:

Protocol Buffers produce messages 60% smaller than equivalent JSON
Encoding time is approximately 10x faster (50 microseconds vs 500 microseconds for complex objects)
gRPC uses HTTP/2 for multiplexing multiple requests over a single connection
Binary protocol parsing is more efficient than string-based JSON parsing
For internal service communication with high throughput requirements, Protobuf significantly reduces CPU overhead
REST with JSON remains acceptable for external APIs where human readability matters

18. What happened during the AWS US-EAST-1 outage of December 2021, and which fallacies did it demonstrate?

Expected answer points:

A US-EAST-1 availability zone power failure cascaded for over 7 hours
Affected thousands of services including Amazon, Netflix, and Disney+
Fallacy 1 (Network is Reliable): single-AZ design assumed failure wouldn't spread properly
Fallacy 2 (Latency is Zero): synchronous cross-service dependencies meant one failing component brought down entire systems
Fallacy 6 (One Administrator): change management process for restoring power took hours of coordination
Key lessons: assume any single component can fail catastrophically, design for partial availability, have runbooks for multi-step recovery

19. Explain how Service Level Objectives (SLOs) help different teams agree on reliability targets in distributed systems.

Expected answer points:

SLOs give teams shared definitions of reliability (e.g., 99.95% availability, p99 latency under 200ms)
Cross-team discussions become easier when everyone agrees on what "reliable" means
SLOs become the common language for reliability decisions across team boundaries
Downstream teams depend on upstream services; SLAs formalize those dependencies with explicit commitments
Teams can make trade-off decisions (speed vs reliability) based on shared SLO understanding

20. What is the relationship between Fallacy 3 (bandwidth is infinite) and database replication strategies across geographic regions?

Expected answer points:

Cross-region database replication exposes bandwidth limitations dramatically
Replicating 500GB at 100Mbps takes over 11 hours, making naive replication strategies impractical
Solutions include: compressing replication streams, batching updates, using change data capture (CDC)
Async replication becomes necessary when bandwidth cannot keep up with write rates
Geo-distribution architectures must account for bandwidth costs and latency between regions
The fallacy leads teams to design replication strategies that assume no bandwidth constraints

Conclusion

The eight fallacies describe assumptions that seem reasonable but are always wrong in production.

Assume the network is unreliable, latency is non-zero, and bandwidth is finite.
Treat internal traffic as untrusted—security boundaries exist at every network hop.
Design for topology changes. Use dynamic service discovery.
Account for transport cost (serialization) in performance budgets.

Copy/Paste Checklist

- [ ] Every network call has timeout and retry logic
- [ ] Circuit breakers protect against cascade failures
- [ ] Fallback behavior defined for when services are unavailable
- [ ] Authentication required for all service-to-service calls
- [ ] Dynamic service discovery instead of static configuration
- [ ] Serialization cost included in latency budgets
- [ ] Network topology changes handled gracefully
- [ ] Monitoring for latency, error rates, and bandwidth utilization

For deeper exploration of failure handling, see Chaos Engineering and Circuit Breaker Pattern.

Introduction

Fallacy 1: The Network is Reliable

Fallacy 2: Latency is Zero

Fallacy 3: Bandwidth is Infinite

Fallacy 4: The Network is Secure

Fallacy 5: Topology Doesn’t Change

Fallacy 6: There is One Administrator

DevOps and SRE Organizational Patterns

Team topologies for distributed systems

SRE practices that counteract Fallacy 6

Cross-team coordination mechanisms

Practical guidelines for reducing coordination overhead

Fallacy 7: Transport Cost is Zero

Fallacy 8: The Network is Homogeneous

Core Concepts

Real incident case studies

AWS US-EAST-1 Outage (December 2021) - Fallacy 1: Network is Reliable

Cloudflare DNS Outage (July 2019) - Fallacy 1 & 4: Network is Reliable, Network is Secure

GitHub Outage (February 2018) - Fallacy 2: Latency is Zero

The compounding effect: How fallacies cascade

Designing for Reality

Production Failure Scenarios

Common Pitfalls / Anti-Patterns

Pitfall 1: Treating network calls like function calls

Pitfall 2: Assuming homogeneous network conditions

Pitfall 3: Ignoring serialization overhead

Self-Assessment Quiz: Is Your Architecture Violating These Fallacies?

Fallacy 1: Network is Reliable

Checklist: Fallacy 2 - Latency is Zero

Checklist: Fallacy 3 - Bandwidth is Infinite

Fallacy 4: Network is Secure

Checklist: Fallacy 5 - Topology Doesn’t Change

Checklist: Fallacy 6 - There is One Administrator

Checklist: Fallacy 7 - Transport Cost is Zero

Fallacy 8: Network is Homogeneous

Scoring

Quick Recap Checklist

Network Reliability

Latency Awareness

Bandwidth Planning

Security Posture

Topology Dynamics

Team Coordination

Transport Efficiency

Network Heterogeneity

When to Use / When Not to Use

Fallacy trade-off decision matrix

Interview Questions

Further Reading

Conclusion

Copy/Paste Checklist

Category

Tags

Related Posts

Distributed Systems Primer: Key Concepts for Modern Architecture

Graceful Degradation: Systems That Bend Instead Break

Microservices vs Monolith: Choosing the Right Architecture