The Eight Fallacies of Distributed Computing

Explore the classic assumptions developers make about networked systems that lead to failures. Learn how to avoid these pitfalls in distributed architecture.

published: reading time: 31 min read author: GeekWorkBench

Introduction

Fallacy 1: The Network is Reliable

The first and most dangerous fallacy. Production networks fail constantly: cables get cut, switches reboot, routers misconfigure. In any large enough system, you should plan for network failures daily.

graph TD
    A[Client Request] --> B[Network]
    B -->|Packet Loss| C[Request Timeout]
    B -->|Latency Spike| D[Request Slow]
    B -->|Partition| E[Request Failed]
    C --> F[Retry Logic]
    D --> F
    E --> G[Fallback Behavior]
    F --> G

The fix: implement retries with exponential backoff, circuit breakers, and fallback behavior. When designing a distributed system, assume the network will fail and design accordingly. See the CAP theorem for what happens during network partitions.


Fallacy 2: Latency is Zero

Light takes about 67ms to cross the US at the speed of fiber. A round trip between New York and Tokyo runs 100-150ms. Developers who treat network calls like function calls soon discover their “simple” distributed system responds in seconds, not milliseconds.

// Naive approach: multiple sequential network calls
async function getUserDashboard(userId) {
  const user = await fetchUser(userId); // 50ms
  const posts = await fetchPosts(userId); // 80ms
  const friends = await fetchFriends(userId); // 60ms
  const notifications = await fetchNotifs(userId); // 40ms
  // Total: 230ms sequential

  // Better: parallel calls
  const [user, posts, friends, notifs] = await Promise.all([
    fetchUser(userId),
    fetchPosts(userId),
    fetchFriends(userId),
    fetchNotifications(userId),
  ]);
  // Total: ~80ms (time of slowest call)
}

Even without partitions, latency and consistency trade off against each other. See the PACELC theorem. Synchronous replication for strong consistency adds latency. Async replication is faster but introduces potential inconsistency.


Fallacy 3: Bandwidth is Infinite

Early network proponents assumed bandwidth was effectively unlimited. You hit the wall fast when replicating large datasets or streaming video across data centers.

Real bandwidth limits hit in ways that surprise teams:

  • Replicating a 500GB database across regions at 100Mbps takes over 11 hours
  • Streaming 4K video to 10,000 concurrent viewers needs ~300Gbps
  • A single gigabit network link maxes out at about 125MB/s of actual throughput

The geo-distribution post covers how to architect when data must span regions—and the latency and bandwidth costs that come with it.


Fallacy 4: The Network is Secure

Internal networks get compromised. Services get repurposed. Attackers find ways in. Zero trust architecture exists because assuming internal networks are safe has killed systems. Every service should authenticate every request, even from within the same data center.

The mTLS post covers mutual TLS for service-to-service authentication. The secrets management post discusses how to handle credentials in a distributed environment.


Fallacy 5: Topology Doesn’t Change

Static network diagrams go stale the moment you draw them. Services get added, removed, migrated, and scaled. Containers move between hosts. VMs get provisioned and deprovisioned. Your “stable” service discovery configuration becomes stale within weeks.

graph LR
    A[Service A] -->|Initially| B[Service B]
    A -->|After migration| C[Service B in DC2]
    A -->|With service mesh| D[Service B via mesh]

Dynamic service discovery (covered in DNS Service Discovery and Service Registry) handles topology changes. But you still need to design for instances appearing and disappearing without warning.


Fallacy 6: There is One Administrator

In a distributed system, multiple teams manage different parts of the infrastructure. The network team controls routers. The database team manages replicas. The platform team handles Kubernetes. The security team enforces firewall rules. Changes made by any one can break things for all.

Coordination overhead grows fast with system size. A simple change like adding a new firewall port requires tickets, approvals, and maintenance windows. Distributed systems require strong processes alongside strong architecture.

The service boundaries post discusses how to structure teams and services to minimize coordination overhead while maintaining proper separation.

DevOps and SRE Organizational Patterns

Fallacy 6 manifests most painfully in multi-team environments. When hundreds of services are managed by dozens of teams, coordination becomes the bottleneck. These patterns help manage the complexity.

Team topologies for distributed systems

The Team Topologies book from SKF and Red Hat identifies four fundamental team patterns that map well to distributed systems:

Team TypeResponsibilityInteraction Pattern
Stream-alignedDeliver user-facing featuresSelf-sufficient, flow work
PlatformProvide internal platforms/toolsEnable stream teams
EnablingHelp teams overcome obstaclesTemporary, embed with teams
Complicated SubsystemOwn complex domain expertiseConsulted by stream teams

In distributed systems, platform teams become critical. They own the shared infrastructure that stream-aligned teams consume: service mesh, observability stack, database platforms, networking policies.

SRE practices that counteract Fallacy 6

SRE (Site Reliability Engineering) was invented at Google to bridge the gap between engineering and operations in large-scale distributed systems:

Error Budgets: Rather than mandating that every change goes through a lengthy approval process, error budgets allow teams to move fast as long as the system remains reliable. If your SLA is 99.9% uptime, you have 0.1% downtime budget to spend on risk. When the budget is full, slow down. When it is healthy, move fast.

Toil Reduction: SRE defines toil as manual, repetitive, automatable work. Teams drowning in toil cannot move fast. SRE practices push organizations to automate operational work so engineers can focus on improving the system rather than running it.

SLOs and SLIs: Service Level Objectives give teams shared definitions of reliability. When everyone agrees that “reliable” means 99.95% availability with p99 latency under 200ms, cross-team discussions become easier. SLOs become the common language for reliability decisions.

Blameless Postmortems: Failures in complex distributed systems almost never trace back to “one person made a mistake.” Multiple contributing factors across team boundaries are almost always involved. Blameless postmortems focus on systemic causes rather than individual fault, which means people actually report near-misses instead of hiding them.

# Example: SRE error budget policy
# .sre/config.yaml
error_budget_policy:
  availability_target: 99.95 # 4 hours downtime per year
  slo_window: 30d
  warning_threshold: 50% # Remaining budget
  critical_threshold: 25% # Remaining budget

  # Actions tied to budget consumption
  when_below_50%:
    - Increase testing rigor
    - Require additional review for risky changes
  when_below_25%:
    - Freeze non-critical deployments
    - Root cause analysis required
    - Executive notification
  when_depleted:
    - Emergency retrospective
    - Post-mortem mandatory
    - Deployment freeze until resolved

Cross-team coordination mechanisms

As distributed systems grow, you need formal mechanisms to coordinate changes across team boundaries:

Architecture Decision Records (ADRs): When a decision affects multiple teams, an ADR documents the context, decision, and consequences. This creates a searchable record of why the system is built the way it is. New team members can read ADRs to understand architectural decisions made before they joined.

Service Level Agreements (SLAs) Between Teams: Downstream teams depend on upstream services. Formalizing those dependencies as SLAs—with explicit latency, availability, and error rate commitments—makes it clear what each team owes to others.

Design Reviews for Cross-Cutting Changes: Changes to shared infrastructure, API contracts, or data schemas should go through a design review process that includes affected teams. This catches breaking changes before they happen.

Dark Launches and Feature Flags: Rather than coordinating deployments across teams, feature flags let teams deploy code independently. A dark launch exposes new functionality to a small subset of traffic while the team monitors for issues. This decouples deployment from release.

# Feature flag example for cross-team coordination
# When Team A wants to change an API that Team B depends on,
# Team A can:
# 1. Deploy the change behind a feature flag
# 2. Test with their own traffic
# 3. Gradually increase traffic while monitoring
# 4. Roll back instantly if issues arise
# No coordination with Team B required until the flag is fully enabled

def gradual_rollout(flag_name: str, initial_percentage: float = 1.0,
                    increment: float = 10.0, interval_seconds: float = 300):
    """
    Gradually increase traffic to a new feature.
    Returns when flag reaches 100% or is disabled.
    """
    current_percentage = initial_percentage
    while current_percentage < 100.0:
        if is_flag_disabled(flag_name):
            print(f"Flag {flag_name} disabled - rolling back")
            return False

        metrics = get_flag_metrics(flag_name)
        if metrics.error_rate > 0.01:  # 1% error threshold
            print(f"Error rate {metrics.error_rate} exceeds threshold - rolling back")
            disable_flag(flag_name)
            return False

        current_percentage = min(current_percentage + increment, 100.0)
        set_flag_percentage(flag_name, current_percentage)
        print(f"Flag {flag_name} at {current_percentage}%")
        sleep(interval_seconds)

    return True

Practical guidelines for reducing coordination overhead

  1. Embrace self-service: Build platforms and tooling that let teams operate independently. If deploying a new service requires filing a ticket with the platform team, you have a bottleneck.

  2. Make defaults safe: Design systems so the safe choice is the default. If encryption should be on by default, make it impossible to turn off. This reduces the number of decisions teams must make and mistakes they can make.

  3. Invest in observability: When something breaks in a distributed system, the first question is always “which service?” Good observability—logs, metrics, traces—lets teams diagnose issues without coordinating with other teams.

  4. Use service meshes for network policies: Rather than coordinating firewall rules across teams, a service mesh like Istio or Linkerd lets you express network policy as code. Teams define what their service needs to access; the mesh enforces it.

  5. Create shared libraries for common patterns: When multiple teams implement the same pattern (retries, timeouts, circuit breakers), a shared library means they all benefit from improvements and bug fixes without coordination.


Fallacy 7: Transport Cost is Zero

This extends “latency is zero” but focuses on serialization and deserialization. Converting an object to JSON, sending it across the network, and converting it back takes CPU time and memory. For high-throughput systems, serialization alone can dominate processing time.

// Efficient binary protocol vs JSON
const userJson = JSON.stringify(user); // CPU: ~500μs for complex object
const userProtobuf = protobuf.encode(user); // CPU: ~50μs, 60% smaller

// Impact at scale: 10,000 requests/second
// JSON: 5 seconds CPU just for serialization
// Protobuf: 0.5 seconds CPU

gRPC with Protocol Buffers outperforms REST with JSON for internal service communication. The API contracts post covers designing service interfaces that minimize overhead.


Fallacy 8: The Network is Homogeneous

Assuming all network paths are equal leads to subtle performance problems. A call from service A to service B in the same data center might take 1ms. The same call routed through a load balancer might take 5ms. A call that crosses a WAN might take 50ms. A call that goes through a poorly configured firewall might time out.

Network paths are never uniform. Quality of service policies, routing anomalies, and congestion all affect performance unpredictably. Load balancing and load balancing algorithms posts cover how to route traffic intelligently despite network heterogeneity.


Core Concepts

These fallacies compound in production. A team assumes latency is negligible, so they make synchronous calls between services. Then a network partition hits. Because latency is actually non-zero, timeouts trigger slowly. Retries overwhelm the system. The cascade spreads.

The availability patterns and resilience patterns posts cover how to build systems that survive when assumptions fail—because they always do.

Real incident case studies

Understanding fallacies through real-world failures helps cement why these assumptions are dangerous.

AWS US-EAST-1 Outage (December 2021) - Fallacy 1: Network is Reliable

What happened: A US-EAST-1 availability zone experienced a power failure that cascaded. The outage lasted over 7 hours and affected thousands of services including major platforms like Amazon, Netflix, and Disney+.

Root fallacy violations:

  • Network is Reliable: The single-AZ design assumed failure wouldn’t spread across availability zones properly
  • Latency is Zero: Synchronous cross-service dependencies meant one failing component brought down entire systems
  • One Administrator: The change management process for restoring power took hours of coordination

Key lessons:

  • Assume any single component can fail catastrophically
  • Design for partial availability - degrade gracefully
  • Have runbooks for multi-step recovery procedures

Cloudflare DNS Outage (July 2019) - Fallacy 1 & 4: Network is Reliable, Network is Secure

What happened: A Layer 3 network routing misconfiguration caused Cloudflare’s DNS service to drop approximately 15% of global traffic. Over 400 million websites and services were unreachable.

Root fallacy violations:

  • Network is Reliable: A single routing configuration error propagated globally
  • Network is Homogeneous: The assumption that all network paths behave consistently proved false

Key lessons:

  • BGP routing errors can propagate globally in minutes
  • Have multiple DNS providers for critical services
  • Monitor for unexpected routing changes

GitHub Outage (February 2018) - Fallacy 2: Latency is Zero

What happened: GitHub experienced 24 hours of degraded performance after a routine maintenance operation caused excessive load on their primary MySQL databases. The cascading failure affected over 35 million developers.

Root fallacy violations:

  • Latency is Zero: Maintenance operations weren’t scheduled with latency impact analysis
  • Transport Cost is Zero: The serialization of large MySQL result sets caused memory pressure

Key lessons:

  • Always analyze latency impact of maintenance operations
  • Have rollback procedures that account for replication lag
  • Monitor database connection pool exhaustion

The compounding effect: How fallacies cascade

flowchart TD
    A[Assume latency is zero] --> B[Make synchronous calls between services]
    B --> C[Network partition occurs]
    C --> D[Timeouts trigger slowly due to no timeout budgets]
    D --> E[Retries overwhelm healthy services]
    E --> F[Cascade failure spreads across system]

    G[Assume network is reliable] --> H[No circuit breakers configured]
    H --> C

    I[Assume topology doesn't change] --> J[Hardcoded IP addresses used]
    J --> K[Service migration causes connection failures]
    K --> C

The compounding pattern: Violating one fallacy often amplifies violations of others. A network partition (fallacy 1) becomes catastrophic when combined with no timeout budgets (fallacy 2) and no circuit breakers (fallacy 1 again).



Designing for Reality

Use these fallacies as a checklist when designing distributed systems:

graph TD
    A[Designing Distributed System] --> B{Ask These Questions}
    B --> C[What happens when network calls fail?]
    B --> D[What is actual latency between services?]
    B --> E[Are we exceeding bandwidth budgets?]
    B --> F[Do we authenticate internal calls?]
    B --> G[How does topology changes affect us?]
    B --> H[Who coordinates cross-team changes?]
    B --> I[Is serialization cost included in latency budgets?]
    B --> J[Are all network paths equally reliable?]

If you cannot answer these questions confidently, your system will fail in production. Not might—will.


Production Failure Scenarios

Failure ScenarioRoot FallacyMitigation
Cascading timeout during network blipLatency is zero, Network is reliableCircuit breakers, bulkheads, timeouts with budgets
Database replication falls behindBandwidth is infiniteMonitor replication lag, compress streams, batch
Service discovery returns stale endpointsTopology doesn’t changeUse dynamic discovery with short TTLs
Lateral movement after internal breachNetwork is secureZero trust, mTLS, every call authenticated
Single team bottleneck on deploymentOne administratorDecentralize ownership, self-service platforms

Common Pitfalls / Anti-Patterns

Pitfall 1: Treating network calls like function calls

Problem: Calling a remote service as if it were a local function, without timeouts, retries, or error handling.

Solution: Every network call must have: timeout, retry logic (with backoff), fallback behavior, and proper error handling.

Pitfall 2: Assuming homogeneous network conditions

Problem: Treating all environments (dev, staging, production) as having similar network characteristics.

Solution: Test against production-like network conditions. Use chaos engineering to inject latency and partition services.

Pitfall 3: Ignoring serialization overhead

Problem: Choosing JSON because it is readable, ignoring CPU and bandwidth costs at scale.

Solution: Profile serialization cost. For high-throughput internal services, consider Protocol Buffers, Avro, or MessagePack.


Self-Assessment Quiz: Is Your Architecture Violating These Fallacies?

Test your system design against each fallacy:

Fallacy 1: Network is Reliable

  • Do you have retry logic with exponential backoff for all network calls?
  • Are circuit breakers configured for all downstream services?
  • Do you have fallback behavior when services are unavailable?
  • Have you tested what happens when a network partition occurs?

Checklist: Fallacy 2 - Latency is Zero

  • Have you measured actual latency between all service pairs?
  • Are timeouts set based on measured P99 latency, not arbitrary values?
  • Do you use parallel calls where possible instead of sequential chains?
  • Is latency included in your SLA calculations?

Checklist: Fallacy 3 - Bandwidth is Infinite

  • Have you calculated bandwidth requirements for data replication?
  • Do you compress data streams between datacenters?
  • Are large data transfers batched or streamed rather than bulk-loaded?
  • Do you monitor bandwidth utilization and alert on limits?

Fallacy 4: Network is Secure

  • Do you use mTLS for all internal service-to-service communication?
  • Is every API endpoint authenticated, even internal ones?
  • Do you rotate secrets and certificates automatically?
  • Do you log and audit all cross-service communication?

Checklist: Fallacy 5 - Topology Doesn’t Change

  • Do you use dynamic service discovery instead of static IPs?
  • Are services designed to handle instances appearing/disappearing?
  • Do you use Kubernetes or similar for dynamic load balancing?
  • Are DNS TTLs short enough for rapid changes?

Checklist: Fallacy 6 - There is One Administrator

  • Do you have clear ownership boundaries for each service?
  • Are cross-team changes coordinated through ADRs (Architecture Decision Records)?
  • Do you have self-service deployment pipelines?
  • Is there a clear escalation path for incidents?

Checklist: Fallacy 7 - Transport Cost is Zero

  • Have you profiled serialization costs at your expected throughput?
  • Are you using efficient serialization (Protobuf, Avro) for internal communication?
  • Do you stream large datasets rather than loading them all at once?
  • Is serialization cost included in latency budgets?

Fallacy 8: Network is Homogeneous

  • Do you use latency-aware routing (not just round-robin)?
  • Have you tested behavior across different network paths?
  • Do you monitor per-path latency, not just aggregate?
  • Do you have quality-of-service policies for different traffic types?

Scoring

Count your unchecked boxes:

  • 0-4 unchecked: Your architecture is resilient - well done!
  • 5-10 unchecked: Some vulnerabilities exist - review weak areas
  • 11+ unchecked: High risk - prioritize fixing these issues before production

Quick Recap Checklist

Use this checklist to verify your distributed system design addresses each fallacy before production deployment:

Network Reliability

  • All network calls have retry logic with exponential backoff
  • Circuit breakers are configured for all downstream services
  • Fallback behavior is defined for when services are unavailable
  • Timeouts are set based on measured P99 latency, not arbitrary values

Latency Awareness

  • Actual latency measured between all service pairs across regions
  • Parallel fan-out used for independent service calls
  • Latency budgets include serialization and deserialization costs
  • Async I/O preferred over synchronous blocking calls

Bandwidth Planning

  • Bandwidth requirements calculated for all data replication streams
  • Compression enabled for cross-datacenter traffic
  • Large transfers batched or streamed, not bulk-loaded
  • Monitoring in place for bandwidth utilization alerts

Security Posture

  • mTLS enabled for all internal service-to-service communication
  • Every API endpoint authenticated, including internal ones
  • Secrets and certificates rotated automatically
  • Zero trust assumed for all network hops

Topology Dynamics

  • Dynamic service discovery used instead of static IPs
  • Services handle instances appearing/disappearing gracefully
  • DNS TTLs configured appropriately for rapid changes
  • Short-lived connections rebuilt cleanly after service migration

Team Coordination

  • Clear ownership boundaries defined for each service
  • Architecture Decision Records (ADRs) maintained for cross-team decisions
  • Self-service deployment pipelines available to all teams
  • SLOs documented and shared across team boundaries

Transport Efficiency

  • Serialization costs profiled at expected throughput
  • Binary protocols (Protobuf, Avro) used for high-throughput internal communication
  • Large datasets streamed rather than loaded entirely into memory
  • Transport cost included in latency budgets

Network Heterogeneity

  • Latency-aware routing configured (not just round-robin)
  • Per-path latency monitored separately from aggregate metrics
  • Quality-of-service policies defined for different traffic types
  • Firewall and load balancer latencies factored into timeout calculations

When to Use / When Not to Use

ScenarioRecommendation
Greenfield distributed systemDesign explicitly for all eight fallacies
Adding distribution to monolithStart with async communication
Multi-region deploymentAssume all fallacies apply doubled
Internal microservicesTreat internal traffic as untrusted
High-throughput systemsProfile serialization cost early

Fallacy trade-off decision matrix

When designing a distributed system, each fallacy creates a set of competing concerns. This matrix maps each fallacy against the engineering trade-offs it forces you to make.

FallacyYou assume…Reality forces…Typical cost if violatedMitigating pattern
1. Network is reliableCalls always succeedPartitions happen, routes flap, links dropData loss, cascade failuresCircuit breakers, retries with backoff
2. Latency is zeroCalls are freeCross-DC latency dominates response timeBroken SLAs, timeouts, user-facing lagAsync I/O, parallel fan-out, latency budgets
3. Bandwidth is infinitePush all data freelyReplication streams saturate linksReplication lag, missed SLAsCompression, batching, CDC, read replicas
4. Network is secureInternal == trustedLateral movement, compromised servicesData breach, internal exfiltrationmTLS, zero trust, every-call auth
5. Topology doesn’t changeEndpoints are stableInstances migrate, pods reschedule, IPs changeConnection failures, stale lookupsDynamic discovery, short DNS TTLs
6. One administratorSingle team owns allCross-team change coordination takes weeksDeployment bottlenecks, incidentsSelf-service platforms, ADRs, feature flags
7. Transport cost is zeroSerialize freelyJSON encoding burns CPU at scaleCPU bottlenecks, throughput capProtobuf, Avro, MessagePack
8. Network is homogeneousAll paths equalHotspots, QoS throttling, firewall timeoutsUnpredictable latency, timeoutsLatency-aware routing, per-path monitoring

How to use this matrix: For each row, check whether your current architecture accounts for the “Reality forces” column. If your design ignores a fallacy, the “Typical cost if violated” column tells you what failure mode to expect. The “Mitigating pattern” column points to the specific pattern that addresses each trade-off on this site.


Interview Questions

These questions test understanding of the eight fallacies and their practical implications in distributed system design.

1. Why is assuming the network is reliable considered the most dangerous fallacy in distributed computing?

Expected answer points:

  • Networks fail constantly through cable cuts, switch reboots, and router misconfigurations
  • In large enough systems, network failures should be expected daily, not treated as exceptional events
  • Unreliable networks cause cascading failures when services make synchronous calls without timeouts
  • The assumption leads to designs without retry logic, circuit breakers, or fallback behavior
  • Production networks experience packet loss, latency spikes, and partitions that break naive assumptions
2. Explain why latency being non-zero fundamentally changes distributed system design compared to local procedure calls.

Expected answer points:

  • Local function calls complete in nanoseconds; network calls take milliseconds to hundreds of milliseconds
  • Sequential network calls compound latency (multiple 50ms calls = 200ms+ total)
  • Developers treating network calls like function calls create systems that respond in seconds, not milliseconds
  • Non-zero latency means timeouts must be explicitly configured with appropriate budgets
  • Parallel execution of independent calls becomes essential for performance
  • The PACELC theorem shows latency and consistency trade off against each other
3. What practical problems arise from assuming bandwidth is infinite in distributed systems?

Expected answer points:

  • Cross-region database replication of 500GB at 100Mbps takes over 11 hours
  • 4K video streaming to 10,000 concurrent viewers requires approximately 300Gbps
  • A gigabit network link maxes out at about 125MB/s of actual throughput
  • Teams encounter unexpected throttling and performance degradation during data-intensive operations
  • Compression, batching, and streaming become necessities rather than optimizations
4. How does zero trust architecture address the fallacy that internal networks are secure?

Expected answer points:

  • Internal networks get compromised through lateral movement, insider threats, and misconfiguration
  • Services get repurposed and expose unexpected attack surfaces
  • Zero trust requires every service to authenticate every request, even from within the same data center
  • mTLS provides mutual authentication between services
  • Every network hop is treated as potentially hostile, regardless of physical location
5. Why do static network diagrams become obsolete quickly, and what replaces them in modern architectures?

Expected answer points:

  • Services get added, removed, migrated, and scaled constantly in production environments
  • Containers move between hosts and VMs get provisioned/deprovisioned dynamically
  • Static IPs become stale within weeks as infrastructure changes
  • Dynamic service discovery with short TTLs handles topology changes gracefully
  • DNS Service Discovery and Service Registries provide real-time endpoint tracking
  • Service meshes enforce network policies as code rather than relying on static firewall rules
6. Describe how Fallacy 6 (there is one administrator) manifests in large-scale distributed systems and organizational patterns that mitigate it.

Expected answer points:

  • Multiple teams manage different infrastructure parts: network team, database team, platform team, security team
  • Adding a firewall port requires tickets, approvals, and maintenance windows across teams
  • Coordination overhead becomes the bottleneck as system size grows
  • Team Topologies patterns: stream-aligned, platform, enabling, and complicated subsystem teams
  • SRE practices with error budgets, SLOs, and blameless postmortems help manage complexity
  • ADRs (Architecture Decision Records) create searchable records of cross-team decisions
7. What is transport cost in distributed systems, and why does it matter more at scale?

Expected answer points:

  • Transport cost includes serialization (object to JSON), transmission, and deserialization on the other side
  • JSON.stringify on a complex object takes approximately 500 microseconds; Protobuf encoding takes approximately 50 microseconds
  • At 10,000 requests/second: JSON uses 5 seconds of CPU for serialization versus 0.5 seconds for Protobuf
  • High-throughput systems find serialization overhead dominates processing time
  • Efficient binary protocols like gRPC with Protocol Buffers significantly reduce transport overhead
8. Explain how network path heterogeneity causes subtle performance problems that round-robin load balancing cannot solve.

Expected answer points:

  • Intra-datacenter calls might take 1ms, calls through load balancers 5ms, WAN calls 50ms
  • Quality of service policies, routing anomalies, and congestion create unpredictable performance
  • Round-robin ignores latency differences and sends equal traffic to fast and slow paths
  • Latency-aware routing must consider actual path performance, not just node selection
  • Poorly configured firewalls may cause calls to time out entirely on specific paths
  • Monitoring per-path latency rather than aggregate reveals hidden performance issues
9. How do the eight fallacies compound in production, creating cascading failures?

Expected answer points:

  • Assuming latency is zero leads to synchronous calls between services
  • Network partitions hit, and because latency is non-zero, timeouts trigger slowly
  • Slow-triggering timeouts cause retries to accumulate and overwhelm healthy services
  • Assuming network is reliable with no circuit breakers allows cascade to spread
  • Hardcoded IPs after service migration cause connection failures during topology changes
  • The compounding pattern: violating one fallacy amplifies violations of others
10. What is chaos engineering, and how does it help teams discover which fallacies their systems violate?

Expected answer points:

  • Chaos engineering injects controlled failures into production to test system behavior
  • Teams inject latency, partition services, and simulate failures to validate assumptions
  • Netflix's Chaos Monkey randomly terminates instances to ensure resilience
  • Discovers hidden vulnerabilities before they cause actual outages
  • Tests whether circuit breakers, retries, and fallback behavior actually work under failure conditions
11. Compare circuit breakers versus bulkheads as patterns for preventing cascade failures. When would you use each?

Expected answer points:

  • Circuit breakers trip when a downstream service fails repeatedly, stopping further requests
  • Bulkheads partition resources (connection pools, threads) so failures in one service don't exhaust resources for others
  • Circuit breakers: use when services call other services that might be slow or unavailable during peak load
  • Bulkheads: use when you need to isolate critical services from noisy neighbors consuming all resources
  • Both patterns address network unreliability by limiting blast radius of failures
12. What is the relationship between the CAP theorem and Fallacy 1 (the network is reliable)?

Expected answer points:

  • The CAP theorem states that during a network partition (P), you must choose between consistency (C) and availability (A)
  • Assuming the network is reliable ignores the possibility of partitions entirely
  • In reality, partitions happen; designs must choose between serving stale data (AP) or refusing requests (CP)
  • When network is reliable and no partition occurs, you can have both consistency and availability (CA)
  • The fallacy leads to designs that assume CA without handling the P case when partitions occur
13. How do feature flags help reduce coordination overhead across teams managing distributed systems?

Expected answer points:

  • Feature flags decouple deployment from release, allowing teams to deploy independently
  • Team A can change an API that Team B depends on by deploying behind a flag
  • Testing happens with own traffic before exposing to dependent teams
  • Gradual rollout with monitoring allows rollback without cross-team coordination
  • Dark launches expose new functionality to small traffic subsets while team monitors for issues
  • No coordination with Team B required until the flag is fully enabled
14. Describe the error budget concept from SRE and how it changes team behavior regarding deployments and risk.

Expected answer points:

  • Error budget = allowed downtime based on SLA (e.g., 99.95% availability = 4 hours downtime per year)
  • Teams can move fast as long as system reliability remains within budget
  • When budget is full (reliable), teams move fast; when budget is depleted, they slow down
  • Removes need for lengthy approval processes when system is healthy
  • At 50% budget remaining: increase testing rigor, require additional review
  • At 25% budget remaining: freeze non-critical deployments, require root cause analysis
15. Why does the PACELC theorem matter when designing for latency in distributed systems?

Expected answer points:

  • PACELC extends CAP: if there is a partition (P), you choose between latency (L) and consistency (C)
  • When no partition occurs, you still face a latency vs consistency tradeoff
  • Synchronous replication for strong consistency adds latency due to coordination overhead
  • Asynchronous replication is faster but introduces potential inconsistency
  • Designing for low latency often means accepting eventual consistency
  • The fallacy "latency is zero" ignores this fundamental tradeoff that affects real system performance
16. What organizational patterns help reduce the multi-team coordination overhead described in Fallacy 6?

Expected answer points:

  • Self-service platforms: teams can deploy new services without filing tickets with platform team
  • Safe defaults: encryption on by default, impossible to turn off, reduces decision mistakes
  • Observability investment: logs, metrics, traces let teams diagnose issues without coordinating with other teams
  • Service meshes: network policy as code instead of coordinating firewall rules
  • Shared libraries for common patterns (retries, timeouts, circuit breakers): all teams benefit from improvements without individual coordination
17. How does gRPC with Protocol Buffers address the transport cost fallacy more effectively than REST with JSON?

Expected answer points:

  • Protocol Buffers produce messages 60% smaller than equivalent JSON
  • Encoding time is approximately 10x faster (50 microseconds vs 500 microseconds for complex objects)
  • gRPC uses HTTP/2 for multiplexing multiple requests over a single connection
  • Binary protocol parsing is more efficient than string-based JSON parsing
  • For internal service communication with high throughput requirements, Protobuf significantly reduces CPU overhead
  • REST with JSON remains acceptable for external APIs where human readability matters
18. What happened during the AWS US-EAST-1 outage of December 2021, and which fallacies did it demonstrate?

Expected answer points:

  • A US-EAST-1 availability zone power failure cascaded for over 7 hours
  • Affected thousands of services including Amazon, Netflix, and Disney+
  • Fallacy 1 (Network is Reliable): single-AZ design assumed failure wouldn't spread properly
  • Fallacy 2 (Latency is Zero): synchronous cross-service dependencies meant one failing component brought down entire systems
  • Fallacy 6 (One Administrator): change management process for restoring power took hours of coordination
  • Key lessons: assume any single component can fail catastrophically, design for partial availability, have runbooks for multi-step recovery
19. Explain how Service Level Objectives (SLOs) help different teams agree on reliability targets in distributed systems.

Expected answer points:

  • SLOs give teams shared definitions of reliability (e.g., 99.95% availability, p99 latency under 200ms)
  • Cross-team discussions become easier when everyone agrees on what "reliable" means
  • SLOs become the common language for reliability decisions across team boundaries
  • Downstream teams depend on upstream services; SLAs formalize those dependencies with explicit commitments
  • Teams can make trade-off decisions (speed vs reliability) based on shared SLO understanding
20. What is the relationship between Fallacy 3 (bandwidth is infinite) and database replication strategies across geographic regions?

Expected answer points:

  • Cross-region database replication exposes bandwidth limitations dramatically
  • Replicating 500GB at 100Mbps takes over 11 hours, making naive replication strategies impractical
  • Solutions include: compressing replication streams, batching updates, using change data capture (CDC)
  • Async replication becomes necessary when bandwidth cannot keep up with write rates
  • Geo-distribution architectures must account for bandwidth costs and latency between regions
  • The fallacy leads teams to design replication strategies that assume no bandwidth constraints

Further Reading

Explore these related topics to deepen your understanding of distributed systems design and the fallacies that govern them.

Fundamental Theorems:

  • CAP Theorem - Understanding consistency and availability trade-offs during network partitions
  • PACELC Theorem - The latency versus consistency tradeoff that persists even when networks are reliable

Resilience Patterns:

Operational Practices:

Security:

  • mTLS - Mutual TLS for authenticating service-to-service communication

Team Coordination:

Observability:

For deeper exploration of failure handling and real-world case studies, see Incident Response.


Conclusion

The eight fallacies describe assumptions that seem reasonable but are always wrong in production.

  • Assume the network is unreliable, latency is non-zero, and bandwidth is finite.
  • Treat internal traffic as untrusted—security boundaries exist at every network hop.
  • Design for topology changes. Use dynamic service discovery.
  • Account for transport cost (serialization) in performance budgets.

Copy/Paste Checklist

- [ ] Every network call has timeout and retry logic
- [ ] Circuit breakers protect against cascade failures
- [ ] Fallback behavior defined for when services are unavailable
- [ ] Authentication required for all service-to-service calls
- [ ] Dynamic service discovery instead of static configuration
- [ ] Serialization cost included in latency budgets
- [ ] Network topology changes handled gracefully
- [ ] Monitoring for latency, error rates, and bandwidth utilization

For deeper exploration of failure handling, see Chaos Engineering and Circuit Breaker Pattern.

Category

Related Posts

Distributed Systems Primer: Key Concepts for Modern Architecture

A practical introduction to distributed systems fundamentals. Learn about failure modes, replication strategies, consensus algorithms, and the core challenges of building distributed software.

#distributed-systems #system-design #architecture

Graceful Degradation: Systems That Bend Instead Break

Design systems that maintain core functionality when components fail through fallback strategies, degradation modes, and progressive service levels.

#distributed-systems #fault-tolerance #resilience

Microservices vs Monolith: Choosing the Right Architecture

Understand the fundamental differences between monolithic and microservices architectures, their trade-offs, and how to decide which approach fits your project.

#microservices #monolith #architecture