System Design Roadmap: From Fundamentals to Distributed Systems Mastery
Master system design with this comprehensive learning path covering distributed systems, scalability, databases, caching, messaging, and real-world case studies for interview prep.
System Design Roadmap
System design is the discipline of architecting scalable, reliable, and efficient software systems. This roadmap is for engineers who have built things and want to understand how the pieces fit together at scale — the APIs they consume, the databases backing their apps, the messaging patterns holding their services together.
This roadmap is designed for software developers who have basic programming experience and want to understand how to design systems that handle millions of users. By the end, you will be able to design and critique systems like Twitter, Netflix, Uber, and more.
Before You Start
- Basic programming knowledge in any language
- Understanding of how the web works (HTTP, DNS, TCP/IP)
- Basic data structures and algorithms knowledge
- Familiarity with at least one programming language
- Understanding of SQL and relational databases basics
The Roadmap
🌐 Networking & APIs
💾 Data Storage
⚡ Caching & CDN
📬 Messaging & Queues
🏗️ Microservices
📊 Observability
🎯 Case Studies
🚀 Advanced Patterns
🎯 Next Steps
Timeline & Milestones
📅 Estimated Timeline
🎓 Capstone Track
- Define functional and non-functional requirements
- Estimate capacity (QPS, storage, bandwidth)
- Design high-level architecture with component diagram
- Choose databases, caching strategy, and messaging patterns
- Document API contracts and data models
- Identify failure points and mitigation strategies
- Set up relational and/or NoSQL database with proper schema
- Implement RESTful API with versioning and documentation
- Add caching layer (Redis) with eviction policy
- Configure CDN for static asset delivery
- Set up message queue for async processing
- Write unit and integration tests
- Add structured logging with correlation IDs
- Set up Prometheus metrics and Grafana dashboards
- Integrate distributed tracing (Jaeger)
- Configure alerting rules for golden signals
- Create runbooks for common failure scenarios
- Implement circuit breakers and bulkheads
- Add rate limiting and throttling
- Configure graceful degradation and fallbacks
- Set up mTLS or API gateway authentication
- Document disaster recovery plan
- Perform load testing and optimize bottlenecks
- Define failure scenarios (service crashes, network latency, dependency failures)
- Inject failures using chaos engineering tools
- Verify circuit breaker triggers and recovery behavior
- Measure recovery time objectives (RTO) and recovery point objectives (RPO)
- Document findings and present architecture review
Milestone Markers
| Milestone | When | What you can do |
|---|---|---|
| Foundation | Week 3 | Complete Sections 1-2, understand CAP/PACELC trade-offs, design basic RESTful APIs with load balancing |
| Storage & Retrieval | Week 7 | Handle distributed data with proper database selection, sharding, and replication strategies |
| Architecture | Week 12 | Build microservices with service mesh, implement messaging patterns, containerize with Docker and deploy to Kubernetes |
| Scaling & Performance | Week 15 | Add caching, CDN, search scaling, implement resilience patterns (circuit breaker, bulkhead), optimize performance for high QPS |
| Capstone Complete | Week 16 | End-to-end system designed, implemented, observable, hardened, and chaos-tested with documented trade-offs and architecture review |
Core Topics: When to Use / When Not to Use
Load Balancing — When to Use vs When Not to Use
| When to Use | When NOT to Use |
|---|---|
| Traffic distribution across multiple servers or instances | Single-server applications with no scaling requirements |
| High-availability setups where failover is critical | Low-traffic applications where the overhead of load balancing is unjustified |
| Canary releases or blue-green deployments requiring gradual traffic shifting | Environments with very low latency requirements where an extra hop is unacceptable |
| DDoS protection and rate limiting at the edge | Systems already behind a managed cloud load balancer (e.g., AWS ALB) that provides these natively |
| Geographic routing for multi-region deployments | Simple development environments where round-robin DNS is sufficient |
Trade-off Summary: Load balancers add a managed abstraction layer for traffic distribution and high availability, but introduce additional complexity and a potential single point of failure. They excel at standardization and operational visibility but can become a bottleneck for teams needing autonomy.
Database Sharding — When to Use vs When Not to Use
| When to Use | When NOT to Use |
|---|---|
| Single database nearing capacity limits (>1TB, >10K QPS) | Small datasets that fit comfortably on a single database instance |
| Multi-tenant applications needing logical data isolation | Regulatory environments requiring serializable transactions across shards |
| High-write throughput scenarios that saturate a single master | Applications with complex join queries across shard keys |
| Horizontal scaling needed to handle traffic spikes | When proper indexing and query optimization would solve the performance problem |
| Global applications needing geographic data locality | Short-lived projects where sharding complexity outlasts the project’s lifetime |
Trade-off Summary: Sharding distributes data and load across multiple database instances, enabling horizontal scaling and geographic locality, but introduces complexity in query routing, cross-shard transactions, and operational overhead. Only shard when you’ve exhausted vertical scaling and read replicas.
Caching — When to Use vs When Not to Use
| When to Use | When NOT to Use |
|---|---|
| Expensive computations or database queries that are repeated frequently | Frequently changing data where cache invalidation is complex and error-prone |
| read-heavy workloads (90%+ reads) where staleness is acceptable | Write-heavy workloads requiring strong consistency on every read |
| Session data, user preferences, leaderboards, and frequently accessed data | Data requiring ACID transactions across multiple entities |
| API response caching to reduce downstream service load | Small datasets where database query optimization is faster |
| Distributed systems needing to reduce inter-service latency | Situations where cache infrastructure complexity exceeds the cost of additional database capacity |
Trade-off Summary: Caching dramatically reduces latency and backend load but trades strong consistency for speed. Cache invalidation is notoriously difficult — use caches for read-heavy, eventually-consistent data and keep invalidation strategies simple. When in doubt, cache less data rather than more.
Message Queues — When to Use vs When Not to Use
| When to Use | When NOT to Use |
|---|---|
| Async processing for tasks that can be deferred (email, notifications, logging) | Synchronous request/response where the caller waits for immediate completion |
| Decoupling producers from consumers so they can scale independently | Simple point-to-point communication that doesn’t need durability or backpressure |
| Handling traffic spikes where downstream services can’t keep up with producer rate | Low-latency requirements where even millisecond queue latency is unacceptable |
| Reliable event delivery in distributed systems needing guaranteed at-least-once semantics | When exactly-once delivery is required without expensive deduplication logic |
| Event-driven microservices where services react to domain events | Monolithic applications where shared memory or function calls are simpler and faster |
Trade-off Summary: Message queues enable resilient, decoupled architectures with natural backpressure and durability, but add infrastructure complexity and eventual consistency. They’re essential for microservices and async workflows but overkill for simple request/response patterns.
Microservices Architecture — When to Use vs When Not to Use
| When to Use | When NOT to Use |
|---|---|
| Large teams (10+ engineers) needing independent deployment and ownership | Small teams or simple applications where monolithic deployment is simpler and faster |
| Polyglot persistence requirements where different services need different database types | When services are tightly coupled by data consistency requirements and frequent cross-service calls |
| Independent scaling of specific components under heavy load | Applications with predictable, uniform load that doesn’t benefit from per-component scaling |
| Multiple release velocities across teams — some teams move faster than others | Startups needing rapid iteration where time-to-market beats architectural purity |
| Regulatory or security isolation requirements between service boundaries | Simple CRUD applications without complex domain logic or multiple ownership boundaries |
Trade-off Summary: Microservices enable independent scaling, polyglot persistence, and team autonomy, but trade that for distributed systems complexity — network unreliability, data consistency challenges, and operational overhead. Start with a well-modularized monolith and extract services only when you have clear, stable bounded contexts.
CDN (Content Delivery Network) — When to Use vs When Not to Use
| When to Use | When NOT to Use |
|---|---|
| Static asset delivery (images, CSS, JavaScript, fonts) to global users | Dynamic, personalized content that can’t be cached at the edge |
| Reducing latency for geographically distributed users | Applications serving a localized audience in a single region |
| DDoS protection and web application firewall at the edge | When origin server infrastructure is already globally distributed |
| Video streaming, large file downloads, or media-heavy applications | Small applications with minimal traffic where CDN cost outweighs benefits |
| API response caching for immutable or rarely-changing data | Real-time data, WebSocket streams, or frequently personalized content |
Trade-off Summary: CDNs dramatically reduce latency for global static content delivery and provide DDoS protection, but introduce caching complexity and potential staleness for dynamic content. Use them aggressively for static assets and API caching — it’s one of the highest-leverage optimizations in system design.
Circuit Breaker — When to Use vs When Not to Use
| When to Use | When NOT to Use |
|---|---|
| Protecting downstream services from being overwhelmed by cascading failures | Systems where failure is acceptable and degradation is not needed |
| When integrating with third-party or external services with unknown reliability | Tightly coupled internal services where failure of one component means system is broken |
| Critical paths where partial availability is preferred over complete unavailability | Short-lived batch jobs where failures can be simply retried |
| When you need graceful degradation instead of hard failures | Scenarios where all-or-nothing behavior is required (no partial responses) |
| Multi-tenant systems where one tenant causing issues shouldn’t affect others | Simple synchronous calls where timeout alone is sufficient |
Trade-off Summary: Circuit breakers prevent cascading failures and enable graceful degradation, but add implementation complexity and require careful tuning of thresholds. They’re essential for resilient distributed systems but become unnecessary overhead for tightly coupled monoliths or services with no external dependencies.
Rate Limiting — When to Use vs When Not to Use
| When to Use | When NOT to Use |
|---|---|
| Protecting APIs from abuse, DDoS, or unintentional usage spikes | Internal services behind a trusted firewall with known consumers |
| Enforcing quotas and usage-based billing for multi-tenant platforms | Systems where every request must be processed regardless of volume |
| Preventing upstream service overload during traffic spikes | Low-traffic applications where operational costs exceed the benefit |
| Compliance requirements for third-party API integrations with strict limits | Scenarios where client-side throttling is sufficient |
| Security hardening against credential stuffing and brute-force attacks | When fair resource allocation between users is not a concern |
Trade-off Summary: Rate limiting protects services from abuse and overload but can reject legitimate traffic and requires careful algorithm selection (token bucket vs. sliding window vs. fixed window). Use it at the API gateway level for centralized control, and combine with per-service limits for defense in depth.
Resources
Networking Fundamentals
- TCP/IP Illustrated by W. Richard Stevens
- HTTP: The Definitive Guide
- Cloudflare Learning Center - DNS
- Let’s Encrypt Documentation
Core Theory
- CAP Theorem Papers and Resources
- Database Internals by Alex Petrov
- Designing Data-Intensive Applications by Martin Kleppmann
- Dynamo: Amazon’s Highly Available Key-value Store
Databases
Caching
Messaging & Queues
- Apache Kafka Documentation
- RabbitMQ Tutorials
- AWS SQS Developer Guide
- Enterprise Integration Patterns
Microservices
Observability
Case Studies & Practice
- System Design Interview Blog
- Exponent - System Design Prep
- System Design Primer
- Awesome System Design
Advanced Topics
Category
Tags
Related Posts
Distributed Systems Roadmap: From Consistency Models to Consensus Algorithms
Master distributed systems with this comprehensive learning path covering CAP theorem, consensus algorithms, distributed transactions, clock synchronization, and fault tolerance patterns.
Microservices Architecture Roadmap: From Monolith to Distributed Systems
A practical learning path for decomposing monoliths, designing service boundaries, handling distributed data, deploying at scale, and keeping a microservices system healthy in production.
Database Design Roadmap: From Schema Basics to Distributed Data Architecture
A practical learning path covering relational modeling, NoSQL patterns, indexing strategies, query optimization, and distributed data systems — everything you need to design databases that actually hold up under production load.