Database Capacity Planning: A Practical Guide
Plan for growth before you hit walls. This guide covers growth forecasting, compute and storage sizing, IOPS requirements, and cloud vs on-prem decisions.
Database Capacity Planning: A Practical Guide
Capacity planning is the unsexy work that prevents dramatic failures. Nobody writes blog posts about databases that had enough capacity. They write posts about databases that ran out of disk at 3am, or whose CPU maxed out during a product launch, or that fell over because connection pools were exhausted.
This guide covers how to size database infrastructure for current needs and future growth, with practical formulas and monitoring approaches.
Growth Forecasting
Before you can plan capacity, you need to understand growth. How is your data growing? How are your users growing? What about query volume?
Data Growth Modeling
Start with your current data volume. How large is your database today?
Workload Characterization
Data volume alone does not tell the whole story. A database with 100GB might be perfectly fine with modest resources, or it might be hopelessly underpowered depending on query patterns. Characterize your workload:
Sizing Compute (CPU)
CPU sizing depends on query complexity and concurrency requirements. A database running simple CRUD operations needs less CPU than one running complex analytical queries.
For CPU, focus on single-threaded performance. PostgreSQL cannot parallelize a single query across multiple cores for most operations (though parallel query exists, it has limits). A database with 8 fast cores is often better than one with 16 slow cores.
Rule of thumb for general workloads: start with 4-8 cores for a production database and scale up based on monitoring. If you consistently see CPU above 70-80%, that is a signal to upgrade.
Sizing Memory
Memory is where PostgreSQL caches data and does its work. More memory means fewer disk I/O operations, which are orders of magnitude slower than memory access.
The key metric is effective cache size:
SHOW effective_cache_size;
This setting tells PostgreSQL how much memory is available for caching. A reasonable starting point is 75% of your total RAM. If you have 32GB RAM, set effective_cache_size to 24GB.
PostgreSQL’s shared_buffers default of 128MB is far too small for production. For a server with 32GB RAM, shared_buffers of 8-16GB is reasonable.
work_mem is per-operation (sort, hash) and maintenance_work_mem is for maintenance operations like VACUUM and CREATE INDEX. Both should be sized appropriately for your workload.
Sizing Storage
Storage sizing requires estimating current data size, growth rate, and headroom.
Current and Projected Size
Calculate current table and index sizes (see queries above). Add 30-50% for headroom and growth between storage expansions.
Storage Type Considerations
Different storage types suit different workloads: SSD (NVMe) for high I/O mixed read/write workloads, HDD for large analytical databases with sequential I/O, and network storage in cloud environments where latency varies. For most OLTP workloads, SSD or NVMe storage is the better choice. The performance difference is significant.
IOPS Requirements
IOPS (Input/Output Operations Per Second) measures storage throughput. Database workloads are I/O intensive, especially with poor cache hit rates.
Calculate required IOPS by understanding your workload. Cloud providers specify IOPS for their storage volumes—make sure your volume type supports your workload’s I/O requirements.
Connection Pool Sizing
PostgreSQL connections have overhead. Each connection uses memory (roughly 5-10MB per connection in work_mem) and competes for CPU. A database with 1000 idle connections is less efficient than one with 100 busy connections.
Direct Connection Limits
max_connections sets the absolute maximum. Typical production values range from 100 to 1000 depending on workload. More is not always better—each idle connection consumes resources.
Connection Poolers
For most production workloads, use a connection pooler like PgBouncer or Pgpool-II. PgBouncer pools connections at the transaction level, allowing many application connections to share few database connections. The tradeoff is that you cannot use session-level features (like prepared statements in some configurations, or SET statements that persist across transactions).
Pool Size Calculation
A reasonable starting formula for default_pool_size: pool_size = (core_count * 2) + effective_io_concurrency. On a 4-core database with SSD storage, that is (4 * 2) + 4 = 12.
But this varies by workload. OLTP workloads with short transactions can often use larger pools. Analytical workloads with long transactions need smaller pools. Monitor connection utilization to tune further.
Cloud vs On-Premises Tradeoffs
The cloud versus on-premises decision affects capacity planning significantly.
Cloud gives you elasticity and variety—you can scale up when needed and choose instance types optimized for compute, memory, or I/O. Managed services like RDS and Cloud SQL handle some operational burden. But at large scale, cloud database costs become significant. A $50,000/month database in the cloud might cost $10,000/month on-premises. You also have less control over kernel parameters and storage optimization, and IOPS often costs extra.
Many organizations use a hybrid approach: core production data on-premises or in a private cloud, with burst capacity or read replicas in public cloud for traffic spikes.
Monitoring Trends to Predict Capacity Needs
Capacity planning is not a one-time exercise. You need ongoing monitoring to predict when you will need to scale.
Key Metrics to Track
- Disk utilization: If consistently above 70-80%, plan storage expansion.
- CPU utilization: Sustained above 70% indicates underpowered CPU.
- Memory pressure: Check for swap usage, which indicates insufficient RAM.
- Connection counts: Watch for approaching
max_connections. - Query latency trends: Slow queries often precede capacity crises.
Capacity Triggers
Establish thresholds that trigger review: disk usage above 70%, CPU above 80% sustained, memory pressure indicators, or connection counts above 70% of max.
Practical Capacity Planning Process
flowchart TD
Start["Start: New workload or growth trigger"] --> Measure["1. Measure current:<br/>Disk, CPU, Memory,<br/>Connections, IOPS"]
Measure --> Characterize["2. Characterize workload:<br/>Read/write ratio?<br/>Query complexity?<br/>Peak patterns?"]
Characterize --> Project["3. Project growth:<br/>3, 6, 12 month<br/>trajectory"]
Project --> Size["4. Size resources:<br/>Compute, Storage,<br/>Memory, IOPS"]
Size --> Buffer["5. Add 30-50% buffer:<br/>Growth headroom +<br/>Traffic spikes"]
Buffer --> Monitor["6. Monitor trends<br/>Quarterly review:<br/>Adjust projections"]
Monitor -->|<--|Adjust| Project
A practical approach to capacity planning: baseline your current usage, characterize your workload, project growth, add buffer, and monitor trends. Update projections quarterly.
Production Failure Scenarios
| Failure | Cause | Mitigation |
|---|---|---|
| Disk exhaustion mid-operation | Underestimated data growth, no monitoring | Track disk weekly, trigger review at 60% |
| CPU saturation during peak | Undersized compute for query complexity | Profile query CPU, add read replicas before peaks |
| Memory OOM crashes | shared_buffers + work_mem undersized | Size effective_cache_size at 75% of RAM |
| Connection pool exhaustion | max_connections too low | Deploy PgBouncer, right-size pool |
| IOPS throttling | Cloud volume IOPS limits hit | Choose appropriate volume types |
Real-World Case Study: GitHub’s Capacity Planning Failures
GitHub has published extensively about their PostgreSQL scaling journey, including several capacity planning failures that cost them significant engineering time and caused service disruptions.
Failure 1: The storage exhaustion cascade
In 2016, GitHub’s PostgreSQL database cluster approached disk capacity on their main database servers. Binary logging enabled for replication consumed more disk than anticipated, an unintended backup retention policy accumulated old binary logs, and growth had been underestimated by 3x. When the primary disk reached 97% capacity, PostgreSQL began experiencing write slowdowns. Their alerting was set at 90% capacity—giving only 7% headroom at a growth rate that consumed that headroom in hours. The incident resulted in 18 hours of degraded service, including periods where the website was read-only.
What they changed: GitHub implemented a capacity planning process tracking disk usage week-over-week, with alerts at 70% and 85% thresholds. They separated WAL volume onto a dedicated volume to isolate backup I/O from production I/O. They also built an internal tool called “Bloat” that tracks waste as a percentage of total storage.
Failure 2: Connection pool exhaustion at 10,000 repositories
As GitHub scaled from thousands to millions of repositories, their connection pooling configuration (PgBouncer) became a bottleneck. They had sized their pool for peak connection rates at small scale, but as the number of repositories grew, simultaneous connections from various internal services grew super-linearly. At 10,000 repositories, their connection pool exhausted its maximum and began refusing connections.
What they changed: GitHub moved to a connection-per-worker model where each foreground service maintains its own connection, and background jobs use a separate pool with lower limits. They also implemented pg_stat_activity monitoring to track connection states and identify leaks early.
The lesson: Capacity planning failures compound. A wrong growth estimate cascades into alerting gaps, monitoring blind spots, and configuration assumptions that worked at small scale but fail at production scale.
Quick Recap Checklist
- Size CPU at 4-8 cores for general workloads; scale up when CPU exceeds 70-80%
- Set
effective_cache_sizeto 75% of RAM for query planner optimization - Set
shared_buffersto 8-16GB on a 32GB server (25% is often too high) work_memshould be 4-10MB per concurrent sort/hash operationmaintenance_work_memshould be 512MB-2GB for index builds and vacuum- Monitor disk usage weekly; trigger capacity review at 70%, expand at 80%
- Use SSD/NVMe storage for OLTP workloads — HDD cannot handle high IOPS
- Connection pool size formula:
(core_count * 2) + effective_io_concurrency - Add 30-50% headroom above projected storage need for bloat and growth
- Track disk growth week-over-week to establish accurate growth rate
- Separate WAL onto dedicated volume to isolate backup I/O from production
- Monitor connection utilization — alert at 70% of max_connections
- Use PgBouncer or similar for connection pooling in production
- Plan compute and storage upgrades when metrics exceed 70-80% sustained
- Review and adjust capacity quarterly based on actual growth patterns
Related Posts
- Database Scaling Strategies - Horizontal and vertical scaling approaches
- Database Monitoring - Key metrics to track
- Cost Optimization - Optimizing database costs
Interview Questions
The critical path is time-to-capacity: 400GB used, 100GB free, 10GB/day growth means 10 days before disk is full. This is already critical. Immediate action: provision additional storage within 48 hours. For the new customer, calculate: 50% more data volume means 200GB additional at current growth rate, plus the 10GB/day ongoing growth. That's 300GB+ needed within 30-60 days. My plan: Week 1: provision storage to get to 200GB headroom (100GB immediate + 100GB for 10 days growth). Week 2: implement or verify monitoring on table-level storage consumption to identify bloat or unnecessary data. Month 1: provision the larger storage tier accounting for new customer load (add ~300GB). Month 2-3: evaluate read replicas or archiving strategies for older data to reduce primary disk load.
The standard formula is: pool_size = (core_count * 2) + effective_spindle_count. On a 4-core server with an NVMe SSD: pool_size = (4 × 2) + 4 = 12. This accounts for CPU cores (parallelism) and I/O concurrency (spindle count). For PostgreSQL specifically, the key constraint is max_connections (global limit) and per-query parallelism settings. Set max_connections high enough to handle your pool size plus superuser_reserved_connections (typically 3). Monitor actual connection usage with pg_stat_activity — if you consistently see connections idle in transaction for long periods, your pool is too large or your application is holding connections improperly. For applications with high connection churn (serverless, microservices), use connection poolers like PgBouncer or PgCat to multiplex hundreds of application connections over a smaller number of actual PostgreSQL connections.
IOPS (Input/Output Operations Per Second) measures the number of discrete I/O operations per second—critical for random read/write patterns like database page accesses. Throughput (MB/s) measures data transfer rate—critical for sequential operations like full table scans or bulk loads. A typical 7200 RPM spinning disk delivers ~100 IOPS and ~100 MB/s throughput. An NVMe SSD delivers ~500K IOPS and ~3 GB/s throughput. For a write-heavy PostgreSQL database: calculate your write IOPS requirement by understanding your wal_file_size generation rate (each 16MB WAL segment = 1 IOPS at minimum) and your write-ahead log write pattern. A database generating 1GB of WAL per minute at 16MB per WAL file = 64 WAL writes per second. NVMe is strongly recommended for any PostgreSQL write workload above 1,000 IOPS.
Three-step approach: First, establish the growth curve. Pull 90 days of historical data from pg_stat_database (tup_returned, tup_fetched, tup_inserted, tup_updated, tup_deleted) and pg_stat_bgwriter to understand write volume and query patterns. Calculate growth rate per week. Second, model the growth curve: linear growth (e.g., 1TB to 5TB then to 10TB = ~5% per week), exponential growth (doubling periods), or step-function growth (acquisition-driven). For 1TB to 10TB in 24 months, average growth is ~375GB/month or ~12.5GB/day. Third, plan provisioning triggers: at 70% of current storage tier (e.g., provision new tier at 700GB for 1TB volume), provision the next tier 60 days before you hit 90% of current tier. Budget for 3x the storage you think you need in year 2 because growth often accelerates.
Managed database services change capacity planning in several ways: (1) Storage is elastic but IOPS may be tied to instance size — you need to verify IOPS limits match your workload; (2) Memory management changes — AWS manages shared_buffers and you can't tune it directly; (3) CPU credits apply to burstable instances — check if your workload sustains high CPU or relies on bursts; (4) Network limits tied to instance type — larger instances have higher network throughput; (5) Connection pooling is still your responsibility (RDS Proxy helps but doesn't eliminate PgBouncer need); (6) Backup and maintenance windows are managed but affect availability; (7) Max connections scales with instance size — plan for connection pooling regardless. For RDS specifically: monitor CloudWatch metrics for CPU, memory, storage, and IOPS; set up alerts before hitting instance limits. Use read replicas for read scaling — they handle read traffic but not write scaling.
Query time differences between dev and production at similar data volumes usually indicate: (1) Insufficient memory — query is spilling to disk due to work_mem being too low; check work_mem setting and increase per-query memory allocation; (2) Different query plan — dev might have fresh statistics while production has stale ones; run ANALYZE on production; (3) Shared buffers — production cache hit ratio might be low because dev fits in memory while production doesn't; (4) Connection pool configuration — production might have connection queueing while dev has direct connection; (5) Network latency — prod might be cross-region or higher network latency; (6) CPU contention — prod might have CPU contention from other processes; (7) Disk I/O — production disks might be slower (cloud volumes vs local SSD) or saturated. The first diagnostic step is EXPLAIN (ANALYZE, BUFFERS) on both systems to compare actual execution plans and cache hit rates.
Base calculation: 100GB/day × 90 days = 9TB base storage. Add factors: (1) Index overhead — typically 20-30% additional storage for indexes; (2) Write amplification — VACUUM generates dead tuples; plan for 10-20% dead tuple headroom during high-write periods; (3) WAL volume — writes generate WAL at roughly 25-50% of data volume; allocate separate storage for WAL if using replication; (4) Growth rate: if data grows 10% month-over-month, the 90th day load is ~30% higher than day 1; (5) Temp space — sorting and hash joins use temp files; allocate 10-20% for temp space. Total: ~14-16TB at day 90. Provision storage to reach 80% at 60 days (giving 30 days headroom). Cloud: consider automated scaling but verify scaling speed — some cloud volumes take minutes to hours to expand.
effective_cache_size and why does it matter for query performance?effective_cache_size is a PostgreSQL configuration that tells the query planner how much memory is available for caching. It does not allocate memory — it is a hint for the planner. Set it to 75% of your total RAM. For a 32GB server: effective_cache_size = 24GB. Why it matters: the planner uses this value to estimate the cost of different join strategies and scan methods. With a higher effective_cache_size, the planner knows it can cache more data and will prefer index scans over sequential scans for larger tables. Set it too low and the planner assumes more disk I/O, potentially choosing slower plans. Set it too high and the planner might underestimate I/O cost for very large tables. The planner also considers shared_buffers and OS cache — effective_cache_size represents total cache available including OS-level file system cache.
work_mem, maintenance_work_mem, and query performance?work_mem is the memory per-operator (sort, hash join, hash aggregate) for query execution. Too low causes disk spilling (sort to temp files, hash joins to disk). Set per-session or per-query, not globally too high or you'll run out of memory under concurrent load. Formula: available_memory = total_RAM - shared_buffers - OS_overhead. For a 32GB server with 8GB shared_buffers: ~24GB available. If you expect 50 concurrent queries, set work_mem to ~400MB (24GB / 50 = 480MB) as a starting point. maintenance_work_mem is for maintenance operations (VACUUM, CREATE INDEX, REINDEX, ANALYZE). Set to 1-2GB on a 32GB server for faster index builds and vacuum. Sort operations during REINDEX benefit from higher maintenance_work_mem. Key difference: work_mem is per-operation and resets after each operator; maintenance_work_mem is per-session and lasts through the maintenance operation.
High cache hit ratio with high latency is a classic sign of a bottleneck other than memory. Diagnosis order: (1) Check CPU — high CPU usage means query complexity is the bottleneck, not data access; (2) Check I/O latency — iostat on the host shows disk queue depth and latency; if latency > 1ms for SSDs, the disk might be saturated; (3) Check network — if the database is remote, network latency adds to query time; (4) Check lock contention — SELECT * FROM pg_stat_activity WHERE wait_event_type IS NOT NULL shows queries waiting for locks; (5) Check connection queueing — if max_connections is near limit, queries wait for a connection; (6) Check for long-running queries — pg_stat_activity shows queries with high runtime. The question to ask: where does time go? Use EXPLAIN (ANALYZE, BUFFERS) on representative slow queries to identify which operator (sort, hash, seqscan) is taking the most time.
The bgwriter (background writer) is responsible for writing dirty buffers to disk. Key settings: bgwriter_delay (default 200ms) — how often bgwriter wakes to write buffers; bgwriter_lru_maxpages (default 100) — max buffers bgwriter can write per round; bgwriter_lru_multiplier (default 2.0) — how aggressively bgwriter writes ahead of demand. Tuning needed when: pg_stat_bgwriter shows high buffers_backend (backend processes writing directly) and buffers_backend_fsync (backends waiting for fsync). High buffers_backend means the bgwriter can't keep up — increase bgwriter_lru_maxpages or decrease bgwriter_delay. High buffers_backend_fsync means disk can't keep up with write demand — the issue is disk I/O, not bgwriter tuning. For write-heavy OLTP workloads, lowering bgwriter_delay to 50-100ms and raising bgwriter_lru_maxpages to 1000 often helps. For mostly-read workloads, the defaults are fine.
Three-year provisioning strategy: Year 1: Start with 2TB storage (1TB data + 1TB headroom/growth). Monitor growth monthly; provision additional storage when usage exceeds 70%. Year 2: At 1TB year-1 growth rate, provision to 4-5TB. Add read replicas if read traffic has grown. Consider partitioning large tables by time to improve query performance and enable archival. Year 3: At this scale (10TB+), evaluate sharding or moving hot data to a faster tier. Ongoing: monitor week-over-week growth rate; provisioning cycle should be 60-90 days ahead of consumption. Key metrics to track: disk growth per week, query latency trends, cache hit ratio. Tools: Grafana dashboards showing disk usage over time with projection lines. Budget: cloud storage at scale is significant cost — consider lifecycle policies to archive old data to cheaper storage.
Replication adds several capacity considerations: (1) Network bandwidth — replica must receive all WAL; for a 10GB/day write database, replica needs 10GB/day network throughput; (2) Storage on replica — replica stores the same data as primary; for 10TB primary, you need 10TB replica; (3) WAL accumulation — if replica falls behind, WAL files accumulate on primary; primary disk can fill up if replica is down for hours; (4) Replica apply lag — replica applies WAL sequentially; large transactions cause replica lag; (5) Connection capacity — max_connections on replica must handle read traffic; (6) Memory pressure — replicas have same shared_buffers as primary; under read load, memory pressure differs. Best practice: size replica with same or slightly less compute than primary (primary handles writes, replica handles reads); ensure network between primary and replica has sufficient bandwidth for WAL streaming; monitor pg_stat_replication for lag.
max_connections setting affect capacity planning?max_connections is the absolute ceiling on database connections. Each connection consumes memory (5-10MB per connection in work_mem) and competes for CPU. A database with 1000 idle connections is less efficient than one with 100 busy connections. Setting max_connections too high: memory pressure from connection overhead can cause swap or OOM. Setting it too low: application gets "too many connections" errors. Capacity planning for max_connections: (1) Determine pool_size using (core_count * 2) + effective_io_concurrency; (2) Set max_connections to pool_size + superuser_reserved_connections (typically 3) + overhead for admin connections; (3) Monitor actual connection usage — if consistently above 80% of max, either increase max_connections or reduce usage via pooling. Cloud RDS limits max_connections by instance size (e.g., db.r5.large has 4500 max connections). Plan for connection pooling regardless — never use direct connections at high scale.
Logical replication capacity considerations: (1) CPU overhead — logical replication runs as background workers that decode WAL and apply changes; each publisher and subscriber consumes CPU; (2) Network — logical replication sends row-level changes, not pages; for write-heavy workloads, network usage can be higher than physical replication; (3) Disk — the wal_sender and wal_receiver processes generate temporary files during large transactions; (4) Memory — subscriptions maintain replication state in memory; large tables with many changes can increase memory usage; (5) Schema changes — logical replication does not replicate DDL changes automatically; each tenant schema update must be applied manually or via separate mechanism; (6) Latency — logical replication is asynchronous with typical lag of milliseconds to seconds; not suitable for synchronous cross-region writes. For multi-tenant: each tenant as a separate publication provides isolation but increases overhead; consider row-level security for simpler multi-tenant rather than separate publications.
Backup and DR add capacity considerations: (1) Backup storage — full pg_dump or base backup + WAL takes storage; for a 10TB database, full backups are 10TB each; maintain multiple point-in-time recovery images; (2) Backup I/O — pg_basebackup during production hours competes with normal I/O; schedule during low-traffic windows; (3) WAL volume — PITR requires WAL archiving; WAL generation rate × retention period = WAL storage needed; a database doing 100GB/day writes might generate 30-50GB/day WAL; (4) Restoration time — for DR, restoration speed matters; a 10TB database might take 2-4 hours to restore; this affects RTO; (5) Network for DR copies — copying backups to offsite storage uses bandwidth; (6) Point-in-time recovery testing — needs capacity on DR site to run the restored database. Plan: allocate backup storage at 2-3x the database size for retention; ensure WAL volume is accounted for in storage provisioning; test restore times annually.
Partitioning helps capacity planning in several ways: (1) Improves query performance on large tables by limiting scans to relevant partitions — reduces I/O and memory pressure; (2) Enables data lifecycle management — older partitions can be moved to cheaper storage or archived; (3) Makes vacuum and indexing more efficient — each partition is vacuumed separately, reducing autovacuum overhead; (4) Improves availability — a corrupted partition doesn't affect the entire table; (5) Simplifies capacity planning — you can size each partition independently based on expected data growth for that time period. Trade-offs: more partitions mean more partition management overhead, more entries in pg_class, and potentially more planning overhead for queries that span many partitions. For time-series data (common in capacity planning), range partitioning by month or quarter is standard. For multi-tenant applications, list partitioning by tenant can isolate tenants but adds management complexity.
random_page_cost parameter affect capacity planning decisions?random_page_cost is the planner's estimate of the cost of a non-sequential page fetch from disk. Default is 4.0 (assuming disk is 4x slower than sequential). For SSD/NVMe storage, you should set random_page_cost to 1.1 or close to seq_page_cost (1.0). Why it matters: the planner uses this value to decide between index scans and sequential scans. If random_page_cost is set too high (appropriate for spinning disks but not for SSD), the planner may prefer sequential scans even when indexes would be faster. Setting it correctly for SSD: ALTER SYSTEM SET random_page_cost = 1.1. This tells the planner that random access is only slightly more expensive than sequential, so it will use indexes more aggressively. For capacity planning: if you have mixed storage (SSD for hot data, HDD for cold data), use tablespaces to separate them and set random_page_cost per tablespace.
Staged scaling approach: Month 1-3 (current scale): Single primary database with vertical scaling (instance size up to largest practical). Optimize queries, add connection pooling, tune autovacuum. Start read replica for read scaling. Month 4-6 (10x scale): Add read replicas (2-3 replicas for read traffic). Begin vertical scaling toward max instance size. Implement caching layer (Redis) for hot data. Consider table partitioning for large tables. Month 7-9 (50x scale): Evaluate sharding if vertical scaling is maxed. Implement read/write splitting at the application layer. Consider moving to a distributed database (CockroachDB, Spanner) if globally distributed. Month 10-12 (100x scale): Sharding is likely necessary. Evaluate managed database services that auto-scale (Amazon Aurora, CockroachDB). Implement multi-region deployment if global users. Key throughout: monitor growth weekly, not monthly. The fastest path to failure is assuming linear growth when growth is actually exponential.
Leading indicators (before hard limits hit): (1) Disk growth rate — track week-over-week growth; if trending toward filling current storage in < 60 days, start procurement; (2) Connection count trend — if growing toward 70% of max_connections, increase pooling or max_connections; (3) Query latency p99 trend — if increasing week-over-week, investigate before it becomes critical; (4) Cache hit ratio decline — gradual decline indicates working set growing beyond cache; (5) Replication lag growth — indicates replica can't keep up with primary write rate; (6) CPU utilization trend — if consistently above 70%, plan upgrade; (7) Autovacuum queue — if tables consistently have high n_dead_tup between autovacuum runs, autovacuum is overwhelmed. Set up Grafana dashboards tracking these metrics with 30-day trends and projections. Alert on rate of change, not just absolute values.
Further Reading
- PostgreSQL Resource Configuration — Memory and connection settings
- Disk Usage Monitoring — Understanding storage consumption
- PgBouncer — Connection pooler for PostgreSQL
- pg_activity — Top-like activity monitoring for PostgreSQL
- PostgreSQL Monitoring Best Practices — DataDog’s comprehensive guide
Conclusion
Capacity planning is continuous, not a one-time event. The goal is not to predict perfectly—it is to have enough visibility into trends that you are not surprised.
Build monitoring into your routine. Know your growth rate. Understand your workload characteristics. Make capacity decisions based on data, not guesswork.
Good capacity planning happens before anyone notices a problem. You want boring infrastructure—the kind that simply works, reliably, as load grows.
Category
Related Posts
Database Backup Strategies: Full, Incremental, and PITR
Learn database backup strategies: full, incremental, and differential backups. Point-in-time recovery, WAL archiving, and RTO/RPO planning.
Connection Pooling: HikariCP, pgBouncer, and ProxySQL
Learn connection pool sizing, HikariCP, pgBouncer, and ProxySQL, timeout settings, idle management, and when pooling helps or hurts performance.
Failover Automation
Automatic failover patterns. Health checks, failure detection, split-brain prevention, and DNS TTL management during database failover.