Database Capacity Planning: A Practical Guide

Plan for growth before you hit walls. This guide covers growth forecasting, compute and storage sizing, IOPS requirements, and cloud vs on-prem decisions.

published: March 26, 2026 reading time: 33 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

Database capacity planning prevents dramatic failures by sizing infrastructure for current needs and future growth. This guide covers growth forecasting, compute and storage sizing, IOPS requirements, and connection pool sizing. The practical approach: baseline your current usage, characterize your workload, project growth, add 30-50% buffer, and monitor trends quarterly. Many failures compound from wrong growth estimates cascading into alerting gaps and monitoring blind spots.

Database Capacity Planning: A Practical Guide

Capacity planning is the unsexy work that prevents dramatic failures. Nobody writes blog posts about databases that had enough capacity. They write posts about databases that ran out of disk at 3am, or whose CPU maxed out during a product launch, or that fell over because connection pools were exhausted.

This guide covers how to size database infrastructure for current needs and future growth, with practical formulas and monitoring approaches.

Growth Forecasting

Before you can plan capacity, you need to understand growth. How is your data growing? How are your users growing? What about query volume?

Data Growth Modeling

Start with your current data volume. How large is your database today?

To get an accurate baseline, run these queries against your PostgreSQL instance:

-- Database total size
SELECT pg_database.datname,
       pg_size_pretty(pg_database_size(pg_database.datname)) AS size
FROM pg_database
ORDER BY pg_database_size(pg_database.datname) DESC;

-- Per-table size including indexes
SELECT schemaname, tablename,
       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size
FROM pg_stat_user_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 20;

Once you have your baseline, project forward. Ask: what is my growth rate per week or per month? Are you adding users, adding data per user, or both? A SaaS application growing 15% month-over-month might need 3x storage in 12 months. A logging system might grow 50% per week and need 10x in the same timeframe. The growth model drives everything else in this guide.

Workload Characterization

Data volume alone does not tell the whole story. A database with 100GB might be perfectly fine with modest resources, or it might be hopelessly underpowered depending on query patterns. Characterize your workload across three dimensions:

Read vs. Write ratio. OLTP workloads typically run 80-90% reads with short transactions. Analytical workloads might be 50-50 or write-heavy during ETL windows. Run this query to check your recent write volume:

SELECT sum(tup_inserted + tup_updated + tup_deleted) AS total_writes
FROM pg_stat_database
WHERE datname = current_database();

Query complexity. Look at pg_stat_statements to see which queries consume the most time and CPU. Are your top queries simple point-lookups (fast, low CPU) or do you have multi-way joins across large tables (slow, high CPU)?

Peak patterns. When do your traffic spikes occur? A database that handles batch jobs at midnight needs different resource margins than one serving interactive users during business hours. Check pg_stat_activity over a 24-hour window to map your query concurrency patterns.

Sizing Compute (CPU)

CPU sizing depends on query complexity and concurrency requirements. A database running simple CRUD operations needs less CPU than one running complex analytical queries.

For CPU, focus on single-threaded performance. PostgreSQL cannot parallelize a single query across multiple cores for most operations (though parallel query exists, it has limits). A database with 8 fast cores is often better than one with 16 slow cores.

Rule of thumb for general workloads: start with 4-8 cores for a production database and scale up based on monitoring. If you consistently see CPU above 70-80%, that is a signal to upgrade.

Sizing Memory

Memory is where PostgreSQL caches data and does its work. More memory means fewer disk I/O operations, which are orders of magnitude slower than memory access.

The key metric is effective cache size:

SHOW effective_cache_size;

This setting tells PostgreSQL how much memory is available for caching. A reasonable starting point is 75% of your total RAM. If you have 32GB RAM, set effective_cache_size to 24GB.

PostgreSQL’s shared_buffers default of 128MB is far too small for production. For a server with 32GB RAM, shared_buffers of 8-16GB is reasonable.

work_mem is per-operation (sort, hash) and maintenance_work_mem is for maintenance operations like VACUUM and CREATE INDEX. Both should be sized appropriately for your workload.

Sizing Storage

Storage sizing requires estimating current data size, growth rate, and headroom.

Current and Projected Size

Calculate current table and index sizes using the queries from the Data Growth Modeling section above. Then project forward by estimating your growth rate.

Growth rates depend on your application type. A typical pattern:

Application Type	Growth Rate	Notes
SaaS with expanding user base	10-20% per month	Users + data per user compound
Content/media platform	30-50% per month	Media-heavy, lower per-user overhead
IoT/data ingestion	50-100% per month	High write volume, minimal reads
Gaming/social platform	15-30% per month	Viral growth phases common

Once you have your monthly growth rate, calculate how much space you will need in 3, 6, and 12 months. Add 30-50% headroom for bloat, VACUUM overhead, and unexpected spikes. For example: a 500GB database growing at 20% per month needs roughly 1.2TB in 6 months and 3TB in 12 months. Plan storage expansions at 70% utilization to give yourself time to provision.

Storage Type Considerations

Storage type selection determines your database’s I/O ceiling. NVMe/SSD, HDD, and network-attached storage each serve different workload profiles, and the choice comes down to tradeoffs between throughput, latency, cost, and capacity.

NVMe and SSD hit the lowest latency and highest IOPS. NVMe drives talk to the CPU over PCIe directly, skipping the SATA controller bottleneck. A modern NVMe drive pushes 500K+ IOPS with sub-millisecond latency. A spinning disk barely clears ~100 IOPS at 5-10ms latency. For a PostgreSQL database handling 5,000 transactions per second, spinning disk storage hits a hard ceiling—NVMe removes that bottleneck entirely. The cost per GB runs higher than HDD, but NVMe prices have fallen hard; a 1TB NVMe drive goes for under $100 now.

HDD still makes sense for specific workloads. Large analytical databases running sequential scans across many terabytes benefit from HDD’s sequential read throughput and lower cost per GB. If your workload is read-heavy, mostly sequential, and capacity-hungry rather than IOPS-hungry, HDD can be 5-10x cheaper per GB. The problem is random I/O—database writes, index updates, and VACUUM activity all generate random I/O patterns that spinning disks handle poorly. Mixing sequential analytical scans with random OLTP I/O on the same HDD is a good way to watch your query times climb.

Network-attached storage in cloud environments adds latency. The path from database to storage goes through the network stack, costing you 0.5-2ms compared to local NVMe. For latency-sensitive OLTP, that matters. For analytical or batch workloads where queries run for minutes or hours, the added latency is effectively zero. Cloud storage also has throughput caps that can throttle large sequential scans.

Storage Type	IOPS (approx)	Latency	Cost/GB (2026)	Best For
NVMe (local)	500K-1M	<1ms	$0.08-0.12	OLTP, mixed read/write
SATA SSD	50K-100K	0.5-2ms	$0.05-0.10	General purpose
HDD (7200 RPM)	75-150	5-15ms	$0.02-0.04	Sequential analytics
Cloud network storage	Varies	1-5ms	$0.08-0.20	Cloud burst, large capacity

Cloud environments have their own gotchas. AWS gp3 volumes give you 3,000 IOPS baseline at 125 MB/s throughput but can run out of steam at high block sizes. gp2 volumes burst to 3,000 IOPS but degrade as you chew through the burst budget. Provisioned IOPS (io1/io2) volumes deliver guaranteed performance at 3-5x the cost of gp3. On Google Cloud, regional SSD performs best but adds network latency compared to zonal SSD. Always test storage performance with fio or pg_test_fsync before committing to a volume type in production.

For most new deployments, go with local NVMe if you can, or cloud SSD (gp3/io2) for managed environments. Reserve HDD for specific analytical workloads where sequential throughput clearly matters more than IOPS, and where the cost savings are real enough to justify the operational risk.

IOPS Requirements

IOPS (Input/Output Operations Per Second) measures storage throughput. Database workloads are I/O intensive, especially with poor cache hit rates. A database with a 90% cache hit rate still needs to write every transaction to disk through the WAL.

Calculate your required IOPS by profiling your actual workload with pg_io_metrics (PostgreSQL 17+) or by instrumenting with pg_stat_bgwriter. The key metrics to capture:

WAL writes per second: each 16MB WAL file generates 1 write IOPS minimum
Random read IOPS: driven by cache misses on large table scans
Random write IOPS: driven by index updates and VACUUM activity

For OLTP workloads, estimate your IOPS requirement as: (transactions_per_second * WAL_size_per_transaction) + (index_scans * random_read_cost). A conservative baseline for a 5,000 TPS PostgreSQL database with 16KB per transaction and moderate indexing: ~800-1,200 IOPS at steady state. Burst traffic can spike 3-5x that.

Cloud providers specify IOPS for their storage volumes. AWS gp3 volumes deliver 3,000 IOPS baseline at 125 MB/s throughput. Provisioned IOPS (io2) volumes can reach 256,000 IOPS. Make sure your volume type supports your workload’s I/O requirements, and test with pg_test_fsync before going to production.

Connection Pool Sizing

PostgreSQL connections have overhead. Each connection uses memory (roughly 5-10MB per connection in work_mem) and competes for CPU. A database with 1000 idle connections is less efficient than one with 100 busy connections.

Direct Connection Limits

max_connections sets the absolute maximum number of PostgreSQL connections. Typical production values range from 100 to 1000 depending on workload. More is not always better—each idle connection consumes roughly 5-10MB of memory through work_mem allocation, and the PostgreSQL process scheduler must context-switch between connections even when they are idle.

The right max_connections value depends on three factors: your pooler size (pool connections count toward max), your expected peak concurrency (how many queries execute simultaneously), and your available memory (each connection can consume memory under load). For a 4-core database running a typical web application, max_connections = 200 is reasonable if you are using PgBouncer with default_pool_size = 20. The remaining headroom handles admin connections, replication connections, and burst traffic.

Setting max_connections too high relative to your available memory causes PostgreSQL to swap or OOM. Setting it too low causes “too many connections” errors that take down applications. If you find yourself consistently hitting the limit, add PgBouncer before raising max_connections. A pooler lets 200 application connections share 20 actual database connections.

Connection Poolers

For most production workloads, use a connection pooler like PgBouncer or Pgpool-II. Without a pooler, each application instance maintains its own database connections, and at scale this creates a problem: 50 application servers each opening 20 connections is 1,000 PostgreSQL connections, most of them idle at any given moment. A pooler lets 1,000 application connections share 20-50 actual database connections efficiently.

PgBouncer is the most common choice. It operates at transaction level by default, meaning a client connection only occupies a real PostgreSQL connection during an active transaction. This lets hundreds of clients share dozens of real connections. The downside: session-level features break. You cannot use prepared statements with the default prepared statement caching, and SET statements that persist across transactions do not work. For most web applications this is fine.

Pgpool-II offers more features including built-in query caching, load balancing across replicas, and session-level pooling. It adds complexity and higher memory overhead. Use it when you need automatic query routing or caching at the pooler layer.

PgCat is a newer option written in Rust, designed for high-performance connection multiplexing with less overhead than PgBouncer. It supports session and transaction modes and has lower memory footprint. Consider it for greenfield deployments on modern infrastructure.

The pooler you choose matters less than using one. Direct database connections at scale cause connection exhaustion, CPU overhead from context switching, and memory waste from idle connections.

Pool Size Calculation

A reasonable starting formula for default_pool_size: pool_size = (core_count * 2) + effective_io_concurrency. On a 4-core database with SSD storage, that is (4 * 2) + 4 = 12.

But this varies by workload. OLTP workloads with short transactions can often use larger pools. Analytical workloads with long transactions need smaller pools. Monitor connection utilization to tune further.

Cloud vs On-Premises Tradeoffs

The cloud versus on-premises decision affects capacity planning significantly.

Cloud gives you elasticity and variety—you can scale up when needed and choose instance types optimized for compute, memory, or I/O. Managed services like RDS and Cloud SQL handle some operational burden. But at large scale, cloud database costs become significant. A $50,000/month database in the cloud might cost $10,000/month on-premises. You also have less control over kernel parameters and storage optimization, and IOPS often costs extra.

Many organizations use a hybrid approach: core production data on-premises or in a private cloud, with burst capacity or read replicas in public cloud for traffic spikes.

Monitoring Trends to Predict Capacity Needs

Capacity planning is not a one-time exercise. You need ongoing monitoring to predict when you will need to scale.

Key Metrics to Track

Disk utilization: If consistently above 70-80%, plan storage expansion.
CPU utilization: Sustained above 70% indicates underpowered CPU.
Memory pressure: Check for swap usage, which indicates insufficient RAM.
Connection counts: Watch for approaching max_connections.
Query latency trends: Slow queries often precede capacity crises.

Capacity Triggers

Set thresholds that force a capacity review before you hit the wall. These differ from alerts—alerts tell you something is already broken, while triggers give you lead time to act.

For a production PostgreSQL database, these are the numbers I come back to:

Disk above 70%. At 70% disk usage, you have roughly 30% headroom. For a database growing 10GB per day, that headroom is gone in three days. Set your trigger at 70% so you have 60 days to provision new storage. When you hit 80%, you are in emergency mode—execute the plan you made at 70%.

CPU above 80% sustained. Brief spikes are normal. Sustained 80%+ means queries are competing for cycles and latency is degrading. Pull pg_stat_statements to check whether the load comes from a few bad queries or general growth. Bad queries: optimize. Growth: scale compute.

Memory pressure. Swap usage is the clearest signal. If the OS is swapping, shared_buffers and work_mem are undersized. Run vmstat and look for non-zero si values. Even before swap kicks in, low free RAM with heavy disk cache means I/O is suffering.

Connections above 70% of max. Connection exhaustion fails queries with a generic “too many connections” message that is painful to debug. At 70% of max, start evaluating whether to add PgBouncer or raise the limit. At 90%, you are one traffic spike away from outages.

p99 latency trending up. Slow queries are often the first warning of an approaching capacity problem. If your p99 latency is climbing week-over-week, investigate before it becomes a crisis. Set a baseline and alert when p99 exceeds 2x that baseline.

These thresholds compound. A database at 65% disk, 75% CPU, and 60% connections is fine. One at 72% disk, 82% CPU, and 73% connections is circling a failure where multiple resources hit their limits at once. That is the failure mode capacity planning prevents—not any single threshold in isolation.

Practical Capacity Planning Process

flowchart TD
    Start["Start: New workload or growth trigger"] --> Measure["1. Measure current:<br/>Disk, CPU, Memory,<br/>Connections, IOPS"]
    Measure --> Characterize["2. Characterize workload:<br/>Read/write ratio?<br/>Query complexity?<br/>Peak patterns?"]
    Characterize --> Project["3. Project growth:<br/>3, 6, 12 month<br/>trajectory"]
    Project --> Size["4. Size resources:<br/>Compute, Storage,<br/>Memory, IOPS"]
    Size --> Buffer["5. Add 30-50% buffer:<br/>Growth headroom +<br/>Traffic spikes"]
    Buffer --> Monitor["6. Monitor trends<br/>Quarterly review:<br/>Adjust projections"]
    Monitor -->|&lt;--|Adjust| Project

A practical approach to capacity planning: baseline your current usage, characterize your workload, project growth, add buffer, and monitor trends. Update projections quarterly.

Production Failure Scenarios

Failure	Cause	Mitigation
Disk exhaustion mid-operation	Underestimated data growth, no monitoring	Track disk weekly, trigger review at 60%
CPU saturation during peak	Undersized compute for query complexity	Profile query CPU, add read replicas before peaks
Memory OOM crashes	`shared_buffers` + `work_mem` undersized	Size `effective_cache_size` at 75% of RAM
Connection pool exhaustion	`max_connections` too low	Deploy PgBouncer, right-size pool
IOPS throttling	Cloud volume IOPS limits hit	Choose appropriate volume types

Real-World Case Study: GitHub’s Capacity Planning Failures

GitHub has published extensively about their PostgreSQL scaling journey, including several capacity planning failures that cost them significant engineering time and caused service disruptions.

Failure 1: The storage exhaustion cascade

In 2016, GitHub’s PostgreSQL database cluster approached disk capacity on their main database servers. Binary logging enabled for replication consumed more disk than anticipated, an unintended backup retention policy accumulated old binary logs, and growth had been underestimated by 3x. When the primary disk reached 97% capacity, PostgreSQL began experiencing write slowdowns. Their alerting was set at 90% capacity—giving only 7% headroom at a growth rate that consumed that headroom in hours. The incident resulted in 18 hours of degraded service, including periods where the website was read-only.

What they changed: GitHub implemented a capacity planning process tracking disk usage week-over-week, with alerts at 70% and 85% thresholds. They separated WAL volume onto a dedicated volume to isolate backup I/O from production I/O. They also built an internal tool called “Bloat” that tracks waste as a percentage of total storage.

Failure 2: Connection pool exhaustion at 10,000 repositories

As GitHub scaled from thousands to millions of repositories, their connection pooling configuration (PgBouncer) became a bottleneck. They had sized their pool for peak connection rates at small scale, but as the number of repositories grew, simultaneous connections from various internal services grew super-linearly. At 10,000 repositories, their connection pool exhausted its maximum and began refusing connections.

What they changed: GitHub moved to a connection-per-worker model where each foreground service maintains its own connection, and background jobs use a separate pool with lower limits. They also implemented pg_stat_activity monitoring to track connection states and identify leaks early.

The lesson: Capacity planning failures compound. A wrong growth estimate cascades into alerting gaps, monitoring blind spots, and configuration assumptions that worked at small scale but fail at production scale.

Quick Recap Checklist

Size CPU at 4-8 cores for general workloads; scale up when CPU exceeds 70-80%
Set effective_cache_size to 75% of RAM for query planner optimization
Set shared_buffers to 8-16GB on a 32GB server (25% is often too high)
work_mem should be 4-10MB per concurrent sort/hash operation
maintenance_work_mem should be 512MB-2GB for index builds and vacuum
Monitor disk usage weekly; trigger capacity review at 70%, expand at 80%
Use SSD/NVMe storage for OLTP workloads — HDD cannot handle high IOPS
Connection pool size formula: (core_count * 2) + effective_io_concurrency
Add 30-50% headroom above projected storage need for bloat and growth
Track disk growth week-over-week to establish accurate growth rate
Separate WAL onto dedicated volume to isolate backup I/O from production
Monitor connection utilization — alert at 70% of max_connections
Use PgBouncer or similar for connection pooling in production
Plan compute and storage upgrades when metrics exceed 70-80% sustained
Review and adjust capacity quarterly based on actual growth patterns

Database Scaling Strategies - Horizontal and vertical scaling approaches
Database Monitoring - Key metrics to track
Cost Optimization - Optimizing database costs

Interview Questions

1. Your PostgreSQL database is consuming 400GB of a 500GB disk. Growth rate is 10GB/day. A new customer onboarding next quarter will add 50% more data volume. What's your capacity plan?

The critical path is time-to-capacity: 400GB used, 100GB free, 10GB/day growth means 10 days before disk is full. This is already critical. Immediate action: provision additional storage within 48 hours. For the new customer, calculate: 50% more data volume means 200GB additional at current growth rate, plus the 10GB/day ongoing growth. That's 300GB+ needed within 30-60 days. My plan: Week 1: provision storage to get to 200GB headroom (100GB immediate + 100GB for 10 days growth). Week 2: implement or verify monitoring on table-level storage consumption to identify bloat or unnecessary data. Month 1: provision the larger storage tier accounting for new customer load (add ~300GB). Month 2-3: evaluate read replicas or archiving strategies for older data to reduce primary disk load.

2. How do you calculate the right number of connections for your connection pool?

The standard formula is: pool_size = (core_count * 2) + effective_spindle_count. On a 4-core server with an NVMe SSD: pool_size = (4 × 2) + 4 = 12. This accounts for CPU cores (parallelism) and I/O concurrency (spindle count). For PostgreSQL specifically, the key constraint is max_connections (global limit) and per-query parallelism settings. Set max_connections high enough to handle your pool size plus superuser_reserved_connections (typically 3). Monitor actual connection usage with pg_stat_activity — if you consistently see connections idle in transaction for long periods, your pool is too large or your application is holding connections improperly. For applications with high connection churn (serverless, microservices), use connection poolers like PgBouncer or PgCat to multiplex hundreds of application connections over a smaller number of actual PostgreSQL connections.

3. What's the difference between IOPS and throughput, and how do you size storage for a write-heavy database?

IOPS (Input/Output Operations Per Second) measures the number of discrete I/O operations per second—critical for random read/write patterns like database page accesses. Throughput (MB/s) measures data transfer rate—critical for sequential operations like full table scans or bulk loads. A typical 7200 RPM spinning disk delivers ~100 IOPS and ~100 MB/s throughput. An NVMe SSD delivers ~500K IOPS and ~3 GB/s throughput. For a write-heavy PostgreSQL database: calculate your write IOPS requirement by understanding your wal_file_size generation rate (each 16MB WAL segment = 1 IOPS at minimum) and your write-ahead log write pattern. A database generating 1GB of WAL per minute at 16MB per WAL file = 64 WAL writes per second. NVMe is strongly recommended for any PostgreSQL write workload above 1,000 IOPS.

4. You need to forecast capacity for a database that will grow from 1TB to 10TB over 2 years. How do you approach this?

Three-step approach: First, establish the growth curve. Pull 90 days of historical data from pg_stat_database (tup_returned, tup_fetched, tup_inserted, tup_updated, tup_deleted) and pg_stat_bgwriter to understand write volume and query patterns. Calculate growth rate per week. Second, model the growth curve: linear growth (e.g., 1TB to 5TB then to 10TB = ~5% per week), exponential growth (doubling periods), or step-function growth (acquisition-driven). For 1TB to 10TB in 24 months, average growth is ~375GB/month or ~12.5GB/day. Third, plan provisioning triggers: at 70% of current storage tier (e.g., provision new tier at 700GB for 1TB volume), provision the next tier 60 days before you hit 90% of current tier. Budget for 3x the storage you think you need in year 2 because growth often accelerates.

5. Your organization is moving from on-premises PostgreSQL to Amazon RDS. How does capacity planning change for a managed database service?

Managed database services change capacity planning in several ways: (1) Storage is elastic but IOPS may be tied to instance size — you need to verify IOPS limits match your workload; (2) Memory management changes — AWS manages shared_buffers and you can't tune it directly; (3) CPU credits apply to burstable instances — check if your workload sustains high CPU or relies on bursts; (4) Network limits tied to instance type — larger instances have higher network throughput; (5) Connection pooling is still your responsibility (RDS Proxy helps but doesn't eliminate PgBouncer need); (6) Backup and maintenance windows are managed but affect availability; (7) Max connections scales with instance size — plan for connection pooling regardless. For RDS specifically: monitor CloudWatch metrics for CPU, memory, storage, and IOPS; set up alerts before hitting instance limits. Use read replicas for read scaling — they handle read traffic but not write scaling.

6. A database query that runs in 100ms on a dev machine takes 10 seconds in production with similar data volumes. What capacity planning factors could explain this?

Query time differences between dev and production at similar data volumes usually indicate: (1) Insufficient memory — query is spilling to disk due to work_mem being too low; check work_mem setting and increase per-query memory allocation; (2) Different query plan — dev might have fresh statistics while production has stale ones; run ANALYZE on production; (3) Shared buffers — production cache hit ratio might be low because dev fits in memory while production doesn't; (4) Connection pool configuration — production might have connection queueing while dev has direct connection; (5) Network latency — prod might be cross-region or higher network latency; (6) CPU contention — prod might have CPU contention from other processes; (7) Disk I/O — production disks might be slower (cloud volumes vs local SSD) or saturated. The first diagnostic step is EXPLAIN (ANALYZE, BUFFERS) on both systems to compare actual execution plans and cache hit rates.

7. You need to size storage for a database that will receive 100GB of data per day. How do you plan for a 90-day retention period with growth?

Base calculation: 100GB/day × 90 days = 9TB base storage. Add factors: (1) Index overhead — typically 20-30% additional storage for indexes; (2) Write amplification — VACUUM generates dead tuples; plan for 10-20% dead tuple headroom during high-write periods; (3) WAL volume — writes generate WAL at roughly 25-50% of data volume; allocate separate storage for WAL if using replication; (4) Growth rate: if data grows 10% month-over-month, the 90th day load is ~30% higher than day 1; (5) Temp space — sorting and hash joins use temp files; allocate 10-20% for temp space. Total: ~14-16TB at day 90. Provision storage to reach 80% at 60 days (giving 30 days headroom). Cloud: consider automated scaling but verify scaling speed — some cloud volumes take minutes to hours to expand.

8. How do you size effective_cache_size and why does it matter for query performance?

effective_cache_size is a PostgreSQL configuration that tells the query planner how much memory is available for caching. It does not allocate memory — it is a hint for the planner. Set it to 75% of your total RAM. For a 32GB server: effective_cache_size = 24GB. Why it matters: the planner uses this value to estimate the cost of different join strategies and scan methods. With a higher effective_cache_size, the planner knows it can cache more data and will prefer index scans over sequential scans for larger tables. Set it too low and the planner assumes more disk I/O, potentially choosing slower plans. Set it too high and the planner might underestimate I/O cost for very large tables. The planner also considers shared_buffers and OS cache — effective_cache_size represents total cache available including OS-level file system cache.

9. What is the relationship between work_mem, maintenance_work_mem, and query performance?

work_mem is the memory per-operator (sort, hash join, hash aggregate) for query execution. Too low causes disk spilling (sort to temp files, hash joins to disk). Set per-session or per-query, not globally too high or you'll run out of memory under concurrent load. Formula: available_memory = total_RAM - shared_buffers - OS_overhead. For a 32GB server with 8GB shared_buffers: ~24GB available. If you expect 50 concurrent queries, set work_mem to ~400MB (24GB / 50 = 480MB) as a starting point. maintenance_work_mem is for maintenance operations (VACUUM, CREATE INDEX, REINDEX, ANALYZE). Set to 1-2GB on a 32GB server for faster index builds and vacuum. Sort operations during REINDEX benefit from higher maintenance_work_mem. Key difference: work_mem is per-operation and resets after each operator; maintenance_work_mem is per-session and lasts through the maintenance operation.

10. A database shows 99% cache hit ratio but p99 latency is high. How do you diagnose this?

High cache hit ratio with high latency is a classic sign of a bottleneck other than memory. Diagnosis order: (1) Check CPU — high CPU usage means query complexity is the bottleneck, not data access; (2) Check I/O latency — iostat on the host shows disk queue depth and latency; if latency > 1ms for SSDs, the disk might be saturated; (3) Check network — if the database is remote, network latency adds to query time; (4) Check lock contention — SELECT * FROM pg_stat_activity WHERE wait_event_type IS NOT NULL shows queries waiting for locks; (5) Check connection queueing — if max_connections is near limit, queries wait for a connection; (6) Check for long-running queries — pg_stat_activity shows queries with high runtime. The question to ask: where does time go? Use EXPLAIN (ANALYZE, BUFFERS) on representative slow queries to identify which operator (sort, hash, seqscan) is taking the most time.

11. How does PostgreSQL's bgwriter configuration affect write performance and when should you tune it?

The bgwriter (background writer) is responsible for writing dirty buffers to disk. Key settings: bgwriter_delay (default 200ms) — how often bgwriter wakes to write buffers; bgwriter_lru_maxpages (default 100) — max buffers bgwriter can write per round; bgwriter_lru_multiplier (default 2.0) — how aggressively bgwriter writes ahead of demand. Tuning needed when: pg_stat_bgwriter shows high buffers_backend (backend processes writing directly) and buffers_backend_fsync (backends waiting for fsync). High buffers_backend means the bgwriter can't keep up — increase bgwriter_lru_maxpages or decrease bgwriter_delay. High buffers_backend_fsync means disk can't keep up with write demand — the issue is disk I/O, not bgwriter tuning. For write-heavy OLTP workloads, lowering bgwriter_delay to 50-100ms and raising bgwriter_lru_maxpages to 1000 often helps. For mostly-read workloads, the defaults are fine.

12. You need to plan capacity for a database that will grow from 1TB to 50TB over 3 years. Walk through the provisioning strategy.

Three-year provisioning strategy: Year 1: Start with 2TB storage (1TB data + 1TB headroom/growth). Monitor growth monthly; provision additional storage when usage exceeds 70%. Year 2: At 1TB year-1 growth rate, provision to 4-5TB. Add read replicas if read traffic has grown. Consider partitioning large tables by time to improve query performance and enable archival. Year 3: At this scale (10TB+), evaluate sharding or moving hot data to a faster tier. Ongoing: monitor week-over-week growth rate; provisioning cycle should be 60-90 days ahead of consumption. Key metrics to track: disk growth per week, query latency trends, cache hit ratio. Tools: Grafana dashboards showing disk usage over time with projection lines. Budget: cloud storage at scale is significant cost — consider lifecycle policies to archive old data to cheaper storage.

13. What capacity planning considerations apply when enabling PostgreSQL streaming replication?

Replication adds several capacity considerations: (1) Network bandwidth — replica must receive all WAL; for a 10GB/day write database, replica needs 10GB/day network throughput; (2) Storage on replica — replica stores the same data as primary; for 10TB primary, you need 10TB replica; (3) WAL accumulation — if replica falls behind, WAL files accumulate on primary; primary disk can fill up if replica is down for hours; (4) Replica apply lag — replica applies WAL sequentially; large transactions cause replica lag; (5) Connection capacity — max_connections on replica must handle read traffic; (6) Memory pressure — replicas have same shared_buffers as primary; under read load, memory pressure differs. Best practice: size replica with same or slightly less compute than primary (primary handles writes, replica handles reads); ensure network between primary and replica has sufficient bandwidth for WAL streaming; monitor pg_stat_replication for lag.

14. How does changing the PostgreSQL max_connections setting affect capacity planning?

max_connections is the absolute ceiling on database connections. Each connection consumes memory (5-10MB per connection in work_mem) and competes for CPU. A database with 1000 idle connections is less efficient than one with 100 busy connections. Setting max_connections too high: memory pressure from connection overhead can cause swap or OOM. Setting it too low: application gets "too many connections" errors. Capacity planning for max_connections: (1) Determine pool_size using (core_count * 2) + effective_io_concurrency; (2) Set max_connections to pool_size + superuser_reserved_connections (typically 3) + overhead for admin connections; (3) Monitor actual connection usage — if consistently above 80% of max, either increase max_connections or reduce usage via pooling. Cloud RDS limits max_connections by instance size (e.g., db.r5.large has 4500 max connections). Plan for connection pooling regardless — never use direct connections at high scale.

15. You are planning to use PostgreSQL logical replication for a multi-tenant saas application. What capacity considerations apply?

Logical replication capacity considerations: (1) CPU overhead — logical replication runs as background workers that decode WAL and apply changes; each publisher and subscriber consumes CPU; (2) Network — logical replication sends row-level changes, not pages; for write-heavy workloads, network usage can be higher than physical replication; (3) Disk — the wal_sender and wal_receiver processes generate temporary files during large transactions; (4) Memory — subscriptions maintain replication state in memory; large tables with many changes can increase memory usage; (5) Schema changes — logical replication does not replicate DDL changes automatically; each tenant schema update must be applied manually or via separate mechanism; (6) Latency — logical replication is asynchronous with typical lag of milliseconds to seconds; not suitable for synchronous cross-region writes. For multi-tenant: each tenant as a separate publication provides isolation but increases overhead; consider row-level security for simpler multi-tenant rather than separate publications.

16. How do you factor backup and disaster recovery into capacity planning?

Backup and DR add capacity considerations: (1) Backup storage — full pg_dump or base backup + WAL takes storage; for a 10TB database, full backups are 10TB each; maintain multiple point-in-time recovery images; (2) Backup I/O — pg_basebackup during production hours competes with normal I/O; schedule during low-traffic windows; (3) WAL volume — PITR requires WAL archiving; WAL generation rate × retention period = WAL storage needed; a database doing 100GB/day writes might generate 30-50GB/day WAL; (4) Restoration time — for DR, restoration speed matters; a 10TB database might take 2-4 hours to restore; this affects RTO; (5) Network for DR copies — copying backups to offsite storage uses bandwidth; (6) Point-in-time recovery testing — needs capacity on DR site to run the restored database. Plan: allocate backup storage at 2-3x the database size for retention; ensure WAL volume is accounted for in storage provisioning; test restore times annually.

17. What is the relationship between table partitioning and capacity planning?

Partitioning helps capacity planning in several ways: (1) Improves query performance on large tables by limiting scans to relevant partitions — reduces I/O and memory pressure; (2) Enables data lifecycle management — older partitions can be moved to cheaper storage or archived; (3) Makes vacuum and indexing more efficient — each partition is vacuumed separately, reducing autovacuum overhead; (4) Improves availability — a corrupted partition doesn't affect the entire table; (5) Simplifies capacity planning — you can size each partition independently based on expected data growth for that time period. Trade-offs: more partitions mean more partition management overhead, more entries in pg_class, and potentially more planning overhead for queries that span many partitions. For time-series data (common in capacity planning), range partitioning by month or quarter is standard. For multi-tenant applications, list partitioning by tenant can isolate tenants but adds management complexity.

18. How does the random_page_cost parameter affect capacity planning decisions?

random_page_cost is the planner's estimate of the cost of a non-sequential page fetch from disk. Default is 4.0 (assuming disk is 4x slower than sequential). For SSD/NVMe storage, you should set random_page_cost to 1.1 or close to seq_page_cost (1.0). Why it matters: the planner uses this value to decide between index scans and sequential scans. If random_page_cost is set too high (appropriate for spinning disks but not for SSD), the planner may prefer sequential scans even when indexes would be faster. Setting it correctly for SSD: ALTER SYSTEM SET random_page_cost = 1.1. This tells the planner that random access is only slightly more expensive than sequential, so it will use indexes more aggressively. For capacity planning: if you have mixed storage (SSD for hot data, HDD for cold data), use tablespaces to separate them and set random_page_cost per tablespace.

19. A startup expects 100x growth in users over 12 months. How do you stage the database infrastructure scaling?

Staged scaling approach: Month 1-3 (current scale): Single primary database with vertical scaling (instance size up to largest practical). Optimize queries, add connection pooling, tune autovacuum. Start read replica for read scaling. Month 4-6 (10x scale): Add read replicas (2-3 replicas for read traffic). Begin vertical scaling toward max instance size. Implement caching layer (Redis) for hot data. Consider table partitioning for large tables. Month 7-9 (50x scale): Evaluate sharding if vertical scaling is maxed. Implement read/write splitting at the application layer. Consider moving to a distributed database (CockroachDB, Spanner) if globally distributed. Month 10-12 (100x scale): Sharding is likely necessary. Evaluate managed database services that auto-scale (Amazon Aurora, CockroachDB). Implement multi-region deployment if global users. Key throughout: monitor growth weekly, not monthly. The fastest path to failure is assuming linear growth when growth is actually exponential.

20. What capacity planning metrics should be tracked as leading indicators of impending capacity exhaustion?

Leading indicators (before hard limits hit): (1) Disk growth rate — track week-over-week growth; if trending toward filling current storage in < 60 days, start procurement; (2) Connection count trend — if growing toward 70% of max_connections, increase pooling or max_connections; (3) Query latency p99 trend — if increasing week-over-week, investigate before it becomes critical; (4) Cache hit ratio decline — gradual decline indicates working set growing beyond cache; (5) Replication lag growth — indicates replica can't keep up with primary write rate; (6) CPU utilization trend — if consistently above 70%, plan upgrade; (7) Autovacuum queue — if tables consistently have high n_dead_tup between autovacuum runs, autovacuum is overwhelmed. Set up Grafana dashboards tracking these metrics with 30-day trends and projections. Alert on rate of change, not just absolute values.

Conclusion

Capacity planning is continuous, not a one-time event. The goal is not to predict perfectly—it is to have enough visibility into trends that you are not surprised.

Build monitoring into your routine. Know your growth rate. Understand your workload characteristics. Make capacity decisions based on data, not guesswork.

Good capacity planning happens before anyone notices a problem. You want boring infrastructure—the kind that simply works, reliably, as load grows.

Database Capacity Planning: A Practical Guide

Growth Forecasting

Data Growth Modeling

Workload Characterization

Sizing Compute (CPU)

Sizing Memory

Sizing Storage

Current and Projected Size

Storage Type Considerations

IOPS Requirements

Connection Pool Sizing

Direct Connection Limits

Connection Poolers

Pool Size Calculation

Cloud vs On-Premises Tradeoffs

Monitoring Trends to Predict Capacity Needs

Key Metrics to Track

Capacity Triggers

Practical Capacity Planning Process

Production Failure Scenarios

Real-World Case Study: GitHub’s Capacity Planning Failures

Failure 1: The storage exhaustion cascade

Failure 2: Connection pool exhaustion at 10,000 repositories

Quick Recap Checklist

Related Posts

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

Database Backup Strategies: Full, Incremental, and PITR

Connection Pooling: HikariCP, pgBouncer, and ProxySQL

Failover Automation