Database Monitoring: Metrics, Tools, and Alerting
Keep your PostgreSQL database healthy with comprehensive monitoring. This guide covers query latency, connection usage, disk I/O, cache hit ratios, and alerting with pg_stat_statements and Prometheus.
Database Monitoring: Metrics, Tools, and Alerting
A database without monitoring is flying blind. You will not know you are out of disk space until writes start failing. You will not know queries are slowing down until users complain. You will not know replication lag is growing until you have a replica that is hours behind.
This guide covers the essential metrics to track, tools for collecting them, and how to set up alerting that actually helps rather than just creating noise.
Introduction
Database monitoring is how you know whether your database is healthy before users complain. Without it, you are reactive: you find out about problems when they become outages. With the right metrics and alerting, you find out before — disk filling up, replication lag growing, queries slowing down, connections approaching the limit.
This guide covers the essential metrics to track (query latency, connection usage, disk I/O, cache hit ratios, replication lag), the tools to collect them (pg_stat_statements, Prometheus exporters, pgAdmin), and how to configure alerting that surfaces real problems without generating noise. It also covers building dashboards that give you situational awareness at a glance.
Key Metrics to Track
Not all metrics are equally important. Focus on the ones that directly indicate user-facing problems or impending failures.
Query Latency
Query latency is the most user-visible metric. Slow queries mean slow page loads, frustrated users, and potential timeouts.
Track these latency percentiles:
- p50 (median): The typical response time
- p95: What 19 out of 20 requests experience
- p99: The worst experiences (excluding outliers)
-- Check query latency from pg_stat_statements
SELECT
query,
calls,
total_exec_time / calls AS avg_ms,
stddev_exec_time AS stddev_ms,
min_exec_time,
max_exec_time,
rows / calls AS avg_rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
The pg_stat_statements extension must be enabled:
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
PostgreSQL’s pg_stat_statements reports total execution time, not per-call latency percentiles. For true percentiles, consider pg_stat_kcache or external tools.
Connection Usage
PostgreSQL has a finite connection limit. When connections are exhausted, new connections fail. This is a hard failure that affects users directly.
-- Current connection status
SELECT
state,
COUNT(*) AS count
FROM pg_stat_activity
GROUP BY state;
-- Connection limits and usage
SELECT
setting::int AS max_connections,
(SELECT COUNT(*) FROM pg_stat_activity) AS current_connections,
(SELECT COUNT(*) FROM pg_stat_activity WHERE state = 'active') AS active_connections;
Track connection usage as a percentage of max. If you are regularly above 70-80%, investigate why—either scale max_connections or reduce connection usage through pooling.
Disk I/O
Disk I/O bottlenecks are common database problems. Slow I/O means queries wait for data to come from disk rather than being served from memory.
-- I/O statistics
SELECT
schedinfo.io_read_count,
schedinfo.io_write_count,
schedinfo.io_read_ms,
schedinfo.io_write_ms
FROM pg_stat_activity act
JOIN pg_proc AS proc ON act.procpid = proc.oid
JOIN pg_catalog.pg_statio_user_tables AS statio ON statio.relid = act.query
LIMIT 1;
For aggregate I/O statistics:
SELECT
pg_stat_database.datname,
blks_read,
blks_hit,
ROUND(100.0 * blks_hit / NULLIF(blks_hit + blks_read, 0), 2) AS cache_hit_ratio
FROM pg_stat_database
WHERE datname = current_database();
A cache hit ratio below 95-99% indicates your working set does not fit in memory, or there are problematic queries scanning large amounts of data.
Buffer Cache Hit Ratio
The buffer cache hit ratio measures how often PostgreSQL finds data in memory versus on disk. A low ratio means lots of disk I/O.
-- System-wide buffer cache hit ratio
SELECT
pg_stat_database.datname,
pg_stat_database.blks_hit,
pg_stat_database.blks_read,
ROUND(100.0 * pg_stat_database.blks_hit /
NULLIF(pg_stat_database.blks_hit + pg_stat_database.blks_read, 0), 2) AS hit_ratio
FROM pg_stat_database
WHERE pg_stat_database.datname = current_database();
A hit ratio above 99% is excellent. Below 95% warrants investigation—either you need more memory, or queries are scanning too much data.
Replication Lag
If you are using streaming replication (read replicas), monitor lag carefully. A replica that falls behind provides stale data and takes load off the primary incorrectly.
-- Check replication lag on replica
SELECT
now() - pg_last_xact_replay_timestamp() AS replication_lag;
-- On primary, check replica status
SELECT
client_addr,
state,
sent_lsn - write_lsn AS sent_lag,
write_lsn - flush_lsn AS write_lag,
flush_lsn - replay_lsn AS replay_lag
FROM pg_stat_replication;
Lag is measured in bytes (LSN difference) and time. The time-based lag is more useful for user-facing impact. Replication lag above a few seconds can cause problems for applications that expect read-after-write consistency.
Slow Query Log Analysis
Identifying slow queries is the first step to fixing them. PostgreSQL can log slow queries automatically.
Configuring Slow Query Logging
-- Log queries taking longer than 100ms
ALTER SYSTEM SET log_min_duration_statement = 100;
SELECT pg_reload_conf();
This logs the query text, duration, and binding parameters. Review logs regularly to find candidates for optimization.
Analyzing Slow Queries
Once you have slow query data, analyze patterns:
-- Top slowest queries by average time
SELECT
LEFT(query, 100) AS query_preview,
calls,
total_exec_time / calls AS avg_ms,
max_exec_time,
rows / calls AS avg_rows
FROM pg_stat_statements
WHERE calls > 10
ORDER BY avg_ms DESC
LIMIT 20;
-- Most frequently called slow queries
SELECT
LEFT(query, 100) AS query_preview,
calls,
total_exec_time / calls AS avg_ms,
total_exec_time AS total_ms
FROM pg_stat_statements
WHERE calls > 100
ORDER BY total_exec_time DESC
LIMIT 20;
The first query finds queries that are slow even when called infrequently. The second finds queries called so often that even small slowdowns add up.
Query Execution Plans
For slow queries, examine the execution plan:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT *
FROM orders
WHERE user_id = 123
ORDER BY created_at DESC
LIMIT 10;
Key things to look for:
- Seq Scan on large tables: Often indicates missing indexes
- High actual rows vs estimated rows: Statistics are stale, run ANALYZE
- Nested Loop on large datasets: May benefit from hash join
- Sort nodes with high memory: May need work_mem increase
Monitoring Tools
PostgreSQL includes built-in stats. For production monitoring, you will want something more comprehensive.
pg_stat_statements
This is the essential tool for query-level visibility.
Enable it in postgresql.conf:
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.track = top
DataDog
DataDog provides PostgreSQL monitoring with minimal setup. The agent auto-discovers PostgreSQL metrics and provides dashboards out of the box.
Key DataDog metrics for PostgreSQL:
postgresql.connections: Current connectionspostgresql.rows_fetched: Rows returned by queriespostgresql.rows_inserted/updated/deleted: Write ratespostgresql.buffer_cache_hit_ratio: Cache effectivenesspostgresql.active_temporary_files: Temporary file creation (indicates memory pressure)
DataDog integrates well with other monitoring data—correlating database metrics with application traces, logs, and infrastructure metrics in one place.
Prometheus and Grafana
For a more DIY approach, Prometheus + Grafana provides powerful monitoring. The postgres_exporter scrapes PostgreSQL metrics for Prometheus.
# prometheus.yml
scrape_configs:
- job_name: "postgres"
static_configs:
- targets: ["localhost:9187"]
Common Prometheus queries for PostgreSQL:
# Connection utilization
postgres_connections{state="active"} / postgres_max_connections
# Cache hit ratio
rate(postgres_blks_hit_total[5m]) / (rate(postgres_blks_hit_total[5m]) + rate(postgres_blks_read_total[5m]))
# Slow query rate
rate(postgres_slow_query_seconds_total[5m])
Grafana dashboards visualize these metrics. The official PostgreSQL dashboard template provides a reasonable starting point.
pg_stat_monitor
pg_stat_monitor is an improved version of pg_stat_statements that provides histogram data, allowing true percentile calculations.
CREATE EXTENSION pg_stat_monitor;
SELECT pg_stat_monitor_settings();
-- Query with timing histogram
SELECT
query,
calls,
total_exec_time,
min_exec_time,
avg_exec_time,
max_exec_time,
p90_exec_time,
p95_exec_time,
p99_exec_time
FROM pg_stat_monitor()
ORDER BY total_exec_time DESC
LIMIT 20;
This gives you actual p90/p95/p99 latencies directly from the database, which is more useful than averages alone.
flowchart TD
subgraph DB["PostgreSQL Internals"]
SS[pg_stat_statements<br/>Query statistics]
SA[pg_stat_activity<br/>Connection states]
SB[pg_stat_bgwriter<br/>Write buffers]
SR[pg_stat_replication<br/>Replica lag]
end
subgraph Collection["Collection Layer"]
EX[postgres_exporter<br/>Prometheus metrics]
DD[DataDog Agent<br/>Custom metrics]
BC[pg_stat_monitor<br/>Percentile histograms]
end
subgraph Store["Storage & Visualization"]
PROM[Prometheus<br/>Time-series DB]
GRAF[Grafana<br/>Dashboards]
DD2[DataDog<br/>Dashboards + APM]
end
subgraph Alert["Alerting"]
AL[AlertManager<br/>Paging]
PN[PagerDuty<br/>On-call routing]
end
DB --> EX
DB --> DD
DB --> BC
SS --> EX
SA --> EX
SR --> EX
EX --> PROM
DD --> DD2
BC --> PROM
PROM --> GRAF
DD2 --> AL
PROM --> AL
AL --> PN
When to Use / When Not to Use
Use built-in PostgreSQL stats (pg_stat_statements, pg_stat_activity) when you need quick visibility without extra setup, during initial investigation of an issue, or when evaluating whether to adopt a heavier monitoring stack.
Use pg_stat_monitor over pg_stat_statements when you need true latency percentiles (p90/p95/p99) directly from the database, since pg_stat_statements only gives averages and totals.
Use DataDog or Prometheus+Grafana when you need to correlate database metrics with application traces, infrastructure metrics, and logs in a single view, or when managing multiple database instances across environments.
Use custom metrics exported to Prometheus when you have specific business-level SLIs (like order placement latency) that standard database exports do not capture.
Do not rely solely on averages — a database with a p99 of 5 seconds looks fine if you only track the mean. Always use percentile-based alerting.
Do not set up alert rules for every metric you track. Alert fatigue is real — if everything triggers an alert, nothing gets attention.
Setting Up Alerts for Thresholds
Alerts should notify you of problems, not create noise that trains you to ignore them. Good alerts are actionable and specific.
Critical Threshold Alerts
Disk space critical
SELECT
df.mountpoint,
df.device,
df.blocks - df.free AS used_blocks,
df.blocks AS total_blocks,
ROUND(100.0 * (df.blocks - df.free) / df.blocks) AS used_pct
FROM pg_catalog.pg_stat_device_stats df
WHERE df.mountpoint = '/data';
Alert threshold: 85% disk usage. At 90%, you might not have room for VACUUM or emergency operations.
Connection utilization high
Alert when connections exceed 80% of max. At this threshold, some capacity remains for burst traffic, but you should investigate.
Replication lag
Alert when replica lag exceeds 30 seconds. Beyond this, read-after-write consistency guarantees break down.
Long-running queries
Alert when queries exceed 60 seconds. These often indicate problems and can hold locks that block other queries.
Alert Design Principles
Alert Anti-Patterns
Avoid alerting on:
- CPU spikes that resolve quickly: Brief CPU spikes are normal during batch operations
- Single slow queries: Track these via slow query log, alert only if they are increasing
- Cache hit ratio drops: Brief fluctuations are normal; alert only on sustained drops
Alerting Best Practices
-
Set reasonable time windows: Alert only when a condition persists for 5+ minutes, not on momentary spikes.
-
Use multiple severity levels: Warning (investigate) vs Critical (respond now).
-
Include context in alerts: The alert should say what is wrong, which system, and when it started.
-
Alert on symptoms, not causes: “Database latency is high” is better than “Cache hit ratio is low”—the symptom is what affects users.
-
Review and tune regularly: If you are ignoring an alert, either fix the underlying issue or adjust the threshold.
Building a Monitoring Dashboard
A good database dashboard shows the key metrics at a glance:
- Connection count (current and trend)
- Query latency (p50, p95, p99)
- Transactions per second
- Cache hit ratio
- Replication lag (if applicable)
- Disk usage (current and trend)
- Active queries (top offenders)
- Locks (blocked queries)
Most monitoring tools provide templates. Build from there based on your specific alerts.
Production Failure Scenarios
| Failure | Cause | Mitigation |
|---|---|---|
| Monitoring blind spot during incident | Alerts set only on CPU/disk but not on query latency spike | Alert on p95/p99 latency and query error rate, not just infrastructure metrics |
| Alert fatigue desensitizes team | Hundreds of warnings fire daily for minor fluctuations | Raise thresholds, use composite alerts (multiple conditions must fire together) |
| pg_stat_statements query plan cached | Slow query captured with placeholder values prevents plan optimization | Use pg_stat_statements_reset() after fixing a query to clear stale entries |
| Monitoring dashboard lies about freshness | Metric export interval is 60s but dashboard refreshes every 5s | Verify metric scrape intervals match dashboard expectations |
| Lock monitoring missed cascading failures | pg_locks view not monitored, deadlocks cascade across tables | Monitor pg_stat_activity.state = 'active' combined with wait_event_type, set deadlock detection alerts |
Trade-Off Table: Monitoring Tools
| Dimension | pg_stat_statements | auto_explain | DataDog | Prometheus + Grafana |
|---|---|---|---|---|
| Setup effort | None (built-in) | Low | Medium | Medium-High |
| Latency percentiles | Averages only | N/A | Yes (histograms) | Yes (histograms) |
| Query plan analysis | No | Yes (actual plans) | Via integration | Via exporter |
| Cross-service correlation | No | No | Yes (APM + infra) | Yes (if metrics aligned) |
| Alerting | No (external tool needed) | No | Yes (built-in) | AlertManager required |
| Cost | Free | Free | Per-host pricing | Free (self-hosted) |
| Best for | Initial investigation, tuning | Plan-level debugging | Production observability | Large-scale infrastructure |
Real-World Case Study: PagerDuty’s PostgreSQL Monitoring Setup
PagerDuty, a company that builds incident management software, runs PostgreSQL as a core operational database. Their engineering team has written extensively about their PostgreSQL monitoring architecture.
What they built: PagerDuty’s monitoring stack uses Prometheus exporters for infrastructure metrics (CPU, disk, network), combined with a PostgreSQL-specific exporter that captures pg_stat_statements, pg_stat_activity, pg_stat_user_tables, replication lag, and vacuum activity. They store all metrics in Prometheus and visualize in Grafana with per-team dashboards.
Key monitoring patterns they emphasize:
- Cardinality management: PagerDuty runs
pg_stat_statementstracking top 10,000 queries by total time, but they avoid high-cardinality labels (like individual user IDs) on PostgreSQL metrics to prevent Prometheus cardinality explosions. They aggregate user-level data into buckets (e.g., “heavy user” vs “light user”) rather than tracking per-user. - Alert fatigue prevention: They use a three-tier alert model. Tier 1: page immediately for customer-facing impact (error rate > 1%, p99 latency > 5s). Tier 2: notify Slack channel for degraded performance (cache hit ratio < 90%, replication lag > 10s). Tier 3: ticket created for trend analysis (disk growth rate > 10%/week, slow query count increasing). This prevents page fatigue while ensuring real incidents get immediate attention.
- Query-level observability: They use
pg_stat_statementswithpg_stat_statements.track = alland aggregate by query normalized text (stripping literal values). They tag each query with a role (api, worker, background) to isolate which component is causing issues. - Capacity-triggered scaling: When
pg_stat_database.tup_returneddrops below baseline by 20% for more than 5 minutes (suggesting table bloat or index degradation), they automatically trigger a vacuum or reindex job.
The lesson: The value of monitoring is not in having dashboards—it’s in having a clear escalation path from metric to action. PagerDuty’s three-tier alert model is a pattern any team can adopt: separate alerts by impact severity, assign different response protocols for each tier, and tune thresholds based on observed baseline rather than arbitrary values.
Quick Recap Checklist
- Track query latency percentiles: p50, p95, p99 — not just averages
- Monitor connection utilization — alert when approaching 80% of max_connections
- Cache hit ratio above 99% is excellent; below 95% warrants investigation
- Set
log_min_duration_statementto 100-500ms to capture slow queries - Use
pg_stat_statementsto identify top queries by total time - Use
pg_stat_activityto diagnose currently running queries and lock contention - Use
pg_stat_user_tablesto identify hot tables and index usage effectiveness - Alert on disk usage at 85% — at 90% there may be no room for VACUUM
- Alert on replication lag: warning at 10s, critical at 30s
- Alert on long-running queries (60+ seconds) that may hold locks
- Monitor
pg_stat_activity.state = 'active'combined withwait_event_typefor deadlocks - Use
auto_explain.log_min_durationto capture query plans for slow queries automatically - Review slow query log regularly to find optimization candidates
- Use
EXPLAIN (ANALYZE, BUFFERS)to examine query execution plans - Set composite alerts (multiple conditions must fire together) to reduce noise
- Use three-tier alerting: page immediately, notify Slack, create ticket
Related Posts
- Metrics, Monitoring, and Alerting - Broader observability practices
- Database Migration Strategies - Day-to-day operational guidance
- Query Execution Plans - Understanding and optimizing query plans
Quick Reference: Alert Thresholds
Disk usage: Critical at 90%, Warning at 80%. At 90% there may be no room for VACUUM or emergency operations.
Connection utilization: Warning at 70% of max_connections, Critical at 85%. Investigate pool sizing or query efficiency before hitting limits.
Cache hit ratio: Warning below 95%, Critical below 90%. Sustained drops indicate queries scanning too much data.
Replication lag: Warning at 10 seconds, Critical at 30 seconds. Beyond 30 seconds read-after-write consistency breaks down.
Query latency: Set p95 alerts at 2-3x your baseline. Set p99 alerts at 5x baseline or at your SLA threshold.
Interview Questions
A high cache hit ratio with degraded latency suggests the bottleneck is not memory-related. Investigate in this order: First, check pg_stat_statements for new slow queries—maybe a new query pattern emerged that wasn't present in the previous baseline. Second, check pg_stat_activity for lock contention: run SELECT * FROM pg_stat_activity WHERE wait_event_type IS NOT NULL to see if queries are waiting on locks. Third, check replication lag if using replicas: a lagging replica could cause read latency to appear as write latency if your application uses synchronous replication. Fourth, check for index bloat using pgstatindex() on your most-used indexes—a bloated index has higher tree depth and requires more I/O per lookup even if the buffer cache is warm. Fifth, check disk I/O latency (iostat on the host): a saturated disk queue causes latency even with high cache hit ratios. Finally, check for increased connection counts—if max_connections is being approached, queries may be waiting in the connection queue.
Threshold setting is an iterative process. Start with theoretical limits: max_connections (e.g., 100), disk capacity (e.g., 500GB), shared_buffers (e.g., 8GB). Set initial thresholds at 70% (warning) and 85% (critical) of theoretical limits. Then refine based on observed baseline: capture metrics for 2-4 weeks to establish baseline behavior. Set warning thresholds at baseline + 3 standard deviations for highly variable metrics (query latency), and at fixed percentages for monotonic resources (disk usage, connection count). Use composite alerts to reduce noise: fire only when both cache hit ratio drops below 90% AND transaction latency exceeds 500ms simultaneously. Review and adjust thresholds quarterly or after any significant load change.
pg_stat_statements, pg_stat_activity, and pg_stat_user_tables?pg_stat_statements tracks query performance across all queries globally: total calls, total time, total I/O time, and cache hit ratio per normalized query. It requires the extension to be installed (CREATE EXTENSION pg_stat_statements) and does not track which session or user ran a query. pg_stat_activity is session-level: it shows every currently active connection, the query each session is running, wait events, transaction state, and query start time. Use it to diagnose currently running queries and lock contention in real time. pg_stat_user_tables (and pg_stat_sys_tables) tracks per-table statistics: sequential scans vs index scans, individual table hit/read/insert/update/delete rates, and vacuum/autovacuum activity. Use it to identify which tables are hot spots and whether indexes are being used effectively.
Configure pg_stat_statements with track_utility = on to capture all query types including COPY and VACUUM. Set a low log_min_duration_statement threshold (e.g., 100ms) during peak hours only, using a scheduled configuration change or pg_scheduler. Capture pg_stat_statements snapshots hourly and diff them to identify queries whose total time increases disproportionately during peak periods. Look specifically at total_exec_time / calls (average per execution) rather than total time alone—a query that runs fast but 100,000 times per hour will show high total time. Use pg_stat_activity with state = 'active' and query_start > now() - interval '5 minutes' filtered to active queries during peak to catch currently-running slow queries. For recurring issues, add auto_explain with auto_explain.log_min_duration = '500ms' to capture query plans for slow queries automatically.
Stable p50 with increasing p99 is a classic sign of a subset of queries experiencing degradation while most queries are fine. This rules out system-wide issues (CPU, memory, disk) and points to query-specific problems. Diagnosis steps: (1) Compare p99 query list this week vs last week in pg_stat_statements — look for queries with high max_exec_time that were previously faster; (2) Check for new queries added to the system; (3) Check if specific business events caused traffic pattern changes (new feature launch, scheduled jobs); (4) Examine execution plans for the degraded queries — look for changes like degraded index scans to sequential scans, or increased row estimates causing different join strategies; (5) Check for lock contention on specific tables or rows — pg_stat_activity with wait_event_type filters for specific tables; (6) Check for parameter changes in pg_stat_statements settings that might have changed query plans.
Alert design principles for sustainable monitoring: (1) Alert on symptoms, not causes — "database latency is high" is better than "cache hit ratio is low"; the symptom is what affects users; (2) Use composite conditions — fire only when disk > 80% AND query latency p99 > 1s simultaneously; single thresholds fire too easily; (3) Require persistence — fire only when condition persists for 5+ minutes, not on momentary spikes; (4) Use graduated severity — warning (investigate within 24h) vs critical (respond now) vs page (wake someone up); (5) Include context — every alert should state: what is wrong, which system, since when, and suggested first action; (6) Tune quarterly — if you ignore an alert, either fix the underlying issue or adjust the threshold; (7) Review alert rate — if you're getting more than 10 warnings per day per system, thresholds are too sensitive; (8) Separate business impact from system metrics — "checkout success rate dropped" is more actionable than "CPU at 90%".
pg_stat_activity differs from pg_stat_statements in database monitoring.pg_stat_activity is real-time session monitoring — it shows what's happening RIGHT NOW. For each active connection: current state (idle, active, waiting), current query (or last query if idle), query start time, wait events if blocked, transaction age. Use for: diagnosing currently running queries, identifying blocking locks, seeing connection counts by state, watching query execution in real time. pg_stat_statements is historical query aggregation — it summarizes ALL queries across all sessions over time. Metrics: total calls, total time, total I/O time, cache hit ratio per normalized query. Use for: identifying top queries by total time, comparing query performance week-over-week, finding queries to optimize. Key difference: pg_stat_activity tells you what's running NOW; pg_stat_statements tells you what has been running OVER TIME. Both are essential for complete monitoring.
Database health check in order: (1) Disk space: SELECT pg_size_pretty(pg_database_size(current_database())) and check df.blocks - df.free; (2) Connection status: SELECT state, COUNT(*) FROM pg_stat_activity GROUP BY state; (3) Long-running queries: SELECT now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND query_start < now() - interval '5 minutes'; (4) Cache hit ratio: SELECT blks_hit * 100.0 / NULLIF(blks_hit + blks_read, 0) AS cache_hit_ratio FROM pg_stat_database; (5) Replication lag: SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag on replica; (6) Lock contention: SELECT * FROM pg_locks WHERE granted = false; (7) Autovacuum status: SELECT relname, n_dead_tup, last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 5; (8) Top queries by total time: from pg_stat_statements. Run these in under 2 minutes to get a complete health picture.
pg_stat_user_tables and pg_stat_user_indexes? When would you use each?pg_stat_user_tables tracks table-level statistics: number of live/dead tuples, sequential scans vs index scans,_INSERT/UPDATE/DELETE counts, last vacuum and analyze times, vacuum and analyze count. Use when: diagnosing which tables are hot spots, checking if autovacuum is keeping up with dead tuples, understanding read vs write patterns per table, verifying ANALYZE ran recently. pg_stat_user_indexes tracks per-index statistics: number of index scans (idx_scan), tuples read and fetched via the index, index size. Use when: identifying unused indexes (idx_scan = 0 — these can be dropped), comparing which indexes are actually being used for queries, diagnosing if adding an index improved access patterns. Combine both: high sequential scans on a table with a corresponding index that has low idx_scan indicates the index is not being used and queries are doing full table scans.
Database DoS indicators and monitoring: (1) Connection flood — spike in connection count above normal baseline; alert when current_connections > 80% of max_connections from unknown sources; (2) Query flood — sudden spike in queries per second beyond normal traffic patterns; compare against historical baseline from pg_stat_statements; (3) Expensive query pattern — many concurrent queries with similar expensive patterns (full table scans) suggest an attack or misconfigured application; (4) Authentication failures — monitor PostgreSQL logs for spike in failed login attempts; (5) Lock exhaustion — many queries waiting on the same lock; attacker might try to hold a lock for extended period. Mitigations: connection limiters at the application layer, PgBouncer with strict pool sizing, rate limiting on the application server, firewall rules restricting access to known IP ranges, failed login monitoring and temporary IP blocks.
Mixed workload monitoring strategy: (1) Separate connection pools — OLTP uses one pool (smaller, higher max connections), OLAP uses a separate pool (larger, fewer connections, different work_mem settings); (2) Separate resource groups or query priority — PostgreSQL doesn't have built-in resource governance, but Linux cgroups can limit OLAP query resource usage; (3) OLTP monitoring — monitor transaction latency p50/p95/p99, connection utilization, lock wait ratio; (4) OLAP monitoring — monitor query duration, temp file usage (indicates work_mem spilling), I/O rates during large scans; (5) Conflict monitoring — OLAP scans holding SHARE locks can block OLTP writes; monitor pg_stat_activity for lock waits involving long-running queries; (6) Alerting — set separate p99 thresholds for OLTP (e.g., 100ms) and OLAP (e.g., 60s); mixing them in one alert rule hides OLTP issues. Dashboard: two separate panels for OLTP latency and OLAP query duration.
Capacity planning dashboard metrics: (1) Disk usage over time — current used vs total, with projection line; (2) Disk growth rate — GB per week; (3) CPU utilization trend — average and p95 over time; (4) Memory utilization — shared_buffers vs effective_cache_size vs total RAM; (5) Connection count trend — current vs max_connections; (6) Cache hit ratio trend — weekly average; (7) Query latency trend — p50, p95, p99 over time; (8) Replication lag — if using replicas; (9) Table and index sizes — top 10 largest tables, top 10 fastest growing; (10) Dead tuple accumulation — n_dead_tup trend for top tables. Projection: use Grafana's threshold visualization to show when current growth rate hits 80% and 90% of provisioned capacity. This gives ops team runway visibility — they should know 60-90 days before hitting limits.
EXPLAIN (ANALYZE, BUFFERS) to diagnose query performance issues?EXPLAIN ANALYZE with BUFFERS shows actual execution details: (1) Look for rows estimates vs actual rows — large discrepancy (> 10x) means stale statistics, run ANALYZE; (2) Look for sequential scans on large tables — often indicates missing index; (3) Check sort and hash operations — high actual rows vs estimated rows for sort indicates work_mem too low, causing disk spills; (4) Look for nested loop on large datasets — may benefit from hash join or merge join; (5) Buffers: shared_hit (from cache) vs shared_read (from disk) — low hit ratio indicates cache is too small for working set; (6) Actual time per node — identifies which step is slowest; (7)loops — if loops > 1, the operation ran multiple times, multiply time per loop to get total contribution to query time. The key is comparing estimated vs actual for rows and time to pinpoint where planner assumptions differ from reality.
pg_stat_bgwriter and what metrics indicate it needs tuning?pg_stat_bgwriter tracks the background writer process. Metrics to monitor: (1) checkpoints — frequency of checkpoints; too many checkpoints indicate issues with checkpoint_segments or checkpoint_timeout; (2) buffers_checkpoint — buffers written during checkpoint; (3) buffers_clean — buffers written by the bgwriter; high buffers_clean relative to buffers_checkpoint means bgwriter is doing more work than checkpoints; (4) maxwritten_clean — how many times bgwriter stopped because it reached bgwriter_lru_maxpages; if this is high, bgwriter can't keep up. High maxwritten_clean: increase bgwriter_lru_maxpages or decrease bgwriter_delay. High buffers_backend (backends writing directly): same tuning. High buffers_backend_fsync: this is disk being overwhelmed, tune disk or reduce write volume. The goal is a healthy balance where bgwriter handles most buffer writes and backends rarely write directly.
pg_partman monitoring additions: (1) Partition count — monitor number of partitions; too many partitions cause performance degradation; pg_partman maintains a part_config table tracking this; (2) Partition creation lag — ensure partitions are being created ahead of data arrival; check partman's run_maintenance_log; (3) Parent table bloat — parent table should be tiny (only contains partition boundaries); if it grows, something is inserting directly into parent; (4) Detached partition monitoring — pg_partman can detach old partitions for archival; monitor that this is happening on schedule; (5) Child table sizes — monitor that partitions are evenly sized; lopsided sizes indicate skewed data patterns. SQL for partition monitoring: SELECT parent_table, COUNT(*) FROM pg_tables WHERE schemaname NOT IN ('pg_catalog', 'information_schema') GROUP BY parent_table. Alert on partition count exceeding thresholds (e.g., > 1000 partitions warrants review).
Same-query slowness with no data change causes: (1) Statistics are stale — run ANALYZE on the table; (2) Index was dropped or disabled — check pg_stat_user_indexes for idx_scan changes; (3) Another query is holding a lock — check pg_stat_activity for wait events on the same table; (4) Connection pooling issue — query might be running under a different session or with different settings; (5) Configuration change — check postgresql.conf for work_mem, random_page_cost, or other planner-affecting settings that changed; (6) Planner regression — PostgreSQL sometimes chooses different plans for the same query due to planner statistics noise; use SET plan_seed to reproduce; (7) Memory pressure — another session might be consuming memory, causing this query to spill to disk; (8) Table bloat — VACUUM hasn't run recently and dead tuples are slowing scans. Start with EXPLAIN (ANALYZE, BUFFERS) and compare to an old EXPLAIN if saved.
Query optimization detection monitoring: (1) Track p99 query duration per query — when p99 increases > 2x baseline for specific queries, flag for review; (2) Monitor sequential scan rate per table — high sequential scans without corresponding data growth indicate index loss or statistics issues; (3) Monitor work_mem spills — queries with high temp buffer usage (sort to disk) via pg_stat_statements.total_plan_time vs total_exec_time; (4) Track query plan changes — if using pg_stat_statements with plan capture, compare plans over time; a plan change often precedes performance regression; (5) Monitor index usage ratio — for each table, compare seq_scan vs idx_scan; increasing seq_scan on tables with indexes indicates optimizer is not using indexes; (6) Set up slow query capture with EXPLAIN plans — auto_explain.log_min_duration captures plans for slow queries. Alert threshold: any query whose p99 increases > 50% week-over-week, or any table with > 10x increase in sequential scans without corresponding data growth.
HA cluster monitoring additions: (1) Replication lag — per replica, in bytes and seconds; alert at 30s for normal workloads, lower for financial applications; (2) Replica connection status — ensure replicas are connected and receiving WAL; (3) Failover readiness — check pg_stat_replication for state and lag; (4) Primary election — monitor for unexpected primary switches; (5) Patroni or similar orchestrator health — ensure consensus is healthy; (6) Quorum presence — for quorum-based systems, ensure majority is reachable; (7) DNS/VIP resolution — if using virtual IP, ensure it resolves correctly after failover; (8) Connection pool health — PgBouncer or PgCat instances remain reachable after failover; (9) Time to recovery — track how long failover takes from start to new primary accepting writes; (10) Split brain detection — for active-active setups, monitor for writes to both sides simultaneously. HA monitoring is only valuable if tested — run failover drills annually to verify monitoring and procedures work.
PgBouncer monitoring metrics: (1) Pool active vs idle — active connections using DB, idle connections in pool; high active count approaching pool limit indicates shortage; (2) Waiting clients — if clients are queuing, pool is undersized or queries are holding connections too long; (3) Server connection lifetime — long-lived server connections indicate connections are being reused rather than created/destroyed (good); (4) Query wait time — how long queries wait in queue before being executed; (5) Bytes received/sent — throughput monitoring; (6) Error rate — connection errors, query errors. PgBouncer show stats command shows: num_ws_acquired, num_wq_auth, num_server_connections, etc. Alert thresholds: wait queue > 0 sustained (indicates pool undersized), server connection errors > 0 (indicates backend issues), pool utilization > 90% sustained. Monitoring: export PgBouncer stats to Prometheus via postgres_exporter and visualize in Grafana.
Production PostgreSQL go-live checklist: (1) Disk space — alert at 85%, critical at 90%; (2) Connection count — alert at 80% of max_connections; (3) Cache hit ratio — alert below 95%; (4) Query latency p99 — baseline established, alert at 2x baseline; (5) Replication lag — alert at 30s for replicas; (6) Long-running queries — alert at 60s; (7) Lock waits — alert for queries waiting > 10s; (8) Database errors — log errors to monitoring system; (9) Backup success/failure — verify backups are running and test restore; (10) Autovacuum activity — ensure autovacuum is running and not falling behind on any table. Beyond these minimums, add: slow query logging with auto_explain, monitoring for specific critical queries, custom business metrics (order placement latency), and capacity planning dashboards showing growth trends. The minimums prevent outages; the additions prevent slowdowns and enable optimization.
Further Reading
- PostgreSQL Monitoring Overview — Official monitoring documentation
- pg_stat_statements — Query performance statistics
- pgMustard — PostgreSQL performance monitoring service
- Supabase Database Debugging — Practical debugging patterns
- DataDog PostgreSQL Monitoring — Comprehensive monitoring guide
Conclusion
Database monitoring is not optional. The metrics exist, the tools exist, and ignoring them leads to incidents that could have been prevented.
Start with the basics: track query latency, connections, disk usage, and cache hit ratios. Add slow query logging. Set up alerts for critical thresholds.
As you mature, add more sophisticated monitoring: query-level percentiles, detailed replication stats, correlation with application traces.
The goal is visibility. You cannot fix what you cannot see. Monitoring turns guesswork into data-driven decisions about where to optimize, when to scale, and how to prevent problems before they become incidents.
Category
Related Posts
Connection Pooling: HikariCP, pgBouncer, and ProxySQL
Learn connection pool sizing, HikariCP, pgBouncer, and ProxySQL, timeout settings, idle management, and when pooling helps or hurts performance.
Vacuuming and Reindexing in PostgreSQL
PostgreSQL's MVCC requires regular maintenance. This guide explains dead tuples, VACUUM vs VACUUM FULL, autovacuum tuning, REINDEX strategies, and how to monitor bloat with pg_stat_user_tables.
Metrics, Monitoring, and Alerting: From SLIs to Alerts
Learn the RED and USE methods, SLIs/SLOs/SLAs, and how to build alerting systems that catch real problems. Includes examples for web services and databases.