Database Monitoring: Metrics, Tools, and Alerting

Keep your PostgreSQL database healthy with comprehensive monitoring. This guide covers query latency, connection usage, disk I/O, cache hit ratios, and alerting with pg_stat_statements and Prometheus.

published: reading time: 31 min read author: GeekWorkBench

Database Monitoring: Metrics, Tools, and Alerting

A database without monitoring is flying blind. You will not know you are out of disk space until writes start failing. You will not know queries are slowing down until users complain. You will not know replication lag is growing until you have a replica that is hours behind.

This guide covers the essential metrics to track, tools for collecting them, and how to set up alerting that actually helps rather than just creating noise.

Introduction

Database monitoring is how you know whether your database is healthy before users complain. Without it, you are reactive: you find out about problems when they become outages. With the right metrics and alerting, you find out before — disk filling up, replication lag growing, queries slowing down, connections approaching the limit.

This guide covers the essential metrics to track (query latency, connection usage, disk I/O, cache hit ratios, replication lag), the tools to collect them (pg_stat_statements, Prometheus exporters, pgAdmin), and how to configure alerting that surfaces real problems without generating noise. It also covers building dashboards that give you situational awareness at a glance.

Key Metrics to Track

Not all metrics are equally important. Focus on the ones that directly indicate user-facing problems or impending failures.

Query Latency

Query latency is the most user-visible metric. Slow queries mean slow page loads, frustrated users, and potential timeouts.

Track these latency percentiles:

  • p50 (median): The typical response time
  • p95: What 19 out of 20 requests experience
  • p99: The worst experiences (excluding outliers)
-- Check query latency from pg_stat_statements
SELECT
  query,
  calls,
  total_exec_time / calls AS avg_ms,
  stddev_exec_time AS stddev_ms,
  min_exec_time,
  max_exec_time,
  rows / calls AS avg_rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

The pg_stat_statements extension must be enabled:

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

PostgreSQL’s pg_stat_statements reports total execution time, not per-call latency percentiles. For true percentiles, consider pg_stat_kcache or external tools.

Connection Usage

PostgreSQL has a finite connection limit. When connections are exhausted, new connections fail. This is a hard failure that affects users directly.

-- Current connection status
SELECT
  state,
  COUNT(*) AS count
FROM pg_stat_activity
GROUP BY state;
-- Connection limits and usage
SELECT
  setting::int AS max_connections,
  (SELECT COUNT(*) FROM pg_stat_activity) AS current_connections,
  (SELECT COUNT(*) FROM pg_stat_activity WHERE state = 'active') AS active_connections;

Track connection usage as a percentage of max. If you are regularly above 70-80%, investigate why—either scale max_connections or reduce connection usage through pooling.

Disk I/O

Disk I/O bottlenecks are common database problems. Slow I/O means queries wait for data to come from disk rather than being served from memory.

-- I/O statistics
SELECT
  schedinfo.io_read_count,
  schedinfo.io_write_count,
  schedinfo.io_read_ms,
  schedinfo.io_write_ms
FROM pg_stat_activity act
JOIN pg_proc AS proc ON act.procpid = proc.oid
JOIN pg_catalog.pg_statio_user_tables AS statio ON statio.relid = act.query
LIMIT 1;

For aggregate I/O statistics:

SELECT
  pg_stat_database.datname,
  blks_read,
  blks_hit,
  ROUND(100.0 * blks_hit / NULLIF(blks_hit + blks_read, 0), 2) AS cache_hit_ratio
FROM pg_stat_database
WHERE datname = current_database();

A cache hit ratio below 95-99% indicates your working set does not fit in memory, or there are problematic queries scanning large amounts of data.

Buffer Cache Hit Ratio

The buffer cache hit ratio measures how often PostgreSQL finds data in memory versus on disk. A low ratio means lots of disk I/O.

-- System-wide buffer cache hit ratio
SELECT
  pg_stat_database.datname,
  pg_stat_database.blks_hit,
  pg_stat_database.blks_read,
  ROUND(100.0 * pg_stat_database.blks_hit /
    NULLIF(pg_stat_database.blks_hit + pg_stat_database.blks_read, 0), 2) AS hit_ratio
FROM pg_stat_database
WHERE pg_stat_database.datname = current_database();

A hit ratio above 99% is excellent. Below 95% warrants investigation—either you need more memory, or queries are scanning too much data.

Replication Lag

If you are using streaming replication (read replicas), monitor lag carefully. A replica that falls behind provides stale data and takes load off the primary incorrectly.

-- Check replication lag on replica
SELECT
  now() - pg_last_xact_replay_timestamp() AS replication_lag;
-- On primary, check replica status
SELECT
  client_addr,
  state,
  sent_lsn - write_lsn AS sent_lag,
  write_lsn - flush_lsn AS write_lag,
  flush_lsn - replay_lsn AS replay_lag
FROM pg_stat_replication;

Lag is measured in bytes (LSN difference) and time. The time-based lag is more useful for user-facing impact. Replication lag above a few seconds can cause problems for applications that expect read-after-write consistency.

Slow Query Log Analysis

Identifying slow queries is the first step to fixing them. PostgreSQL can log slow queries automatically.

Configuring Slow Query Logging

-- Log queries taking longer than 100ms
ALTER SYSTEM SET log_min_duration_statement = 100;
SELECT pg_reload_conf();

This logs the query text, duration, and binding parameters. Review logs regularly to find candidates for optimization.

Analyzing Slow Queries

Once you have slow query data, analyze patterns:

-- Top slowest queries by average time
SELECT
  LEFT(query, 100) AS query_preview,
  calls,
  total_exec_time / calls AS avg_ms,
  max_exec_time,
  rows / calls AS avg_rows
FROM pg_stat_statements
WHERE calls > 10
ORDER BY avg_ms DESC
LIMIT 20;
-- Most frequently called slow queries
SELECT
  LEFT(query, 100) AS query_preview,
  calls,
  total_exec_time / calls AS avg_ms,
  total_exec_time AS total_ms
FROM pg_stat_statements
WHERE calls > 100
ORDER BY total_exec_time DESC
LIMIT 20;

The first query finds queries that are slow even when called infrequently. The second finds queries called so often that even small slowdowns add up.

Query Execution Plans

For slow queries, examine the execution plan:

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT *
FROM orders
WHERE user_id = 123
ORDER BY created_at DESC
LIMIT 10;

Key things to look for:

  • Seq Scan on large tables: Often indicates missing indexes
  • High actual rows vs estimated rows: Statistics are stale, run ANALYZE
  • Nested Loop on large datasets: May benefit from hash join
  • Sort nodes with high memory: May need work_mem increase

Monitoring Tools

PostgreSQL includes built-in stats. For production monitoring, you will want something more comprehensive.

pg_stat_statements

This is the essential tool for query-level visibility.

Enable it in postgresql.conf:

shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.track = top

DataDog

DataDog provides PostgreSQL monitoring with minimal setup. The agent auto-discovers PostgreSQL metrics and provides dashboards out of the box.

Key DataDog metrics for PostgreSQL:

  • postgresql.connections: Current connections
  • postgresql.rows_fetched: Rows returned by queries
  • postgresql.rows_inserted/updated/deleted: Write rates
  • postgresql.buffer_cache_hit_ratio: Cache effectiveness
  • postgresql.active_temporary_files: Temporary file creation (indicates memory pressure)

DataDog integrates well with other monitoring data—correlating database metrics with application traces, logs, and infrastructure metrics in one place.

Prometheus and Grafana

For a more DIY approach, Prometheus + Grafana provides powerful monitoring. The postgres_exporter scrapes PostgreSQL metrics for Prometheus.

# prometheus.yml
scrape_configs:
  - job_name: "postgres"
    static_configs:
      - targets: ["localhost:9187"]

Common Prometheus queries for PostgreSQL:

# Connection utilization
postgres_connections{state="active"} / postgres_max_connections

# Cache hit ratio
rate(postgres_blks_hit_total[5m]) / (rate(postgres_blks_hit_total[5m]) + rate(postgres_blks_read_total[5m]))

# Slow query rate
rate(postgres_slow_query_seconds_total[5m])

Grafana dashboards visualize these metrics. The official PostgreSQL dashboard template provides a reasonable starting point.

pg_stat_monitor

pg_stat_monitor is an improved version of pg_stat_statements that provides histogram data, allowing true percentile calculations.

CREATE EXTENSION pg_stat_monitor;
SELECT pg_stat_monitor_settings();
-- Query with timing histogram
SELECT
  query,
  calls,
  total_exec_time,
  min_exec_time,
  avg_exec_time,
  max_exec_time,
  p90_exec_time,
  p95_exec_time,
  p99_exec_time
FROM pg_stat_monitor()
ORDER BY total_exec_time DESC
LIMIT 20;

This gives you actual p90/p95/p99 latencies directly from the database, which is more useful than averages alone.

flowchart TD
    subgraph DB["PostgreSQL Internals"]
        SS[pg_stat_statements<br/>Query statistics]
        SA[pg_stat_activity<br/>Connection states]
        SB[pg_stat_bgwriter<br/>Write buffers]
        SR[pg_stat_replication<br/>Replica lag]
    end

    subgraph Collection["Collection Layer"]
        EX[postgres_exporter<br/>Prometheus metrics]
        DD[DataDog Agent<br/>Custom metrics]
        BC[pg_stat_monitor<br/>Percentile histograms]
    end

    subgraph Store["Storage & Visualization"]
        PROM[Prometheus<br/>Time-series DB]
        GRAF[Grafana<br/>Dashboards]
        DD2[DataDog<br/>Dashboards + APM]
    end

    subgraph Alert["Alerting"]
        AL[AlertManager<br/>Paging]
        PN[PagerDuty<br/>On-call routing]
    end

    DB --> EX
    DB --> DD
    DB --> BC
    SS --> EX
    SA --> EX
    SR --> EX
    EX --> PROM
    DD --> DD2
    BC --> PROM
    PROM --> GRAF
    DD2 --> AL
    PROM --> AL
    AL --> PN

When to Use / When Not to Use

Use built-in PostgreSQL stats (pg_stat_statements, pg_stat_activity) when you need quick visibility without extra setup, during initial investigation of an issue, or when evaluating whether to adopt a heavier monitoring stack.

Use pg_stat_monitor over pg_stat_statements when you need true latency percentiles (p90/p95/p99) directly from the database, since pg_stat_statements only gives averages and totals.

Use DataDog or Prometheus+Grafana when you need to correlate database metrics with application traces, infrastructure metrics, and logs in a single view, or when managing multiple database instances across environments.

Use custom metrics exported to Prometheus when you have specific business-level SLIs (like order placement latency) that standard database exports do not capture.

Do not rely solely on averages — a database with a p99 of 5 seconds looks fine if you only track the mean. Always use percentile-based alerting.

Do not set up alert rules for every metric you track. Alert fatigue is real — if everything triggers an alert, nothing gets attention.

Setting Up Alerts for Thresholds

Alerts should notify you of problems, not create noise that trains you to ignore them. Good alerts are actionable and specific.

Critical Threshold Alerts

Disk space critical

SELECT
  df.mountpoint,
  df.device,
  df.blocks - df.free AS used_blocks,
  df.blocks AS total_blocks,
  ROUND(100.0 * (df.blocks - df.free) / df.blocks) AS used_pct
FROM pg_catalog.pg_stat_device_stats df
WHERE df.mountpoint = '/data';

Alert threshold: 85% disk usage. At 90%, you might not have room for VACUUM or emergency operations.

Connection utilization high

Alert when connections exceed 80% of max. At this threshold, some capacity remains for burst traffic, but you should investigate.

Replication lag

Alert when replica lag exceeds 30 seconds. Beyond this, read-after-write consistency guarantees break down.

Long-running queries

Alert when queries exceed 60 seconds. These often indicate problems and can hold locks that block other queries.

Alert Design Principles

Alert Anti-Patterns

Avoid alerting on:

  • CPU spikes that resolve quickly: Brief CPU spikes are normal during batch operations
  • Single slow queries: Track these via slow query log, alert only if they are increasing
  • Cache hit ratio drops: Brief fluctuations are normal; alert only on sustained drops

Alerting Best Practices

  1. Set reasonable time windows: Alert only when a condition persists for 5+ minutes, not on momentary spikes.

  2. Use multiple severity levels: Warning (investigate) vs Critical (respond now).

  3. Include context in alerts: The alert should say what is wrong, which system, and when it started.

  4. Alert on symptoms, not causes: “Database latency is high” is better than “Cache hit ratio is low”—the symptom is what affects users.

  5. Review and tune regularly: If you are ignoring an alert, either fix the underlying issue or adjust the threshold.

Building a Monitoring Dashboard

A good database dashboard shows the key metrics at a glance:

  1. Connection count (current and trend)
  2. Query latency (p50, p95, p99)
  3. Transactions per second
  4. Cache hit ratio
  5. Replication lag (if applicable)
  6. Disk usage (current and trend)
  7. Active queries (top offenders)
  8. Locks (blocked queries)

Most monitoring tools provide templates. Build from there based on your specific alerts.

Production Failure Scenarios

FailureCauseMitigation
Monitoring blind spot during incidentAlerts set only on CPU/disk but not on query latency spikeAlert on p95/p99 latency and query error rate, not just infrastructure metrics
Alert fatigue desensitizes teamHundreds of warnings fire daily for minor fluctuationsRaise thresholds, use composite alerts (multiple conditions must fire together)
pg_stat_statements query plan cachedSlow query captured with placeholder values prevents plan optimizationUse pg_stat_statements_reset() after fixing a query to clear stale entries
Monitoring dashboard lies about freshnessMetric export interval is 60s but dashboard refreshes every 5sVerify metric scrape intervals match dashboard expectations
Lock monitoring missed cascading failurespg_locks view not monitored, deadlocks cascade across tablesMonitor pg_stat_activity.state = 'active' combined with wait_event_type, set deadlock detection alerts

Trade-Off Table: Monitoring Tools

Dimensionpg_stat_statementsauto_explainDataDogPrometheus + Grafana
Setup effortNone (built-in)LowMediumMedium-High
Latency percentilesAverages onlyN/AYes (histograms)Yes (histograms)
Query plan analysisNoYes (actual plans)Via integrationVia exporter
Cross-service correlationNoNoYes (APM + infra)Yes (if metrics aligned)
AlertingNo (external tool needed)NoYes (built-in)AlertManager required
CostFreeFreePer-host pricingFree (self-hosted)
Best forInitial investigation, tuningPlan-level debuggingProduction observabilityLarge-scale infrastructure

Real-World Case Study: PagerDuty’s PostgreSQL Monitoring Setup

PagerDuty, a company that builds incident management software, runs PostgreSQL as a core operational database. Their engineering team has written extensively about their PostgreSQL monitoring architecture.

What they built: PagerDuty’s monitoring stack uses Prometheus exporters for infrastructure metrics (CPU, disk, network), combined with a PostgreSQL-specific exporter that captures pg_stat_statements, pg_stat_activity, pg_stat_user_tables, replication lag, and vacuum activity. They store all metrics in Prometheus and visualize in Grafana with per-team dashboards.

Key monitoring patterns they emphasize:

  • Cardinality management: PagerDuty runs pg_stat_statements tracking top 10,000 queries by total time, but they avoid high-cardinality labels (like individual user IDs) on PostgreSQL metrics to prevent Prometheus cardinality explosions. They aggregate user-level data into buckets (e.g., “heavy user” vs “light user”) rather than tracking per-user.
  • Alert fatigue prevention: They use a three-tier alert model. Tier 1: page immediately for customer-facing impact (error rate > 1%, p99 latency > 5s). Tier 2: notify Slack channel for degraded performance (cache hit ratio < 90%, replication lag > 10s). Tier 3: ticket created for trend analysis (disk growth rate > 10%/week, slow query count increasing). This prevents page fatigue while ensuring real incidents get immediate attention.
  • Query-level observability: They use pg_stat_statements with pg_stat_statements.track = all and aggregate by query normalized text (stripping literal values). They tag each query with a role (api, worker, background) to isolate which component is causing issues.
  • Capacity-triggered scaling: When pg_stat_database.tup_returned drops below baseline by 20% for more than 5 minutes (suggesting table bloat or index degradation), they automatically trigger a vacuum or reindex job.

The lesson: The value of monitoring is not in having dashboards—it’s in having a clear escalation path from metric to action. PagerDuty’s three-tier alert model is a pattern any team can adopt: separate alerts by impact severity, assign different response protocols for each tier, and tune thresholds based on observed baseline rather than arbitrary values.

Quick Recap Checklist

  • Track query latency percentiles: p50, p95, p99 — not just averages
  • Monitor connection utilization — alert when approaching 80% of max_connections
  • Cache hit ratio above 99% is excellent; below 95% warrants investigation
  • Set log_min_duration_statement to 100-500ms to capture slow queries
  • Use pg_stat_statements to identify top queries by total time
  • Use pg_stat_activity to diagnose currently running queries and lock contention
  • Use pg_stat_user_tables to identify hot tables and index usage effectiveness
  • Alert on disk usage at 85% — at 90% there may be no room for VACUUM
  • Alert on replication lag: warning at 10s, critical at 30s
  • Alert on long-running queries (60+ seconds) that may hold locks
  • Monitor pg_stat_activity.state = 'active' combined with wait_event_type for deadlocks
  • Use auto_explain.log_min_duration to capture query plans for slow queries automatically
  • Review slow query log regularly to find optimization candidates
  • Use EXPLAIN (ANALYZE, BUFFERS) to examine query execution plans
  • Set composite alerts (multiple conditions must fire together) to reduce noise
  • Use three-tier alerting: page immediately, notify Slack, create ticket

Quick Reference: Alert Thresholds

Disk usage: Critical at 90%, Warning at 80%. At 90% there may be no room for VACUUM or emergency operations.

Connection utilization: Warning at 70% of max_connections, Critical at 85%. Investigate pool sizing or query efficiency before hitting limits.

Cache hit ratio: Warning below 95%, Critical below 90%. Sustained drops indicate queries scanning too much data.

Replication lag: Warning at 10 seconds, Critical at 30 seconds. Beyond 30 seconds read-after-write consistency breaks down.

Query latency: Set p95 alerts at 2-3x your baseline. Set p99 alerts at 5x baseline or at your SLA threshold.

Interview Questions

1. Your PostgreSQL database shows a cache hit ratio of 99% but query latency has doubled over the past week. What do you investigate?

A high cache hit ratio with degraded latency suggests the bottleneck is not memory-related. Investigate in this order: First, check pg_stat_statements for new slow queries—maybe a new query pattern emerged that wasn't present in the previous baseline. Second, check pg_stat_activity for lock contention: run SELECT * FROM pg_stat_activity WHERE wait_event_type IS NOT NULL to see if queries are waiting on locks. Third, check replication lag if using replicas: a lagging replica could cause read latency to appear as write latency if your application uses synchronous replication. Fourth, check for index bloat using pgstatindex() on your most-used indexes—a bloated index has higher tree depth and requires more I/O per lookup even if the buffer cache is warm. Fifth, check disk I/O latency (iostat on the host): a saturated disk queue causes latency even with high cache hit ratios. Finally, check for increased connection counts—if max_connections is being approached, queries may be waiting in the connection queue.

2. How do you set meaningful alert thresholds for a PostgreSQL database?

Threshold setting is an iterative process. Start with theoretical limits: max_connections (e.g., 100), disk capacity (e.g., 500GB), shared_buffers (e.g., 8GB). Set initial thresholds at 70% (warning) and 85% (critical) of theoretical limits. Then refine based on observed baseline: capture metrics for 2-4 weeks to establish baseline behavior. Set warning thresholds at baseline + 3 standard deviations for highly variable metrics (query latency), and at fixed percentages for monotonic resources (disk usage, connection count). Use composite alerts to reduce noise: fire only when both cache hit ratio drops below 90% AND transaction latency exceeds 500ms simultaneously. Review and adjust thresholds quarterly or after any significant load change.

3. What's the difference between pg_stat_statements, pg_stat_activity, and pg_stat_user_tables?

pg_stat_statements tracks query performance across all queries globally: total calls, total time, total I/O time, and cache hit ratio per normalized query. It requires the extension to be installed (CREATE EXTENSION pg_stat_statements) and does not track which session or user ran a query. pg_stat_activity is session-level: it shows every currently active connection, the query each session is running, wait events, transaction state, and query start time. Use it to diagnose currently running queries and lock contention in real time. pg_stat_user_tables (and pg_stat_sys_tables) tracks per-table statistics: sequential scans vs index scans, individual table hit/read/insert/update/delete rates, and vacuum/autovacuum activity. Use it to identify which tables are hot spots and whether indexes are being used effectively.

4. How would you monitor for a slow query that only occurs during peak traffic?

Configure pg_stat_statements with track_utility = on to capture all query types including COPY and VACUUM. Set a low log_min_duration_statement threshold (e.g., 100ms) during peak hours only, using a scheduled configuration change or pg_scheduler. Capture pg_stat_statements snapshots hourly and diff them to identify queries whose total time increases disproportionately during peak periods. Look specifically at total_exec_time / calls (average per execution) rather than total time alone—a query that runs fast but 100,000 times per hour will show high total time. Use pg_stat_activity with state = 'active' and query_start > now() - interval '5 minutes' filtered to active queries during peak to catch currently-running slow queries. For recurring issues, add auto_explain with auto_explain.log_min_duration = '500ms' to capture query plans for slow queries automatically.

5. Your monitoring shows that query latency p99 increased from 50ms to 500ms over the past week, but p50 remained stable at 30ms. What does this indicate and how do you diagnose it?

Stable p50 with increasing p99 is a classic sign of a subset of queries experiencing degradation while most queries are fine. This rules out system-wide issues (CPU, memory, disk) and points to query-specific problems. Diagnosis steps: (1) Compare p99 query list this week vs last week in pg_stat_statements — look for queries with high max_exec_time that were previously faster; (2) Check for new queries added to the system; (3) Check if specific business events caused traffic pattern changes (new feature launch, scheduled jobs); (4) Examine execution plans for the degraded queries — look for changes like degraded index scans to sequential scans, or increased row estimates causing different join strategies; (5) Check for lock contention on specific tables or rows — pg_stat_activity with wait_event_type filters for specific tables; (6) Check for parameter changes in pg_stat_statements settings that might have changed query plans.

6. How do you configure alerting to avoid both false negatives (missing real incidents) and alert fatigue?

Alert design principles for sustainable monitoring: (1) Alert on symptoms, not causes — "database latency is high" is better than "cache hit ratio is low"; the symptom is what affects users; (2) Use composite conditions — fire only when disk > 80% AND query latency p99 > 1s simultaneously; single thresholds fire too easily; (3) Require persistence — fire only when condition persists for 5+ minutes, not on momentary spikes; (4) Use graduated severity — warning (investigate within 24h) vs critical (respond now) vs page (wake someone up); (5) Include context — every alert should state: what is wrong, which system, since when, and suggested first action; (6) Tune quarterly — if you ignore an alert, either fix the underlying issue or adjust the threshold; (7) Review alert rate — if you're getting more than 10 warnings per day per system, thresholds are too sensitive; (8) Separate business impact from system metrics — "checkout success rate dropped" is more actionable than "CPU at 90%".

7. Explain how pg_stat_activity differs from pg_stat_statements in database monitoring.

pg_stat_activity is real-time session monitoring — it shows what's happening RIGHT NOW. For each active connection: current state (idle, active, waiting), current query (or last query if idle), query start time, wait events if blocked, transaction age. Use for: diagnosing currently running queries, identifying blocking locks, seeing connection counts by state, watching query execution in real time. pg_stat_statements is historical query aggregation — it summarizes ALL queries across all sessions over time. Metrics: total calls, total time, total I/O time, cache hit ratio per normalized query. Use for: identifying top queries by total time, comparing query performance week-over-week, finding queries to optimize. Key difference: pg_stat_activity tells you what's running NOW; pg_stat_statements tells you what has been running OVER TIME. Both are essential for complete monitoring.

8. A developer asks you to check "database health" — what specific queries or commands do you run?

Database health check in order: (1) Disk space: SELECT pg_size_pretty(pg_database_size(current_database())) and check df.blocks - df.free; (2) Connection status: SELECT state, COUNT(*) FROM pg_stat_activity GROUP BY state; (3) Long-running queries: SELECT now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND query_start < now() - interval '5 minutes'; (4) Cache hit ratio: SELECT blks_hit * 100.0 / NULLIF(blks_hit + blks_read, 0) AS cache_hit_ratio FROM pg_stat_database; (5) Replication lag: SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag on replica; (6) Lock contention: SELECT * FROM pg_locks WHERE granted = false; (7) Autovacuum status: SELECT relname, n_dead_tup, last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 5; (8) Top queries by total time: from pg_stat_statements. Run these in under 2 minutes to get a complete health picture.

9. What is the difference between pg_stat_user_tables and pg_stat_user_indexes? When would you use each?

pg_stat_user_tables tracks table-level statistics: number of live/dead tuples, sequential scans vs index scans,_INSERT/UPDATE/DELETE counts, last vacuum and analyze times, vacuum and analyze count. Use when: diagnosing which tables are hot spots, checking if autovacuum is keeping up with dead tuples, understanding read vs write patterns per table, verifying ANALYZE ran recently. pg_stat_user_indexes tracks per-index statistics: number of index scans (idx_scan), tuples read and fetched via the index, index size. Use when: identifying unused indexes (idx_scan = 0 — these can be dropped), comparing which indexes are actually being used for queries, diagnosing if adding an index improved access patterns. Combine both: high sequential scans on a table with a corresponding index that has low idx_scan indicates the index is not being used and queries are doing full table scans.

10. How would you monitor for a potential denial-of-service attack against the database?

Database DoS indicators and monitoring: (1) Connection flood — spike in connection count above normal baseline; alert when current_connections > 80% of max_connections from unknown sources; (2) Query flood — sudden spike in queries per second beyond normal traffic patterns; compare against historical baseline from pg_stat_statements; (3) Expensive query pattern — many concurrent queries with similar expensive patterns (full table scans) suggest an attack or misconfigured application; (4) Authentication failures — monitor PostgreSQL logs for spike in failed login attempts; (5) Lock exhaustion — many queries waiting on the same lock; attacker might try to hold a lock for extended period. Mitigations: connection limiters at the application layer, PgBouncer with strict pool sizing, rate limiting on the application server, firewall rules restricting access to known IP ranges, failed login monitoring and temporary IP blocks.

11. Your database is a mix of OLTP (short transactions) and OLAP (long analytical queries). How do you configure monitoring for both workloads?

Mixed workload monitoring strategy: (1) Separate connection pools — OLTP uses one pool (smaller, higher max connections), OLAP uses a separate pool (larger, fewer connections, different work_mem settings); (2) Separate resource groups or query priority — PostgreSQL doesn't have built-in resource governance, but Linux cgroups can limit OLAP query resource usage; (3) OLTP monitoring — monitor transaction latency p50/p95/p99, connection utilization, lock wait ratio; (4) OLAP monitoring — monitor query duration, temp file usage (indicates work_mem spilling), I/O rates during large scans; (5) Conflict monitoring — OLAP scans holding SHARE locks can block OLTP writes; monitor pg_stat_activity for lock waits involving long-running queries; (6) Alerting — set separate p99 thresholds for OLTP (e.g., 100ms) and OLAP (e.g., 60s); mixing them in one alert rule hides OLTP issues. Dashboard: two separate panels for OLTP latency and OLAP query duration.

12. What metrics would you include in a capacity planning dashboard for a PostgreSQL database?

Capacity planning dashboard metrics: (1) Disk usage over time — current used vs total, with projection line; (2) Disk growth rate — GB per week; (3) CPU utilization trend — average and p95 over time; (4) Memory utilization — shared_buffers vs effective_cache_size vs total RAM; (5) Connection count trend — current vs max_connections; (6) Cache hit ratio trend — weekly average; (7) Query latency trend — p50, p95, p99 over time; (8) Replication lag — if using replicas; (9) Table and index sizes — top 10 largest tables, top 10 fastest growing; (10) Dead tuple accumulation — n_dead_tup trend for top tables. Projection: use Grafana's threshold visualization to show when current growth rate hits 80% and 90% of provisioned capacity. This gives ops team runway visibility — they should know 60-90 days before hitting limits.

13. How do you use PostgreSQL's EXPLAIN (ANALYZE, BUFFERS) to diagnose query performance issues?

EXPLAIN ANALYZE with BUFFERS shows actual execution details: (1) Look for rows estimates vs actual rows — large discrepancy (> 10x) means stale statistics, run ANALYZE; (2) Look for sequential scans on large tables — often indicates missing index; (3) Check sort and hash operations — high actual rows vs estimated rows for sort indicates work_mem too low, causing disk spills; (4) Look for nested loop on large datasets — may benefit from hash join or merge join; (5) Buffers: shared_hit (from cache) vs shared_read (from disk) — low hit ratio indicates cache is too small for working set; (6) Actual time per node — identifies which step is slowest; (7)loops — if loops > 1, the operation ran multiple times, multiply time per loop to get total contribution to query time. The key is comparing estimated vs actual for rows and time to pinpoint where planner assumptions differ from reality.

14. What is the purpose of pg_stat_bgwriter and what metrics indicate it needs tuning?

pg_stat_bgwriter tracks the background writer process. Metrics to monitor: (1) checkpoints — frequency of checkpoints; too many checkpoints indicate issues with checkpoint_segments or checkpoint_timeout; (2) buffers_checkpoint — buffers written during checkpoint; (3) buffers_clean — buffers written by the bgwriter; high buffers_clean relative to buffers_checkpoint means bgwriter is doing more work than checkpoints; (4) maxwritten_clean — how many times bgwriter stopped because it reached bgwriter_lru_maxpages; if this is high, bgwriter can't keep up. High maxwritten_clean: increase bgwriter_lru_maxpages or decrease bgwriter_delay. High buffers_backend (backends writing directly): same tuning. High buffers_backend_fsync: this is disk being overwhelmed, tune disk or reduce write volume. The goal is a healthy balance where bgwriter handles most buffer writes and backends rarely write directly.

15. How do you set up monitoring for PostgreSQL extensions like pg_partman for time-based partitioning?

pg_partman monitoring additions: (1) Partition count — monitor number of partitions; too many partitions cause performance degradation; pg_partman maintains a part_config table tracking this; (2) Partition creation lag — ensure partitions are being created ahead of data arrival; check partman's run_maintenance_log; (3) Parent table bloat — parent table should be tiny (only contains partition boundaries); if it grows, something is inserting directly into parent; (4) Detached partition monitoring — pg_partman can detach old partitions for archival; monitor that this is happening on schedule; (5) Child table sizes — monitor that partitions are evenly sized; lopsided sizes indicate skewed data patterns. SQL for partition monitoring: SELECT parent_table, COUNT(*) FROM pg_tables WHERE schemaname NOT IN ('pg_catalog', 'information_schema') GROUP BY parent_table. Alert on partition count exceeding thresholds (e.g., > 1000 partitions warrants review).

16. A query that was fast yesterday is slow today, but the data volume hasn't changed significantly. What do you check?

Same-query slowness with no data change causes: (1) Statistics are stale — run ANALYZE on the table; (2) Index was dropped or disabled — check pg_stat_user_indexes for idx_scan changes; (3) Another query is holding a lock — check pg_stat_activity for wait events on the same table; (4) Connection pooling issue — query might be running under a different session or with different settings; (5) Configuration change — check postgresql.conf for work_mem, random_page_cost, or other planner-affecting settings that changed; (6) Planner regression — PostgreSQL sometimes chooses different plans for the same query due to planner statistics noise; use SET plan_seed to reproduce; (7) Memory pressure — another session might be consuming memory, causing this query to spill to disk; (8) Table bloat — VACUUM hasn't run recently and dead tuples are slowing scans. Start with EXPLAIN (ANALYZE, BUFFERS) and compare to an old EXPLAIN if saved.

17. How would you configure monitoring to detect when query optimization is needed?

Query optimization detection monitoring: (1) Track p99 query duration per query — when p99 increases > 2x baseline for specific queries, flag for review; (2) Monitor sequential scan rate per table — high sequential scans without corresponding data growth indicate index loss or statistics issues; (3) Monitor work_mem spills — queries with high temp buffer usage (sort to disk) via pg_stat_statements.total_plan_time vs total_exec_time; (4) Track query plan changes — if using pg_stat_statements with plan capture, compare plans over time; a plan change often precedes performance regression; (5) Monitor index usage ratio — for each table, compare seq_scan vs idx_scan; increasing seq_scan on tables with indexes indicates optimizer is not using indexes; (6) Set up slow query capture with EXPLAIN plans — auto_explain.log_min_duration captures plans for slow queries. Alert threshold: any query whose p99 increases > 50% week-over-week, or any table with > 10x increase in sequential scans without corresponding data growth.

18. What monitoring would you set up for a PostgreSQL HA (High Availability) cluster?

HA cluster monitoring additions: (1) Replication lag — per replica, in bytes and seconds; alert at 30s for normal workloads, lower for financial applications; (2) Replica connection status — ensure replicas are connected and receiving WAL; (3) Failover readiness — check pg_stat_replication for state and lag; (4) Primary election — monitor for unexpected primary switches; (5) Patroni or similar orchestrator health — ensure consensus is healthy; (6) Quorum presence — for quorum-based systems, ensure majority is reachable; (7) DNS/VIP resolution — if using virtual IP, ensure it resolves correctly after failover; (8) Connection pool health — PgBouncer or PgCat instances remain reachable after failover; (9) Time to recovery — track how long failover takes from start to new primary accepting writes; (10) Split brain detection — for active-active setups, monitor for writes to both sides simultaneously. HA monitoring is only valuable if tested — run failover drills annually to verify monitoring and procedures work.

19. How do you monitor the effectiveness of connection pooling with PgBouncer?

PgBouncer monitoring metrics: (1) Pool active vs idle — active connections using DB, idle connections in pool; high active count approaching pool limit indicates shortage; (2) Waiting clients — if clients are queuing, pool is undersized or queries are holding connections too long; (3) Server connection lifetime — long-lived server connections indicate connections are being reused rather than created/destroyed (good); (4) Query wait time — how long queries wait in queue before being executed; (5) Bytes received/sent — throughput monitoring; (6) Error rate — connection errors, query errors. PgBouncer show stats command shows: num_ws_acquired, num_wq_auth, num_server_connections, etc. Alert thresholds: wait queue > 0 sustained (indicates pool undersized), server connection errors > 0 (indicates backend issues), pool utilization > 90% sustained. Monitoring: export PgBouncer stats to Prometheus via postgres_exporter and visualize in Grafana.

20. What are the minimum monitoring requirements for a production PostgreSQL database before go-live?

Production PostgreSQL go-live checklist: (1) Disk space — alert at 85%, critical at 90%; (2) Connection count — alert at 80% of max_connections; (3) Cache hit ratio — alert below 95%; (4) Query latency p99 — baseline established, alert at 2x baseline; (5) Replication lag — alert at 30s for replicas; (6) Long-running queries — alert at 60s; (7) Lock waits — alert for queries waiting > 10s; (8) Database errors — log errors to monitoring system; (9) Backup success/failure — verify backups are running and test restore; (10) Autovacuum activity — ensure autovacuum is running and not falling behind on any table. Beyond these minimums, add: slow query logging with auto_explain, monitoring for specific critical queries, custom business metrics (order placement latency), and capacity planning dashboards showing growth trends. The minimums prevent outages; the additions prevent slowdowns and enable optimization.

Further Reading

Conclusion

Database monitoring is not optional. The metrics exist, the tools exist, and ignoring them leads to incidents that could have been prevented.

Start with the basics: track query latency, connections, disk usage, and cache hit ratios. Add slow query logging. Set up alerts for critical thresholds.

As you mature, add more sophisticated monitoring: query-level percentiles, detailed replication stats, correlation with application traces.

The goal is visibility. You cannot fix what you cannot see. Monitoring turns guesswork into data-driven decisions about where to optimize, when to scale, and how to prevent problems before they become incidents.

Category

Related Posts

Connection Pooling: HikariCP, pgBouncer, and ProxySQL

Learn connection pool sizing, HikariCP, pgBouncer, and ProxySQL, timeout settings, idle management, and when pooling helps or hurts performance.

#database #connection-pooling #performance

Vacuuming and Reindexing in PostgreSQL

PostgreSQL's MVCC requires regular maintenance. This guide explains dead tuples, VACUUM vs VACUUM FULL, autovacuum tuning, REINDEX strategies, and how to monitor bloat with pg_stat_user_tables.

#database #postgresql #vacuum

Metrics, Monitoring, and Alerting: From SLIs to Alerts

Learn the RED and USE methods, SLIs/SLOs/SLAs, and how to build alerting systems that catch real problems. Includes examples for web services and databases.

#observability #monitoring #metrics