Alerting in Production: Paging, Runbooks, and On-Call
Build effective alerting systems that wake people up for real emergencies: alert fatigue prevention, runbook automation, and healthy on-call practices.
Introduction
PagerDuty vs Static Thresholds vs SLO-Based Alerting
Use PagerDuty-style urgent alerting when you have customer-facing outages that need immediate human response: complete service unavailability, data integrity risks, or security breaches. These warrant 3am pages and should be rare.
Use static threshold alerting when you have clear, known failure modes with fixed boundaries: CPU > 90%, disk > 80%, error rate > 1%. Static thresholds work for infrastructure-level signals that have predictable normal ranges.
Use SLO-based alerting when you want to alert on user impact rather than internal metrics. SLO alerts fire when your error budget is burning faster than sustainable, catching both sudden spikes and slow leaks. This is the most user-centric approach.
Use anomaly-based alerting when normal behavior varies too much for static thresholds — for example, traffic patterns that change by time of day or day of week. Anomaly detection adapts to patterns but requires historical data and can produce false positives.
The practical stack: SLO alerts for customer-facing services, static thresholds for infrastructure health, anomaly detection for high-variance business metrics.
When to Page vs When to Slack
Page (PagerDuty, SMS, call) when the issue requires human action within 15 minutes: service is down, data is at risk, security incident in progress. If it cannot be fixed by an engineer clicking something in the next 15 minutes, it probably does not need a page.
Slack (or email) when the issue is important but can wait: a disk at 75% has days of runway, a non-critical service is degraded but users can work around it, capacity planning needs attention.
The litmus test: if you wake someone up, can they do something actionable in 5 minutes? If not, it should not page.
flowchart TD
A[Metric Fires] --> B{User impact?}
B -->|Yes| C{SLO budget burning?}
B -->|No| D{Remediable in 15min?}
C -->|Yes| E[SLO Burn Rate Alert<br/>Slack + Page if fast burn]
C -->|No| F[Static Threshold Alert<br/>Slack only]
D -->|Yes| G[Static Threshold Alert<br/>Slack only]
D -->|No| H[Log/Capacity Alert<br/>Ticketing system]
Alerting Philosophy: Symptoms vs Causes
The first question to ask before creating any alert: does this represent a symptom a user is experiencing, or a cause that needs investigation? Many teams alert on causes, high CPU, memory pressure, disk I/O, when they should be alerting on symptoms: error rates, latency spikes, request failures.
A user does not care that your CPU hit 90%. They care that their checkout page is loading slowly or returning errors. Alert on the symptom, investigate the cause.
The Metrics & Monitoring guide covers this distinction in more detail, but the key principle is simple: your alerting should answer the question “is the user okay?” without requiring deep system knowledge.
Defining SLOs and Error Budgets
Service Level Objectives give your alerting meaning. Without SLOs, you have no rational basis for deciding what deserves a page and what can wait.
Define SLOs based on user experience:
# Example SLO definitions
checkout_service:
availability: 99.9% # 43 minutes downtime per month
latency_p99: 2000ms # 99% of requests under 2 seconds
error_rate: 0.1% # 0.1% of requests return 5xx
Error budgets are the flip side of SLOs. If your availability SLO is 99.9%, you have a 43-minute error budget per month. When you burn through that budget, alerts should fire, even if the system has not completely failed.
This approach shifts alerting from “something is wrong” to “we are at risk of missing our commitments to users.”
Alert Severity Levels and Routing
Not everything deserves the same response. Use a clear severity hierarchy:
| Severity | Definition | Response Time | Channel |
|---|---|---|---|
| SEV1 | User-facing outage, data loss risk | 5 minutes | PagerDuty + SMS + Call |
| SEV2 | Degraded performance, partial outage | 15 minutes | PagerDuty + Slack |
| SEV3 | Non-critical issue, capacity risk | 2 hours | Slack |
| SEV4 | Informational, maintenance soon | Next business day |
Route alerts based on severity and on-call schedules. A SEV1 fires regardless of time; a SEV3 on a Saturday afternoon might wait until Monday if the on-call engineer is on a hike.
Runbook Writing and Automation
A runbook is not documentation. It is a decision tree for stressful situations. When an alert fires at 2am, you should not be reading documentation. You should be following steps.
Good runbooks have three properties: they are skimmable, they have clear commands to run, and they include escalation points.
# Runbook: High Error Rate on Checkout Service
## Symptoms
- Error rate > 1% for 5 minutes
- Checkout failures appearing in logs
## Investigation Steps
1. Check payment processor status → [link to status page]
2. Check database connections: `SELECT count(*) FROM pg_stat_activity WHERE state = 'active';`
3. Check recent deployments: `git log --oneline -10`
4. Check feature flags: [Plaid link]
## Mitigation
- If deployment-related: rollback with `./scripts/rollback.sh checkout-service`
- If database: scale up connection pool
- If payment processor: enable circuit breaker
## Escalation
- SEV1: Call Platform Lead immediately
- Still stuck after 20 minutes: Page Engineering Manager
Automate what you can. If a runbook step can be scripted, script it. Runbook automation reduces mean time to resolution because you are not copy-pasting commands under pressure.
The Logging Best Practices post has examples of log queries you can embed directly into runbooks for faster diagnosis.
On-Call Rotation Best Practices
Healthy on-call rotations share a few characteristics: fairness, manageable load, and psychological safety.
Rotate frequently enough that no single engineer bears the burden. A two-week rotation is common. One week is better for reducing context-switching overhead.
Compensate people for being on-call. This is table stakes. Being on-call interrupts sleep, social life, and personal time. Pay them for that disruption, either in extra pay or time off.
Separate primary and secondary on-call. Primary gets paged first. If they do not acknowledge within 5 minutes, secondary gets paged. This prevents single points of failure.
Post-Incident Review and Alert Tuning
After every SEV1 and SEV2, conduct a post-incident review. The goal is to understand what happened, why the alert did or did not help, and what to fix.
Use the “5 Whys” technique: start with the incident, then ask why five times to get to root cause.
## Post-Incident Review: Checkout Outage 2026-03-15
**Duration:** 23 minutes
**Impact:** ~340 failed checkouts
**Root Cause:** Database connection pool exhausted after a slow query
**5 Whys:**
1. Why did checkouts fail? → Database connections were exhausted
2. Why were connections exhausted? → A query was holding connections for >30 seconds
3. Why was the query slow? → Missing index on orders.user_id
4. Why was the index missing? → Added in staging but not in production migration
5. Why did the migration not run? → Production migration was blocked by a lint check
**Alert Feedback:** The "high error rate" alert fired, but engineers had to investigate for 8 minutes before finding the database connection issue. We should add a separate alert for connection pool utilization >80%.
Tune alerts based on post-incident findings. If an alert fired but was not actionable, adjust the threshold or add context. If something should have alerted but did not, add a new alert.
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Alert storm from one root cause | Hundreds of pages; on-call ignores all | Use alert grouping and deduplication; route related alerts to single incident |
| SLO alert fires but system already recovering | Wasted wake-up; engineer annoyed | Add for: 5m to catch transient issues; use burn-rate alerts instead of threshold |
| Alert routing to wrong team | Wrong people paged; issue unaddressed | Validate routing rules quarterly; test escalation paths quarterly |
| Runbook steps no longer accurate | MTTR increases; engineers confused | Review and update runbooks after every SEV1; version control runbooks |
| On-call rotation gap | Alert fires but nobody acknowledges; extended outage | Overlap on-call schedules; test handoff process; have fallback escalation |
| Static threshold too sensitive | Pages fire for normal traffic spikes | Add baseline deviation; use relative thresholds (CPU > 2x normal) not absolute |
Observability Hooks for Alerting
Alerting systems need their own monitoring. If your alerting system fails, you lose visibility at the worst possible moment.
Alert on Alerting Itself
| What to Monitor | Metric | Alert Threshold |
|---|---|---|
| Alertmanager down | alertmanager_up == 0 | Page immediately |
| Alert delivery latency | alertmanager_notification_latency_seconds > 30s | Page if > 1 minute |
| Alert storm detected | alerts_firing{severity="critical"} > 20 | Page if sustained > 5 min |
| Notification queue backing up | alertmanager_alerts_pending > 100 | Warning if > 50 |
| On-call acknowledgment rate | MTTA by engineer | Track but do not alert on |
Alert Quality Metrics to Track
# False positive rate: alerts that fire but require no action
sum(rate(alerts_firing{action="none"}[1h])) / sum(rate(alerts_firing[1h]))
# Mean time to acknowledge (MTTA) by severity
histogram_quantile(0.95,
sum(rate(alert_ack_time_seconds_bucket[1h])) by (le, severity)
)
# Alert volume by service and severity
sum by (service, severity) (rate(alerts_firing[1h]))
# Burn rate for alert fatigue: are we paging more this week vs last?
sum(rate(alerts_firing{severity="warning"}[1h])) /
sum(rate(alerts_firing{severity="warning"}[1h] offset 7d))
Track these metrics in a dashboard. If false positive rate exceeds 30%, start pruning alerts. If MTTA is trending up, on-call load may be too high.
Alert Routing Observability
# Alert to verify alert routing is healthy
groups:
- name: alerting-health
rules:
- alert: AlertManagerDown
expr: up{job="alertmanager"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "AlertManager instance {{ $labels.instance }} is down"
description: "Alert routing is unavailable. All alerts will queue or fail."
- alert: AlertDeliveryLatency
expr: histogram_quantile(0.95, sum(rate(alertmanager_notification_latency_seconds_bucket[5m])) by (le)) > 30
for: 5m
labels:
severity: warning
annotations:
summary: "Alert delivery latency above 30 seconds"
description: "P95 alert delivery latency is {{ $value }}s. Alerts may arrive late during incidents."
- alert: AlertQueueBackingUp
expr: alertmanager_alerts_pending / alertmanager_alerts_maximum > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "Alert notification queue above 80%"
description: "Alert queue is backing up. Check AlertManager connectivity."
Common Pitfalls / Anti-Patterns
Alerting on causes instead of symptoms. CPU at 90% does not tell you if users are affected. Error rate > 1% does. Alert on what users experience, then investigate internal signals to find the cause.
Static thresholds without context. A CPU at 90% on a Monday morning after a deployment is suspicious. The same CPU at 90% during peak traffic on Black Friday might be expected. Pair static thresholds with service-level context.
No alert deduplication. When a database fails, you do not want 50 alerts from 50 affected services. Group alerts by root cause so engineers investigate one incident, not 50 pages.
Alerting without runbooks. A page without a runbook means the on-call engineer has to start from scratch. Every page-worthy alert needs a linked runbook.
Failing to tune alerts. If an alert fires every week and nobody fixes it, the alert is useless. Either fix the underlying issue or remove the alert. Constant alerts train engineers to ignore pages.
Compensating for bad architecture with alerts. If your database fills up every month, fix the cleanup job, do not just alert on it. Alerts mask problems; they do not solve them.
Trade-off Analysis
| Alerting Strategy | Precision | Recall | Alert Volume | Best For |
|---|---|---|---|---|
| SLO-based alerts | High | High | Low | User-facing services |
| Resource-based alerts | Medium | Low | Medium | Infrastructure |
| Error-rate alerts | High | Medium | Medium | API services |
| Anomaly-based alerts | Low | High | High | Novel failure modes |
| Blackbox monitoring | High | Low | Low | External dependencies |
| PagerDuty vs alternatives | Cost | Integrations | On-call Features |
|---|---|---|---|
| PagerDuty | High | Largest | Most mature |
| OpsGenie | Medium | Large | Good |
| Squadcast | Low | Growing | Good |
| Slack + bots | Lowest | Varies | Limited |
| Custom (webhooks) | Infrastructure | Varies | DIY |
Interview Questions
What to cover:
- Static thresholds use fixed values like CPU > 80%; simple but requires knowledge of normal baselines
- SLO-based alerting fires when error budget burns faster than sustainable; catches user impact directly
- Use static thresholds for infrastructure health (disk, memory, process count)
- Use SLO-based for customer-facing services where you care about actual user experience
- SLO alerts reduce noise during expected high-traffic periods if error budget is still healthy
What to cover:
- Ask: can a human do something actionable in 15 minutes? If not, it can wait
- Page for: complete service unavailability, data loss risk, security breach in progress
- Slack/email for: degraded performance, capacity planning issues, non-critical failures
- Consider: what is the cost of not responding immediately vs the cost of waking someone?
- Build escalation: primary on-call pages first, secondary pages after 5 min no-ack
What to cover:
- Define SLO: 99.9% availability, p99 latency < 2s, error rate < 0.1%
- Alert on symptoms: error rate spike, latency degradation, success rate drop
- Investigate causes: database connections, payment processor health, dependency failures
- Use burn-rate alerting: multi-window (1h, 6h, 3d) to catch both fast burns and slow leaks
- Every symptom alert needs a runbook with investigation steps and mitigation playbooks
What to cover:
- Skimmable under pressure: clear headers, numbered steps, commands you can copy-paste
- Three parts: symptoms (what triggered it), investigation steps, mitigation steps
- Include exact commands with real examples, not generic documentation
- Escalation path: when to call in more help, who to wake up
- Version control and review after every SEV1; stale runbooks are worse than none
What to cover:
- Track false positive rate: alerts that fire but require no action. Target < 20%
- Run alert review sessions: ask team "which alerts do you ignore and why?"
- Remove or tune alerts that fire constantly without being actionable
- Group related alerts to single incident to prevent alert storms
- Use SLO-based alerting to reduce noise during acceptable performance windows
- Create culture where it is safe to say "this alert is not useful" and prioritize fixing it
What to cover:
- Error budget = acceptable downtime based on SLO (99.9% = 43 min/month)
- Alert when budget burns fast, not when system fails completely
- Fast burn alert (1h window): burning 14.4x sustainable rate → page
- Slow leak alert (3d window): burning 3x sustainable rate → investigate
- This shifts thinking from "something is broken" to "we are at risk of missing commitments"
What to cover:
- Alert on AlertManager being down: alertmanager_up == 0 → page immediately
- Monitor alert delivery latency: > 30s means late arrival during incidents
- Track notification queue depth: > 80% means alerts may queue or fail
- Test escalation paths quarterly with fake alerts to verify routing
- Track MTTA (mean time to acknowledge) by engineer to spot on-call overload
What to cover:
- Conduct review for every SEV1 and SEV2 within 48 hours
- Use 5 Whys to find root cause, not blame individuals
- Ask: did alerting help or hurt? Were symptoms clear? Was the runbook useful?
- Output: action items with owners and deadlines, not just discussion
- Add new alerts if something should have fired but did not; tune or remove if unhelpful
What to cover:
- PagerDuty: enterprise-grade, most integrations, mature escalation policies, higher cost
- OpsGenie: good integrations, flexible scheduling, moderate pricing
- Squadcast: simpler, lower cost, growing integrations, good for smaller teams
- Key features: scheduling, escalation rules, duty roster, alert grouping, analytics
- Choose based on team size, existing toolchain, and budget
What to cover:
- Use alert grouping/deduplication: database failure triggers one incident, not 50 separate alerts
- Route to the team responsible for the root cause, not every affected service team
- Suppression: when a high-severity alert fires, suppress lower-priority related alerts
- Use correlation IDs or CMDB relationships to link dependent services
- Test grouping logic regularly; mis-routing creates noise and delays resolution
What to cover:
- Burn-rate alerting uses multi-window comparison: 1h burning 14.4x sustainable rate = fast burn, page; 3d burning 3x = slow leak, investigate
- Traditional threshold fires when a metric crosses a fixed value; burn-rate fires when error budget consumption exceeds threshold
- Burn-rate catches both sudden spikes (fast burn) and slow degradation (slow leak) that threshold misses
- Burn-rate reduces false positives during expected high-traffic events when budget is still healthy
- Best for: SLO-based alerting where you care about user impact, not just metric values
What to cover:
- Alert on symptoms at the entry point (API error rate, latency) not on downstream internal metrics
- Use circuit breakers to prevent cascade; alert when circuit breaker trips
- Set up Dead Letter Queue (DLQ) monitoring: messages piling up indicate downstream failures
- Use distributed tracing to correlate cascade failures across services
- Group alerts by root cause: one underlying database issue should produce one incident, not 20
What to cover:
- Anomaly-based alerting uses historical data to detect deviations from normal patterns
- Works for high-variance metrics where static thresholds do not apply (traffic patterns, business metrics)
- Requires significant historical data to establish baseline; cold start is problematic
- False positives: legitimate large changes (Black Friday, product launches) trigger alerts
- False negatives: gradual degradation may not appear as anomaly if change is slow
- Hybrid approach: use anomaly for high-variance, static for clear failure modes
What to cover:
- Track MTTA by engineer, severity, and time of day to identify patterns
- High MTTA often indicates on-call rotation is overloaded or alert routing is wrong
- Reduce MTTA: ensure alerts route to correct team, primary on-call has clear escalation path
- Secondary escalation should trigger after 5-10 minutes of no acknowledgment
- Post-incident review should include MTTA analysis — if acknowledgment was slow, why?
- MTTA target typically 5 minutes for SEV1, 15 minutes for SEV2
What to cover:
- Use `avg_over_time()` with multiple time windows: 1h, 6h, 3d
- 1h window at 14.4x burn rate = fast burn (page immediately)
- 6h window at 6x burn rate = sustainable burn exceeded (warning + page)
- 3d window at 3x burn rate = slow leak (investigate soon)
- PromQL: `sum(rate(errors_total[1h])) / 86400 > 14.4 * (1 - slo / 100)`
- Add `for: 5m` to avoid flapping on transient spikes
What to cover:
- Toil: manual work that must be repeated indefinitely; alerting noise is a major source of toil
- Excessive pages that require no action drain engineer time and increase fatigue
- Address toil by: pruning alerts that fire without action, grouping related alerts
- Run quarterly alert reviews: ask team "which alerts do you ignore and why?"
- Target false positive rate below 20%; if above, start removing or tuning alerts
- Track toil metrics: alerts per week per engineer, time spent on non-actionable pages
What to cover:
- Structure: symptoms → investigation tree → mitigation options → escalation path
- Create decision branches: if X, do Y; if Z, escalate to team W
- Embed exact commands with real examples, not generic documentation
- Include links to dashboards, logs, and runbooks for each investigation step
- After incident, review runbook: did it help? Was the tree complete?
- Version control runbooks alongside application code
What to cover:
- PagerDuty: enterprise-grade, most integrations, mature escalation policies, higher cost
- OpsGenie: good integrations, flexible scheduling, moderate pricing
- Squadcast: simpler, lower cost, growing integrations, good for smaller teams
- Key features: scheduling, escalation rules, duty roster, alert grouping, analytics
- Choose based on team size, existing toolchain, and budget
What to cover:
- Alert grouping: database failure should produce one incident, not 50 pages
- Suppression: when SEV1 fires, suppress lower-priority related alerts for same root cause
- Use incident channel instead of page for cascading failures — Slack war room
- Set alert storm detection: >20 critical alerts in 5 minutes triggers automatic grouping
- Rate limiting: limit pages to once per 15 minutes per service during major incidents
- Post-incident: identify why storm occurred and add grouping/suppression
What to cover:
- Collect data: alert volume per service, false positive rate, MTTA, time to resolution
- Run team retrospective: which alerts saved the day? Which did we ignore?
- Identify top 5 noisiest alerts and investigate: are they actionable or just noise?
- For each noisy alert: tune threshold, add context, or remove entirely
- Check alert coverage: are there failure modes we do not alert on at all?
- Document decisions: if we keep an alert despite low action rate, note why
- Track progress: target reduction in false positive rate from baseline
Further Reading
- Metrics, Monitoring & Alerting - Complete observability stack integration
- Logging Best Practices - Log collection and analysis patterns
- Distributed Tracing - End-to-end request correlation
- SRE Post-Mortem Template - Building blameless incident reviews
- Alert Fatigue Guide - Recognizing and addressing alert fatigue patterns
Conclusion
Key Takeaways
- Alert on symptoms users experience, not internal root causes
- SLO-based alerting catches both fast burns and slow leaks
- Every page-worthy alert needs a runbook with actionable steps
- Tune alerts after every SEV1 and SEV2; dead alerts train engineers to ignore pages
- Track alert quality metrics: false positive rate, MTTA, alert volume over time
Alerting Checklist
# 1. Define SLOs for customer-facing services
# availability: 99.9% (43 min/month budget)
# latency_p99: 2000ms
# error_rate: 0.1%
# 2. Configure multi-window burn-rate alerts
# 1h window: page if burning 14.4x sustainable rate
# 6h window: warning if burning 6x sustainable rate
# 3d window: investigate if burning 3x sustainable rate
# 3. Set up alert routing
# PagerDuty: SEV1 critical (page + SMS + call)
# Slack: SEV2/SEV3 warnings
# Ticket: SEV4 informational
# 4. Write runbooks for every page-worthy alert
# Investigation steps with exact commands
# Mitigation steps with rollback procedures
# Escalation path with contact info
# 5. Tune quarterly
# Review false positive rate — target < 20%
# Review MTTA — target < 5 minutes for critical
# Remove or fix alerts that fire but require no action
# 6. Monitor the monitoring
# alertmanager_up == 1
# MTTA by engineer tracked weekly
# Alert volume by severity trended monthly Category
Related Posts
Alerting in Production: Building Alerts That Matter
Build alerting systems that catch real problems without fatigue. Learn alert design principles, severity levels, runbooks, and on-call best practices.
The Observability Engineering Mindset: Beyond Monitoring
Transition from traditional monitoring to full observability: structured logs, metrics, traces, and the cultural practices that make observability teams successful.
Metrics, Monitoring, and Alerting: From SLIs to Alerts
Learn the RED and USE methods, SLIs/SLOs/SLAs, and how to build alerting systems that catch real problems. Includes examples for web services and databases.