Alerting in Production: Paging, Runbooks, and On-Call

Build effective alerting systems that wake people up for real emergencies: alert fatigue prevention, runbook automation, and healthy on-call practices.

published: March 25, 2026 reading time: 22 min read author: GeekWorkBench updated: June 17, 2026

Quick Summary

Effective alerting means waking people up only for real emergencies while keeping alert fatigue off your plate. This guide walks through SLO-based alerting, burn-rate alerts, runbook structure, and on-call rotation design. You will learn how to route alerts by severity, write runbooks that people can actually follow at 2am, and tune alerts after every SEV1. By the end you will have an alerting system that catches user-facing issues fast without burning out your team.

Introduction

PagerDuty vs Static Thresholds vs SLO-Based Alerting

Most production systems run all four of these approaches at once, and that is not overengineering. Each one catches different failure modes, and the combinations matter more than any single method.

Alerting Type	Best For	Limitations
PagerDuty-style urgent	Complete outages, security breaches, data loss risk	Overused if you apply it to recoverable degradations
Static threshold	Infrastructure metrics with known normal ranges	Produces false positives during expected spikes
SLO-based	Customer-facing services where user experience is the real signal	Needs upfront SLO and error budget definition
Anomaly-based	Business metrics with variable baselines	Cold start is rough; false positives pile up during launches

Picture the same incident through each lens. A database connection pool starts exhausting: a static threshold catches it at 80% connections, an SLO alert fires if checkout failures burn through the error budget, an anomaly detector flags the daily rhythm deviation, and PagerDuty pages only if retries are finally exhausted and users are locked out of checkout. Static thresholds get you to the root cause fast. SLO alerts tell you if users are actually hurting. Anomaly detection sometimes catches the slow leak before it turns into a full outage.

The practical stack is SLO alerts for customer-facing services, static thresholds for infrastructure health, and anomaly detection for the high-variance business metrics that defy fixed rules. Each covers gaps the others leave open. Static thresholds are fast and noisy. SLO alerts are quieter but require definition work upfront. Anomaly detection handles the unpredictable, but treat it as a supplement, not a replacement for the others.

When to Page vs When to Slack

Page (PagerDuty, SMS, call) when the issue requires human action within 15 minutes: service is down, data is at risk, security incident in progress. If it cannot be fixed by an engineer clicking something in the next 15 minutes, it probably does not need a page.

Slack (or email) when the issue is important but can wait: a disk at 75% has days of runway, a non-critical service is degraded but users can work around it, capacity planning needs attention.

The litmus test: if you wake someone up, can they do something actionable in 5 minutes? If not, it should not page.

flowchart TD
    A[Metric Fires] --> B{User impact?}
    B -->|Yes| C{SLO budget burning?}
    B -->|No| D{Remediable in 15min?}
    C -->|Yes| E[SLO Burn Rate Alert<br/>Slack + Page if fast burn]
    C -->|No| F[Static Threshold Alert<br/>Slack only]
    D -->|Yes| G[Static Threshold Alert<br/>Slack only]
    D -->|No| H[Log/Capacity Alert<br/>Ticketing system]

Alerting Philosophy: Symptoms vs Causes

The first question to ask before creating any alert: does this represent a symptom a user is experiencing, or a cause that needs investigation? Many teams alert on causes, high CPU, memory pressure, disk I/O, when they should be alerting on symptoms: error rates, latency spikes, request failures.

A user does not care that your CPU hit 90%. They care that their checkout page is loading slowly or returning errors. Alert on the symptom, investigate the cause.

The Metrics & Monitoring guide covers this distinction in more detail, but the key principle is simple: your alerting should answer the question “is the user okay?” without requiring deep system knowledge.

Defining SLOs and Error Budgets

Service Level Objectives give your alerting meaning. Without SLOs, you have no rational basis for deciding what deserves a page and what can wait.

Define SLOs based on user experience:

# Example SLO definitions
checkout_service:
  availability: 99.9% # 43 minutes downtime per month
  latency_p99: 2000ms # 99% of requests under 2 seconds
  error_rate: 0.1% # 0.1% of requests return 5xx

Error budgets are the flip side of SLOs. If your availability SLO is 99.9%, you have a 43-minute error budget per month. When you burn through that budget, alerts should fire, even if the system has not completely failed.

This approach shifts alerting from “something is wrong” to “we are at risk of missing our commitments to users.”

Alert Severity Levels and Routing

Not everything deserves the same response. Use a clear severity hierarchy:

Severity	Definition	Response Time	Channel
SEV1	User-facing outage, data loss risk	5 minutes	PagerDuty + SMS + Call
SEV2	Degraded performance, partial outage	15 minutes	PagerDuty + Slack
SEV3	Non-critical issue, capacity risk	2 hours	Slack
SEV4	Informational, maintenance soon	Next business day	Email

Route alerts based on severity and on-call schedules. A SEV1 fires regardless of time; a SEV3 on a Saturday afternoon might wait until Monday if the on-call engineer is on a hike.

Runbook Writing and Automation

A runbook is not documentation. It is a decision tree for stressful situations. When an alert fires at 2am, you should not be reading documentation. You should be following steps.

Good runbooks have three properties: they are skimmable, they have clear commands to run, and they include escalation points.

# Runbook: High Error Rate on Checkout Service

## Symptoms

- Error rate > 1% for 5 minutes
- Checkout failures appearing in logs

## Investigation Steps

1. Check payment processor status → [link to status page]
2. Check database connections: `SELECT count(*) FROM pg_stat_activity WHERE state = 'active';`
3. Check recent deployments: `git log --oneline -10`
4. Check feature flags: [Plaid link]

## Mitigation

- If deployment-related: rollback with `./scripts/rollback.sh checkout-service`
- If database: scale up connection pool
- If payment processor: enable circuit breaker

## Escalation

- SEV1: Call Platform Lead immediately
- Still stuck after 20 minutes: Page Engineering Manager

Automate what you can. If a runbook step can be scripted, script it. Runbook automation reduces mean time to resolution because you are not copy-pasting commands under pressure.

The Logging Best Practices post has examples of log queries you can embed directly into runbooks for faster diagnosis.

On-Call Rotation Best Practices

Healthy on-call rotations share a few characteristics: fairness, manageable load, and psychological safety.

Rotate frequently enough that no single engineer bears the burden. A two-week rotation is common. One week is better for reducing context-switching overhead.

Compensate people for being on-call. This is table stakes. Being on-call interrupts sleep, social life, and personal time. Pay them for that disruption, either in extra pay or time off.

Separate primary and secondary on-call. Primary gets paged first. If they do not acknowledge within 5 minutes, secondary gets paged. This prevents single points of failure.

Post-Incident Review and Alert Tuning

After every SEV1 and SEV2, conduct a post-incident review. The goal is to understand what happened, why the alert did or did not help, and what to fix.

Use the “5 Whys” technique: start with the incident, then ask why five times to get to root cause.

## Post-Incident Review: Checkout Outage 2026-03-15

**Duration:** 23 minutes
**Impact:** ~340 failed checkouts
**Root Cause:** Database connection pool exhausted after a slow query

**5 Whys:**

1. Why did checkouts fail? → Database connections were exhausted
2. Why were connections exhausted? → A query was holding connections for >30 seconds
3. Why was the query slow? → Missing index on orders.user_id
4. Why was the index missing? → Added in staging but not in production migration
5. Why did the migration not run? → Production migration was blocked by a lint check

**Alert Feedback:** The "high error rate" alert fired, but engineers had to investigate for 8 minutes before finding the database connection issue. We should add a separate alert for connection pool utilization >80%.

Tune alerts based on post-incident findings. If an alert fired but was not actionable, adjust the threshold or add context. If something should have alerted but did not, add a new alert.

Production Failure Scenarios

Failure	Impact	Mitigation
Alert storm from one root cause	Hundreds of pages; on-call ignores all	Use alert grouping and deduplication; route related alerts to single incident
SLO alert fires but system already recovering	Wasted wake-up; engineer annoyed	Add `for: 5m` to catch transient issues; use burn-rate alerts instead of threshold
Alert routing to wrong team	Wrong people paged; issue unaddressed	Validate routing rules quarterly; test escalation paths quarterly
Runbook steps no longer accurate	MTTR increases; engineers confused	Review and update runbooks after every SEV1; version control runbooks
On-call rotation gap	Alert fires but nobody acknowledges; extended outage	Overlap on-call schedules; test handoff process; have fallback escalation
Static threshold too sensitive	Pages fire for normal traffic spikes	Add baseline deviation; use relative thresholds (CPU > 2x normal) not absolute

Observability Hooks for Alerting

Alerting systems need their own monitoring. If your alerting system fails, you lose visibility at the worst possible moment.

Alert on Alerting Itself

What to Monitor	Metric	Alert Threshold
Alertmanager down	`alertmanager_up == 0`	Page immediately
Alert delivery latency	`alertmanager_notification_latency_seconds > 30s`	Page if > 1 minute
Alert storm detected	`alerts_firing{severity="critical"} > 20`	Page if sustained > 5 min
Notification queue backing up	`alertmanager_alerts_pending > 100`	Warning if > 50
On-call acknowledgment rate	MTTA by engineer	Track but do not alert on

Alert Quality Metrics to Track

# False positive rate: alerts that fire but require no action
sum(rate(alerts_firing{action="none"}[1h])) / sum(rate(alerts_firing[1h]))

# Mean time to acknowledge (MTTA) by severity
histogram_quantile(0.95,
  sum(rate(alert_ack_time_seconds_bucket[1h])) by (le, severity)
)

# Alert volume by service and severity
sum by (service, severity) (rate(alerts_firing[1h]))

# Burn rate for alert fatigue: are we paging more this week vs last?
sum(rate(alerts_firing{severity="warning"}[1h])) /
  sum(rate(alerts_firing{severity="warning"}[1h] offset 7d))

Track these metrics in a dashboard. If false positive rate exceeds 30%, start pruning alerts. If MTTA is trending up, on-call load may be too high.

Alert Routing Observability

# Alert to verify alert routing is healthy
groups:
  - name: alerting-health
    rules:
      - alert: AlertManagerDown
        expr: up{job="alertmanager"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "AlertManager instance {{ $labels.instance }} is down"
          description: "Alert routing is unavailable. All alerts will queue or fail."

      - alert: AlertDeliveryLatency
        expr: histogram_quantile(0.95, sum(rate(alertmanager_notification_latency_seconds_bucket[5m])) by (le)) > 30
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Alert delivery latency above 30 seconds"
          description: "P95 alert delivery latency is {{ $value }}s. Alerts may arrive late during incidents."

      - alert: AlertQueueBackingUp
        expr: alertmanager_alerts_pending / alertmanager_alerts_maximum > 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Alert notification queue above 80%"
          description: "Alert queue is backing up. Check AlertManager connectivity."

Common Pitfalls / Anti-Patterns

Alerting on causes instead of symptoms. CPU at 90% does not tell you if users are affected. Error rate > 1% does. Alert on what users experience, then investigate internal signals to find the cause.

Static thresholds without context. A CPU at 90% on a Monday morning after a deployment is suspicious. The same CPU at 90% during peak traffic on Black Friday might be expected. Pair static thresholds with service-level context.

No alert deduplication. When a database fails, you do not want 50 alerts from 50 affected services. Group alerts by root cause so engineers investigate one incident, not 50 pages.

Alerting without runbooks. A page without a runbook means the on-call engineer has to start from scratch. Every page-worthy alert needs a linked runbook.

Failing to tune alerts. If an alert fires every week and nobody fixes it, the alert is useless. Either fix the underlying issue or remove the alert. Constant alerts train engineers to ignore pages.

Compensating for bad architecture with alerts. If your database fills up every month, fix the cleanup job, do not just alert on it. Alerts mask problems; they do not solve them.

Trade-off Analysis

Alerting Strategy	Precision	Recall	Alert Volume	Best For
SLO-based alerts	High	High	Low	User-facing services
Resource-based alerts	Medium	Low	Medium	Infrastructure
Error-rate alerts	High	Medium	Medium	API services
Anomaly-based alerts	Low	High	High	Novel failure modes
Blackbox monitoring	High	Low	Low	External dependencies

PagerDuty vs alternatives	Cost	Integrations	On-call Features
PagerDuty	High	Largest	Most mature
OpsGenie	Medium	Large	Good
Squadcast	Low	Growing	Good
Slack + bots	Lowest	Varies	Limited
Custom (webhooks)	Infrastructure	Varies	DIY

Interview Questions

1. What is the difference between static threshold alerting and SLO-based alerting, and when would you use each?

What to cover:

Static thresholds use fixed values like CPU > 80%; simple but requires knowledge of normal baselines
SLO-based alerting fires when error budget burns faster than sustainable; catches user impact directly
Use static thresholds for infrastructure health (disk, memory, process count)
Use SLO-based for customer-facing services where you care about actual user experience
SLO alerts reduce noise during expected high-traffic periods if error budget is still healthy

2. How do you decide whether an alert should page someone at 3am or wait until business hours?

What to cover:

Ask: can a human do something actionable in 15 minutes? If not, it can wait
Page for: complete service unavailability, data loss risk, security breach in progress
Slack/email for: degraded performance, capacity planning issues, non-critical failures
Consider: what is the cost of not responding immediately vs the cost of waking someone?
Build escalation: primary on-call pages first, secondary pages after 5 min no-ack

3. Walk through how you would design alerting for a checkout service. What signals would you use?

What to cover:

Define SLO: 99.9% availability, p99 latency < 2s, error rate < 0.1%
Alert on symptoms: error rate spike, latency degradation, success rate drop
Investigate causes: database connections, payment processor health, dependency failures
Use burn-rate alerting: multi-window (1h, 6h, 3d) to catch both fast burns and slow leaks
Every symptom alert needs a runbook with investigation steps and mitigation playbooks

4. What makes a good runbook? Describe the structure and key elements.

What to cover:

Skimmable under pressure: clear headers, numbered steps, commands you can copy-paste
Three parts: symptoms (what triggered it), investigation steps, mitigation steps
Include exact commands with real examples, not generic documentation
Escalation path: when to call in more help, who to wake up
Version control and review after every SEV1; stale runbooks are worse than none

5. How would you reduce alert fatigue on a team that has been ignoring pages?

What to cover:

Track false positive rate: alerts that fire but require no action. Target < 20%
Run alert review sessions: ask team "which alerts do you ignore and why?"
Remove or tune alerts that fire constantly without being actionable
Group related alerts to single incident to prevent alert storms
Use SLO-based alerting to reduce noise during acceptable performance windows
Create culture where it is safe to say "this alert is not useful" and prioritize fixing it

6. What is an error budget and how does it change your alerting philosophy?

What to cover:

Error budget = acceptable downtime based on SLO (99.9% = 43 min/month)
Alert when budget burns fast, not when system fails completely
Fast burn alert (1h window): burning 14.4x sustainable rate → page
Slow leak alert (3d window): burning 3x sustainable rate → investigate
This shifts thinking from "something is broken" to "we are at risk of missing commitments"

7. How do you validate that your alerting system itself is working?

What to cover:

Alert on AlertManager being down: alertmanager_up == 0 → page immediately
Monitor alert delivery latency: > 30s means late arrival during incidents
Track notification queue depth: > 80% means alerts may queue or fail
Test escalation paths quarterly with fake alerts to verify routing
Track MTTA (mean time to acknowledge) by engineer to spot on-call overload

8. Describe the post-incident review process and how it feeds back into alerting improvements.

What to cover:

Conduct review for every SEV1 and SEV2 within 48 hours
Use 5 Whys to find root cause, not blame individuals
Ask: did alerting help or hurt? Were symptoms clear? Was the runbook useful?
Output: action items with owners and deadlines, not just discussion
Add new alerts if something should have fired but did not; tune or remove if unhelpful

9. What is the difference between PagerDuty, OpsGenie, and Squadcast for on-call management?

What to cover:

PagerDuty: enterprise-grade, most integrations, mature escalation policies, higher cost
OpsGenie: good integrations, flexible scheduling, moderate pricing
Squadcast: simpler, lower cost, growing integrations, good for smaller teams
Key features: scheduling, escalation rules, duty roster, alert grouping, analytics
Choose based on team size, existing toolchain, and budget

10. How do you handle alert routing when multiple services are affected by one underlying failure?

What to cover:

Use alert grouping/deduplication: database failure triggers one incident, not 50 separate alerts
Route to the team responsible for the root cause, not every affected service team
Suppression: when a high-severity alert fires, suppress lower-priority related alerts
Use correlation IDs or CMDB relationships to link dependent services
Test grouping logic regularly; mis-routing creates noise and delays resolution

11. What is the difference between burn-rate alerting and traditional threshold alerting? When does burn-rate win?

What to cover:

Burn-rate alerting uses multi-window comparison: 1h burning 14.4x sustainable rate = fast burn, page; 3d burning 3x = slow leak, investigate
Traditional threshold fires when a metric crosses a fixed value; burn-rate fires when error budget consumption exceeds threshold
Burn-rate catches both sudden spikes (fast burn) and slow degradation (slow leak) that threshold misses
Burn-rate reduces false positives during expected high-traffic events when budget is still healthy
Best for: SLO-based alerting where you care about user impact, not just metric values

12. How do you design alerts for a microservices architecture where failures can cascade?

What to cover:

Alert on symptoms at the entry point (API error rate, latency) not on downstream internal metrics
Use circuit breakers to prevent cascade; alert when circuit breaker trips
Set up Dead Letter Queue (DLQ) monitoring: messages piling up indicate downstream failures
Use distributed tracing to correlate cascade failures across services
Group alerts by root cause: one underlying database issue should produce one incident, not 20

13. What is anomaly-based alerting and what are its main failure modes?

What to cover:

Anomaly-based alerting uses historical data to detect deviations from normal patterns
Works for high-variance metrics where static thresholds do not apply (traffic patterns, business metrics)
Requires significant historical data to establish baseline; cold start is problematic
False positives: legitimate large changes (Black Friday, product launches) trigger alerts
False negatives: gradual degradation may not appear as anomaly if change is slow
Hybrid approach: use anomaly for high-variance, static for clear failure modes

14. How do you measure and improve Mean Time To Acknowledge (MTTA) for your on-call team?

What to cover:

Track MTTA by engineer, severity, and time of day to identify patterns
High MTTA often indicates on-call rotation is overloaded or alert routing is wrong
Reduce MTTA: ensure alerts route to correct team, primary on-call has clear escalation path
Secondary escalation should trigger after 5-10 minutes of no acknowledgment
Post-incident review should include MTTA analysis — if acknowledgment was slow, why?
MTTA target typically 5 minutes for SEV1, 15 minutes for SEV2

15. Describe how you would implement multi-window burn-rate alerting in Prometheus.

What to cover:

Use `avg_over_time()` with multiple time windows: 1h, 6h, 3d
1h window at 14.4x burn rate = fast burn (page immediately)
6h window at 6x burn rate = sustainable burn exceeded (warning + page)
3d window at 3x burn rate = slow leak (investigate soon)
PromQL: `sum(rate(errors_total[1h])) / 86400 > 14.4 * (1 - slo / 100)`
Add `for: 5m` to avoid flapping on transient spikes

16. What is the "toil" problem in on-call alerting and how do you address it?

What to cover:

Toil: manual work that must be repeated indefinitely; alerting noise is a major source of toil
Excessive pages that require no action drain engineer time and increase fatigue
Address toil by: pruning alerts that fire without action, grouping related alerts
Run quarterly alert reviews: ask team "which alerts do you ignore and why?"
Target false positive rate below 20%; if above, start removing or tuning alerts
Track toil metrics: alerts per week per engineer, time spent on non-actionable pages

17. How do you design runbooks for issues that require multi-step investigation across systems?

What to cover:

Structure: symptoms → investigation tree → mitigation options → escalation path
Create decision branches: if X, do Y; if Z, escalate to team W
Embed exact commands with real examples, not generic documentation
Include links to dashboards, logs, and runbooks for each investigation step
After incident, review runbook: did it help? Was the tree complete?
Version control runbooks alongside application code

18. What is the difference between PagerDuty, OpsGenie, and Squadcast for on-call management?

What to cover:

PagerDuty: enterprise-grade, most integrations, mature escalation policies, higher cost
OpsGenie: good integrations, flexible scheduling, moderate pricing
Squadcast: simpler, lower cost, growing integrations, good for smaller teams
Key features: scheduling, escalation rules, duty roster, alert grouping, analytics
Choose based on team size, existing toolchain, and budget

19. How do you handle alert storms during major incidents? What strategies prevent overwhelming on-call?

What to cover:

Alert grouping: database failure should produce one incident, not 50 pages
Suppression: when SEV1 fires, suppress lower-priority related alerts for same root cause
Use incident channel instead of page for cascading failures — Slack war room
Set alert storm detection: >20 critical alerts in 5 minutes triggers automatic grouping
Rate limiting: limit pages to once per 15 minutes per service during major incidents
Post-incident: identify why storm occurred and add grouping/suppression

20. How would you design a quarterly alerting review process to keep alert quality high?

What to cover:

Collect data: alert volume per service, false positive rate, MTTA, time to resolution
Run team retrospective: which alerts saved the day? Which did we ignore?
Identify top 5 noisiest alerts and investigate: are they actionable or just noise?
For each noisy alert: tune threshold, add context, or remove entirely
Check alert coverage: are there failure modes we do not alert on at all?
Document decisions: if we keep an alert despite low action rate, note why
Track progress: target reduction in false positive rate from baseline

Conclusion

Key Takeaways

Alert on symptoms users experience, not internal root causes
SLO-based alerting catches both fast burns and slow leaks
Every page-worthy alert needs a runbook with actionable steps
Tune alerts after every SEV1 and SEV2; dead alerts train engineers to ignore pages
Track alert quality metrics: false positive rate, MTTA, alert volume over time

Alerting Checklist

# 1. Define SLOs for customer-facing services
# availability: 99.9% (43 min/month budget)
# latency_p99: 2000ms
# error_rate: 0.1%

# 2. Configure multi-window burn-rate alerts
# 1h window: page if burning 14.4x sustainable rate
# 6h window: warning if burning 6x sustainable rate
# 3d window: investigate if burning 3x sustainable rate

# 3. Set up alert routing
# PagerDuty: SEV1 critical (page + SMS + call)
# Slack: SEV2/SEV3 warnings
# Ticket: SEV4 informational

# 4. Write runbooks for every page-worthy alert
# Investigation steps with exact commands
# Mitigation steps with rollback procedures
# Escalation path with contact info

# 5. Tune quarterly
# Review false positive rate — target < 20%
# Review MTTA — target < 5 minutes for critical
# Remove or fix alerts that fire but require no action

# 6. Monitor the monitoring
# alertmanager_up == 1
# MTTA by engineer tracked weekly
# Alert volume by severity trended monthly

Introduction

PagerDuty vs Static Thresholds vs SLO-Based Alerting

When to Page vs When to Slack

Alerting Philosophy: Symptoms vs Causes

Defining SLOs and Error Budgets

Alert Severity Levels and Routing

Runbook Writing and Automation

On-Call Rotation Best Practices

Post-Incident Review and Alert Tuning

Production Failure Scenarios

Observability Hooks for Alerting

Alert on Alerting Itself

Alert Quality Metrics to Track

Alert Routing Observability

Common Pitfalls / Anti-Patterns

Trade-off Analysis

Interview Questions

Further Reading

Conclusion

Key Takeaways

Alerting Checklist

Category

Tags

Related Posts

Alerting in Production: Building Alerts That Matter

The Observability Engineering Mindset: Beyond Monitoring

Metrics, Monitoring, and Alerting: From SLIs to Alerts