Alerting in Production: Paging, Runbooks, and On-Call

Build effective alerting systems that wake people up for real emergencies: alert fatigue prevention, runbook automation, and healthy on-call practices.

published: reading time: 21 min read author: GeekWorkBench

Introduction

PagerDuty vs Static Thresholds vs SLO-Based Alerting

Use PagerDuty-style urgent alerting when you have customer-facing outages that need immediate human response: complete service unavailability, data integrity risks, or security breaches. These warrant 3am pages and should be rare.

Use static threshold alerting when you have clear, known failure modes with fixed boundaries: CPU > 90%, disk > 80%, error rate > 1%. Static thresholds work for infrastructure-level signals that have predictable normal ranges.

Use SLO-based alerting when you want to alert on user impact rather than internal metrics. SLO alerts fire when your error budget is burning faster than sustainable, catching both sudden spikes and slow leaks. This is the most user-centric approach.

Use anomaly-based alerting when normal behavior varies too much for static thresholds — for example, traffic patterns that change by time of day or day of week. Anomaly detection adapts to patterns but requires historical data and can produce false positives.

The practical stack: SLO alerts for customer-facing services, static thresholds for infrastructure health, anomaly detection for high-variance business metrics.

When to Page vs When to Slack

Page (PagerDuty, SMS, call) when the issue requires human action within 15 minutes: service is down, data is at risk, security incident in progress. If it cannot be fixed by an engineer clicking something in the next 15 minutes, it probably does not need a page.

Slack (or email) when the issue is important but can wait: a disk at 75% has days of runway, a non-critical service is degraded but users can work around it, capacity planning needs attention.

The litmus test: if you wake someone up, can they do something actionable in 5 minutes? If not, it should not page.

flowchart TD
    A[Metric Fires] --> B{User impact?}
    B -->|Yes| C{SLO budget burning?}
    B -->|No| D{Remediable in 15min?}
    C -->|Yes| E[SLO Burn Rate Alert<br/>Slack + Page if fast burn]
    C -->|No| F[Static Threshold Alert<br/>Slack only]
    D -->|Yes| G[Static Threshold Alert<br/>Slack only]
    D -->|No| H[Log/Capacity Alert<br/>Ticketing system]

Alerting Philosophy: Symptoms vs Causes

The first question to ask before creating any alert: does this represent a symptom a user is experiencing, or a cause that needs investigation? Many teams alert on causes, high CPU, memory pressure, disk I/O, when they should be alerting on symptoms: error rates, latency spikes, request failures.

A user does not care that your CPU hit 90%. They care that their checkout page is loading slowly or returning errors. Alert on the symptom, investigate the cause.

The Metrics & Monitoring guide covers this distinction in more detail, but the key principle is simple: your alerting should answer the question “is the user okay?” without requiring deep system knowledge.

Defining SLOs and Error Budgets

Service Level Objectives give your alerting meaning. Without SLOs, you have no rational basis for deciding what deserves a page and what can wait.

Define SLOs based on user experience:

# Example SLO definitions
checkout_service:
  availability: 99.9% # 43 minutes downtime per month
  latency_p99: 2000ms # 99% of requests under 2 seconds
  error_rate: 0.1% # 0.1% of requests return 5xx

Error budgets are the flip side of SLOs. If your availability SLO is 99.9%, you have a 43-minute error budget per month. When you burn through that budget, alerts should fire, even if the system has not completely failed.

This approach shifts alerting from “something is wrong” to “we are at risk of missing our commitments to users.”

Alert Severity Levels and Routing

Not everything deserves the same response. Use a clear severity hierarchy:

SeverityDefinitionResponse TimeChannel
SEV1User-facing outage, data loss risk5 minutesPagerDuty + SMS + Call
SEV2Degraded performance, partial outage15 minutesPagerDuty + Slack
SEV3Non-critical issue, capacity risk2 hoursSlack
SEV4Informational, maintenance soonNext business dayEmail

Route alerts based on severity and on-call schedules. A SEV1 fires regardless of time; a SEV3 on a Saturday afternoon might wait until Monday if the on-call engineer is on a hike.

Runbook Writing and Automation

A runbook is not documentation. It is a decision tree for stressful situations. When an alert fires at 2am, you should not be reading documentation. You should be following steps.

Good runbooks have three properties: they are skimmable, they have clear commands to run, and they include escalation points.

# Runbook: High Error Rate on Checkout Service

## Symptoms

- Error rate > 1% for 5 minutes
- Checkout failures appearing in logs

## Investigation Steps

1. Check payment processor status → [link to status page]
2. Check database connections: `SELECT count(*) FROM pg_stat_activity WHERE state = 'active';`
3. Check recent deployments: `git log --oneline -10`
4. Check feature flags: [Plaid link]

## Mitigation

- If deployment-related: rollback with `./scripts/rollback.sh checkout-service`
- If database: scale up connection pool
- If payment processor: enable circuit breaker

## Escalation

- SEV1: Call Platform Lead immediately
- Still stuck after 20 minutes: Page Engineering Manager

Automate what you can. If a runbook step can be scripted, script it. Runbook automation reduces mean time to resolution because you are not copy-pasting commands under pressure.

The Logging Best Practices post has examples of log queries you can embed directly into runbooks for faster diagnosis.

On-Call Rotation Best Practices

Healthy on-call rotations share a few characteristics: fairness, manageable load, and psychological safety.

Rotate frequently enough that no single engineer bears the burden. A two-week rotation is common. One week is better for reducing context-switching overhead.

Compensate people for being on-call. This is table stakes. Being on-call interrupts sleep, social life, and personal time. Pay them for that disruption, either in extra pay or time off.

Separate primary and secondary on-call. Primary gets paged first. If they do not acknowledge within 5 minutes, secondary gets paged. This prevents single points of failure.

Post-Incident Review and Alert Tuning

After every SEV1 and SEV2, conduct a post-incident review. The goal is to understand what happened, why the alert did or did not help, and what to fix.

Use the “5 Whys” technique: start with the incident, then ask why five times to get to root cause.

## Post-Incident Review: Checkout Outage 2026-03-15

**Duration:** 23 minutes
**Impact:** ~340 failed checkouts
**Root Cause:** Database connection pool exhausted after a slow query

**5 Whys:**

1. Why did checkouts fail? → Database connections were exhausted
2. Why were connections exhausted? → A query was holding connections for >30 seconds
3. Why was the query slow? → Missing index on orders.user_id
4. Why was the index missing? → Added in staging but not in production migration
5. Why did the migration not run? → Production migration was blocked by a lint check

**Alert Feedback:** The "high error rate" alert fired, but engineers had to investigate for 8 minutes before finding the database connection issue. We should add a separate alert for connection pool utilization >80%.

Tune alerts based on post-incident findings. If an alert fired but was not actionable, adjust the threshold or add context. If something should have alerted but did not, add a new alert.

Production Failure Scenarios

FailureImpactMitigation
Alert storm from one root causeHundreds of pages; on-call ignores allUse alert grouping and deduplication; route related alerts to single incident
SLO alert fires but system already recoveringWasted wake-up; engineer annoyedAdd for: 5m to catch transient issues; use burn-rate alerts instead of threshold
Alert routing to wrong teamWrong people paged; issue unaddressedValidate routing rules quarterly; test escalation paths quarterly
Runbook steps no longer accurateMTTR increases; engineers confusedReview and update runbooks after every SEV1; version control runbooks
On-call rotation gapAlert fires but nobody acknowledges; extended outageOverlap on-call schedules; test handoff process; have fallback escalation
Static threshold too sensitivePages fire for normal traffic spikesAdd baseline deviation; use relative thresholds (CPU > 2x normal) not absolute

Observability Hooks for Alerting

Alerting systems need their own monitoring. If your alerting system fails, you lose visibility at the worst possible moment.

Alert on Alerting Itself

What to MonitorMetricAlert Threshold
Alertmanager downalertmanager_up == 0Page immediately
Alert delivery latencyalertmanager_notification_latency_seconds > 30sPage if > 1 minute
Alert storm detectedalerts_firing{severity="critical"} > 20Page if sustained > 5 min
Notification queue backing upalertmanager_alerts_pending > 100Warning if > 50
On-call acknowledgment rateMTTA by engineerTrack but do not alert on

Alert Quality Metrics to Track

# False positive rate: alerts that fire but require no action
sum(rate(alerts_firing{action="none"}[1h])) / sum(rate(alerts_firing[1h]))

# Mean time to acknowledge (MTTA) by severity
histogram_quantile(0.95,
  sum(rate(alert_ack_time_seconds_bucket[1h])) by (le, severity)
)

# Alert volume by service and severity
sum by (service, severity) (rate(alerts_firing[1h]))

# Burn rate for alert fatigue: are we paging more this week vs last?
sum(rate(alerts_firing{severity="warning"}[1h])) /
  sum(rate(alerts_firing{severity="warning"}[1h] offset 7d))

Track these metrics in a dashboard. If false positive rate exceeds 30%, start pruning alerts. If MTTA is trending up, on-call load may be too high.

Alert Routing Observability

# Alert to verify alert routing is healthy
groups:
  - name: alerting-health
    rules:
      - alert: AlertManagerDown
        expr: up{job="alertmanager"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "AlertManager instance {{ $labels.instance }} is down"
          description: "Alert routing is unavailable. All alerts will queue or fail."

      - alert: AlertDeliveryLatency
        expr: histogram_quantile(0.95, sum(rate(alertmanager_notification_latency_seconds_bucket[5m])) by (le)) > 30
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Alert delivery latency above 30 seconds"
          description: "P95 alert delivery latency is {{ $value }}s. Alerts may arrive late during incidents."

      - alert: AlertQueueBackingUp
        expr: alertmanager_alerts_pending / alertmanager_alerts_maximum > 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Alert notification queue above 80%"
          description: "Alert queue is backing up. Check AlertManager connectivity."

Common Pitfalls / Anti-Patterns

Alerting on causes instead of symptoms. CPU at 90% does not tell you if users are affected. Error rate > 1% does. Alert on what users experience, then investigate internal signals to find the cause.

Static thresholds without context. A CPU at 90% on a Monday morning after a deployment is suspicious. The same CPU at 90% during peak traffic on Black Friday might be expected. Pair static thresholds with service-level context.

No alert deduplication. When a database fails, you do not want 50 alerts from 50 affected services. Group alerts by root cause so engineers investigate one incident, not 50 pages.

Alerting without runbooks. A page without a runbook means the on-call engineer has to start from scratch. Every page-worthy alert needs a linked runbook.

Failing to tune alerts. If an alert fires every week and nobody fixes it, the alert is useless. Either fix the underlying issue or remove the alert. Constant alerts train engineers to ignore pages.

Compensating for bad architecture with alerts. If your database fills up every month, fix the cleanup job, do not just alert on it. Alerts mask problems; they do not solve them.

Trade-off Analysis

Alerting StrategyPrecisionRecallAlert VolumeBest For
SLO-based alertsHighHighLowUser-facing services
Resource-based alertsMediumLowMediumInfrastructure
Error-rate alertsHighMediumMediumAPI services
Anomaly-based alertsLowHighHighNovel failure modes
Blackbox monitoringHighLowLowExternal dependencies
PagerDuty vs alternativesCostIntegrationsOn-call Features
PagerDutyHighLargestMost mature
OpsGenieMediumLargeGood
SquadcastLowGrowingGood
Slack + botsLowestVariesLimited
Custom (webhooks)InfrastructureVariesDIY

Interview Questions

1. What is the difference between static threshold alerting and SLO-based alerting, and when would you use each?

What to cover:

  • Static thresholds use fixed values like CPU > 80%; simple but requires knowledge of normal baselines
  • SLO-based alerting fires when error budget burns faster than sustainable; catches user impact directly
  • Use static thresholds for infrastructure health (disk, memory, process count)
  • Use SLO-based for customer-facing services where you care about actual user experience
  • SLO alerts reduce noise during expected high-traffic periods if error budget is still healthy
2. How do you decide whether an alert should page someone at 3am or wait until business hours?

What to cover:

  • Ask: can a human do something actionable in 15 minutes? If not, it can wait
  • Page for: complete service unavailability, data loss risk, security breach in progress
  • Slack/email for: degraded performance, capacity planning issues, non-critical failures
  • Consider: what is the cost of not responding immediately vs the cost of waking someone?
  • Build escalation: primary on-call pages first, secondary pages after 5 min no-ack
3. Walk through how you would design alerting for a checkout service. What signals would you use?

What to cover:

  • Define SLO: 99.9% availability, p99 latency < 2s, error rate < 0.1%
  • Alert on symptoms: error rate spike, latency degradation, success rate drop
  • Investigate causes: database connections, payment processor health, dependency failures
  • Use burn-rate alerting: multi-window (1h, 6h, 3d) to catch both fast burns and slow leaks
  • Every symptom alert needs a runbook with investigation steps and mitigation playbooks
4. What makes a good runbook? Describe the structure and key elements.

What to cover:

  • Skimmable under pressure: clear headers, numbered steps, commands you can copy-paste
  • Three parts: symptoms (what triggered it), investigation steps, mitigation steps
  • Include exact commands with real examples, not generic documentation
  • Escalation path: when to call in more help, who to wake up
  • Version control and review after every SEV1; stale runbooks are worse than none
5. How would you reduce alert fatigue on a team that has been ignoring pages?

What to cover:

  • Track false positive rate: alerts that fire but require no action. Target < 20%
  • Run alert review sessions: ask team "which alerts do you ignore and why?"
  • Remove or tune alerts that fire constantly without being actionable
  • Group related alerts to single incident to prevent alert storms
  • Use SLO-based alerting to reduce noise during acceptable performance windows
  • Create culture where it is safe to say "this alert is not useful" and prioritize fixing it
6. What is an error budget and how does it change your alerting philosophy?

What to cover:

  • Error budget = acceptable downtime based on SLO (99.9% = 43 min/month)
  • Alert when budget burns fast, not when system fails completely
  • Fast burn alert (1h window): burning 14.4x sustainable rate → page
  • Slow leak alert (3d window): burning 3x sustainable rate → investigate
  • This shifts thinking from "something is broken" to "we are at risk of missing commitments"
7. How do you validate that your alerting system itself is working?

What to cover:

  • Alert on AlertManager being down: alertmanager_up == 0 → page immediately
  • Monitor alert delivery latency: > 30s means late arrival during incidents
  • Track notification queue depth: > 80% means alerts may queue or fail
  • Test escalation paths quarterly with fake alerts to verify routing
  • Track MTTA (mean time to acknowledge) by engineer to spot on-call overload
8. Describe the post-incident review process and how it feeds back into alerting improvements.

What to cover:

  • Conduct review for every SEV1 and SEV2 within 48 hours
  • Use 5 Whys to find root cause, not blame individuals
  • Ask: did alerting help or hurt? Were symptoms clear? Was the runbook useful?
  • Output: action items with owners and deadlines, not just discussion
  • Add new alerts if something should have fired but did not; tune or remove if unhelpful
9. What is the difference between PagerDuty, OpsGenie, and Squadcast for on-call management?

What to cover:

  • PagerDuty: enterprise-grade, most integrations, mature escalation policies, higher cost
  • OpsGenie: good integrations, flexible scheduling, moderate pricing
  • Squadcast: simpler, lower cost, growing integrations, good for smaller teams
  • Key features: scheduling, escalation rules, duty roster, alert grouping, analytics
  • Choose based on team size, existing toolchain, and budget
10. How do you handle alert routing when multiple services are affected by one underlying failure?

What to cover:

  • Use alert grouping/deduplication: database failure triggers one incident, not 50 separate alerts
  • Route to the team responsible for the root cause, not every affected service team
  • Suppression: when a high-severity alert fires, suppress lower-priority related alerts
  • Use correlation IDs or CMDB relationships to link dependent services
  • Test grouping logic regularly; mis-routing creates noise and delays resolution
11. What is the difference between burn-rate alerting and traditional threshold alerting? When does burn-rate win?

What to cover:

  • Burn-rate alerting uses multi-window comparison: 1h burning 14.4x sustainable rate = fast burn, page; 3d burning 3x = slow leak, investigate
  • Traditional threshold fires when a metric crosses a fixed value; burn-rate fires when error budget consumption exceeds threshold
  • Burn-rate catches both sudden spikes (fast burn) and slow degradation (slow leak) that threshold misses
  • Burn-rate reduces false positives during expected high-traffic events when budget is still healthy
  • Best for: SLO-based alerting where you care about user impact, not just metric values
12. How do you design alerts for a microservices architecture where failures can cascade?

What to cover:

  • Alert on symptoms at the entry point (API error rate, latency) not on downstream internal metrics
  • Use circuit breakers to prevent cascade; alert when circuit breaker trips
  • Set up Dead Letter Queue (DLQ) monitoring: messages piling up indicate downstream failures
  • Use distributed tracing to correlate cascade failures across services
  • Group alerts by root cause: one underlying database issue should produce one incident, not 20
13. What is anomaly-based alerting and what are its main failure modes?

What to cover:

  • Anomaly-based alerting uses historical data to detect deviations from normal patterns
  • Works for high-variance metrics where static thresholds do not apply (traffic patterns, business metrics)
  • Requires significant historical data to establish baseline; cold start is problematic
  • False positives: legitimate large changes (Black Friday, product launches) trigger alerts
  • False negatives: gradual degradation may not appear as anomaly if change is slow
  • Hybrid approach: use anomaly for high-variance, static for clear failure modes
14. How do you measure and improve Mean Time To Acknowledge (MTTA) for your on-call team?

What to cover:

  • Track MTTA by engineer, severity, and time of day to identify patterns
  • High MTTA often indicates on-call rotation is overloaded or alert routing is wrong
  • Reduce MTTA: ensure alerts route to correct team, primary on-call has clear escalation path
  • Secondary escalation should trigger after 5-10 minutes of no acknowledgment
  • Post-incident review should include MTTA analysis — if acknowledgment was slow, why?
  • MTTA target typically 5 minutes for SEV1, 15 minutes for SEV2
15. Describe how you would implement multi-window burn-rate alerting in Prometheus.

What to cover:

  • Use `avg_over_time()` with multiple time windows: 1h, 6h, 3d
  • 1h window at 14.4x burn rate = fast burn (page immediately)
  • 6h window at 6x burn rate = sustainable burn exceeded (warning + page)
  • 3d window at 3x burn rate = slow leak (investigate soon)
  • PromQL: `sum(rate(errors_total[1h])) / 86400 > 14.4 * (1 - slo / 100)`
  • Add `for: 5m` to avoid flapping on transient spikes
16. What is the "toil" problem in on-call alerting and how do you address it?

What to cover:

  • Toil: manual work that must be repeated indefinitely; alerting noise is a major source of toil
  • Excessive pages that require no action drain engineer time and increase fatigue
  • Address toil by: pruning alerts that fire without action, grouping related alerts
  • Run quarterly alert reviews: ask team "which alerts do you ignore and why?"
  • Target false positive rate below 20%; if above, start removing or tuning alerts
  • Track toil metrics: alerts per week per engineer, time spent on non-actionable pages
17. How do you design runbooks for issues that require multi-step investigation across systems?

What to cover:

  • Structure: symptoms → investigation tree → mitigation options → escalation path
  • Create decision branches: if X, do Y; if Z, escalate to team W
  • Embed exact commands with real examples, not generic documentation
  • Include links to dashboards, logs, and runbooks for each investigation step
  • After incident, review runbook: did it help? Was the tree complete?
  • Version control runbooks alongside application code
18. What is the difference between PagerDuty, OpsGenie, and Squadcast for on-call management?

What to cover:

  • PagerDuty: enterprise-grade, most integrations, mature escalation policies, higher cost
  • OpsGenie: good integrations, flexible scheduling, moderate pricing
  • Squadcast: simpler, lower cost, growing integrations, good for smaller teams
  • Key features: scheduling, escalation rules, duty roster, alert grouping, analytics
  • Choose based on team size, existing toolchain, and budget
19. How do you handle alert storms during major incidents? What strategies prevent overwhelming on-call?

What to cover:

  • Alert grouping: database failure should produce one incident, not 50 pages
  • Suppression: when SEV1 fires, suppress lower-priority related alerts for same root cause
  • Use incident channel instead of page for cascading failures — Slack war room
  • Set alert storm detection: >20 critical alerts in 5 minutes triggers automatic grouping
  • Rate limiting: limit pages to once per 15 minutes per service during major incidents
  • Post-incident: identify why storm occurred and add grouping/suppression
20. How would you design a quarterly alerting review process to keep alert quality high?

What to cover:

  • Collect data: alert volume per service, false positive rate, MTTA, time to resolution
  • Run team retrospective: which alerts saved the day? Which did we ignore?
  • Identify top 5 noisiest alerts and investigate: are they actionable or just noise?
  • For each noisy alert: tune threshold, add context, or remove entirely
  • Check alert coverage: are there failure modes we do not alert on at all?
  • Document decisions: if we keep an alert despite low action rate, note why
  • Track progress: target reduction in false positive rate from baseline

Further Reading

Conclusion

Key Takeaways

  • Alert on symptoms users experience, not internal root causes
  • SLO-based alerting catches both fast burns and slow leaks
  • Every page-worthy alert needs a runbook with actionable steps
  • Tune alerts after every SEV1 and SEV2; dead alerts train engineers to ignore pages
  • Track alert quality metrics: false positive rate, MTTA, alert volume over time

Alerting Checklist

# 1. Define SLOs for customer-facing services
# availability: 99.9% (43 min/month budget)
# latency_p99: 2000ms
# error_rate: 0.1%

# 2. Configure multi-window burn-rate alerts
# 1h window: page if burning 14.4x sustainable rate
# 6h window: warning if burning 6x sustainable rate
# 3d window: investigate if burning 3x sustainable rate

# 3. Set up alert routing
# PagerDuty: SEV1 critical (page + SMS + call)
# Slack: SEV2/SEV3 warnings
# Ticket: SEV4 informational

# 4. Write runbooks for every page-worthy alert
# Investigation steps with exact commands
# Mitigation steps with rollback procedures
# Escalation path with contact info

# 5. Tune quarterly
# Review false positive rate — target < 20%
# Review MTTA — target < 5 minutes for critical
# Remove or fix alerts that fire but require no action

# 6. Monitor the monitoring
# alertmanager_up == 1
# MTTA by engineer tracked weekly
# Alert volume by severity trended monthly

Category

Related Posts

Alerting in Production: Building Alerts That Matter

Build alerting systems that catch real problems without fatigue. Learn alert design principles, severity levels, runbooks, and on-call best practices.

#data-engineering #alerting #monitoring

The Observability Engineering Mindset: Beyond Monitoring

Transition from traditional monitoring to full observability: structured logs, metrics, traces, and the cultural practices that make observability teams successful.

#observability #engineering #sre

Metrics, Monitoring, and Alerting: From SLIs to Alerts

Learn the RED and USE methods, SLIs/SLOs/SLAs, and how to build alerting systems that catch real problems. Includes examples for web services and databases.

#observability #monitoring #metrics