Incident Response: Detection, Response, and Post-Mortems
Build an effective incident response process: from detection and escalation to resolution and blameless post-mortems that prevent recurrence.
Introduction
When something breaks in production, the minutes matter. Not just for user impact, but because stressed engineers under time pressure make bad decisions. An incident response process gives you a playbook so your team can focus on fixing the problem, not figuring out who does what. Without one, organizations end up running fire drills, missing escalations, and sending confused updates that prolong outages and baffle stakeholders.
Good incident response covers three phases that feed into each other. Detection gets the right people alerted quickly through monitoring and alerting systems. Response gets the problem solved through coordination, investigation, and whatever mitigation works fastest. Learning is where you figure out why it happened and how to stop it happening again. Weakness in any one phase drags down the whole system.
This guide walks through building an incident response process from scratch: severity levels and when to use each, escalation paths and who to page, communication templates that keep stakeholders informed, the Incident Commander role and why it matters, and the blameless post-mortem that turns failures into improvements. By the end, you will have the structure to handle incidents consistently, communicate clearly under pressure, and extract learning that compounds over time.
When to Use
SEV1 vs. SEV2: How to Decide
Declare SEV1 when you have complete service unavailability, data loss or corruption, or a security breach. If customers cannot complete their primary task at all, that is SEV1. The bar for declaring SEV1 should be low — it is better to over-communicate severity and scale back than to under-respond.
Declare SEV2 when a major feature is degraded but customers can work around it. Search returning errors for 20% of users is SEV2. Payment processing slow but completing is SEV2. SEV2 means the response is urgent but the business is not on fire.
The hard call is partial availability. A service that is technically up but serving errors for a subset of users could be SEV2. When in doubt, declare the higher severity and demote it later if it turns out to be minor.
War Room vs. Async Slack
Open a war room (video call, dedicated channel) when you have a SEV1, when multiple teams need to coordinate simultaneously, or when the incident is actively worsening and needs rapid decision-making. War rooms add coordination overhead, so use them only when the speed of parallel investigation outweighs the overhead.
Use async Slack updates when you have a SEV2, when the incident is stable and slowly improving, or when only one team is working the problem. Async keeps people informed without tying up multiple engineers in a call.
Rollback vs. Fix-Forward
Rollback when the deployment caused the incident, when a known-good state exists and can be restored quickly, and when the fix requires more time than rollback. Rollback is the right call when you can be confident in 10 minutes that reverting solves the problem.
Fix-forward when the deployment is not the cause, when rollback would cause more disruption (for example, in-flight transactions), and when the fix is simpler than the rollback procedure. Fix-forward requires confidence that the fix will not make things worse.
Incident Lifecycle
flowchart TD
A[Alert Fires<br/>or User Reports] --> B{Classify Severity}
B -->|SEV1| C[Open War Room<br/>Page Secondary On-Call]
B -->|SEV2| D[Open Incident Channel<br/>Assign IC]
C --> E[Investigate<br/>Identify Root Cause]
D --> E
E --> F{Rollback<br/>or Fix-Forward?}
F -->|Rollback| G[Execute Rollback<br/>Monitor Recovery]
F -->|Fix-Forward| H[Implement Fix<br/>Test in Staging]
G --> I{Service<br/>Recovered?}
H --> I
I -->|Yes| J[Declare Resolved<br/>Update Status Page]
I -->|No| E
J --> K[Schedule Post-Mortem<br/>Track Action Items]
Incident Classification and Severity
Not all incidents are created equal. Define severity levels so everyone knows the stakes.
| Severity | Definition | Example | Response |
|---|---|---|---|
| SEV1 | Complete outage or data loss | Checkout service down, database corrupted | Immediate all-hands |
| SEV2 | Major feature degraded | Search returning errors for 20% of queries | Response in 15 minutes |
| SEV3 | Minor feature degraded | PDF exports slow but working | Response in 2 hours |
| SEV4 | Cosmetic or minor issue | Wrong logo color on one page | Next sprint |
Classify based on user impact, not technical cause. A single-user issue is SEV4 even if the root cause is interesting. A 1% error rate on a critical path is SEV2.
Detection Sources and Alerting
The Alerting in Production post covers how to build alerts that page the right people. This is where that investment pays off.
Detection can come from automated monitoring, user reports, or employee reports. Automated detection is faster and more reliable. When PagerDuty fires at 3am, you have context: error rates, latency, which service is affected.
User-reported incidents are harder. Create a clear channel ( Slack channel, dedicated email, emergency hotline) that routes to the on-call. Train customer support to recognize severity.
Escalation Paths and Communication
Escalation is about getting the right people involved quickly.
# Example escalation policy (PagerDuty format)
escalation_policy:
name: Platform Team Escalation
description: Primary on-call, then secondary, then manager
levels:
- recipients:
- user: primary-oncall
delay_minutes: 0
- recipients:
- user: secondary-oncall
delay_minutes: 15
- recipients:
- user: platform-manager
delay_minutes: 30
During an incident, communicate early and often. Users would rather hear “we are aware and investigating” than watch silence.
Status page updates:
**2026-03-25 14:32 UTC - Investigating**
We are seeing elevated error rates on our checkout service. Our team is investigating and will update in 30 minutes.
**2026-03-25 14:45 UTC - Identified**
We have identified the issue as a database connection pool exhaustion caused by a slow-running query. We are working on a fix.
**2026-03-25 15:02 UTC - Resolved**
The issue has been resolved. Checkout service is operating normally. We will publish a post-mortem within 5 business days.
Internal communication should be in a dedicated incident Slack channel. Do not let incident discussion pollute regular team channels.
Active Incident Management
One person leads the incident. This is the Incident Commander (IC). The IC coordinates, communicates, and makes decisions. They are not necessarily the most technical person, they are the person who can keep the incident moving.
The IC role:
- Declares incident (sets severity, creates Slack channel, pages on-call)
- Coordinates responders (who is investigating, who is fixing, who is communicating)
- Tracks progress (main incident timeline)
- Makes calls (rollback vs. fix-forward, customer communication)
- Closes incident (declares resolved, schedules post-mortem)
Use a war room for SEV1s. Video conference, screen share, one channel for coordination. For SEV2s, a Slack channel with async updates often suffices.
The Chaos Engineering post has more on building systems that fail gracefully, which reduces incident frequency and severity.
Blameless Post-Mortem Process
After a SEV1 or SEV2, conduct a post-mortem. The purpose is learning, not blame. Blameless means: focus on systems and processes, not individuals.
Timeline first. Reconstruct what happened and when. Include when the incident was detected, when it was escalated, when mitigation started.
## Post-Mortem: Checkout Service Outage
**Date:** 2026-03-25
**Duration:** 47 minutes
**Impact:** 1,247 failed checkout attempts
**Severity:** SEV1
### Timeline (UTC)
- 13:15 - Last successful deployment to checkout service
- 13:47 - Alerting fires: error rate > 5%
- 13:48 - Primary on-call acknowledges
- 13:52 - Incident channel created, IC assigned
- 14:01 - Database connection issue identified
- 14:08 - Query timeout identified as cause
- 14:15 - Rollback initiated
- 14:34 - Rollback complete, service recovering
- 14:47 - Error rates back to normal
### Root Cause
A deployment introduced a query that did not use an index, causing full table scans on the orders table. Under load, this exhausted the connection pool.
### Contributing Factors
- No query plan review in deployment process
- Load testing did not include the new query pattern
- Connection pool size was not monitored
### Action Items
| Item | Owner | Due |
| -------------------------------------------- | ------ | ---------- |
| Add query plan review to CI | @sarah | 2026-04-01 |
| Add connection pool monitoring | @james | 2026-04-05 |
| Update load testing to include checkout flow | @ops | 2026-04-10 |
### What Went Well
- Detection was fast (3 minutes from failure to alert)
- Rollback procedure worked as documented
- Communication was clear and timely
Share post-mortems widely. Blameless only works if people believe it. Seeing the same process applied to senior engineers as junior engineers builds trust.
Improving Detection and Response
Post-mortems are useless if action items are not tracked. Put them in your project management tool. Review action items in weekly ops meetings.
Track your incident metrics over time:
- Mean Time to Detection (MTTD): How long between failure and alert?
- Mean Time to Acknowledge (MTTA): How long between alert and someone looking at it?
- Mean Time to Resolution (MTTR): How long from alert to fix?
Set targets for each. If your MTTD is 15 minutes and your target is 5, you need better alerting. If your MTTR is 2 hours and your target is 30 minutes, you need better runbooks or faster rollback procedures.
Production Failure Scenarios
| Failure | Impact | Mitigation |
|---|---|---|
| Severity misclassification (under-response) | SEV2 treated as SEV3, wrong people paged, slow resolution | Default to higher severity, demote later if warranted; track over- and under-escalations in post-mortems |
| Escalation policy not triggering due to on-call rotation gap | Alert fires but nobody sees it for 30 minutes | Test escalation policies quarterly, simulate on-call transitions, have a fallback communication channel |
| Status page update causing customer panic due to inaccurate information | Customers react to incomplete data, support tickets spike | Have a reviewer check status page updates before posting, verify facts before publishing |
| Post-mortem action items never implemented | Same incident recurs six months later | Assign action item owners with due dates, review in weekly ops meeting, track in project management tool |
| War room too large, too many people talking | Noise drowns signal, IC cannot coordinate | Strict attendee list, use a separate investigation channel for parallel work, IC manages participation |
Incident Response Trade-off Analysis
| Scenario | War Room | Async Slack |
|---|---|---|
| Coordination speed | Fast (everyone on call) | Slow (check messages periodically) |
| Noise level | High | Low |
| Multi-team alignment | Good | Requires explicit updates |
| Best for | SEV1, rapidly evolving | SEV2, stable investigation |
| Scenario | Rollback | Fix-Forward |
|---|---|---|
| Time to recovery | Usually faster | Variable |
| Risk | Lower (known-good state) | Depends on fix quality |
| When to use | Deployment-caused incidents | Non-deployment causes |
| Trade-off | May not fix root cause | Fixes the actual problem |
Incident Response Observability
Track these metrics to understand your incident response health:
| Metric | What It Tells You | Target |
|---|---|---|
| MTTD (Mean Time to Detection) | Speed of automated detection | < 5 minutes |
| MTTA (Mean Time to Acknowledge) | On-call responsiveness | < 15 minutes |
| MTTR (Mean Time to Resolution) | Overall incident resolution speed | < 30 minutes for SEV1 |
| False positive alert rate | Alert quality | < 10% |
| Alert to war room open time | How fast coordination starts | < 5 minutes |
Alert fatigue is a real problem. If engineers ignore alerts because 80% are false positives, a real incident gets missed. Review alert quality monthly. If an alert fires and nobody acts on it within 15 minutes, it was either not important enough to page or it was a false positive.
Key commands:
# Check PagerDuty escalation policy status
pd-cli escalation-policy list --team "Platform Team"
# Review alert volume by service in the last 24 hours
curl -s "http://prometheus:9090/api/v1/query?query=ALERTS{alertstate='firing'}" | jq '.data.result | group_by(.labels.service) | map({service: .[0].labels.service, count: length})'
# Count incidents by severity in the last quarter
kubectl get events --all-namespaces --field-selector type=Warning --since="12h" | wc -l
Common Pitfalls / Anti-Patterns
Skipping post-mortems for minor incidents. Every SEV1 and SEV2 deserves a post-mortem. SEV3s and SEV4s can be documented in a sentence, but you should still review patterns. If your SEV4 count is growing, something is wrong.
Not acting on post-mortem action items. A post-mortem without tracked action items is just a document. If the same class of incident happens twice, the first post-mortem failed.
Severity inflation. Declaring everything SEV1 because you want attention makes real SEV1s harder to spot. Save SEV1 for actual outages. Your team will learn to ignore the noise.
Treating all incidents as fire drills. Not every incident requires everyone to drop everything. A SEV3 can wait until business hours. Waking people up unnecessarily builds resentment that surfaces when a real SEV1 happens.
No clear IC. When everyone is investigating, nobody is coordinating. The IC makes calls, tracks the timeline, and manages communication. Without one, you get parallel investigations, duplicated effort, and conflicting status updates.
Closing Thoughts
Effective incident response is not about being perfect. It is about being consistent. When every incident follows the same playbook, same severity definitions, same escalation paths, same communication templates, your team can focus on the problem instead of the process.
Run the process, learn from it, and improve it. Incidents are inevitable. Outages are optional.
Trade-off Analysis
No single incident response approach works for every situation. These are the key trade-offs you will face.
Severity Declaration: SEV1 vs SEV2
The choice between declaring a SEV1 or SEV2 shapes how much coordination overhead you get versus how much attention the incident receives.
| Factor | SEV1 | SEV2 |
|---|---|---|
| War room | All hands, full focus | Lead + 2-3 responders |
| Communication | Executive updates | Team-level updates |
| Resolution pressure | Maximum | High |
| Overhead cost | High | Moderate |
Default to higher severity and demote later. It is easier to explain why a SEV1 was over-egineered than why a SEV2 became a SEV1 mid-incident. Partial availability is the hardest case — when in doubt, declare higher.
Incident Commander: Single IC vs Distributed Investigation
Having one person own coordination versus letting everyone investigate in parallel is a fundamental trade-off.
| Approach | Pros | Cons |
|---|---|---|
| Single IC | Clear decisions, no duplicated effort, one timeline | IC must resist investigating — coordination is a full-time job |
| No IC (everyone investigates) | More parallel coverage | Conflicting updates, duplicated work, no clear go/no-go |
The IC role is non-negotiable for SEV1 and SEV2. For SEV3 and SEV4, a brief Slack thread with a designated lead is usually sufficient.
Rollback vs Fix-Forward
Whether to revert a deployment or push a fix depends on the situation and what you know.
| Factor | Rollback | Fix-Forward |
|---|---|---|
| When deployment caused incident | Yes | Only if fix is faster |
| Known-good state available | Yes | N/A |
| Fix is trivial and fast | No | Yes |
| In-flight transactions/due to lose | No | Yes |
| Root cause unknown | Risky | Risky either way |
The 10-minute rule: If you cannot be confident within 10 minutes that rollback solves the problem, evaluate fix-forward. Both are valid — the goal is fastest time-to-resolution.
Post-Mortem Timing: Same Day vs Within 5 Days
| Timing | When to use | Trade-off |
|---|---|---|
| Same day | Minor incidents, quick wins | May miss contributing factors when memory is freshest |
| Within 5 business days | SEV1/SEV2 | Time to process, but action items can lose urgency |
For SEV1 and SEV2, 5 business days gives enough distance to see patterns rather than just symptoms, while staying close enough to the incident for accurate recall.
Communication Templates vs Ad-Hoc Updates
| Approach | Pros | Cons |
|---|---|---|
| Templates | Faster, consistent, harder to forget key info | Feels robotic if over-used |
| Ad-hoc | Flexible, context-specific | Inconsistency, missing stakeholders |
Use templates for severity, status page, and stakeholder updates. Adapt the tone but keep the structure. For SEV1, pre-built templates in a shared doc save minutes that matter.
Interview Questions
Expected answer points:
- Detection: monitoring, alerting, and user reports — the faster you know, the faster you respond
- Containment: isolate the blast radius to prevent further damage while you investigate
- Investigation: root cause analysis under pressure — distinguish symptoms from causes
- Resolution: fix the problem and verify the fix works in production
- Post-mortem: blameless review to identify systemic improvements, not assign blame
Expected answer points:
- SEV1: complete service unavailability, data loss, security breach — the business is on fire
- SEV2: major feature degraded but customers can work around it — urgent but not catastrophic
- SEV3: minor feature impairment, low user impact — addressed in normal workflow
- When in doubt, declare a higher severity and demote later — under-responding is more damaging
- Partial availability is the hardest call — treat it as SEV2 unless proven otherwise
Expected answer points:
- Open a war room for SEV1, multiple-team coordination, or actively worsening incidents
- Use async Slack for SEV2, stable incidents, or single-team investigations
- War rooms add coordination overhead — justify it with the speed of parallel investigation
- Keep war room scope tight: focused on stopping bleeding, not complete diagnosis
Expected answer points:
- Follow-the-sun works for global teams — high coverage at the cost of time zone disruption
- Fixed rotations minimize context-switching but can lead to stale knowledge
- Dev-first rotation (same team builds and runs) increases ownership but raises burnout risk
- Managed services (PagerDuty, Opsgenie) offload scheduling complexity but add cost
- Track MTTA and MTTR by rotation — if responders are burned out, numbers show it
Expected answer points:
- A runbook is a step-by-step playbook for a specific incident type
- Good runbooks are precise, scoped to one problem, and updated after every incident
- Bad runbooks are vague, try to cover too many scenarios, or become outdated
- Template-driven runbooks are more maintainable than rigid scripts
- Every runbook should have a clear exit criterion — when is this incident resolved?
Expected answer points:
- Identify dependent services early — what breaks when this component fails?
- Use circuit breakers to stop failures from propagating upstream
- Throttle traffic intentionally to keep the system alive at reduced capacity
- Capacity buffers and graceful degradation prevent total collapse under load
- Chaos engineering tests these failure modes before they happen in production
Expected answer points:
- Blast radius analysis estimates how far the impact spreads from the failure point
- Do it immediately after containment — understand scope before adding more responders
- Categorize impact: user-facing vs internal, revenue vs reputation, current vs potential
- The goal is right-sizing the response — enough to contain, not so much it creates noise
Expected answer points:
- Public status page updates within minutes of declaring an incident
- Use a single source of truth — one channel, one incident room, one status page
- Initial communication: what happened, current impact, what you are doing
- Follow-up communication: every 15-30 min for SEV1, less for SEV2
- Never speculate about root cause publicly until it is confirmed
Expected answer points:
- A blameless post-mortem focuses on systemic causes, not individual fault
- Engineers who fear blame will hide mistakes, making systemic problems invisible
- Effective post-mortems identify contributing factors across people, process, and tooling
- Action items must be specific, assigned, and tracked to completion
- Review post-mortems regularly — patterns across incidents reveal structural issues
Expected answer points:
- Symptoms are observable effects — error rates, latency spikes, failed requests
- Root cause is the underlying mechanism that produces the symptom
- Ask "why" five times: why did users see errors → why did the service fail → why did it crash
- Correlation is not causation — a spike in metric A followed by failure B does not mean A caused B
- Instrumentation and tracing help separate coincidence from causality
Expected answer points:
- Missing metrics: services without RED (Rate, Errors, Duration) coverage hide failure signals
- Silent failures: catch-all error handlers that swallow exceptions without alerting
- Alert fatigue: too many alerts cause responders to ignore or delay on real incidents
- No causation signal: alerts show symptoms but give no path to diagnosis
- Stale dashboards: dashboards that are not queried during incidents show outdated state
Expected answer points:
- Rotation frequency matters more than rotation length — weekly is better than monthly
- Compensate on-call time regardless of incident frequency
- Allow engineers to opt out of on-call during high-stress personal periods
- Track alert volume per service — unhealthy services burn out on-call engineers
- Regular on-call training and game days reduce stress during actual incidents
Expected answer points:
- MTTR (Mean Time To Resolution) measures average time from alert to restored service
- High-fidelity alerting reduces MTTR by cutting time spent in detection
- Runbook maturity reduces MTTR by removing the "what do we do now" phase
- Parallel investigation (war room) reduces MTTR for complex incidents
- Never trade off resolution quality for speed — incomplete fixes cause repeat incidents
Expected answer points:
- The incident commander owns coordination, not investigation — they facilitate, not fix
- They manage the war room, delegate tasks, and track action items
- They own external communication and stakeholder updates
- Separating command from investigation prevents tunnel vision and missed angles
- Rotation during long incidents prevents fatigue-driven decision errors
Expected answer points:
- Declare the breach immediately and update the status page — transparency > avoidance
- Focus engineering effort on the fastest viable fix, not the perfect fix
- Communication cadence doubles — every 10-15 minutes for active SLA breach
- Account management should be pre-warned before customers call in
- Post-incident: review the SLA itself — was it achievable given the system's true capacity?
Expected answer points:
- Declare incident severity — do not wait for full understanding to start responding
- Open the war room and page the incident commander
- Check dashboards for upstream dependency failures — look for causal signals, not correlations
- Throttle or shed traffic to preserve partial service — degraded is better than dead
- Enable circuit breakers on dependent services to stop further propagation
Expected answer points:
- Start with a clear framing: this is about systems and processes, not individuals
- Use the 5 whys technique to trace backward from incident to root cause
- Every action item needs an owner and a deadline — vague follow-ups are useless
- Share post-mortems broadly — organizational learning compounds over time
- Track action item completion rate — if it is low, the process has no teeth
Expected answer points:
- Start with alerting: instrument services so failures produce signals, not silence
- Define severity levels and response SLAs — even preliminary ones create structure
- Create runbooks for the top 3 most common incidents — build habit before building completeness
- Run game days: simulate failures in staging to stress-test the process
- Retrospect after every incident, no matter how small — build the learning muscle
Expected answer points:
- Chronic incidents recur multiple times in a short window — same root cause, different symptoms
- Warning signs: repeat pages, a service that never leaves the incident state, growing defect backlog
- Management approach: stop treating symptoms, allocate dedicated time to root-cause elimination
- Elevate to project mode — incidents become tracked work items, not firefighting
- Communicate chronic status to stakeholders — ongoing instability requires expectation management
Expected answer points:
- During active incident: document actions taken, not analysis — real-time notes are enough
- Formal documentation and process updates happen after resolution, not during
- Build templates and runbooks in peacetime so incident responders do not create process during crisis
- Compliance should be baked into tooling (automated rollback, mandatory SLO gates) not human memory
- Post-incident: the review process is where rigor and documentation matter — do not rush it
Further Reading
- PagerDuty Incident Response Documentation - Comprehensive incident response guides and templates
- SRE Book: Post-Mortem Culture - Google’s guide to blameless post-mortems
- NIST SP 800-61r2 - Computer Security Incident Handling Guide
- Atlassian Incident Handbook - Practical incident management playbooks
- FireHydrant Runbook Template - Template for creating effective runbooks
Conclusion
Key Takeaways
- Define severity levels and use them consistently — classify by user impact, not technical interest
- Declare SEV1 early and demote if needed — under-responding is worse than over-responding
- The Incident Commander coordinates, does not investigate — separate the coordination role from the fix role
- Blameless post-mortems focus on systems and processes, not individuals
- Track MTTD, MTTA, MTTR over time — you cannot improve what you do not measure
- Action items from post-mortems must be tracked and reviewed
Incident Response Checklist
# 1. Define severity levels (SEV1/SEV2/SEV3/SEV4) in your runbook
# 2. Set up PagerDuty escalation with 0 -> 15 -> 30 minute delays
# 3. Create status page templates for Investigating / Identified / Resolved
# 4. Assign an Incident Commander for every SEV1 and SEV2
# 5. Open separate Slack channel per incident: #inc-YYYY-MM-DD-description
# 6. Document timeline as the incident progresses, not after
# 7. Conduct blameless post-mortem within 5 business days of SEV1/SEV2
# 8. Create action items in project management tool with owners and due dates
# 9. Review MTTD, MTTA, MTTR monthly in ops review meeting
For more on building resilient systems, see Chaos Engineering. For alerting best practices, see Alerting in Production. For monitoring best practices and SLOs, see Observability Engineering.
Category
Related Posts
Alerting in Production: Paging, Runbooks, and On-Call
Build effective alerting systems that wake people up for real emergencies: alert fatigue prevention, runbook automation, and healthy on-call practices.
Kubernetes High Availability: HPA, Pod Disruption Budgets, Multi-AZ
Build resilient Kubernetes applications with Horizontal Pod Autoscaler, Pod Disruption Budgets, and multi-availability zone deployments for production workloads.
The Observability Engineering Mindset: Beyond Monitoring
Transition from traditional monitoring to full observability: structured logs, metrics, traces, and the cultural practices that make observability teams successful.