Incident Response: Detection, Response, and Post-Mortems

Build an effective incident response process: from detection and escalation to resolution and blameless post-mortems that prevent recurrence.

published: March 25, 2026 reading time: 26 min read author: GeekWorkBench updated: May 14, 2026

Quick Summary

When stuff breaks, you want engineers fixing it, not debating who does what. That's what a solid incident response process gets you. The Incident Commander is the one role you can't skip for SEV1 and SEV2 — they run the show while others dig into the problem. Postmortems should be blameless, focused on what the system, not the person, needs to do better. And track your metrics: MTTD, MTTA, MTTR. You can't fix what you don't measure.

Introduction

When something breaks in production, the minutes matter. Not just for user impact, but because stressed engineers under time pressure make bad decisions. An incident response process gives you a playbook so your team can focus on fixing the problem, not figuring out who does what. Without one, organizations end up running fire drills, missing escalations, and sending confused updates that prolong outages and baffle stakeholders.

Good incident response covers three phases that feed into each other. Detection gets the right people alerted quickly through monitoring and alerting systems. Response gets the problem solved through coordination, investigation, and whatever mitigation works fastest. Learning is where you figure out why it happened and how to stop it happening again. Weakness in any one phase drags down the whole system.

This guide walks through building an incident response process from scratch: severity levels and when to use each, escalation paths and who to page, communication templates that keep stakeholders informed, the Incident Commander role and why it matters, and the blameless post-mortem that turns failures into improvements. By the end, you will have the structure to handle incidents consistently, communicate clearly under pressure, and extract learning that compounds over time.

When to Use

SEV1 vs. SEV2: How to Decide

Declare SEV1 when you have complete service unavailability, data loss or corruption, or a security breach. If customers cannot complete their primary task at all, that is SEV1. The bar for declaring SEV1 should be low — it is better to over-communicate severity and scale back than to under-respond.

Declare SEV2 when a major feature is degraded but customers can work around it. Search returning errors for 20% of users is SEV2. Payment processing slow but completing is SEV2. SEV2 means the response is urgent but the business is not on fire.

The hard call is partial availability. A service that is technically up but serving errors for a subset of users could be SEV2. When in doubt, declare the higher severity and demote it later if it turns out to be minor.

The practical difference between SEV1 and SEV2 is the coordination overhead and who gets paged. A SEV1 opens a war room, pages the secondary on-call, and typically triggers executive notification. A SEV2 pages only the primary on-call and runs through a Slack channel. Using SEV1 for a SEV2 wastes everyone’s time on an over-produced response. Using SEV2 for something that is actually a SEV1 delays the right response and makes the incident harder to resolve.

The specific threshold to memorize is whether the core user action is blocked. If users cannot complete the primary workflow, that is SEV1 regardless of the technical cause. If users can complete their primary workflow through an alternative path or with degraded performance, that is SEV2. A checkout service that errors for 5% of users due to a payment provider issue is SEV2 because users can retry or use an alternative payment method. A checkout service that errors for 100% of users is SEV1 because no checkout completes at all.

War Room vs. Async Slack

Open a war room (video call, dedicated channel) when you have a SEV1, when multiple teams need to coordinate simultaneously, or when the incident is actively worsening and needs rapid decision-making. War rooms add coordination overhead, so use them only when the speed of parallel investigation outweighs the overhead.

Use async Slack updates when you have a SEV2, when the incident is stable and slowly improving, or when only one team is working the problem. Async keeps people informed without tying up multiple engineers in a call.

A war room works when you need three things: real-time synchronization between multiple responders, rapid decision-making on a moving situation, and a single source of truth for what is happening. If you have an incident where the database team needs to coordinate with the application team and both need to report to a single incident commander who is making go/no-go calls, a war room prevents the confusion of cross-team communication happening in separate threads.

The failure mode for war rooms is too many people in them. A war room with 20 people becomes a cacophony of simultaneous conversations that drowns out the signal. The incident commander should actively manage the guest list: on-call responders and subject matter experts only. Everyone else follows along in the Slack channel. A good war room has fewer than eight people. The incident commander is the hub, not a participant — they facilitate the investigation, they do not run the diagnosis themselves.

Rollback vs. Fix-Forward

Rollback when the deployment caused the incident, when a known-good state exists and can be restored quickly, and when the fix requires more time than rollback. Rollback is the right call when you can be confident in 10 minutes that reverting solves the problem.

Fix-forward when the deployment is not the cause, when rollback would cause more disruption (for example, in-flight transactions), and when the fix is simpler than the rollback procedure. Fix-forward requires confidence that the fix will not make things worse.

Incident Lifecycle

flowchart TD
    A[Alert Fires<br/>or User Reports] --> B{Classify Severity}
    B -->|SEV1| C[Open War Room<br/>Page Secondary On-Call]
    B -->|SEV2| D[Open Incident Channel<br/>Assign IC]
    C --> E[Investigate<br/>Identify Root Cause]
    D --> E
    E --> F{Rollback<br/>or Fix-Forward?}
    F -->|Rollback| G[Execute Rollback<br/>Monitor Recovery]
    F -->|Fix-Forward| H[Implement Fix<br/>Test in Staging]
    G --> I{Service<br/>Recovered?}
    H --> I
    I -->|Yes| J[Declare Resolved<br/>Update Status Page]
    I -->|No| E
    J --> K[Schedule Post-Mortem<br/>Track Action Items]

Incident Classification and Severity

Not all incidents are created equal. Define severity levels so everyone knows the stakes.

Severity	Definition	Example	Response
SEV1	Complete outage or data loss	Checkout service down, database corrupted	Immediate all-hands
SEV2	Major feature degraded	Search returning errors for 20% of queries	Response in 15 minutes
SEV3	Minor feature degraded	PDF exports slow but working	Response in 2 hours
SEV4	Cosmetic or minor issue	Wrong logo color on one page	Next sprint

Classify based on user impact, not technical cause. A single-user issue is SEV4 even if the root cause is interesting. A 1% error rate on a critical path is SEV2.

Detection Sources and Alerting

The Alerting in Production post covers how to build alerts that page the right people. This is where that investment pays off.

Detection can come from automated monitoring, user reports, or employee reports. Automated detection is faster and more reliable. When PagerDuty fires at 3am, you have context: error rates, latency, which service is affected.

User-reported incidents are harder. Create a clear channel ( Slack channel, dedicated email, emergency hotline) that routes to the on-call. Train customer support to recognize severity.

Escalation Paths and Communication

Escalation is about getting the right people involved quickly.

# Example escalation policy (PagerDuty format)
escalation_policy:
  name: Platform Team Escalation
  description: Primary on-call, then secondary, then manager
  levels:
    - recipients:
        - user: primary-oncall
      delay_minutes: 0
    - recipients:
        - user: secondary-oncall
      delay_minutes: 15
    - recipients:
        - user: platform-manager
      delay_minutes: 30

During an incident, communicate early and often. Users would rather hear “we are aware and investigating” than watch silence.

Status page updates:

**2026-03-25 14:32 UTC - Investigating**
We are seeing elevated error rates on our checkout service. Our team is investigating and will update in 30 minutes.

**2026-03-25 14:45 UTC - Identified**
We have identified the issue as a database connection pool exhaustion caused by a slow-running query. We are working on a fix.

**2026-03-25 15:02 UTC - Resolved**
The issue has been resolved. Checkout service is operating normally. We will publish a post-mortem within 5 business days.

Internal communication should be in a dedicated incident Slack channel. Do not let incident discussion pollute regular team channels.

Active Incident Management

One person leads the incident. This is the Incident Commander (IC). The IC coordinates, communicates, and makes decisions. They are not necessarily the most technical person, they are the person who can keep the incident moving.

The IC role:

Declares incident (sets severity, creates Slack channel, pages on-call)
Coordinates responders (who is investigating, who is fixing, who is communicating)
Tracks progress (main incident timeline)
Makes calls (rollback vs. fix-forward, customer communication)
Closes incident (declares resolved, schedules post-mortem)

Use a war room for SEV1s. Video conference, screen share, one channel for coordination. For SEV2s, a Slack channel with async updates often suffices.

The Chaos Engineering post has more on building systems that fail gracefully, which reduces incident frequency and severity.

Blameless Post-Mortem Process

After a SEV1 or SEV2, conduct a post-mortem. The purpose is learning, not blame. Blameless means: focus on systems and processes, not individuals.

Timeline first. Reconstruct what happened and when. Include when the incident was detected, when it was escalated, when mitigation started.

## Post-Mortem: Checkout Service Outage

**Date:** 2026-03-25
**Duration:** 47 minutes
**Impact:** 1,247 failed checkout attempts
**Severity:** SEV1

### Timeline (UTC)

- 13:15 - Last successful deployment to checkout service
- 13:47 - Alerting fires: error rate > 5%
- 13:48 - Primary on-call acknowledges
- 13:52 - Incident channel created, IC assigned
- 14:01 - Database connection issue identified
- 14:08 - Query timeout identified as cause
- 14:15 - Rollback initiated
- 14:34 - Rollback complete, service recovering
- 14:47 - Error rates back to normal

### Root Cause

A deployment introduced a query that did not use an index, causing full table scans on the orders table. Under load, this exhausted the connection pool.

### Contributing Factors

- No query plan review in deployment process
- Load testing did not include the new query pattern
- Connection pool size was not monitored

### Action Items

| Item                                         | Owner  | Due        |
| -------------------------------------------- | ------ | ---------- |
| Add query plan review to CI                  | @sarah | 2026-04-01 |
| Add connection pool monitoring               | @james | 2026-04-05 |
| Update load testing to include checkout flow | @ops   | 2026-04-10 |

### What Went Well

- Detection was fast (3 minutes from failure to alert)
- Rollback procedure worked as documented
- Communication was clear and timely

Share post-mortems widely. Blameless only works if people believe it. Seeing the same process applied to senior engineers as junior engineers builds trust.

Improving Detection and Response

Post-mortems are useless if action items are not tracked. Put them in your project management tool. Review action items in weekly ops meetings.

Track your incident metrics over time:

Mean Time to Detection (MTTD): How long between failure and alert?
Mean Time to Acknowledge (MTTA): How long between alert and someone looking at it?
Mean Time to Resolution (MTTR): How long from alert to fix?

Set targets for each. If your MTTD is 15 minutes and your target is 5, you need better alerting. If your MTTR is 2 hours and your target is 30 minutes, you need better runbooks or faster rollback procedures.

Production Failure Scenarios

Failure	Impact	Mitigation
Severity misclassification (under-response)	SEV2 treated as SEV3, wrong people paged, slow resolution	Default to higher severity, demote later if warranted; track over- and under-escalations in post-mortems
Escalation policy not triggering due to on-call rotation gap	Alert fires but nobody sees it for 30 minutes	Test escalation policies quarterly, simulate on-call transitions, have a fallback communication channel
Status page update causing customer panic due to inaccurate information	Customers react to incomplete data, support tickets spike	Have a reviewer check status page updates before posting, verify facts before publishing
Post-mortem action items never implemented	Same incident recurs six months later	Assign action item owners with due dates, review in weekly ops meeting, track in project management tool
War room too large, too many people talking	Noise drowns signal, IC cannot coordinate	Strict attendee list, use a separate investigation channel for parallel work, IC manages participation

Incident Response Trade-off Analysis

Scenario	War Room	Async Slack
Coordination speed	Fast (everyone on call)	Slow (check messages periodically)
Noise level	High	Low
Multi-team alignment	Good	Requires explicit updates
Best for	SEV1, rapidly evolving	SEV2, stable investigation

Scenario	Rollback	Fix-Forward
Time to recovery	Usually faster	Variable
Risk	Lower (known-good state)	Depends on fix quality
When to use	Deployment-caused incidents	Non-deployment causes
Trade-off	May not fix root cause	Fixes the actual problem

Incident Response Observability

Track these metrics to understand your incident response health:

Metric	What It Tells You	Target
MTTD (Mean Time to Detection)	Speed of automated detection	< 5 minutes
MTTA (Mean Time to Acknowledge)	On-call responsiveness	< 15 minutes
MTTR (Mean Time to Resolution)	Overall incident resolution speed	< 30 minutes for SEV1
False positive alert rate	Alert quality	< 10%
Alert to war room open time	How fast coordination starts	< 5 minutes

Alert fatigue is a real problem. If engineers ignore alerts because 80% are false positives, a real incident gets missed. Review alert quality monthly. If an alert fires and nobody acts on it within 15 minutes, it was either not important enough to page or it was a false positive.

Key commands:

# Check PagerDuty escalation policy status
pd-cli escalation-policy list --team "Platform Team"

# Review alert volume by service in the last 24 hours
curl -s "http://prometheus:9090/api/v1/query?query=ALERTS{alertstate='firing'}" | jq '.data.result | group_by(.labels.service) | map({service: .[0].labels.service, count: length})'

# Count incidents by severity in the last quarter
kubectl get events --all-namespaces --field-selector type=Warning --since="12h" | wc -l

Common Pitfalls / Anti-Patterns

Skipping post-mortems for minor incidents. Every SEV1 and SEV2 deserves a post-mortem. SEV3s and SEV4s can be documented in a sentence, but you should still review patterns. If your SEV4 count is growing, something is wrong.

Not acting on post-mortem action items. A post-mortem without tracked action items is just a document. If the same class of incident happens twice, the first post-mortem failed.

Severity inflation. Declaring everything SEV1 because you want attention makes real SEV1s harder to spot. Save SEV1 for actual outages. Your team will learn to ignore the noise.

Treating all incidents as fire drills. Not every incident requires everyone to drop everything. A SEV3 can wait until business hours. Waking people up unnecessarily builds resentment that surfaces when a real SEV1 happens.

No clear IC. When everyone is investigating, nobody is coordinating. The IC makes calls, tracks the timeline, and manages communication. Without one, you get parallel investigations, duplicated effort, and conflicting status updates.

Closing Thoughts

Effective incident response is not about being perfect. It is about being consistent. When every incident follows the same playbook, same severity definitions, same escalation paths, same communication templates, your team can focus on the problem instead of the process.

Run the process, learn from it, and improve it. Incidents are inevitable. Outages are optional.

Trade-off Analysis

No single incident response approach works for every situation. These are the key trade-offs you will face.

Severity Declaration: SEV1 vs SEV2

The choice between declaring a SEV1 or SEV2 shapes how much coordination overhead you get versus how much attention the incident receives.

Factor	SEV1	SEV2
War room	All hands, full focus	Lead + 2-3 responders
Communication	Executive updates	Team-level updates
Resolution pressure	Maximum	High
Overhead cost	High	Moderate

Default to higher severity and demote later. It is easier to explain why a SEV1 was over-egineered than why a SEV2 became a SEV1 mid-incident. Partial availability is the hardest case — when in doubt, declare higher.

Incident Commander: Single IC vs Distributed Investigation

Having one person own coordination versus letting everyone investigate in parallel is a fundamental trade-off.

Approach	Pros	Cons
Single IC	Clear decisions, no duplicated effort, one timeline	IC must resist investigating — coordination is a full-time job
No IC (everyone investigates)	More parallel coverage	Conflicting updates, duplicated work, no clear go/no-go

The IC role is non-negotiable for SEV1 and SEV2. For SEV3 and SEV4, a brief Slack thread with a designated lead is usually sufficient.

Rollback vs Fix-Forward

Whether to revert a deployment or push a fix depends on the situation and what you know.

Factor	Rollback	Fix-Forward
When deployment caused incident	Yes	Only if fix is faster
Known-good state available	Yes	N/A
Fix is trivial and fast	No	Yes
In-flight transactions/due to lose	No	Yes
Root cause unknown	Risky	Risky either way

The 10-minute rule: If you cannot be confident within 10 minutes that rollback solves the problem, evaluate fix-forward. Both are valid — the goal is fastest time-to-resolution.

Post-Mortem Timing: Same Day vs Within 5 Days

Timing	When to use	Trade-off
Same day	Minor incidents, quick wins	May miss contributing factors when memory is freshest
Within 5 business days	SEV1/SEV2	Time to process, but action items can lose urgency

For SEV1 and SEV2, 5 business days gives enough distance to see patterns rather than just symptoms, while staying close enough to the incident for accurate recall.

Communication Templates vs Ad-Hoc Updates

Approach	Pros	Cons
Templates	Faster, consistent, harder to forget key info	Feels robotic if over-used
Ad-hoc	Flexible, context-specific	Inconsistency, missing stakeholders

Use templates for severity, status page, and stakeholder updates. Adapt the tone but keep the structure. For SEV1, pre-built templates in a shared doc save minutes that matter.

Interview Questions

1. Define the key phases of an incident response lifecycle and explain why each matters.

Expected answer points:

Detection: monitoring, alerting, and user reports — the faster you know, the faster you respond
Containment: isolate the blast radius to prevent further damage while you investigate
Investigation: root cause analysis under pressure — distinguish symptoms from causes
Resolution: fix the problem and verify the fix works in production
Post-mortem: blameless review to identify systemic improvements, not assign blame

2. What is the difference between SEV1, SEV2, and SEV3? How do you decide which severity to assign?

Expected answer points:

SEV1: complete service unavailability, data loss, security breach — the business is on fire
SEV2: major feature degraded but customers can work around it — urgent but not catastrophic
SEV3: minor feature impairment, low user impact — addressed in normal workflow
When in doubt, declare a higher severity and demote later — under-responding is more damaging
Partial availability is the hardest call — treat it as SEV2 unless proven otherwise

3. When should you open a war room versus handling an incident asynchronously in Slack?

Expected answer points:

Open a war room for SEV1, multiple-team coordination, or actively worsening incidents
Use async Slack for SEV2, stable incidents, or single-team investigations
War rooms add coordination overhead — justify it with the speed of parallel investigation
Keep war room scope tight: focused on stopping bleeding, not complete diagnosis

4. What makes an effective on-call rotation? How do you balance coverage quality against engineer burnout?

Expected answer points:

Follow-the-sun works for global teams — high coverage at the cost of time zone disruption
Fixed rotations minimize context-switching but can lead to stale knowledge
Dev-first rotation (same team builds and runs) increases ownership but raises burnout risk
Managed services (PagerDuty, Opsgenie) offload scheduling complexity but add cost
Track MTTA and MTTR by rotation — if responders are burned out, numbers show it

5. What is a runbook and what separates a good one from a bad one?

Expected answer points:

A runbook is a step-by-step playbook for a specific incident type
Good runbooks are precise, scoped to one problem, and updated after every incident
Bad runbooks are vague, try to cover too many scenarios, or become outdated
Template-driven runbooks are more maintainable than rigid scripts
Every runbook should have a clear exit criterion — when is this incident resolved?

6. How do you prevent cascading failures during incident response?

Expected answer points:

Identify dependent services early — what breaks when this component fails?
Use circuit breakers to stop failures from propagating upstream
Throttle traffic intentionally to keep the system alive at reduced capacity
Capacity buffers and graceful degradation prevent total collapse under load
Chaos engineering tests these failure modes before they happen in production

7. What is a blast radius analysis and when should you perform one during an incident?

Expected answer points:

Blast radius analysis estimates how far the impact spreads from the failure point
Do it immediately after containment — understand scope before adding more responders
Categorize impact: user-facing vs internal, revenue vs reputation, current vs potential
The goal is right-sizing the response — enough to contain, not so much it creates noise

8. How should you communicate incident status to stakeholders during an ongoing incident?

Expected answer points:

Public status page updates within minutes of declaring an incident
Use a single source of truth — one channel, one incident room, one status page
Initial communication: what happened, current impact, what you are doing
Follow-up communication: every 15-30 min for SEV1, less for SEV2
Never speculate about root cause publicly until it is confirmed

9. What is a blameless post-mortem and why is it important for organizational learning?

Expected answer points:

A blameless post-mortem focuses on systemic causes, not individual fault
Engineers who fear blame will hide mistakes, making systemic problems invisible
Effective post-mortems identify contributing factors across people, process, and tooling
Action items must be specific, assigned, and tracked to completion
Review post-mortems regularly — patterns across incidents reveal structural issues

10. How do you distinguish root cause from symptom during incident investigation?

Expected answer points:

Symptoms are observable effects — error rates, latency spikes, failed requests
Root cause is the underlying mechanism that produces the symptom
Ask "why" five times: why did users see errors → why did the service fail → why did it crash
Correlation is not causation — a spike in metric A followed by failure B does not mean A caused B
Instrumentation and tracing help separate coincidence from causality

11. What are the most common monitoring and alerting failures that prolong incidents?

Expected answer points:

Missing metrics: services without RED (Rate, Errors, Duration) coverage hide failure signals
Silent failures: catch-all error handlers that swallow exceptions without alerting
Alert fatigue: too many alerts cause responders to ignore or delay on real incidents
No causation signal: alerts show symptoms but give no path to diagnosis
Stale dashboards: dashboards that are not queried during incidents show outdated state

12. How do you design on-call schedules that balance coverage quality with engineer wellbeing?

Expected answer points:

Rotation frequency matters more than rotation length — weekly is better than monthly
Compensate on-call time regardless of incident frequency
Allow engineers to opt out of on-call during high-stress personal periods
Track alert volume per service — unhealthy services burn out on-call engineers
Regular on-call training and game days reduce stress during actual incidents

13. What is MTTR and how do you optimize it without sacrificing resolution quality?

Expected answer points:

MTTR (Mean Time To Resolution) measures average time from alert to restored service
High-fidelity alerting reduces MTTR by cutting time spent in detection
Runbook maturity reduces MTTR by removing the "what do we do now" phase
Parallel investigation (war room) reduces MTTR for complex incidents
Never trade off resolution quality for speed — incomplete fixes cause repeat incidents

14. What is the role of a dedicated incident commander during a major outage?

Expected answer points:

The incident commander owns coordination, not investigation — they facilitate, not fix
They manage the war room, delegate tasks, and track action items
They own external communication and stakeholder updates
Separating command from investigation prevents tunnel vision and missed angles
Rotation during long incidents prevents fatigue-driven decision errors

15. How do you handle an SLA breach situation when customers are actively being impacted?

Expected answer points:

Declare the breach immediately and update the status page — transparency > avoidance
Focus engineering effort on the fastest viable fix, not the perfect fix
Communication cadence doubles — every 10-15 minutes for active SLA breach
Account management should be pre-warned before customers call in
Post-incident: review the SLA itself — was it achievable given the system's true capacity?

16. What steps would you take in the first 5 minutes of a suspected cascading failure?

Expected answer points:

Declare incident severity — do not wait for full understanding to start responding
Open the war room and page the incident commander
Check dashboards for upstream dependency failures — look for causal signals, not correlations
Throttle or shed traffic to preserve partial service — degraded is better than dead
Enable circuit breakers on dependent services to stop further propagation

17. How do you facilitate an effective post-mortem meeting without it turning into a blame session?

Expected answer points:

Start with a clear framing: this is about systems and processes, not individuals
Use the 5 whys technique to trace backward from incident to root cause
Every action item needs an owner and a deadline — vague follow-ups are useless
Share post-mortems broadly — organizational learning compounds over time
Track action item completion rate — if it is low, the process has no teeth

18. How would you design an incident response process for a team that has never done formal on-call?

Expected answer points:

Start with alerting: instrument services so failures produce signals, not silence
Define severity levels and response SLAs — even preliminary ones create structure
Create runbooks for the top 3 most common incidents — build habit before building completeness
Run game days: simulate failures in staging to stress-test the process
Retrospect after every incident, no matter how small — build the learning muscle

19. What are the signs that an incident is becoming chronic and how do you manage it differently?

Expected answer points:

Chronic incidents recur multiple times in a short window — same root cause, different symptoms
Warning signs: repeat pages, a service that never leaves the incident state, growing defect backlog
Management approach: stop treating symptoms, allocate dedicated time to root-cause elimination
Elevate to project mode — incidents become tracked work items, not firefighting
Communicate chronic status to stakeholders — ongoing instability requires expectation management

20. How do you balance the need for speed in incident response with documentation and process compliance?

Expected answer points:

During active incident: document actions taken, not analysis — real-time notes are enough
Formal documentation and process updates happen after resolution, not during
Build templates and runbooks in peacetime so incident responders do not create process during crisis
Compliance should be baked into tooling (automated rollback, mandatory SLO gates) not human memory
Post-incident: the review process is where rigor and documentation matter — do not rush it

Conclusion

Key Takeaways

Define severity levels and use them consistently — classify by user impact, not technical interest
Declare SEV1 early and demote if needed — under-responding is worse than over-responding
The Incident Commander coordinates, does not investigate — separate the coordination role from the fix role
Blameless post-mortems focus on systems and processes, not individuals
Track MTTD, MTTA, MTTR over time — you cannot improve what you do not measure
Action items from post-mortems must be tracked and reviewed

Incident Response Checklist

# 1. Define severity levels (SEV1/SEV2/SEV3/SEV4) in your runbook
# 2. Set up PagerDuty escalation with 0 -> 15 -> 30 minute delays
# 3. Create status page templates for Investigating / Identified / Resolved
# 4. Assign an Incident Commander for every SEV1 and SEV2
# 5. Open separate Slack channel per incident: #inc-YYYY-MM-DD-description
# 6. Document timeline as the incident progresses, not after
# 7. Conduct blameless post-mortem within 5 business days of SEV1/SEV2
# 8. Create action items in project management tool with owners and due dates
# 9. Review MTTD, MTTA, MTTR monthly in ops review meeting

For more on building resilient systems, see Chaos Engineering. For alerting best practices, see Alerting in Production. For monitoring best practices and SLOs, see Observability Engineering.

Introduction

When to Use

SEV1 vs. SEV2: How to Decide

War Room vs. Async Slack

Rollback vs. Fix-Forward

Incident Lifecycle

Incident Classification and Severity

Detection Sources and Alerting

Escalation Paths and Communication

Active Incident Management

Blameless Post-Mortem Process

Improving Detection and Response

Production Failure Scenarios

Incident Response Trade-off Analysis

Incident Response Observability

Common Pitfalls / Anti-Patterns

Closing Thoughts

Trade-off Analysis

Severity Declaration: SEV1 vs SEV2

Incident Commander: Single IC vs Distributed Investigation

Rollback vs Fix-Forward

Post-Mortem Timing: Same Day vs Within 5 Days

Communication Templates vs Ad-Hoc Updates

Interview Questions

Further Reading

Conclusion

Key Takeaways

Incident Response Checklist

Category

Tags

Related Posts

Alerting in Production: Paging, Runbooks, and On-Call

Kubernetes High Availability: HPA, Pod Disruption Budgets, Multi-AZ

The Observability Engineering Mindset: Beyond Monitoring