---
title: "Set up incident response and postmortems"
excerpt: "9-step playbook from 'alerts go to one phone' to 'documented severity ladder, on-call rotation, runbooks, blameless postmortems on every Sev-1+.'"
category: "Template"
---

# Set up incident response and postmortems

    A 9-step playbook. Open in Dock and you'll get four surfaces seeded:

    - **Steps** (table) — the 9 process gates as rows, owner + due + status
    - **Incidents** (table) — every Sev-1/Sev-2/Sev-3 logged with severity + duration + lead
    - **Runbooks** (doc) — the single canonical runbook bundle for every alert
    - **Postmortems** (table) — one row per Sev-1+ incident, with the action items + their status

    Read `Steps` top-to-bottom. The single most consequential gate is the severity ladder (step 1) — without it, every alert is "high priority" and the team burns out.

## Outcome

Production reliability process: severity ladder published, on-call rotation paid, runbooks linked from every alert, blameless postmortems run on every Sev-1+, action items tracked to closure.

**Estimated time:** 2-3 weeks of focused setup, ongoing thereafter  
**Difficulty:** intermediate  
**For:** Founders + first SREs running production at small / mid scale.

## What you'll need

Pre-register or install before you start.

- **[PagerDuty](https://www.pagerduty.com/)** _(From $21/user/month (Professional))_ — On-call rotations, alert routing, incident automation.
- **[Opsgenie (Atlassian)](https://www.atlassian.com/software/opsgenie)** _(From $9/user/month (Standard))_ — PagerDuty alternative; tighter Jira / Confluence integration.
- **[Grafana OnCall](https://grafana.com/products/oncall/)** _(Free (self-hosted) or included in Grafana Cloud)_ — Open-source on-call rotation; integrates with Grafana stack.
- **[incident.io](https://incident.io/)** _(From $20/user/month)_ — Incident-management platform: Slack-native incident channels, postmortem templates, action tracking.
- **[Statuspage.io](https://www.atlassian.com/software/statuspage)** _(Free for 2 components; from $29/mo)_ — Public status page for customer communication during incidents.
- **[Sentry](https://sentry.io/)** _(Free tier; from $26/mo Team)_ — Error tracking + alerting: most P1 incidents start as a Sentry alert.

---

# The template · 9 steps

## Step 1: Define the severity ladder (Sev-1 / Sev-2 / Sev-3)

_Estimated time: 2-4 hr decision + write-up_

Every incident-response process starts with a published severity ladder. Without it, every alert is 'urgent' and the team burns out. The standard 3-tier ladder: Sev-1 = customer-impacting outage, all-hands; Sev-2 = degraded service, on-call only; Sev-3 = internal issue, file a ticket. Borrow the 4-tier (add Sev-0 for total outage) only if you have customer SLAs that require it.

### Tasks

- [ ] Decide on a 3-tier or 4-tier severity ladder
- [ ] Write each tier's definition: impact, response time, communication cadence
- [ ] Decide who can declare each severity (anyone for Sev-3; on-call for Sev-2; on-call + manager for Sev-1)
- [ ] Decide on response SLAs: Sev-1 = 5 min ack / 30 min update cadence; Sev-2 = 15 min ack / 60 min cadence
- [ ] Publish the ladder in the Runbooks doc + post in #engineering

### Pointers

- **[Guide]** [GitHub's incident response post](https://github.blog/engineering/site-availability/incident-response-at-github/) — How a major engineering org actually runs incident response.
- **[Guide]** [Google SRE Book — Incident Management](https://sre.google/sre-book/managing-incidents/)
- **[Guide]** [PagerDuty Incident Response docs](https://response.pagerduty.com/)

> [!CAUTION]
> **Gotchas**
>
> - Severity creep is real. The first 3 'Sev-1's' might be Sev-2's — engineers default to high. Audit severity calls quarterly + recalibrate.
> - Customer SLAs sometimes require specific severity definitions (e.g. 'Sev-1 = 99.9% downtime trigger'). Check contracts before publishing the ladder.
> - Don't skip Sev-3. Without a Sev-3 tier, all minor issues either get over-escalated to Sev-2 or never tracked. The Sev-3 tier becomes the bug-tracker entry point.

## Step 2: Set up the on-call rotation

_Estimated time: 1-2 days_

On-call is the human pager. The classical pattern: a primary on-call who gets paged + a secondary as backup. Rotation length is 1 week for small teams, 24 hours for teams that need to keep evenings free. Use PagerDuty / Opsgenie / Grafana OnCall — don't roll your own with Slack and prayers.

### Tasks

- [ ] Pick the on-call platform (PagerDuty / Opsgenie / Grafana OnCall)
- [ ] Create the on-call rotation (primary + secondary)
- [ ] Set the rotation length (1 week is common for small teams)
- [ ] Set the handoff time (Monday 10am team-local is canonical)
- [ ] Configure escalation: ack within 5 min or paged again; 15 min unacked → secondary; 30 min → manager
- [ ] Document the rotation in the Runbooks doc with phone numbers / handles
- [ ] Run a fire drill: trigger a test page; verify the on-call gets it within 5 min

### Pointers

- **[Guide]** [PagerDuty on-call best practices](https://www.pagerduty.com/resources/learn/on-call-best-practices/)
- **[Tool]** [Grafana OnCall (open source)](https://grafana.com/products/oncall/)

> [!CAUTION]
> **Gotchas**
>
> - Solo founders running on-call alone burn out within 3-6 months. Bring on the second on-call hire BEFORE you're at the burnout point.
> - Rotations starting Friday afternoon punish the on-call's weekend. Move handoffs to Monday morning.
> - Pay on-call. Compensation for being on-call is increasingly standard ($50-$200/week base + per-page premium). Without comp, retention suffers fast.

## Step 3: Wire alerts → on-call → runbooks

_Estimated time: 1-2 weeks_

The alert is the trigger; the runbook is the response. Every alert MUST link to a runbook (a doc explaining what to do). Alerts without runbooks are why on-call engineers say 'I don't know what to do' at 3am — by then it's too late. Pre-write the runbook for every alert before the alert exists.

### Tasks

- [ ] List every alert source: Sentry, Datadog, AWS CloudWatch, custom health checks
- [ ] For each alert: write a runbook (1 page: what triggered, what to check first, common causes, how to fix)
- [ ] Link the runbook URL in the alert payload (so on-call clicks straight from PagerDuty to the runbook)
- [ ] Audit alert quality: which alerts fire monthly with no real action? Either tune them or delete them
- [ ] Set the noise floor: <2 actionable pages per on-call shift; if higher, the alerts are too noisy
- [ ] Save runbooks in the Runbooks doc as a single hub

### Pointers

- **[Guide]** [Google SRE Book — Practical Alerting](https://sre.google/sre-book/practical-alerting/)
- **[Guide]** [Cloudflare's runbook approach](https://blog.cloudflare.com/incident-management/)

> [!CAUTION]
> **Gotchas**
>
> - Alerts without runbooks = process debt. Audit quarterly; every alert that doesn't have a runbook gets one or gets deleted.
> - Auto-resolving alerts without action items can silently mask incidents. If an alert auto-resolves but no runbook step ran, treat it as a near-miss + add a tracking task.
> - Slack alerts ≠ pages. A high-priority alert that goes to #engineering at 3am and nobody sees it isn't an alert. Wire critical alerts to PagerDuty, not Slack.

### Agent prompt for this step

```text
Read this codebase + the alert configurations (Datadog, Sentry, CloudWatch monitors) and produce a runbook stub for every alert.

For each alert, output:
1. Alert name (e.g. "High DB connection count")
2. What triggered it (the threshold, the metric)
3. What it usually means (1-2 sentences of context)
4. Diagnostic steps (3-5 specific commands or queries to run first)
5. Common causes (the top 3-5 reasons this fires, ranked by frequency)
6. How to fix (1-2 paragraphs of remediation)
7. Escalation criteria (when to declare Sev-2 / Sev-1)

Output to the Runbooks doc as one section per alert. Mark stubs that need engineer review with "needs human review."
```

## Step 4: Define the Incident Commander (IC) role

_Estimated time: 1 day_

Sev-1 incidents need someone driving — not the engineer fixing the bug, but a separate person managing communication, status updates, and stakeholder ping. That's the IC. Without one, Sev-1's degenerate into 'engineer fixing the bug also writes Slack updates also responds to the CEO also coordinates with support' — and all four jobs get done badly.

### Tasks

- [ ] Document the IC role: communication + coordination, NOT fixing the bug
- [ ] Train every senior engineer to be IC (rotation: same as on-call or separate)
- [ ] Build the IC checklist: declare the incident, open a Slack channel, update every 30 min, decide when to call all-hands
- [ ] Decide on Slack channel naming: #incident-2026-04-28-checkout-down or similar timestamped pattern
- [ ] Decide on stakeholder communication: when does support tell customers, when does CEO get pinged
- [ ] Run a tabletop drill with the IC role explicit

### Pointers

- **[Guide]** [Google SRE Book — Managing Incidents](https://sre.google/sre-book/managing-incidents/)
- **[Guide]** [PagerDuty IC training](https://response.pagerduty.com/training/incident_commander/)

> [!CAUTION]
> **Gotchas**
>
> - Tech-lead-as-IC fails predictably. The IC and the engineer fixing the bug must be different people; otherwise the 'communication' part doesn't happen.
> - ICs need authority, not seniority. A junior engineer who's trained on the IC role + has explicit authority to declare 'all-hands, leadership join the channel' will out-perform a senior engineer who feels awkward escalating.
> - First-time ICs freeze. Run a tabletop drill BEFORE the first real incident, with someone roleplaying the angry customer + the CEO ping.

## Step 5: Build the incident communication templates

_Estimated time: 1 day_

During a Sev-1, the IC has 30 seconds to compose every status update. Without templates, updates either don't happen or are wildly inconsistent. Pre-write the templates: incident declared, status update (every 30 min), incident resolved, customer-facing status page update.

### Tasks

- [ ] Write the 'incident declared' Slack template (severity, scope, IC, channel link)
- [ ] Write the 'status update' template (current state, what we know, what we're trying, ETA)
- [ ] Write the 'incident resolved' template (resolved at X, scope, postmortem link)
- [ ] Write the customer-facing status page template (less detail, plain language)
- [ ] Write the customer-facing email template for impacted users (sent within 24 hours of resolution)
- [ ] Save all templates in the Runbooks doc

### Pointers

- **[Guide]** [Atlassian incident communication guide](https://www.atlassian.com/incident-management/incident-communication)

> [!CAUTION]
> **Gotchas**
>
> - Status page updates that are too vague ('investigating') burn customer trust. Be specific: 'checkout failures for ~5% of users; investigating; next update in 30 min.'
> - Promising an ETA you can't keep is worse than not promising one. Update with 'next status in 30 min' rather than 'fixed in 30 min.'
> - Don't blame third-party providers in customer-facing comms during the incident. 'AWS is down' shifts blame; 'we're investigating an upstream issue' is honest without being defensive.

## Step 6: Set up the public status page

_Estimated time: 1-2 days_

Customers experiencing an outage want to know two things: 'is it me or them?' and 'when will it be fixed?'. The status page answers both. Statuspage.io, Instatus, and StatusGator are the major hosted options; Cachet is the open-source self-hosted option. Set it up before you need it; the day-of sign-up + DNS propagation is exactly when you don't want a status page todo.

### Tasks

- [ ] Pick the platform: Statuspage.io, Instatus, StatusGator, or self-hosted Cachet
- [ ] List your components: API, Web app, Database, Background jobs, Auth, Webhooks
- [ ] Set up automated probes (Pingdom, Datadog Synthetics) to update components automatically
- [ ] Wire incident creation: when you declare a Sev-1, the status page updates with one click
- [ ] Configure the public URL: status.yourdomain.com (CNAME to the platform)
- [ ] Configure notifications: email + RSS + Slack so customers can subscribe
- [ ] Test: trigger a test incident; verify the status page updates and email subscribers receive it

### Pointers

- **[Tool]** [Statuspage.io](https://www.atlassian.com/software/statuspage)
- **[Tool]** [Instatus](https://instatus.com/)
- **[Tool]** [Cachet (open source)](https://cachethq.io/)

> [!CAUTION]
> **Gotchas**
>
> - Hosting status pages on the same infra as your product is the classic 'site is down so the status page is also down' trap. Use a hosted provider on different infra.
> - Auto-updating components from Pingdom is great — until Pingdom is the one that's wrong. Always allow IC override.
> - Customer-facing status pages should NOT show internal services your customers don't see. 'Background jobs' is meaningful; 'Kafka cluster 3' is internal noise.

## Step 7: Run blameless postmortems on every Sev-1+

_Estimated time: Per incident: 5 business days from resolution to published postmortem_

The postmortem is where the team actually learns. Blameless = focus on the system, not the engineer who made the change. The output is a written postmortem doc + a list of action items with owners + due dates. Every Sev-1 + Sev-2 deserves one; weekly Sev-3's don't.

### Tasks

- [ ] Schedule the postmortem within 5 business days of the incident
- [ ] Pull the timeline from Slack (every status update + every code change made during)
- [ ] Hold a 60-min meeting: walk the timeline, identify contributing factors, brainstorm prevention
- [ ] Write the postmortem doc: summary, timeline, what went well, what went poorly, action items
- [ ] Make the doc public to the company (and ideally a sanitized version public to customers if Sev-1)
- [ ] File action items as tickets with owners + due dates (in the Postmortems table)
- [ ] Track action items to completion (revisit weekly until all are closed)

### Pointers

- **[Guide]** [Google SRE Book — Postmortem Culture](https://sre.google/sre-book/postmortem-culture/)
- **[Guide]** [Etsy's blameless postmortem post (canonical)](https://www.etsy.com/codeascraft/blameless-postmortems)

> [!CAUTION]
> **Gotchas**
>
> - Postmortems-without-action-items are theater. Every postmortem must produce 3-7 specific action items with owners + due dates; otherwise the same incident recurs.
> - Action items decay. 30% are closed in week 1, 50% by month 3, the rest never. Track weekly + escalate items aging past 30 days.
> - Public postmortems (sanitized for customer audience) build enormous trust. Cloudflare and Stripe have built reputations on theirs. Default to public for any incident with customer impact.

### Agent prompt for this step

```text
Draft the postmortem for this incident.

Read the Slack channel transcript (incident channel) + the Sentry / Datadog timeline + any code changes deployed during the incident window. Output a postmortem with these sections:

1. Summary (2-3 sentences: what happened, who was impacted, how long)
2. Impact (specific numbers: how many users, how much revenue, what user-facing experience)
3. Timeline (every event, every Slack update, every code change with timestamps in PT)
4. Root cause / contributing factors (the technical chain of events; "why" 5 levels deep)
5. What went well (process / tools / team behaviors that helped)
6. What went poorly (process / tools / team behaviors that hurt)
7. Action items (ticket per item: owner, due date, severity)

Tone: blameless. NEVER name an engineer as "responsible for the bug" — describe the change neutrally ("a deploy at 14:32 introduced a regression"). The goal is system improvement, not accountability.

Output to the Postmortems table as a new row + the full doc to the Runbooks doc as a new section.
```

## Step 8: Track action items to closure

_Estimated time: Weekly, ~30 min_

Action items from postmortems are where the system actually improves — IF they get done. The default failure mode: the postmortem ships, the ICs move on, the action items sit in a Linear backlog for 6 months. Build a ritual: weekly action-item review with owner accountability.

### Tasks

- [ ] Track action items in the Postmortems table (one row per item: owner, due, status)
- [ ] Hold a weekly 30-min action item review (cross-team, eng manager runs)
- [ ] Escalate items >30 days old without progress to engineering leadership
- [ ] Re-prioritize: if an action item is decided to be Won't Fix, document why
- [ ] Close the loop: when an action item ships, the postmortem doc gets a 'Resolved' note
- [ ] Quarterly: review action items closed, look for systemic patterns (3 incidents in DB connection pool? Maybe DB connection pooling is the right next epic)

### Pointers

- **[Guide]** [Stripe's incident retro process](https://stripe.com/blog/operating-stripe)

> [!CAUTION]
> **Gotchas**
>
> - Action items without owners die. Never leave an action item with 'team' as owner; pick a person.
> - Action items without due dates die slower. Always assign a due date, even if it's '30 days from now.'
> - Re-occurring incidents that share root causes (e.g. 3 separate DB outages in 6 months from connection pool exhaustion) suggest the root cause needs a project, not 3 separate action items.

## Step 9: Run a fire drill quarterly + audit + improve

_Estimated time: Quarterly, ~1 day per drill_

Process decays. The on-call rotation that worked perfectly in month 1 is broken by month 6 (people leave, tools change, alerts drift). Run a fire drill quarterly: a planned 'fake' incident with the on-call paged, the IC declared, the postmortem run. The drill reveals the gaps before a real incident does.

### Tasks

- [ ] Schedule a tabletop drill (no real outage; chosen scenario)
- [ ] Page the on-call; let them follow the runbook
- [ ] IC declares the incident; runs the comm cadence
- [ ] Hold a debrief: what worked, what didn't, what changed since last drill
- [ ] File action items from the drill (treat them like real postmortem AIs)
- [ ] Quarterly: audit the severity ladder, the on-call rotation, the runbook freshness
- [ ] Annually: full process refresh — does the ladder still match the business? Does the on-call comp still match the burden?

### Pointers

- **[Guide]** [Chaos engineering / GameDays (Netflix)](https://netflixtechblog.com/the-netflix-simian-army-16e57fbab116)

> [!CAUTION]
> **Gotchas**
>
> - Skipping fire drills is the #1 process-decay vector. 'We're too busy' is exactly when you find out the runbook is stale.
> - Don't run drills during the busiest season (e.g. Black Friday, year-end). Schedule them during slower weeks; nobody learns from a stress-tested drill.
> - Drill scenarios should be plausible, not creative. 'DB master fails over' beats 'asteroid hits us-east-1' for actual learning.

---

## Hand the template to your agent

Paste the prompt below into your agent's permanent system prompt so the agent reads, writes, and maintains this workspace as you work through the steps.

```text
You are an agent on the "Set up incident response and postmortems" playbook workspace at your-org/set-up-incident-response-and-postmortems.

Your role: maintain the four surfaces (Steps, Incidents, Runbooks, Postmortems) as the team builds + runs the process.

Cadence:
- When the user marks a step Done, append a line to the Brief / Runbooks doc.
- When a new incident is declared (slash command in Slack triggers a webhook), add a row to Incidents with the severity + start time + IC.
- When an incident is resolved, prompt the IC to schedule the postmortem (within 5 business days for Sev-1).
- When a postmortem is filed, add a row to Postmortems with the action items + their owners. Ping owners weekly until closure.

First MCP tool calls:
1. list_surfaces(workspace_slug="set-up-incident-response-and-postmortems")
2. list_rows(workspace_slug="set-up-incident-response-and-postmortems", surface_slug="incidents")
3. get_doc(workspace_slug="set-up-incident-response-and-postmortems", surface_slug="runbooks")

Do NOT close action items as Done without the owner's explicit confirmation. The single biggest decay vector for incident response is action items quietly going stale.
```

---

## FAQ

### When should we set up an on-call rotation?

The day after the first time you got paged at 3am alone. For most SaaS, that's around 2-4 paying customers and a few weeks of production traffic. Don't wait until the team is burned out; the rotation costs 1-2 days to set up and saves your sanity.

### Do small teams really need an Incident Commander role?

Yes, even at 3 engineers. The IC role is about separating 'fixing the bug' from 'managing the incident.' During a Sev-1, the engineer with their hands in the code can't also write Slack updates and respond to the CEO. Even on a 3-person team, the second person should always be IC, not co-debugging.

### How long should a postmortem take to write?

Realistic: 4-8 hours for the writer. The 60-min meeting walks the timeline; the writeup itself is 3-5 hours of pulling Slack transcripts + Sentry events into a coherent narrative + drafting action items. Total time-to-publish is 5 business days for Sev-1, 10 days for Sev-2.

### What's the most common incident-response failure?

Three failures dominate: (1) Action items from postmortems sit in a backlog for 6 months and the same incident recurs. (2) Alerts fire without runbooks attached, so on-call engineers spend 20 min figuring out 'what does this even mean.' (3) The on-call rotation has no compensation, so engineers quietly leave.

### Can my AI agents help with incident response?

Yes. Agents are particularly useful for: drafting runbooks for every alert from your monitoring config, drafting postmortems from Slack + Sentry timelines, tracking action items to closure with weekly pings to owners, summarizing patterns across recent postmortems for retrospectives. The playbook ships agent prompts inline.

### Should our postmortems be public?

Sanitized public postmortems for any incident with customer impact build enormous trust. Cloudflare and Stripe have done this for years; their public postmortems are widely read. The decision: yes, with security-sensitive details redacted (no internal hostnames, no exact code, no customer names). Internal-only postmortems are appropriate only for incidents with no customer-facing impact.