Define the severity ladder (Sev-1 / Sev-2 / Sev-3)
2-4 hr decision + write-upEvery incident-response process starts with a published severity ladder. Without it, every alert is 'urgent' and the team burns out. The standard 3-tier ladder: Sev-1 = customer-impacting outage, all-hands; Sev-2 = degraded service, on-call only; Sev-3 = internal issue, file a ticket. Borrow the 4-tier (add Sev-0 for total outage) only if you have customer SLAs that require it.
- Decide on a 3-tier or 4-tier severity ladder
- Write each tier's definition: impact, response time, communication cadence
- Decide who can declare each severity (anyone for Sev-3; on-call for Sev-2; on-call + manager for Sev-1)
- Decide on response SLAs: Sev-1 = 5 min ack / 30 min update cadence; Sev-2 = 15 min ack / 60 min cadence
- Publish the ladder in the Runbooks doc + post in #engineering
- Severity creep is real. The first 3 'Sev-1's' might be Sev-2's — engineers default to high. Audit severity calls quarterly + recalibrate.
- Customer SLAs sometimes require specific severity definitions (e.g. 'Sev-1 = 99.9% downtime trigger'). Check contracts before publishing the ladder.
- Don't skip Sev-3. Without a Sev-3 tier, all minor issues either get over-escalated to Sev-2 or never tracked. The Sev-3 tier becomes the bug-tracker entry point.