Incident severity model¶
Runbook
Source:
doc/operations/Incident_Severity_Model.md ยท doc/operations/runbooks/Incident_Communication_Runbook.md
Severity levels¶
| Sev | Definition | Response | Comm |
|---|---|---|---|
| Sev1 | Customer-impacting outage, data loss risk, security breach | Page on-call immediately (24/7) | Incident channel + status page within 15 min |
| Sev2 | Degraded service, single-tenant impact, partial feature outage | Page on-call in business hours | Internal stakeholder updates |
| Sev3 | Background error rate elevated, non-customer-visible | Ticket queue | None until triage |
Triage flow¶
flowchart TB
A[Alert fires] --> B{Customer impact?}
B -- Outage / data loss --> S1[Sev1]
B -- Degraded / partial --> S2[Sev2]
B -- Internal only --> S3[Sev3]
S1 --> P1[Page on-call immediately]
S1 --> C1[Open incident channel<br/>+ status page]
S1 --> R1[Open runbook]
S1 --> COMM1[Incident comms<br/>every 30 min]
S1 --> POST1[Postmortem required<br/>RCA if root cause spans layers]
S2 --> P2[Page on-call business hours]
S2 --> R2[Open runbook]
S2 --> COMM2[Internal updates]
S2 --> POST2[Postmortem required]
S3 --> TKT[Ticket queue]
S3 --> R3[Triage; runbook if<br/>recurrent class]
RCA trigger¶
A postmortem becomes an RCA when:
- The same owner-layer breaks more than once.
- Recovery required debugging across multiple layers.
- Observability gaps materially increased recovery time.
- The fix should drive follow-up design, tests, or operator workflow changes.
โ See RCAs on record.
Comms responsibility¶
โ Read source: Incident_Communication_Runbook.md.
| Role | Sev1 | Sev2 |
|---|---|---|
| On-call engineer | Mitigation + tech updates | Mitigation |
| Comm lead | Status page + customer comms | Internal updates |
| IC (incident commander) | Coordinates handoffs | (not required) |
| Recorder | Timeline + evidence | (not required) |
Postmortem template¶
flowchart TB
A[1. Summary<br/>impact + duration + root cause] --> B[2. Timeline<br/>UTC, minute granularity]
B --> C[3. Detection<br/>how, when, who]
C --> D[4. Impact<br/>customers, revenue, data]
D --> E[5. Root cause<br/>owning layer, concrete defect]
E --> F[6. What went well]
F --> G[7. What went wrong]
G --> H[8. Action items<br/>owner + ETA]
Blameless. Action items must be assigned and tracked to closure.