Skip to content

Incident severity model

Runbook

Source: doc/operations/Incident_Severity_Model.md ยท doc/operations/runbooks/Incident_Communication_Runbook.md

Severity levels

Sev Definition Response Comm
Sev1 Customer-impacting outage, data loss risk, security breach Page on-call immediately (24/7) Incident channel + status page within 15 min
Sev2 Degraded service, single-tenant impact, partial feature outage Page on-call in business hours Internal stakeholder updates
Sev3 Background error rate elevated, non-customer-visible Ticket queue None until triage

Triage flow

flowchart TB
    A[Alert fires] --> B{Customer impact?}
    B -- Outage / data loss --> S1[Sev1]
    B -- Degraded / partial --> S2[Sev2]
    B -- Internal only --> S3[Sev3]

    S1 --> P1[Page on-call immediately]
    S1 --> C1[Open incident channel<br/>+ status page]
    S1 --> R1[Open runbook]
    S1 --> COMM1[Incident comms<br/>every 30 min]
    S1 --> POST1[Postmortem required<br/>RCA if root cause spans layers]

    S2 --> P2[Page on-call business hours]
    S2 --> R2[Open runbook]
    S2 --> COMM2[Internal updates]
    S2 --> POST2[Postmortem required]

    S3 --> TKT[Ticket queue]
    S3 --> R3[Triage; runbook if<br/>recurrent class]

RCA trigger

A postmortem becomes an RCA when:

  • The same owner-layer breaks more than once.
  • Recovery required debugging across multiple layers.
  • Observability gaps materially increased recovery time.
  • The fix should drive follow-up design, tests, or operator workflow changes.

โ†’ See RCAs on record.

Comms responsibility

โ†’ Read source: Incident_Communication_Runbook.md.

Role Sev1 Sev2
On-call engineer Mitigation + tech updates Mitigation
Comm lead Status page + customer comms Internal updates
IC (incident commander) Coordinates handoffs (not required)
Recorder Timeline + evidence (not required)

Postmortem template

flowchart TB
    A[1. Summary<br/>impact + duration + root cause] --> B[2. Timeline<br/>UTC, minute granularity]
    B --> C[3. Detection<br/>how, when, who]
    C --> D[4. Impact<br/>customers, revenue, data]
    D --> E[5. Root cause<br/>owning layer, concrete defect]
    E --> F[6. What went well]
    F --> G[7. What went wrong]
    G --> H[8. Action items<br/>owner + ETA]

Blameless. Action items must be assigned and tracked to closure.

Where to look next