Skip to content

Operations

Implemented Runbook

Operating GPUaaS: local-dev setup, observability, runbooks, incident severity, lab topology.

Pages

Page Audience
Local dev setup Developers, QA, on-call new joiners
Observability stack SRE, platform-control, on-call
Runbook index All on-call
Incident severity model All on-call, comm leads
Lab topology Platform, MAAS operators

Operating principles

flowchart TB
    A[Alert / SLO breach / incident] --> B{Severity}
    B -- Sev1 --> C[Page on-call<br/>incident channel<br/>comm runbook]
    B -- Sev2 --> D[Page on-call<br/>business hours]
    B -- Sev3 --> E[Ticket queue]
    C & D & E --> F[Open runbook<br/>follow steps<br/>capture evidence]
    F --> G{Resolved?}
    G -- yes --> H[Close incident<br/>postmortem for Sev1/Sev2]
    G -- no --> I[Escalate or open RCA]
    H --> J{Same failure class<br/>twice?}
    J -- yes --> K[RCA required]
    I --> K
    K --> L[Update runbook,<br/>add test, add gate]

Source docs

Read these for the policy and ground rules: