Operations¶
Implemented Runbook
Operating GPUaaS: local-dev setup, observability, runbooks, incident severity, lab topology.
Pages¶
| Page | Audience |
|---|---|
| Local dev setup | Developers, QA, on-call new joiners |
| Observability stack | SRE, platform-control, on-call |
| Runbook index | All on-call |
| Incident severity model | All on-call, comm leads |
| Lab topology | Platform, MAAS operators |
Operating principles¶
flowchart TB
A[Alert / SLO breach / incident] --> B{Severity}
B -- Sev1 --> C[Page on-call<br/>incident channel<br/>comm runbook]
B -- Sev2 --> D[Page on-call<br/>business hours]
B -- Sev3 --> E[Ticket queue]
C & D & E --> F[Open runbook<br/>follow steps<br/>capture evidence]
F --> G{Resolved?}
G -- yes --> H[Close incident<br/>postmortem for Sev1/Sev2]
G -- no --> I[Escalate or open RCA]
H --> J{Same failure class<br/>twice?}
J -- yes --> K[RCA required]
I --> K
K --> L[Update runbook,<br/>add test, add gate]
Source docs¶
Read these for the policy and ground rules: