Runbook inventory
Runbook
Source: doc/operations/runbooks/ · 38 runbooks · catalog manifest at runbooks.catalog.json
Every operational failure mode has a runbook. The catalog manifest (runbooks.catalog.json) is machine-readable and powers the in-product admin runbook panel.
By category
pie title Runbook distribution (38 total)
"GPU Slice" : 4
"App Platform" : 5
"Billing / Payments" : 3
"IAM / Auth" : 3
"Node / MAAS" : 5
"Terminal / Proxy" : 4
"Platform Recovery" : 4
"Database" : 1
"Other" : 9
GPU Slice (4)
| Runbook |
When to use |
App_Artifact_Lifecycle_Incident_Runbook.md |
OCI artifact promote/trust/digest issues |
App_Catalog_Incident_Runbook.md |
App manifest registration or catalog page failures |
App_Platform_Operator_Incident_Runbook.md |
Platform-operator app (system app) incident |
App_Runtime_Billing_Incident_Runbook.md |
App-runtime usage records / billing alignment |
App_Runtime_Lifecycle_Incident_Runbook.md |
App instance stuck in create/start/stop/release |
Billing / Payments (3)
| Runbook |
When to use |
Billing_Worker_Failure_Runbook.md |
Accrual loop wedged, depleted-balance enforcement not triggering |
Webhook_Processing_Outage_Runbook.md |
Stripe webhook backlog or signature failures |
Queue_Backlog_Runbook.md |
NATS DLQ or worker backlog growth |
IAM / Auth (3)
| Runbook |
When to use |
IAM_Role_Assignment_and_Membership_Incident_Runbook.md |
Membership grant/revoke or role-resolution bug |
User_Onboarding_Auth_Context_Runbook.md |
New-user signup not bootstrapping tenant+project |
Tenant_Project_Authorization_Runbook.md |
Scope-resolution or cross-tenant data leak suspicion |
Enterprise_Federation_Incident_Runbook.md |
Enterprise OIDC SSO breakage |
Node / MAAS (5)
| Runbook |
When to use |
Node_Onboarding_Runbook.md |
Enrolling a new node end-to-end |
Node_Agent_Control_Plane_Recovery_2026-03.md |
Node-agent ↔ API mTLS or cert rotation issues |
MAAS_H200_Host_Image_Pipeline_Runbook.md |
MAAS image pipeline for H200 hosts |
Three_Host_Lab_Incident_Runbook.md |
Three-host dev/CI/MaaS lab issues |
Fleet_Telemetry_Incident_Runbook.md |
Host telemetry pipeline failures |
Terminal / Proxy (4)
| Runbook |
When to use |
Terminal_Gateway_Incident_Runbook.md |
Browser terminal stuck/failed to connect |
Proxied_App_UI_Incident_Runbook.md |
Embedded app UI proxy failures |
API_Degradation_Runbook.md |
API p50/p99 latency or error budget burn |
Platform_Control_Disk_Cleanup_Runbook.md |
Platform-control disk space recovery |
| Runbook |
When to use |
Platform_Control_K3s_Recovery_Runbook.md |
K3s control-plane recovery |
JWKS_Compromise_Breakglass_Runbook.md |
Suspected JWKS key compromise |
Key_Rotation_and_Compromise_Response_Runbook.md |
Routine + emergency key rotation |
Vault_Bootstrap_and_Root_Token_Runbook.md |
Vault initial bootstrap, root token handling |
Other
| Runbook |
When to use |
Database_Latency_or_Failover_Runbook.md |
Postgres latency, replication lag, failover |
CLI_Incident_and_Support_Triage_Runbook.md |
CLI break/regression triage |
Python_SDK_Incident_and_Observability_Runbook.md |
Python SDK issues |
Provisioning_Workflow_Stuck_Runbook.md |
Allocation stuck in requested / provisioning for too long |
Slurm_Reference_Deploying_Stuck_Runbook.md |
Reference Slurm controller stuck during deploy |
Admin_Ops_Dashboard_Usage_Runbook.md |
Operating the in-product admin ops panel |
Agent_Orchestrator_v2_Runbook.md |
Multi-agent execution operations |
Incident_Communication_Runbook.md |
Severity-aligned stakeholder communication |
Lifecycle and ownership
flowchart LR
A[Alert fires<br/>SLO breach / error budget / page] --> B{Severity?}
B -- Sev1 --> C[Page on-call<br/>+ incident channel<br/>+ comm runbook]
B -- Sev2 --> D[Page on-call<br/>during business hours]
B -- Sev3 --> E[Ticket queue]
C --> F[Open runbook]
D --> F
E --> F
F --> G[Follow steps<br/>capture evidence]
G --> H{Resolved?}
H -- yes --> I[Close incident<br/>postmortem if Sev1/Sev2]
H -- no --> J[Escalate or<br/>open RCA]
I --> K[Update runbook if<br/>missing/incorrect step]
Detail: Incident severity model.
Every runbook in this index follows the same structure:
- When to use — the symptom or alert that should open this runbook.
- Owning team — who's accountable for the underlying system.
- Diagnostic steps — what to check first, in order.
- Mitigations — bounded actions that may resolve the incident.
- Escalation — when to open an RCA or page senior on-call.
- Evidence to capture — links/queries/timestamps to record for postmortem.
The format is enforced via runbooks.catalog.json schema. New runbooks must register there before being linked from alerts.
Where to look next