Runbook inventory¶

Runbook

Source: doc/operations/runbooks/ · 38 runbooks · catalog manifest at runbooks.catalog.json

Every operational failure mode has a runbook. The catalog manifest (runbooks.catalog.json) is machine-readable and powers the in-product admin runbook panel.

By category¶

pie title Runbook distribution (38 total)
    "GPU Slice" : 4
    "App Platform" : 5
    "Billing / Payments" : 3
    "IAM / Auth" : 3
    "Node / MAAS" : 5
    "Terminal / Proxy" : 4
    "Platform Recovery" : 4
    "Database" : 1
    "Other" : 9

GPU Slice (4)¶

Runbook	When to use
`GPU_Slice_Cleanup_Blocked_Slot_Runbook.md`	Slot stuck in `cleanup_blocked` (mounted host storage, wipe failure, drift)
`GPU_Slice_Image_Pipeline_Runbook.md`	Slice image build/verify/import/cache invalidation issues
`GPU_Slice_Node_Manual_Bootstrap_Runbook.md`	Manual host bootstrap when MAAS commissioning didn't apply the slice firmware profile
`GPU_Slice_Infra_Enablement_Proposal_v1.md`	Enabling new physical hosts as slice-capable (joint infra + platform work)

App Platform (5)¶

Runbook	When to use
`App_Artifact_Lifecycle_Incident_Runbook.md`	OCI artifact promote/trust/digest issues
`App_Catalog_Incident_Runbook.md`	App manifest registration or catalog page failures
`App_Platform_Operator_Incident_Runbook.md`	Platform-operator app (system app) incident
`App_Runtime_Billing_Incident_Runbook.md`	App-runtime usage records / billing alignment
`App_Runtime_Lifecycle_Incident_Runbook.md`	App instance stuck in create/start/stop/release

Billing / Payments (3)¶

Runbook	When to use
`Billing_Worker_Failure_Runbook.md`	Accrual loop wedged, depleted-balance enforcement not triggering
`Webhook_Processing_Outage_Runbook.md`	Stripe webhook backlog or signature failures
`Queue_Backlog_Runbook.md`	NATS DLQ or worker backlog growth

IAM / Auth (3)¶

Runbook	When to use
`IAM_Role_Assignment_and_Membership_Incident_Runbook.md`	Membership grant/revoke or role-resolution bug
`User_Onboarding_Auth_Context_Runbook.md`	New-user signup not bootstrapping tenant+project
`Tenant_Project_Authorization_Runbook.md`	Scope-resolution or cross-tenant data leak suspicion
`Enterprise_Federation_Incident_Runbook.md`	Enterprise OIDC SSO breakage

Node / MAAS (5)¶

Runbook	When to use
`Node_Onboarding_Runbook.md`	Enrolling a new node end-to-end
`Node_Agent_Control_Plane_Recovery_2026-03.md`	Node-agent ↔ API mTLS or cert rotation issues
`MAAS_H200_Host_Image_Pipeline_Runbook.md`	MAAS image pipeline for H200 hosts
`Three_Host_Lab_Incident_Runbook.md`	Three-host dev/CI/MaaS lab issues
`Fleet_Telemetry_Incident_Runbook.md`	Host telemetry pipeline failures

Terminal / Proxy (4)¶

Runbook	When to use
`Terminal_Gateway_Incident_Runbook.md`	Browser terminal stuck/failed to connect
`Proxied_App_UI_Incident_Runbook.md`	Embedded app UI proxy failures
`API_Degradation_Runbook.md`	API p50/p99 latency or error budget burn
`Platform_Control_Disk_Cleanup_Runbook.md`	Platform-control disk space recovery

Platform Recovery (4)¶

Runbook	When to use
`Platform_Control_K3s_Recovery_Runbook.md`	K3s control-plane recovery
`JWKS_Compromise_Breakglass_Runbook.md`	Suspected JWKS key compromise
`Key_Rotation_and_Compromise_Response_Runbook.md`	Routine + emergency key rotation
`Vault_Bootstrap_and_Root_Token_Runbook.md`	Vault initial bootstrap, root token handling

Other¶

Runbook	When to use
`Database_Latency_or_Failover_Runbook.md`	Postgres latency, replication lag, failover
`CLI_Incident_and_Support_Triage_Runbook.md`	CLI break/regression triage
`Python_SDK_Incident_and_Observability_Runbook.md`	Python SDK issues
`Provisioning_Workflow_Stuck_Runbook.md`	Allocation stuck in `requested` / `provisioning` for too long
`Slurm_Reference_Deploying_Stuck_Runbook.md`	Reference Slurm controller stuck during deploy
`Admin_Ops_Dashboard_Usage_Runbook.md`	Operating the in-product admin ops panel
`Agent_Orchestrator_v2_Runbook.md`	Multi-agent execution operations
`Incident_Communication_Runbook.md`	Severity-aligned stakeholder communication

Lifecycle and ownership¶

flowchart LR
    A[Alert fires<br/>SLO breach / error budget / page] --> B{Severity?}
    B -- Sev1 --> C[Page on-call<br/>+ incident channel<br/>+ comm runbook]
    B -- Sev2 --> D[Page on-call<br/>during business hours]
    B -- Sev3 --> E[Ticket queue]
    C --> F[Open runbook]
    D --> F
    E --> F
    F --> G[Follow steps<br/>capture evidence]
    G --> H{Resolved?}
    H -- yes --> I[Close incident<br/>postmortem if Sev1/Sev2]
    H -- no --> J[Escalate or<br/>open RCA]
    I --> K[Update runbook if<br/>missing/incorrect step]

Detail: Incident severity model.

Runbook format¶

Every runbook in this index follows the same structure:

When to use — the symptom or alert that should open this runbook.
Owning team — who's accountable for the underlying system.
Diagnostic steps — what to check first, in order.
Mitigations — bounded actions that may resolve the incident.
Escalation — when to open an RCA or page senior on-call.
Evidence to capture — links/queries/timestamps to record for postmortem.

The format is enforced via runbooks.catalog.json schema. New runbooks must register there before being linked from alerts.

Where to look next¶

Runbook detail index (full sorted view)
Incident severity model
Observability stack
RCAs on record — the three published post-incident analyses