Skip to content

Runbook inventory

Runbook

Source: doc/operations/runbooks/ · 38 runbooks · catalog manifest at runbooks.catalog.json

Every operational failure mode has a runbook. The catalog manifest (runbooks.catalog.json) is machine-readable and powers the in-product admin runbook panel.

By category

pie title Runbook distribution (38 total)
    "GPU Slice" : 4
    "App Platform" : 5
    "Billing / Payments" : 3
    "IAM / Auth" : 3
    "Node / MAAS" : 5
    "Terminal / Proxy" : 4
    "Platform Recovery" : 4
    "Database" : 1
    "Other" : 9

GPU Slice (4)

Runbook When to use
GPU_Slice_Cleanup_Blocked_Slot_Runbook.md Slot stuck in cleanup_blocked (mounted host storage, wipe failure, drift)
GPU_Slice_Image_Pipeline_Runbook.md Slice image build/verify/import/cache invalidation issues
GPU_Slice_Node_Manual_Bootstrap_Runbook.md Manual host bootstrap when MAAS commissioning didn't apply the slice firmware profile
GPU_Slice_Infra_Enablement_Proposal_v1.md Enabling new physical hosts as slice-capable (joint infra + platform work)

App Platform (5)

Runbook When to use
App_Artifact_Lifecycle_Incident_Runbook.md OCI artifact promote/trust/digest issues
App_Catalog_Incident_Runbook.md App manifest registration or catalog page failures
App_Platform_Operator_Incident_Runbook.md Platform-operator app (system app) incident
App_Runtime_Billing_Incident_Runbook.md App-runtime usage records / billing alignment
App_Runtime_Lifecycle_Incident_Runbook.md App instance stuck in create/start/stop/release

Billing / Payments (3)

Runbook When to use
Billing_Worker_Failure_Runbook.md Accrual loop wedged, depleted-balance enforcement not triggering
Webhook_Processing_Outage_Runbook.md Stripe webhook backlog or signature failures
Queue_Backlog_Runbook.md NATS DLQ or worker backlog growth

IAM / Auth (3)

Runbook When to use
IAM_Role_Assignment_and_Membership_Incident_Runbook.md Membership grant/revoke or role-resolution bug
User_Onboarding_Auth_Context_Runbook.md New-user signup not bootstrapping tenant+project
Tenant_Project_Authorization_Runbook.md Scope-resolution or cross-tenant data leak suspicion
Enterprise_Federation_Incident_Runbook.md Enterprise OIDC SSO breakage

Node / MAAS (5)

Runbook When to use
Node_Onboarding_Runbook.md Enrolling a new node end-to-end
Node_Agent_Control_Plane_Recovery_2026-03.md Node-agent ↔ API mTLS or cert rotation issues
MAAS_H200_Host_Image_Pipeline_Runbook.md MAAS image pipeline for H200 hosts
Three_Host_Lab_Incident_Runbook.md Three-host dev/CI/MaaS lab issues
Fleet_Telemetry_Incident_Runbook.md Host telemetry pipeline failures

Terminal / Proxy (4)

Runbook When to use
Terminal_Gateway_Incident_Runbook.md Browser terminal stuck/failed to connect
Proxied_App_UI_Incident_Runbook.md Embedded app UI proxy failures
API_Degradation_Runbook.md API p50/p99 latency or error budget burn
Platform_Control_Disk_Cleanup_Runbook.md Platform-control disk space recovery

Platform Recovery (4)

Runbook When to use
Platform_Control_K3s_Recovery_Runbook.md K3s control-plane recovery
JWKS_Compromise_Breakglass_Runbook.md Suspected JWKS key compromise
Key_Rotation_and_Compromise_Response_Runbook.md Routine + emergency key rotation
Vault_Bootstrap_and_Root_Token_Runbook.md Vault initial bootstrap, root token handling

Other

Runbook When to use
Database_Latency_or_Failover_Runbook.md Postgres latency, replication lag, failover
CLI_Incident_and_Support_Triage_Runbook.md CLI break/regression triage
Python_SDK_Incident_and_Observability_Runbook.md Python SDK issues
Provisioning_Workflow_Stuck_Runbook.md Allocation stuck in requested / provisioning for too long
Slurm_Reference_Deploying_Stuck_Runbook.md Reference Slurm controller stuck during deploy
Admin_Ops_Dashboard_Usage_Runbook.md Operating the in-product admin ops panel
Agent_Orchestrator_v2_Runbook.md Multi-agent execution operations
Incident_Communication_Runbook.md Severity-aligned stakeholder communication

Lifecycle and ownership

flowchart LR
    A[Alert fires<br/>SLO breach / error budget / page] --> B{Severity?}
    B -- Sev1 --> C[Page on-call<br/>+ incident channel<br/>+ comm runbook]
    B -- Sev2 --> D[Page on-call<br/>during business hours]
    B -- Sev3 --> E[Ticket queue]
    C --> F[Open runbook]
    D --> F
    E --> F
    F --> G[Follow steps<br/>capture evidence]
    G --> H{Resolved?}
    H -- yes --> I[Close incident<br/>postmortem if Sev1/Sev2]
    H -- no --> J[Escalate or<br/>open RCA]
    I --> K[Update runbook if<br/>missing/incorrect step]

Detail: Incident severity model.

Runbook format

Every runbook in this index follows the same structure:

  1. When to use — the symptom or alert that should open this runbook.
  2. Owning team — who's accountable for the underlying system.
  3. Diagnostic steps — what to check first, in order.
  4. Mitigations — bounded actions that may resolve the incident.
  5. Escalation — when to open an RCA or page senior on-call.
  6. Evidence to capture — links/queries/timestamps to record for postmortem.

The format is enforced via runbooks.catalog.json schema. New runbooks must register there before being linked from alerts.

Where to look next