Skip to content

Runbook index (full)

Runbook

All 38 runbooks under doc/operations/runbooks/, alphabetised. For the categorised view see Runbook inventory in /now.

A–C

Runbook
Admin_Ops_Dashboard_Usage_Runbook.md
Agent_Orchestrator_v2_Runbook.md
API_Degradation_Runbook.md
App_Artifact_Lifecycle_Incident_Runbook.md
App_Catalog_Incident_Runbook.md
App_Platform_Operator_Incident_Runbook.md
App_Runtime_Billing_Incident_Runbook.md
App_Runtime_Lifecycle_Incident_Runbook.md
Billing_Worker_Failure_Runbook.md
CLI_Incident_and_Support_Triage_Runbook.md

D–I

Runbook
Database_Latency_or_Failover_Runbook.md
Enterprise_Federation_Incident_Runbook.md
Fleet_Telemetry_Incident_Runbook.md
GPU_Slice_Cleanup_Blocked_Slot_Runbook.md
GPU_Slice_Image_Pipeline_Runbook.md
GPU_Slice_Infra_Enablement_Proposal_v1.md
GPU_Slice_Node_Manual_Bootstrap_Runbook.md
IAM_Role_Assignment_and_Membership_Incident_Runbook.md
Incident_Communication_Runbook.md

J–P

Runbook
JWKS_Compromise_Breakglass_Runbook.md
Key_Rotation_and_Compromise_Response_Runbook.md
MAAS_H200_Host_Image_Pipeline_Runbook.md
Node_Agent_Control_Plane_Recovery_2026-03.md
Node_Onboarding_Runbook.md
Platform_Control_Disk_Cleanup_Runbook.md
Platform_Control_K3s_Recovery_Runbook.md
Provisioning_Workflow_Stuck_Runbook.md
Proxied_App_UI_Incident_Runbook.md
Python_SDK_Incident_and_Observability_Runbook.md

Q–Z

Runbook
Queue_Backlog_Runbook.md
Slurm_Reference_Deploying_Stuck_Runbook.md
Tenant_Project_Authorization_Runbook.md
Terminal_Gateway_Incident_Runbook.md
Three_Host_Lab_Incident_Runbook.md
User_Onboarding_Auth_Context_Runbook.md
Vault_Bootstrap_and_Root_Token_Runbook.md
Webhook_Processing_Outage_Runbook.md

Catalog manifest

doc/operations/runbooks/runbooks.catalog.json is the machine-readable index used by the in-product admin runbook panel. New runbooks must register there before being linked from alerts.

Format requirements

Every runbook contains:

  1. When to use — alert/symptom that opens it.
  2. Owning team.
  3. Diagnostic steps — ordered.
  4. Mitigations — bounded actions.
  5. Escalation — when to open an RCA / page senior on-call.
  6. Evidence to capture — timestamps, queries, links.

Where to look next