Skip to content

SRE Runbook Index

Architecture

  • Ops runbook delivery and mapping model:
  • doc/operations/Ops_Runbook_Architecture.md
  • Release smoke validation checklist:
  • doc/operations/Release_Smoke_Checklist.md

Core Runbooks

  1. Admin Ops dashboard interpretation and decision-first triage
  2. doc/operations/runbooks/Admin_Ops_Dashboard_Usage_Runbook.md
  3. API degradation and high error rate
  4. doc/operations/runbooks/API_Degradation_Runbook.md
  5. Queue backlog and worker saturation
  6. doc/operations/runbooks/Queue_Backlog_Runbook.md
  7. Billing worker failure
  8. doc/operations/runbooks/Billing_Worker_Failure_Runbook.md
  9. Webhook processing outage
  10. doc/operations/runbooks/Webhook_Processing_Outage_Runbook.md
  11. Provisioning workflow stuck/failing
  12. doc/operations/runbooks/Provisioning_Workflow_Stuck_Runbook.md
  13. Database latency or failover
  14. doc/operations/runbooks/Database_Latency_or_Failover_Runbook.md
  15. Incident communication and stakeholder updates
  16. doc/operations/runbooks/Incident_Communication_Runbook.md
  17. Terminal gateway incidents (Option C cutover/rollback)
  18. doc/operations/runbooks/Terminal_Gateway_Incident_Runbook.md
  19. Node onboarding and bootstrap controls
  20. doc/operations/runbooks/Node_Onboarding_Runbook.md
  21. Tenant/project authorization failures
  22. doc/operations/runbooks/Tenant_Project_Authorization_Runbook.md
  23. User onboarding and auth context failures
  24. doc/operations/runbooks/User_Onboarding_Auth_Context_Runbook.md
  25. IAM role assignment and membership incident response
  26. doc/operations/runbooks/IAM_Role_Assignment_and_Membership_Incident_Runbook.md
  27. App catalog browse/filter and entitlement incident response
  28. doc/operations/runbooks/App_Catalog_Incident_Runbook.md
  29. Fleet telemetry incident response (CPU/GPU/Memory/Storage)
  30. doc/operations/runbooks/Fleet_Telemetry_Incident_Runbook.md
  31. CLI incident and support triage
  32. doc/operations/runbooks/CLI_Incident_and_Support_Triage_Runbook.md
  33. Python SDK incident and observability triage
  34. doc/operations/runbooks/Python_SDK_Incident_and_Observability_Runbook.md
  35. App artifact lifecycle and trust incident response
  36. doc/operations/runbooks/App_Artifact_Lifecycle_Incident_Runbook.md
  37. Slurm reference instance stuck in deploying
  38. doc/operations/runbooks/Slurm_Reference_Deploying_Stuck_Runbook.md
  39. Platform-control disk cleanup
  40. doc/operations/runbooks/Platform_Control_Disk_Cleanup_Runbook.md
  41. Platform-control k3s recovery after disk-full or bad local image rollout
  42. doc/operations/runbooks/Platform_Control_K3s_Recovery_Runbook.md

Runbook Template

  • Trigger condition
  • Impact and blast radius
  • Immediate mitigation
  • Deep diagnosis
  • Recovery steps
  • Post-incident follow-up