SRE Runbook Index
Architecture
- Ops runbook delivery and mapping model:
doc/operations/Ops_Runbook_Architecture.md
- Release smoke validation checklist:
doc/operations/Release_Smoke_Checklist.md
Core Runbooks
- Admin Ops dashboard interpretation and decision-first triage
doc/operations/runbooks/Admin_Ops_Dashboard_Usage_Runbook.md
- API degradation and high error rate
doc/operations/runbooks/API_Degradation_Runbook.md
- Queue backlog and worker saturation
doc/operations/runbooks/Queue_Backlog_Runbook.md
- Billing worker failure
doc/operations/runbooks/Billing_Worker_Failure_Runbook.md
- Webhook processing outage
doc/operations/runbooks/Webhook_Processing_Outage_Runbook.md
- Provisioning workflow stuck/failing
doc/operations/runbooks/Provisioning_Workflow_Stuck_Runbook.md
- Database latency or failover
doc/operations/runbooks/Database_Latency_or_Failover_Runbook.md
- Incident communication and stakeholder updates
doc/operations/runbooks/Incident_Communication_Runbook.md
- Terminal gateway incidents (Option C cutover/rollback)
doc/operations/runbooks/Terminal_Gateway_Incident_Runbook.md
- Node onboarding and bootstrap controls
doc/operations/runbooks/Node_Onboarding_Runbook.md
- Tenant/project authorization failures
doc/operations/runbooks/Tenant_Project_Authorization_Runbook.md
- User onboarding and auth context failures
doc/operations/runbooks/User_Onboarding_Auth_Context_Runbook.md
- IAM role assignment and membership incident response
doc/operations/runbooks/IAM_Role_Assignment_and_Membership_Incident_Runbook.md
- App catalog browse/filter and entitlement incident response
doc/operations/runbooks/App_Catalog_Incident_Runbook.md
- Fleet telemetry incident response (CPU/GPU/Memory/Storage)
doc/operations/runbooks/Fleet_Telemetry_Incident_Runbook.md
- CLI incident and support triage
doc/operations/runbooks/CLI_Incident_and_Support_Triage_Runbook.md
- Python SDK incident and observability triage
doc/operations/runbooks/Python_SDK_Incident_and_Observability_Runbook.md
- App artifact lifecycle and trust incident response
doc/operations/runbooks/App_Artifact_Lifecycle_Incident_Runbook.md
- Slurm reference instance stuck in
deploying
doc/operations/runbooks/Slurm_Reference_Deploying_Stuck_Runbook.md
- Platform-control disk cleanup
doc/operations/runbooks/Platform_Control_Disk_Cleanup_Runbook.md
- Platform-control k3s recovery after disk-full or bad local image rollout
doc/operations/runbooks/Platform_Control_K3s_Recovery_Runbook.md
Runbook Template
- Trigger condition
- Impact and blast radius
- Immediate mitigation
- Deep diagnosis
- Recovery steps
- Post-incident follow-up