Skip to content

Provisioning Workflow Stuck/Failure Runbook

Trigger

  • Spike in provisioning.failed / provisioning.release_failed.
  • Allocations stuck in requested/provisioning/releasing beyond SLO.
  • Control-loop alerts firing:
  • GPUaaSProvisioningQueueDepthHigh
  • GPUaaSProvisioningDispatchLatencyP95High
  • GPUaaSProvisioningTimeoutRateHigh
  • GPUaaSProvisioningFailureBurst

Impact

  • Users cannot start or release GPU allocations reliably.
  • Potential billing/support impact for stuck states.

Immediate Mitigation

  1. Confirm queue lag for PROVISIONING stream and worker health.
  2. Pause new provisioning requests if failure rate is severe.
  3. Prioritize draining stuck release requests to stop billing risk.

Diagnosis

  1. Check provisioning-worker logs for SSH/runtime and DB errors.
  2. Verify NATS consumer lag and redelivery counts.
  3. nats_consumer_lag{stream="PROVISIONING"}
  4. Confirm control-loop metrics around dispatch and timeout behavior.
  5. provisioning_dispatch_latency_seconds p95
  6. provisioning_timeouts_total
  7. provisioning_failures_total
  8. Validate node probe status and network reachability for target nodes.
  9. Inspect failed allocation rows (failure_reason, release_failed_reason).

Recovery

  1. Restart provisioning-worker if consumer is unhealthy.
  2. Reprocess/replay recoverable events from DLQ.
  3. Force-release irrecoverable active allocations per policy/runbook.
  4. For GPU slice allocations that leave slots cleanup_blocked, use the API-first slot recovery flow in doc/operations/runbooks/GPU_Slice_Cleanup_Blocked_Slot_Runbook.md.
  5. Confirm state transitions resume (requested -> provisioning -> active, releasing -> released).

Post-Incident

  • Capture top failure reasons and affected allocation count.
  • File follow-up fixes for root causes (SSH creds, node health, queue pressure).