GPU Slice cleanup_blocked Slot Recovery Runbook¶
Purpose¶
Recover GPU slice slots that are stuck in cleanup_blocked without direct
database edits.
Use the API/read-model path first so the recovery is auditable and safe to
automate later. Do not update node_resource_slots or
allocation_resource_claims by hand during normal operations.
When To Use This¶
Use this runbook when:
- an admin node card shows
blocked > 0for slice slots; GET /api/v1/admin/nodes/{node_id}/resource-slotsreturns one or more slots incleanup_blocked;- a slice allocation failed during
requestedorprovisioningand left slot state behind.
Do not use this for:
- active slice allocations that still have a running VM;
- bare-metal allocations;
- cases where the node itself is unhealthy and must be drained first.
API Surfaces¶
Primary read-model and recovery APIs:
GET /api/v1/admin/nodes/{node_id}/resource-slotsPOST /api/v1/admin/allocations/{allocation_id}/force-cleanup
force-cleanup behavior:
- with
runtime_verified_absent=false: - allocation is failed;
- claims are failed;
- slots remain
cleanup_blockedfor further operator inspection. - with
runtime_verified_absent=true: - allocation is failed;
- claims are failed;
- slots return to
available.
The API is intentionally conservative. Set runtime_verified_absent=true only
after checking that no slice VM, libvirt domain, or conflicting runtime state
remains on the host.
Operator Workflow¶
1. Identify the blocked slot and owning allocation¶
Query node slots:
curl -sS \
-H "Authorization: Bearer $TOKEN" \
"$BASE_URL/api/v1/admin/nodes/$NODE_ID/resource-slots" | jq
Find entries with:
status == "cleanup_blocked"health_state == "failed"health_detail.allocation_id
Record:
node_idslot_idslot_indexallocation_idhealth_detail.reason
2. Check the allocation state¶
Confirm the allocation is no longer active:
curl -sS \
-H "Authorization: Bearer $TOKEN" \
"$BASE_URL/api/v1/admin/allocations/$ALLOCATION_ID" | jq
Expected recovery candidates are allocations stuck in:
requestedprovisioningfailed
If the allocation is still active, do not force-clean it. Investigate the
runtime first.
3. Verify host runtime state¶
On the slice host, verify whether any runtime for that failed allocation still exists. Use host evidence, not database assumptions.
Typical checks:
sudo virsh list --all | grep gpuaas-slice
sudo virsh dominfo "gpuaas-slice-<allocation-id-without-dashes>" || true
sudo ps -ef | grep -E 'dnsmasq|qemu' | grep -v grep
sudo ls /var/lib/libvirt/dnsmasq/
Use host-specific checks as needed:
- stale libvirt domain still defined;
- qemu process still running;
- dnsmasq reservation conflict still present;
- raw disk still marked in use by a stale domain.
4. Apply force-cleanup through the API¶
If runtime state is not yet verified absent, fail the allocation but keep the slot blocked:
curl -sS -X POST \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-H "X-Idempotency-Key: $(uuidgen)" \
"$BASE_URL/api/v1/admin/allocations/$ALLOCATION_ID/force-cleanup" \
-d '{
"reason": "ops cleanup after failed slice provisioning",
"runtime_verified_absent": false
}' | jq
If runtime state is verified absent, return the slot to available:
curl -sS -X POST \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-H "X-Idempotency-Key: $(uuidgen)" \
"$BASE_URL/api/v1/admin/allocations/$ALLOCATION_ID/force-cleanup" \
-d '{
"reason": "ops cleanup after verifying no slice runtime remains",
"runtime_verified_absent": true
}' | jq
This writes an audit log entry with:
- action:
admin.allocation.force_cleanup.applied runtime_verified_absent- operator-provided reason
5. Confirm slot recovery¶
Re-read the slot inventory:
curl -sS \
-H "Authorization: Bearer $TOKEN" \
"$BASE_URL/api/v1/admin/nodes/$NODE_ID/resource-slots" | jq
Expected outcome after a successful verified cleanup:
- affected slot is
available health_stateishealthyhealth_detailis empty
If the slot remains cleanup_blocked, re-check host runtime residue before
retrying with runtime_verified_absent=true.
Automation Notes¶
This runbook is intended to become automatable because the state transitions are already exposed through APIs.
Safe automation shape:
- detect
cleanup_blockedvia node resource slot API; - extract
allocation_idfromhealth_detail; - run host verification logic;
- call
force-cleanup; - re-read the slot inventory and stop if the slot does not become
available.
Do not automate direct SQL edits to slot or claim tables.
Current Example¶
Example observed on j22u15:
- one slot in
cleanup_blocked health_detail.reasonreported a dnsmasq reservation conflict with an active VM on the same host- the owning failed allocation remained recoverable through
POST /api/v1/admin/allocations/{allocation_id}/force-cleanup
That is the expected class of incident this runbook covers.