GPU Slice `cleanup_blocked` Slot Recovery Runbook¶

Purpose¶

Recover GPU slice slots that are stuck in cleanup_blocked without direct database edits.

Use the API/read-model path first so the recovery is auditable and safe to automate later. Do not update node_resource_slots or allocation_resource_claims by hand during normal operations.

When To Use This¶

Use this runbook when:

an admin node card shows blocked > 0 for slice slots;
GET /api/v1/admin/nodes/{node_id}/resource-slots returns one or more slots in cleanup_blocked;
a slice allocation failed during requested or provisioning and left slot state behind.

Do not use this for:

active slice allocations that still have a running VM;
bare-metal allocations;
cases where the node itself is unhealthy and must be drained first.

API Surfaces¶

Primary read-model and recovery APIs:

GET /api/v1/admin/nodes/{node_id}/resource-slots
POST /api/v1/admin/allocations/{allocation_id}/force-cleanup

force-cleanup behavior:

with runtime_verified_absent=false:
allocation is failed;
claims are failed;
slots remain cleanup_blocked for further operator inspection.
with runtime_verified_absent=true:
allocation is failed;
claims are failed;
slots return to available.

The API is intentionally conservative. Set runtime_verified_absent=true only after checking that no slice VM, libvirt domain, or conflicting runtime state remains on the host.

Operator Workflow¶

1. Identify the blocked slot and owning allocation¶

Query node slots:

curl -sS \
  -H "Authorization: Bearer $TOKEN" \
  "$BASE_URL/api/v1/admin/nodes/$NODE_ID/resource-slots" | jq

Find entries with:

status == "cleanup_blocked"
health_state == "failed"
health_detail.allocation_id

Record:

node_id
slot_id
slot_index
allocation_id
health_detail.reason

2. Check the allocation state¶

Confirm the allocation is no longer active:

curl -sS \
  -H "Authorization: Bearer $TOKEN" \
  "$BASE_URL/api/v1/admin/allocations/$ALLOCATION_ID" | jq

Expected recovery candidates are allocations stuck in:

requested
provisioning
failed

If the allocation is still active, do not force-clean it. Investigate the runtime first.

3. Verify host runtime state¶

On the slice host, verify whether any runtime for that failed allocation still exists. Use host evidence, not database assumptions.

Typical checks:

sudo virsh list --all | grep gpuaas-slice
sudo virsh dominfo "gpuaas-slice-<allocation-id-without-dashes>" || true
sudo ps -ef | grep -E 'dnsmasq|qemu' | grep -v grep
sudo ls /var/lib/libvirt/dnsmasq/

Use host-specific checks as needed:

stale libvirt domain still defined;
qemu process still running;
dnsmasq reservation conflict still present;
raw disk still marked in use by a stale domain.

4. Apply force-cleanup through the API¶

If runtime state is not yet verified absent, fail the allocation but keep the slot blocked:

curl -sS -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -H "X-Idempotency-Key: $(uuidgen)" \
  "$BASE_URL/api/v1/admin/allocations/$ALLOCATION_ID/force-cleanup" \
  -d '{
    "reason": "ops cleanup after failed slice provisioning",
    "runtime_verified_absent": false
  }' | jq

If runtime state is verified absent, return the slot to available:

curl -sS -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -H "X-Idempotency-Key: $(uuidgen)" \
  "$BASE_URL/api/v1/admin/allocations/$ALLOCATION_ID/force-cleanup" \
  -d '{
    "reason": "ops cleanup after verifying no slice runtime remains",
    "runtime_verified_absent": true
  }' | jq

This writes an audit log entry with:

action: admin.allocation.force_cleanup.applied
runtime_verified_absent
operator-provided reason

5. Confirm slot recovery¶

Re-read the slot inventory:

curl -sS \
  -H "Authorization: Bearer $TOKEN" \
  "$BASE_URL/api/v1/admin/nodes/$NODE_ID/resource-slots" | jq

Expected outcome after a successful verified cleanup:

affected slot is available
health_state is healthy
health_detail is empty

If the slot remains cleanup_blocked, re-check host runtime residue before retrying with runtime_verified_absent=true.

Automation Notes¶

This runbook is intended to become automatable because the state transitions are already exposed through APIs.

Safe automation shape:

detect cleanup_blocked via node resource slot API;
extract allocation_id from health_detail;
run host verification logic;
call force-cleanup;
re-read the slot inventory and stop if the slot does not become available.

Do not automate direct SQL edits to slot or claim tables.

Current Example¶

Example observed on j22u15:

one slot in cleanup_blocked
health_detail.reason reported a dnsmasq reservation conflict with an active VM on the same host
the owning failed allocation remained recoverable through POST /api/v1/admin/allocations/{allocation_id}/force-cleanup

That is the expected class of incident this runbook covers.

GPU Slice cleanup_blocked Slot Recovery Runbook¶