Skip to content

GPU Slice cleanup_blocked Slot Recovery Runbook

Purpose

Recover GPU slice slots that are stuck in cleanup_blocked without direct database edits.

Use the API/read-model path first so the recovery is auditable and safe to automate later. Do not update node_resource_slots or allocation_resource_claims by hand during normal operations.

When To Use This

Use this runbook when:

  • an admin node card shows blocked > 0 for slice slots;
  • GET /api/v1/admin/nodes/{node_id}/resource-slots returns one or more slots in cleanup_blocked;
  • a slice allocation failed during requested or provisioning and left slot state behind.

Do not use this for:

  • active slice allocations that still have a running VM;
  • bare-metal allocations;
  • cases where the node itself is unhealthy and must be drained first.

API Surfaces

Primary read-model and recovery APIs:

  • GET /api/v1/admin/nodes/{node_id}/resource-slots
  • POST /api/v1/admin/allocations/{allocation_id}/force-cleanup

force-cleanup behavior:

  • with runtime_verified_absent=false:
  • allocation is failed;
  • claims are failed;
  • slots remain cleanup_blocked for further operator inspection.
  • with runtime_verified_absent=true:
  • allocation is failed;
  • claims are failed;
  • slots return to available.

The API is intentionally conservative. Set runtime_verified_absent=true only after checking that no slice VM, libvirt domain, or conflicting runtime state remains on the host.

Operator Workflow

1. Identify the blocked slot and owning allocation

Query node slots:

curl -sS \
  -H "Authorization: Bearer $TOKEN" \
  "$BASE_URL/api/v1/admin/nodes/$NODE_ID/resource-slots" | jq

Find entries with:

  • status == "cleanup_blocked"
  • health_state == "failed"
  • health_detail.allocation_id

Record:

  • node_id
  • slot_id
  • slot_index
  • allocation_id
  • health_detail.reason

2. Check the allocation state

Confirm the allocation is no longer active:

curl -sS \
  -H "Authorization: Bearer $TOKEN" \
  "$BASE_URL/api/v1/admin/allocations/$ALLOCATION_ID" | jq

Expected recovery candidates are allocations stuck in:

  • requested
  • provisioning
  • failed

If the allocation is still active, do not force-clean it. Investigate the runtime first.

3. Verify host runtime state

On the slice host, verify whether any runtime for that failed allocation still exists. Use host evidence, not database assumptions.

Typical checks:

sudo virsh list --all | grep gpuaas-slice
sudo virsh dominfo "gpuaas-slice-<allocation-id-without-dashes>" || true
sudo ps -ef | grep -E 'dnsmasq|qemu' | grep -v grep
sudo ls /var/lib/libvirt/dnsmasq/

Use host-specific checks as needed:

  • stale libvirt domain still defined;
  • qemu process still running;
  • dnsmasq reservation conflict still present;
  • raw disk still marked in use by a stale domain.

4. Apply force-cleanup through the API

If runtime state is not yet verified absent, fail the allocation but keep the slot blocked:

curl -sS -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -H "X-Idempotency-Key: $(uuidgen)" \
  "$BASE_URL/api/v1/admin/allocations/$ALLOCATION_ID/force-cleanup" \
  -d '{
    "reason": "ops cleanup after failed slice provisioning",
    "runtime_verified_absent": false
  }' | jq

If runtime state is verified absent, return the slot to available:

curl -sS -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -H "X-Idempotency-Key: $(uuidgen)" \
  "$BASE_URL/api/v1/admin/allocations/$ALLOCATION_ID/force-cleanup" \
  -d '{
    "reason": "ops cleanup after verifying no slice runtime remains",
    "runtime_verified_absent": true
  }' | jq

This writes an audit log entry with:

  • action: admin.allocation.force_cleanup.applied
  • runtime_verified_absent
  • operator-provided reason

5. Confirm slot recovery

Re-read the slot inventory:

curl -sS \
  -H "Authorization: Bearer $TOKEN" \
  "$BASE_URL/api/v1/admin/nodes/$NODE_ID/resource-slots" | jq

Expected outcome after a successful verified cleanup:

  • affected slot is available
  • health_state is healthy
  • health_detail is empty

If the slot remains cleanup_blocked, re-check host runtime residue before retrying with runtime_verified_absent=true.

Automation Notes

This runbook is intended to become automatable because the state transitions are already exposed through APIs.

Safe automation shape:

  1. detect cleanup_blocked via node resource slot API;
  2. extract allocation_id from health_detail;
  3. run host verification logic;
  4. call force-cleanup;
  5. re-read the slot inventory and stop if the slot does not become available.

Do not automate direct SQL edits to slot or claim tables.

Current Example

Example observed on j22u15:

  • one slot in cleanup_blocked
  • health_detail.reason reported a dnsmasq reservation conflict with an active VM on the same host
  • the owning failed allocation remained recoverable through POST /api/v1/admin/allocations/{allocation_id}/force-cleanup

That is the expected class of incident this runbook covers.