Allocation lifecycle¶
Implemented Contract
Source:
packages/services/provisioning/orchestrator/service.go · cmd/provisioning-worker/ · doc/architecture/State_Machines.md · doc/api/asyncapi.draft.yaml
The allocation FSM is the central state machine of the platform. It's the same machine for both baremetal and gpu_slice capacity shapes.
State diagram¶
stateDiagram-v2
[*] --> requested: API POST /allocations<br/>(orchestrator reserves slots/node + outbox)
requested --> provisioning: worker picks up<br/>provisioning.requested
provisioning --> active: node-agent reports ready
provisioning --> failed: image / vfio / virt-install /<br/>readiness / lease conflict
active --> releasing: user release OR<br/>billing force-release OR<br/>admin force-release
releasing --> released: cleanup proof complete
releasing --> release_failed: cleanup retries exhausted
release_failed --> releasing: user retry OR<br/>admin force-release
failed --> [*]
released --> [*]
Status semantics¶
| Status | Meaning | Who can transition out |
|---|---|---|
requested |
Slots/node reserved in DB; outbox written | Worker (on provisioning.requested) |
provisioning |
Node-agent executing typed tasks | Worker (on terminal task result) |
active |
Resources usable by tenant; billing accruing | User / billing / admin |
releasing |
Cleanup running | Worker |
released |
Cleanup proven complete | (terminal) |
failed |
Provisioning gave up | (terminal) |
release_failed |
Cleanup exhausted retries; billing stopped | User retry / admin force-release |
Provisioning sequence¶
sequenceDiagram
autonumber
participant U as User
participant API as cmd/api
participant ORCH as orchestrator
participant DB as Postgres
participant OR as outbox-relay
participant NATS as NATS
participant W as provisioning-worker
participant TMP as Temporal
participant RT as runtime_agent.go
participant N as cmd/node-agent
U->>API: POST /allocations<br/>{sku, gpus, region, ssh_key_ids}
API->>ORCH: CreateRequested(...)
ORCH->>DB: BEGIN
ORCH->>DB: SELECT sku_catalog<br/>(shape, allowed_counts)
alt baremetal
ORCH->>DB: SELECT nodes FOR UPDATE SKIP LOCKED<br/>filter active + not-claimed
ORCH->>DB: INSERT allocation, node_exclusive claim
else gpu_slice
ORCH->>DB: listSlicePlacementCandidates (filter + os_images)
ORCH->>ORCH: rankSlicePlacementCandidates (NUMA-fit, best-fit)
ORCH->>DB: lockAvailableSliceSlots FOR UPDATE SKIP LOCKED
ORCH->>DB: UPDATE slots SET status='reserved'
ORCH->>DB: INSERT allocation + N slot claims
end
ORCH->>DB: INSERT outbox_events: provisioning.requested
ORCH->>DB: COMMIT
ORCH-->>API: allocation_id, status='requested'
API-->>U: 201 Created
OR->>DB: SELECT outbox FOR UPDATE SKIP LOCKED
OR->>NATS: publish provisioning.requested
OR->>DB: UPDATE status='published'
NATS->>W: deliver provisioning.requested
W->>TMP: StartWorkflow(provisioning-{event_id})
TMP->>W: ExecuteActivity(HandleProvisionRequested)
W->>DB: UPDATE allocation status='provisioning'
RT->>DB: INSERT node_tasks (slice.vm_provision)
N->>API: GET /tasks/wait (mTLS)
API->>DB: claim queued task → dispatched
API-->>N: task_id, signed params
N->>N: execute provision phases
N->>API: POST /tasks/{id}/result
API->>DB: UPDATE node_tasks status='completed'
RT->>DB: poll node_tasks (250ms) → completed
W->>DB: UPDATE allocation status='active'<br/>(active_at = clock_timestamp())
W->>DB: INSERT outbox: provisioning.active
OR->>NATS: publish provisioning.active
NATS->>BW[billing-worker]: start accruing
NATS->>NR[notification-relay]: WS notify tenant
Release sequence¶
sequenceDiagram
autonumber
participant U as User or billing or admin
participant API as cmd/api
participant DB as Postgres
participant NATS as NATS
participant W as provisioning-worker
participant N as cmd/node-agent
U->>API: POST /allocations/:id/release<br/>(or auto: depleted balance)
API->>DB: UPDATE allocation status='releasing'<br/>+ outbox: provisioning.releasing.requested
NATS-->>W: deliver releasing.requested
W->>N: dispatch slice.vm_release / allocation.deprovision_user
N->>N: graceful shutdown → destroy → undefine → wipe leases
N-->>W: result {released, hard_stopped, wiped, leases_released}
alt all cleanup steps succeeded
W->>DB: UPDATE allocation status='released'
W->>DB: INSERT outbox: provisioning.releasing.completed
else cleanup failed after retries
W->>DB: UPDATE allocation status='release_failed'
W->>DB: INSERT outbox: provisioning.release_failed
Note over W: billing stops on release_failed
end
Failure → state mapping¶
| Failure | Lands in | Recovery |
|---|---|---|
| SKU not found / wrong shape / count not allowed | requested rejected with sku_unavailable |
User picks different SKU |
| No same-node slots in region | requested rejected with sku_unavailable |
Add capacity / try other region |
| Lease conflict on node-agent | failed |
Workflow retries; lease reconciler clears expired |
| Image download / sha mismatch | failed |
Operator checks image catalog |
| VFIO bind required but host wrong | failed |
Operator runs host bootstrap + reboot |
| Slice VM SSH readiness times out | failed |
Inspect cloud-init log on boot disk |
| Guest readiness marker never appears | failed |
Switch to preinstalled driver image |
| Release SSH unreachable, hard-destroy succeeds | released, hard_stopped=true |
None |
| Release retries exhausted | release_failed |
Admin POST /api/v1/admin/allocations/{id}/force-release |
Concurrency model¶
| Step | Mechanism | Guarantee |
|---|---|---|
| Place baremetal | SELECT … FOR UPDATE SKIP LOCKED on nodes |
At most one allocation per node |
| Place slice | SELECT … FOR UPDATE SKIP LOCKED on candidate slot rows |
At most one allocation per slot |
| Reserve slot atomic | UPDATE inside same tx as allocation insert + outbox row | Never published-but-uncommitted |
| Outbox publish | SELECT … FOR UPDATE SKIP LOCKED on outbox_events |
At-least-once with dedupe upstream |
| Task claim | CTE UPDATE node_tasks … WHERE status='queued' RETURNING |
One node-agent claims; identity-bound |
| Temporal workflow id | provisioning-{event_id} |
Idempotent; replay-safe |
Timing rule (from RCA)¶
Any worker transaction that waits for external work must not use
now()for terminal-state timestamps. Useclock_timestamp()forallocation.active_at,released_at, failure timestamps, and outboxoccurred_atcreated after a long wait.
Source: RCA 2026-03-provisioning-workflow-recovery-gaps.
Where to look next¶
- Outbox & event flow — how
provisioning.*events propagate - GPU slice as-built — slice-specific phases inside
provisioning - Billing & ledger — what
activemeans for the meter - Allocation timeline UX (source) — the read model