Allocation lifecycle¶

Implemented Contract

Source: packages/services/provisioning/orchestrator/service.go · cmd/provisioning-worker/ · doc/architecture/State_Machines.md · doc/api/asyncapi.draft.yaml

The allocation FSM is the central state machine of the platform. It's the same machine for both baremetal and gpu_slice capacity shapes.

State diagram¶

stateDiagram-v2
    [*] --> requested: API POST /allocations<br/>(orchestrator reserves slots/node + outbox)
    requested --> provisioning: worker picks up<br/>provisioning.requested
    provisioning --> active: node-agent reports ready
    provisioning --> failed: image / vfio / virt-install /<br/>readiness / lease conflict
    active --> releasing: user release OR<br/>billing force-release OR<br/>admin force-release
    releasing --> released: cleanup proof complete
    releasing --> release_failed: cleanup retries exhausted
    release_failed --> releasing: user retry OR<br/>admin force-release
    failed --> [*]
    released --> [*]

Status semantics¶

Status	Meaning	Who can transition out
`requested`	Slots/node reserved in DB; outbox written	Worker (on `provisioning.requested`)
`provisioning`	Node-agent executing typed tasks	Worker (on terminal task result)
`active`	Resources usable by tenant; billing accruing	User / billing / admin
`releasing`	Cleanup running	Worker
`released`	Cleanup proven complete	(terminal)
`failed`	Provisioning gave up	(terminal)
`release_failed`	Cleanup exhausted retries; billing stopped	User retry / admin force-release

Provisioning sequence¶

sequenceDiagram
    autonumber
    participant U as User
    participant API as cmd/api
    participant ORCH as orchestrator
    participant DB as Postgres
    participant OR as outbox-relay
    participant NATS as NATS
    participant W as provisioning-worker
    participant TMP as Temporal
    participant RT as runtime_agent.go
    participant N as cmd/node-agent

    U->>API: POST /allocations<br/>{sku, gpus, region, ssh_key_ids}
    API->>ORCH: CreateRequested(...)
    ORCH->>DB: BEGIN
    ORCH->>DB: SELECT sku_catalog<br/>(shape, allowed_counts)
    alt baremetal
        ORCH->>DB: SELECT nodes FOR UPDATE SKIP LOCKED<br/>filter active + not-claimed
        ORCH->>DB: INSERT allocation, node_exclusive claim
    else gpu_slice
        ORCH->>DB: listSlicePlacementCandidates (filter + os_images)
        ORCH->>ORCH: rankSlicePlacementCandidates (NUMA-fit, best-fit)
        ORCH->>DB: lockAvailableSliceSlots FOR UPDATE SKIP LOCKED
        ORCH->>DB: UPDATE slots SET status='reserved'
        ORCH->>DB: INSERT allocation + N slot claims
    end
    ORCH->>DB: INSERT outbox_events: provisioning.requested
    ORCH->>DB: COMMIT
    ORCH-->>API: allocation_id, status='requested'
    API-->>U: 201 Created

    OR->>DB: SELECT outbox FOR UPDATE SKIP LOCKED
    OR->>NATS: publish provisioning.requested
    OR->>DB: UPDATE status='published'

    NATS->>W: deliver provisioning.requested
    W->>TMP: StartWorkflow(provisioning-{event_id})
    TMP->>W: ExecuteActivity(HandleProvisionRequested)

    W->>DB: UPDATE allocation status='provisioning'
    RT->>DB: INSERT node_tasks (slice.vm_provision)
    N->>API: GET /tasks/wait (mTLS)
    API->>DB: claim queued task → dispatched
    API-->>N: task_id, signed params
    N->>N: execute provision phases
    N->>API: POST /tasks/{id}/result
    API->>DB: UPDATE node_tasks status='completed'

    RT->>DB: poll node_tasks (250ms) → completed
    W->>DB: UPDATE allocation status='active'<br/>(active_at = clock_timestamp())
    W->>DB: INSERT outbox: provisioning.active
    OR->>NATS: publish provisioning.active
    NATS->>BW[billing-worker]: start accruing
    NATS->>NR[notification-relay]: WS notify tenant

Release sequence¶

sequenceDiagram
    autonumber
    participant U as User or billing or admin
    participant API as cmd/api
    participant DB as Postgres
    participant NATS as NATS
    participant W as provisioning-worker
    participant N as cmd/node-agent

    U->>API: POST /allocations/:id/release<br/>(or auto: depleted balance)
    API->>DB: UPDATE allocation status='releasing'<br/>+ outbox: provisioning.releasing.requested
    NATS-->>W: deliver releasing.requested
    W->>N: dispatch slice.vm_release / allocation.deprovision_user
    N->>N: graceful shutdown → destroy → undefine → wipe leases
    N-->>W: result {released, hard_stopped, wiped, leases_released}

    alt all cleanup steps succeeded
        W->>DB: UPDATE allocation status='released'
        W->>DB: INSERT outbox: provisioning.releasing.completed
    else cleanup failed after retries
        W->>DB: UPDATE allocation status='release_failed'
        W->>DB: INSERT outbox: provisioning.release_failed
        Note over W: billing stops on release_failed
    end

Failure → state mapping¶

Failure	Lands in	Recovery
SKU not found / wrong shape / count not allowed	`requested` rejected with `sku_unavailable`	User picks different SKU
No same-node slots in region	`requested` rejected with `sku_unavailable`	Add capacity / try other region
Lease conflict on node-agent	`failed`	Workflow retries; lease reconciler clears expired
Image download / sha mismatch	`failed`	Operator checks image catalog
VFIO bind required but host wrong	`failed`	Operator runs host bootstrap + reboot
Slice VM SSH readiness times out	`failed`	Inspect cloud-init log on boot disk
Guest readiness marker never appears	`failed`	Switch to `preinstalled` driver image
Release SSH unreachable, hard-destroy succeeds	`released`, `hard_stopped=true`	None
Release retries exhausted	`release_failed`	Admin `POST /api/v1/admin/allocations/{id}/force-release`

Concurrency model¶

Step	Mechanism	Guarantee
Place baremetal	`SELECT … FOR UPDATE SKIP LOCKED` on `nodes`	At most one allocation per node
Place slice	`SELECT … FOR UPDATE SKIP LOCKED` on candidate slot rows	At most one allocation per slot
Reserve slot atomic	UPDATE inside same tx as allocation insert + outbox row	Never published-but-uncommitted
Outbox publish	`SELECT … FOR UPDATE SKIP LOCKED` on `outbox_events`	At-least-once with dedupe upstream
Task claim	CTE `UPDATE node_tasks … WHERE status='queued' RETURNING`	One node-agent claims; identity-bound
Temporal workflow id	`provisioning-{event_id}`	Idempotent; replay-safe

Timing rule (from RCA)¶

Any worker transaction that waits for external work must not use now() for terminal-state timestamps. Use clock_timestamp() for allocation.active_at, released_at, failure timestamps, and outbox occurred_at created after a long wait.

Source: RCA 2026-03-provisioning-workflow-recovery-gaps.

Where to look next¶

Outbox & event flow — how provisioning.* events propagate
GPU slice as-built — slice-specific phases inside provisioning
Billing & ledger — what active means for the meter
Allocation timeline UX (source) — the read model