Skip to content

Allocation lifecycle

Implemented Contract

Source: packages/services/provisioning/orchestrator/service.go · cmd/provisioning-worker/ · doc/architecture/State_Machines.md · doc/api/asyncapi.draft.yaml

The allocation FSM is the central state machine of the platform. It's the same machine for both baremetal and gpu_slice capacity shapes.

State diagram

stateDiagram-v2
    [*] --> requested: API POST /allocations<br/>(orchestrator reserves slots/node + outbox)
    requested --> provisioning: worker picks up<br/>provisioning.requested
    provisioning --> active: node-agent reports ready
    provisioning --> failed: image / vfio / virt-install /<br/>readiness / lease conflict
    active --> releasing: user release OR<br/>billing force-release OR<br/>admin force-release
    releasing --> released: cleanup proof complete
    releasing --> release_failed: cleanup retries exhausted
    release_failed --> releasing: user retry OR<br/>admin force-release
    failed --> [*]
    released --> [*]

Status semantics

Status Meaning Who can transition out
requested Slots/node reserved in DB; outbox written Worker (on provisioning.requested)
provisioning Node-agent executing typed tasks Worker (on terminal task result)
active Resources usable by tenant; billing accruing User / billing / admin
releasing Cleanup running Worker
released Cleanup proven complete (terminal)
failed Provisioning gave up (terminal)
release_failed Cleanup exhausted retries; billing stopped User retry / admin force-release

Provisioning sequence

sequenceDiagram
    autonumber
    participant U as User
    participant API as cmd/api
    participant ORCH as orchestrator
    participant DB as Postgres
    participant OR as outbox-relay
    participant NATS as NATS
    participant W as provisioning-worker
    participant TMP as Temporal
    participant RT as runtime_agent.go
    participant N as cmd/node-agent

    U->>API: POST /allocations<br/>{sku, gpus, region, ssh_key_ids}
    API->>ORCH: CreateRequested(...)
    ORCH->>DB: BEGIN
    ORCH->>DB: SELECT sku_catalog<br/>(shape, allowed_counts)
    alt baremetal
        ORCH->>DB: SELECT nodes FOR UPDATE SKIP LOCKED<br/>filter active + not-claimed
        ORCH->>DB: INSERT allocation, node_exclusive claim
    else gpu_slice
        ORCH->>DB: listSlicePlacementCandidates (filter + os_images)
        ORCH->>ORCH: rankSlicePlacementCandidates (NUMA-fit, best-fit)
        ORCH->>DB: lockAvailableSliceSlots FOR UPDATE SKIP LOCKED
        ORCH->>DB: UPDATE slots SET status='reserved'
        ORCH->>DB: INSERT allocation + N slot claims
    end
    ORCH->>DB: INSERT outbox_events: provisioning.requested
    ORCH->>DB: COMMIT
    ORCH-->>API: allocation_id, status='requested'
    API-->>U: 201 Created

    OR->>DB: SELECT outbox FOR UPDATE SKIP LOCKED
    OR->>NATS: publish provisioning.requested
    OR->>DB: UPDATE status='published'

    NATS->>W: deliver provisioning.requested
    W->>TMP: StartWorkflow(provisioning-{event_id})
    TMP->>W: ExecuteActivity(HandleProvisionRequested)

    W->>DB: UPDATE allocation status='provisioning'
    RT->>DB: INSERT node_tasks (slice.vm_provision)
    N->>API: GET /tasks/wait (mTLS)
    API->>DB: claim queued task → dispatched
    API-->>N: task_id, signed params
    N->>N: execute provision phases
    N->>API: POST /tasks/{id}/result
    API->>DB: UPDATE node_tasks status='completed'

    RT->>DB: poll node_tasks (250ms) → completed
    W->>DB: UPDATE allocation status='active'<br/>(active_at = clock_timestamp())
    W->>DB: INSERT outbox: provisioning.active
    OR->>NATS: publish provisioning.active
    NATS->>BW[billing-worker]: start accruing
    NATS->>NR[notification-relay]: WS notify tenant

Release sequence

sequenceDiagram
    autonumber
    participant U as User or billing or admin
    participant API as cmd/api
    participant DB as Postgres
    participant NATS as NATS
    participant W as provisioning-worker
    participant N as cmd/node-agent

    U->>API: POST /allocations/:id/release<br/>(or auto: depleted balance)
    API->>DB: UPDATE allocation status='releasing'<br/>+ outbox: provisioning.releasing.requested
    NATS-->>W: deliver releasing.requested
    W->>N: dispatch slice.vm_release / allocation.deprovision_user
    N->>N: graceful shutdown → destroy → undefine → wipe leases
    N-->>W: result {released, hard_stopped, wiped, leases_released}

    alt all cleanup steps succeeded
        W->>DB: UPDATE allocation status='released'
        W->>DB: INSERT outbox: provisioning.releasing.completed
    else cleanup failed after retries
        W->>DB: UPDATE allocation status='release_failed'
        W->>DB: INSERT outbox: provisioning.release_failed
        Note over W: billing stops on release_failed
    end

Failure → state mapping

Failure Lands in Recovery
SKU not found / wrong shape / count not allowed requested rejected with sku_unavailable User picks different SKU
No same-node slots in region requested rejected with sku_unavailable Add capacity / try other region
Lease conflict on node-agent failed Workflow retries; lease reconciler clears expired
Image download / sha mismatch failed Operator checks image catalog
VFIO bind required but host wrong failed Operator runs host bootstrap + reboot
Slice VM SSH readiness times out failed Inspect cloud-init log on boot disk
Guest readiness marker never appears failed Switch to preinstalled driver image
Release SSH unreachable, hard-destroy succeeds released, hard_stopped=true None
Release retries exhausted release_failed Admin POST /api/v1/admin/allocations/{id}/force-release

Concurrency model

Step Mechanism Guarantee
Place baremetal SELECT … FOR UPDATE SKIP LOCKED on nodes At most one allocation per node
Place slice SELECT … FOR UPDATE SKIP LOCKED on candidate slot rows At most one allocation per slot
Reserve slot atomic UPDATE inside same tx as allocation insert + outbox row Never published-but-uncommitted
Outbox publish SELECT … FOR UPDATE SKIP LOCKED on outbox_events At-least-once with dedupe upstream
Task claim CTE UPDATE node_tasks … WHERE status='queued' RETURNING One node-agent claims; identity-bound
Temporal workflow id provisioning-{event_id} Idempotent; replay-safe

Timing rule (from RCA)

Any worker transaction that waits for external work must not use now() for terminal-state timestamps. Use clock_timestamp() for allocation.active_at, released_at, failure timestamps, and outbox occurred_at created after a long wait.

Source: RCA 2026-03-provisioning-workflow-recovery-gaps.

Where to look next