Allocation Provisioning Task Timeline v1¶
Purpose¶
Allocation provisioning now includes asynchronous host work that can take several minutes, especially for GPU VM slices. The user-facing allocation state alone is not enough to explain progress or delays. Allocation detail should expose an app-runtime-style task timeline backed by the same durable task data used by provisioning workers and node agents.
This is the allocation counterpart to app instance operation timelines.
Problem Observed¶
For GPU slice allocation f73c9691-cbd1-4bf4-bb1f-af28a23102e2 on April 18,
2026:
- allocation create and placement completed quickly;
slice.vm_provisionwas queued at2026-04-18T01:31:58Z;- node-agent claimed it about 1 second later;
- node-agent completed it at
2026-04-18T01:36:38Z; - real provision duration was about 279 seconds.
The UI showed a generic long-running message, but the platform had the more
specific task state in node_tasks.
The same investigation exposed a timestamp bug: the provisioning worker keeps a
DB transaction open while waiting for node-agent task completion, so now()
inside that transaction records transaction-start time, not completion time.
Long-running worker transactions must use clock_timestamp() for terminal
state timestamps and outbox event occurred_at.
Timeline Model¶
Allocation detail should present two layers:
- allocation lifecycle state;
- node/runtime task timeline.
Example:
requested
-> placement_reserved
-> provisioning_started
-> node_task_queued(slice.vm_provision)
-> node_task_dispatched
-> guest_booting
-> ssh_ready
-> active
The first implementation can derive most of this from existing rows:
| Timeline item | Source |
|---|---|
requested |
allocations.created_at |
placement_reserved |
allocation_resource_claims.status and slot/node claim rows |
provisioning_started |
allocations.provisioning_started_at |
node_task_queued |
node_tasks.created_at / issued_at |
node_task_dispatched |
node_tasks.dispatched_at |
node_task_completed |
node_tasks.completed_at |
active |
allocations.active_at |
failed |
allocations.failure_reason and failed node task error |
For richer progress, node-agent should include a redacted progress array in
node_tasks.output or update a future node_task_events table. Example phases:
- image prepared or reused;
- cloud-init seed generated;
- libvirt domain defined;
- VM started;
- guest agent or SSH reachable;
- driver readiness probe completed.
API Direction¶
Add a read-only project-scoped surface:
Admin/operator equivalent:
Response shape should be generic enough for bare metal and slices:
{
"allocation_id": "uuid",
"status": "provisioning",
"items": [
{
"kind": "allocation_state",
"name": "provisioning_started",
"status": "succeeded",
"started_at": "2026-04-18T01:31:58Z"
},
{
"kind": "node_task",
"name": "slice.vm_provision",
"status": "running",
"task_id": "uuid",
"started_at": "2026-04-18T01:31:59Z",
"completed_at": null,
"duration_seconds": 180,
"summary": "Creating GPU slice VM"
}
]
}
Contract rules:
- update
doc/api/openapi.draft.yamlbefore implementation; - project users can read only their own/project allocations;
- admin endpoint requires explicit admin authorization and audit is not needed for read-only access unless platform policy later requires read audit;
- task params and output must be sanitized before returning to the UI;
- never return SSH keys, access secrets, task signatures, cloud-init content, or provider credentials.
Implementation Notes¶
Initial backend can query node_tasks by params->>'allocation_id' and join
the allocation row. This is acceptable for v1 because allocation-scoped node
tasks already carry the allocation id.
Recommended later schema:
This lets node-agent stream progress without mutating a large node_tasks.output
document and gives the UI stable phase ordering.
Timing Rule¶
Any worker transaction that waits for external work must not use now() for
terminal state timestamps after the wait. Use clock_timestamp() for:
- allocation
active_at,released_at, and failure timestamps; - node/resource slot terminal
updated_at; - outbox
occurred_atcreated after a long wait.
now() is still acceptable for short transaction bookkeeping where
transaction-start time is intended.