Allocation Provisioning Task Timeline v1¶

Purpose¶

Allocation provisioning now includes asynchronous host work that can take several minutes, especially for GPU VM slices. The user-facing allocation state alone is not enough to explain progress or delays. Allocation detail should expose an app-runtime-style task timeline backed by the same durable task data used by provisioning workers and node agents.

This is the allocation counterpart to app instance operation timelines.

Problem Observed¶

For GPU slice allocation f73c9691-cbd1-4bf4-bb1f-af28a23102e2 on April 18, 2026:

allocation create and placement completed quickly;
slice.vm_provision was queued at 2026-04-18T01:31:58Z;
node-agent claimed it about 1 second later;
node-agent completed it at 2026-04-18T01:36:38Z;
real provision duration was about 279 seconds.

The UI showed a generic long-running message, but the platform had the more specific task state in node_tasks.

The same investigation exposed a timestamp bug: the provisioning worker keeps a DB transaction open while waiting for node-agent task completion, so now() inside that transaction records transaction-start time, not completion time. Long-running worker transactions must use clock_timestamp() for terminal state timestamps and outbox event occurred_at.

Timeline Model¶

Allocation detail should present two layers:

allocation lifecycle state;
node/runtime task timeline.

Example:

requested
  -> placement_reserved
  -> provisioning_started
  -> node_task_queued(slice.vm_provision)
  -> node_task_dispatched
  -> guest_booting
  -> ssh_ready
  -> active

The first implementation can derive most of this from existing rows:

Timeline item	Source
`requested`	`allocations.created_at`
`placement_reserved`	`allocation_resource_claims.status` and slot/node claim rows
`provisioning_started`	`allocations.provisioning_started_at`
`node_task_queued`	`node_tasks.created_at` / `issued_at`
`node_task_dispatched`	`node_tasks.dispatched_at`
`node_task_completed`	`node_tasks.completed_at`
`active`	`allocations.active_at`
`failed`	`allocations.failure_reason` and failed node task `error`

For richer progress, node-agent should include a redacted progress array in node_tasks.output or update a future node_task_events table. Example phases:

image prepared or reused;
cloud-init seed generated;
libvirt domain defined;
VM started;
guest agent or SSH reachable;
driver readiness probe completed.

API Direction¶

Add a read-only project-scoped surface:

GET /api/v1/allocations/{allocation_id}/timeline

Admin/operator equivalent:

GET /api/v1/admin/allocations/{allocation_id}/timeline

Response shape should be generic enough for bare metal and slices:

{
  "allocation_id": "uuid",
  "status": "provisioning",
  "items": [
    {
      "kind": "allocation_state",
      "name": "provisioning_started",
      "status": "succeeded",
      "started_at": "2026-04-18T01:31:58Z"
    },
    {
      "kind": "node_task",
      "name": "slice.vm_provision",
      "status": "running",
      "task_id": "uuid",
      "started_at": "2026-04-18T01:31:59Z",
      "completed_at": null,
      "duration_seconds": 180,
      "summary": "Creating GPU slice VM"
    }
  ]
}

Contract rules:

update doc/api/openapi.draft.yaml before implementation;
project users can read only their own/project allocations;
admin endpoint requires explicit admin authorization and audit is not needed for read-only access unless platform policy later requires read audit;
task params and output must be sanitized before returning to the UI;
never return SSH keys, access secrets, task signatures, cloud-init content, or provider credentials.

Implementation Notes¶

Initial backend can query node_tasks by params->>'allocation_id' and join the allocation row. This is acceptable for v1 because allocation-scoped node tasks already carry the allocation id.

Recommended later schema:

node_task_events(
  id,
  node_task_id,
  allocation_id,
  phase,
  status,
  message,
  details,
  occurred_at
)

This lets node-agent stream progress without mutating a large node_tasks.output document and gives the UI stable phase ordering.

Timing Rule¶

Any worker transaction that waits for external work must not use now() for terminal state timestamps after the wait. Use clock_timestamp() for:

allocation active_at, released_at, and failure timestamps;
node/resource slot terminal updated_at;
outbox occurred_at created after a long wait.

now() is still acceptable for short transaction bookkeeping where transaction-start time is intended.