Skip to content

Allocation Provisioning Task Timeline v1

Purpose

Allocation provisioning now includes asynchronous host work that can take several minutes, especially for GPU VM slices. The user-facing allocation state alone is not enough to explain progress or delays. Allocation detail should expose an app-runtime-style task timeline backed by the same durable task data used by provisioning workers and node agents.

This is the allocation counterpart to app instance operation timelines.

Problem Observed

For GPU slice allocation f73c9691-cbd1-4bf4-bb1f-af28a23102e2 on April 18, 2026:

  1. allocation create and placement completed quickly;
  2. slice.vm_provision was queued at 2026-04-18T01:31:58Z;
  3. node-agent claimed it about 1 second later;
  4. node-agent completed it at 2026-04-18T01:36:38Z;
  5. real provision duration was about 279 seconds.

The UI showed a generic long-running message, but the platform had the more specific task state in node_tasks.

The same investigation exposed a timestamp bug: the provisioning worker keeps a DB transaction open while waiting for node-agent task completion, so now() inside that transaction records transaction-start time, not completion time. Long-running worker transactions must use clock_timestamp() for terminal state timestamps and outbox event occurred_at.

Timeline Model

Allocation detail should present two layers:

  1. allocation lifecycle state;
  2. node/runtime task timeline.

Example:

requested
  -> placement_reserved
  -> provisioning_started
  -> node_task_queued(slice.vm_provision)
  -> node_task_dispatched
  -> guest_booting
  -> ssh_ready
  -> active

The first implementation can derive most of this from existing rows:

Timeline item Source
requested allocations.created_at
placement_reserved allocation_resource_claims.status and slot/node claim rows
provisioning_started allocations.provisioning_started_at
node_task_queued node_tasks.created_at / issued_at
node_task_dispatched node_tasks.dispatched_at
node_task_completed node_tasks.completed_at
active allocations.active_at
failed allocations.failure_reason and failed node task error

For richer progress, node-agent should include a redacted progress array in node_tasks.output or update a future node_task_events table. Example phases:

  1. image prepared or reused;
  2. cloud-init seed generated;
  3. libvirt domain defined;
  4. VM started;
  5. guest agent or SSH reachable;
  6. driver readiness probe completed.

API Direction

Add a read-only project-scoped surface:

GET /api/v1/allocations/{allocation_id}/timeline

Admin/operator equivalent:

GET /api/v1/admin/allocations/{allocation_id}/timeline

Response shape should be generic enough for bare metal and slices:

{
  "allocation_id": "uuid",
  "status": "provisioning",
  "items": [
    {
      "kind": "allocation_state",
      "name": "provisioning_started",
      "status": "succeeded",
      "started_at": "2026-04-18T01:31:58Z"
    },
    {
      "kind": "node_task",
      "name": "slice.vm_provision",
      "status": "running",
      "task_id": "uuid",
      "started_at": "2026-04-18T01:31:59Z",
      "completed_at": null,
      "duration_seconds": 180,
      "summary": "Creating GPU slice VM"
    }
  ]
}

Contract rules:

  1. update doc/api/openapi.draft.yaml before implementation;
  2. project users can read only their own/project allocations;
  3. admin endpoint requires explicit admin authorization and audit is not needed for read-only access unless platform policy later requires read audit;
  4. task params and output must be sanitized before returning to the UI;
  5. never return SSH keys, access secrets, task signatures, cloud-init content, or provider credentials.

Implementation Notes

Initial backend can query node_tasks by params->>'allocation_id' and join the allocation row. This is acceptable for v1 because allocation-scoped node tasks already carry the allocation id.

Recommended later schema:

node_task_events(
  id,
  node_task_id,
  allocation_id,
  phase,
  status,
  message,
  details,
  occurred_at
)

This lets node-agent stream progress without mutating a large node_tasks.output document and gives the UI stable phase ordering.

Timing Rule

Any worker transaction that waits for external work must not use now() for terminal state timestamps after the wait. Use clock_timestamp() for:

  1. allocation active_at, released_at, and failure timestamps;
  2. node/resource slot terminal updated_at;
  3. outbox occurred_at created after a long wait.

now() is still acceptable for short transaction bookkeeping where transaction-start time is intended.