Node-agent task catalog¶

Implemented

Source: cmd/node-agent/agent.go:32-64 (task type registry) · cmd/node-agent/*.go (handlers)

The node-agent is a bounded typed-task executor. It is not a remote shell. Every task has a typed input, a structured output, and a signed contract. Tasks come from cmd/api over mTLS pull (GET /internal/v1/nodes/{id}/tasks/wait) and results are posted back (POST .../tasks/{id}/result).

sequenceDiagram
    autonumber
    participant W as cmd/provisioning-worker
    participant A as cmd/api
    participant PG as Postgres
    participant N as cmd/node-agent
    participant H as Host OS

    W->>A: enqueueTask(node_id, type, params)
    A->>PG: INSERT node_tasks (status='queued')
    N->>A: GET /tasks/wait (long-poll, mTLS)
    A->>PG: CTE: claim queued task → dispatched
    A-->>N: {task_id, task_type, params, signature, expires_at}
    N->>N: verify signature + expiry
    N->>H: execute typed handler
    N->>A: POST /tasks/{id}/result
    A->>PG: UPDATE node_tasks status='completed'
    W->>PG: poll node_tasks (250ms)
    W-->>W: terminal → continue workflow

Task taxonomy¶

flowchart LR
    classDef slice fill:#fff3e0,stroke:#e65100
    classDef bm    fill:#e3f2fd,stroke:#1565c0
    classDef user  fill:#e8f5e9,stroke:#2e7d32
    classDef diag  fill:#f3e5f5,stroke:#6a1b9a

    subgraph Slice["GPU slice family"]
        T1[slice.topology_discover]:::slice
        T2[slice.vm_provision]:::slice
        T3[slice.vm_release]:::slice
    end

    subgraph BM["Baremetal allocation"]
        T4[allocation.provision_user]:::bm
        T5[allocation.deprovision_user]:::bm
    end

    subgraph Diag["Diagnostics"]
        T6[diag.health_probe]:::diag
    end

Slice family¶

`slice.topology_discover`¶

Implemented

Operator-onboarding task. Scans the host (/sys/bus/pci/devices, /sys/block, OVS, IPoIB, kernel cmdline) and returns a candidate slot map plus host readiness blockers.

Output key	Meaning
`prerequisites.kvm_available`	`/dev/kvm` exists
`prerequisites.iommu_group_count`	non-zero ⇒ VFIO usable
`prerequisites.iommu_kernel_args`	`intel_iommu=on \\| amd_iommu=on \\| iommu=pt` present
`prerequisites.cpu_virtualization`	`{vmx, svm}` flags
`prerequisites.loaded_modules`	`vfio_pci`, `vfio_iommu_type1`, etc.
`prerequisites.required_commands`	`cloud-localds`, `qemu-img`, `virt-install`, `virsh`, `ovs-vsctl`, `findmnt`
`prerequisites.slice_network`	OVS bridge, NAT, IP forwarding, IPoIB netplan checks
`prerequisites.reboot_required`	marker files present
`gpu_devices[]`	NVIDIA GPUs (vendor `0x10de`) with PCI/IOMMU/NUMA
`fabric_devices[]`	Mellanox (vendor `0x15b3`) with VF detection
`nvme_devices[]`	NVMe namespaces with stable `/dev/disk/by-id/` path
`candidate_slots[]`	GPU↔NVMe↔fabric VF pairs, deterministic MAC + IP
`candidate_summary.blockers[]`	List of why slots are not schedulable

Approval required: approval_required: true is always set on the output. Discovery never writes node_resource_slots rows.

Source: cmd/node-agent/slice_topology.go.

`slice.vm_provision`¶

Implemented

Provisions a slice VM with N slots passed through. Phase-timed; failure runs deferred cleanup.

flowchart TB
    P0([slice.vm_provision])
    P0 --> P1[1. lease_acquire<br/>per-slot exclusive JSON lease]
    P1 --> P2[2. host_dependencies<br/>apt-install qemu-kvm/libvirt/ovs]
    P2 --> P3[3. host_passthrough_check<br/>/dev/kvm + modprobe vfio-pci]
    P3 --> P4[4. vfio_bind_check<br/>GPU + fabric VF → vfio-pci]
    P4 --> P5[5. image_stat_download<br/>fetch from image_url if missing]
    P5 --> P6[6. image_digest_verify]
    P6 --> P7[7. cloud_init_dir]
    P7 --> P8[8. terminal_key]
    P8 --> P9[9. guest_telemetry_register]
    P9 --> P10[10. cloud_init_seed_files]
    P10 --> P11[11. cloud_localds]
    P11 --> P12[12. runtime_validate<br/>OVS br-exists + NVMe unmounted]
    P12 --> P13[13. dhcp_reservation]
    P13 --> P14[14. image_write_convert<br/>qemu-img convert → boot NVMe]
    P14 --> P15[15. virt_install]
    P15 --> P16[16. readiness<br/>SSH + /var/lib/gpuaas/slice-ready]
    P16 --> P17[17. performance_probe]
    P17 --> OK([success])

    P1 & P4 & P14 & P15 & P16 -.fail.-> CL[cleanup:<br/>drop DHCP / unregister telemetry /<br/>release leases / rm cloud_init_dir]
    CL --> ERR([error])

Inputs (selected): allocation_id, vm_name, image_path, image_sha256, image_trusted, driver_strategy (cloud-init|preinstalled|none), default_username, ssh_public_keys[], slots[], ovs_bridge, graceful_timeout_seconds (30–900).

Each slot carries slot_id, slot_index, pci_address (GPU), fabric_device (VF PCI), nvme_device, numa_node, vcpu_count (default 12), memory_mib (default 65536), mac_address, private_ip.

Output includes vm_name, default_user, private_ip, ssh_port=22, slot_count, readiness, performance, timings.<phase>_ms, raw_vnc=false, console_model=gateway_required.

Source: cmd/node-agent/slice_vm.go:113-635 (handler + provision flow).

`slice.vm_release`¶

Implemented

flowchart TB
    R0([slice.vm_release]) --> R1[virsh shutdown]
    R1 --> R2{running after<br/>graceful_timeout?}
    R2 -- yes --> R3[virsh destroy<br/>hard_stopped=true]
    R2 -- no --> R4[virsh undefine<br/>--nvram fallback]
    R3 --> R4
    R4 --> R5[re-bind GPU + fabric VF<br/>to vfio-pci]
    R5 --> R6[rm cloud_init_dir<br/>rm DHCP reservation]
    R6 --> R7{wipe=true?}
    R7 -- yes --> R8[zero each NVMe]
    R7 -- no --> R9
    R8 --> R9[release slot leases<br/>unregister guest telemetry]
    R9 --> DONE([result])

Source: cmd/node-agent/slice_vm.go:1175-1221.

Baremetal family¶

`allocation.provision_user`¶

Implemented RCA

Creates the OS user on a bare-metal host, applies SSH authorized keys, verifies user/home presence.

Drove RCA: 2026-03-node-api-mtls-identity-handoff (DB recorded the task as completed while the node never executed it — fix made the task pull idempotent and identity-bound).

`allocation.deprovision_user`¶

Implemented

Inverse — removes user, revokes SSH access.

Diagnostics¶

`diag.health_probe`¶

Implemented

Periodic node health check. Reports kernel version, free RAM, KVM presence, VFIO module status, GPU detection, fabric link state.

Security model¶

Property	Mechanism
Transport	mTLS — node identity is its enrollment cert (24 h TTL, X5C renewal via step-ca)
Task signing	Every task params payload is signed by `cmd/api`; node-agent verifies before execution
Expiry	Each task carries `expires_at`; node-agent rejects expired tasks
No remote shell	Only the registered handlers run. No `exec.task` or equivalent
Sandboxed paths	Image paths restricted to `/var/lib/gpuaas/slice-images` or `/var/lib/libvirt/images`; cloud-init dirs to `/var/lib/gpuaas/slices/`
Single-use terminal keys	Each provision injects the host's per-instance ed25519 pubkey; the gateway uses it to broker SSH
Audit	API writes `audit_logs` for every task enqueue with actor, role, action, target, result

Detail: Node Task Signing Lifecycle (source).

Where to look next¶

GPU slice as-built → for end-to-end flow including the orchestrator side
GPU slice runbooks → — manual bootstrap, image pipeline, cleanup-blocked
Outbox & event flow → — how task completion maps back to allocation state

Node-agent task catalog¶

Task taxonomy¶

Slice family¶

slice.topology_discover¶

slice.vm_provision¶

slice.vm_release¶

Baremetal family¶

allocation.provision_user¶

allocation.deprovision_user¶

Diagnostics¶

diag.health_probe¶

Security model¶

Where to look next¶

`slice.topology_discover`¶

`slice.vm_provision`¶

`slice.vm_release`¶

`allocation.provision_user`¶

`allocation.deprovision_user`¶

`diag.health_probe`¶