Skip to content

Node-agent task catalog

Implemented

Source: cmd/node-agent/agent.go:32-64 (task type registry) · cmd/node-agent/*.go (handlers)

The node-agent is a bounded typed-task executor. It is not a remote shell. Every task has a typed input, a structured output, and a signed contract. Tasks come from cmd/api over mTLS pull (GET /internal/v1/nodes/{id}/tasks/wait) and results are posted back (POST .../tasks/{id}/result).

sequenceDiagram
    autonumber
    participant W as cmd/provisioning-worker
    participant A as cmd/api
    participant PG as Postgres
    participant N as cmd/node-agent
    participant H as Host OS

    W->>A: enqueueTask(node_id, type, params)
    A->>PG: INSERT node_tasks (status='queued')
    N->>A: GET /tasks/wait (long-poll, mTLS)
    A->>PG: CTE: claim queued task → dispatched
    A-->>N: {task_id, task_type, params, signature, expires_at}
    N->>N: verify signature + expiry
    N->>H: execute typed handler
    N->>A: POST /tasks/{id}/result
    A->>PG: UPDATE node_tasks status='completed'
    W->>PG: poll node_tasks (250ms)
    W-->>W: terminal → continue workflow

Task taxonomy

flowchart LR
    classDef slice fill:#fff3e0,stroke:#e65100
    classDef bm    fill:#e3f2fd,stroke:#1565c0
    classDef user  fill:#e8f5e9,stroke:#2e7d32
    classDef diag  fill:#f3e5f5,stroke:#6a1b9a

    subgraph Slice["GPU slice family"]
        T1[slice.topology_discover]:::slice
        T2[slice.vm_provision]:::slice
        T3[slice.vm_release]:::slice
    end

    subgraph BM["Baremetal allocation"]
        T4[allocation.provision_user]:::bm
        T5[allocation.deprovision_user]:::bm
    end

    subgraph Diag["Diagnostics"]
        T6[diag.health_probe]:::diag
    end

Slice family

slice.topology_discover

Implemented

Operator-onboarding task. Scans the host (/sys/bus/pci/devices, /sys/block, OVS, IPoIB, kernel cmdline) and returns a candidate slot map plus host readiness blockers.

Output key Meaning
prerequisites.kvm_available /dev/kvm exists
prerequisites.iommu_group_count non-zero ⇒ VFIO usable
prerequisites.iommu_kernel_args intel_iommu=on \| amd_iommu=on \| iommu=pt present
prerequisites.cpu_virtualization {vmx, svm} flags
prerequisites.loaded_modules vfio_pci, vfio_iommu_type1, etc.
prerequisites.required_commands cloud-localds, qemu-img, virt-install, virsh, ovs-vsctl, findmnt
prerequisites.slice_network OVS bridge, NAT, IP forwarding, IPoIB netplan checks
prerequisites.reboot_required marker files present
gpu_devices[] NVIDIA GPUs (vendor 0x10de) with PCI/IOMMU/NUMA
fabric_devices[] Mellanox (vendor 0x15b3) with VF detection
nvme_devices[] NVMe namespaces with stable /dev/disk/by-id/ path
candidate_slots[] GPU↔NVMe↔fabric VF pairs, deterministic MAC + IP
candidate_summary.blockers[] List of why slots are not schedulable

Approval required: approval_required: true is always set on the output. Discovery never writes node_resource_slots rows.

Source: cmd/node-agent/slice_topology.go.

slice.vm_provision

Implemented

Provisions a slice VM with N slots passed through. Phase-timed; failure runs deferred cleanup.

flowchart TB
    P0([slice.vm_provision])
    P0 --> P1[1. lease_acquire<br/>per-slot exclusive JSON lease]
    P1 --> P2[2. host_dependencies<br/>apt-install qemu-kvm/libvirt/ovs]
    P2 --> P3[3. host_passthrough_check<br/>/dev/kvm + modprobe vfio-pci]
    P3 --> P4[4. vfio_bind_check<br/>GPU + fabric VF → vfio-pci]
    P4 --> P5[5. image_stat_download<br/>fetch from image_url if missing]
    P5 --> P6[6. image_digest_verify]
    P6 --> P7[7. cloud_init_dir]
    P7 --> P8[8. terminal_key]
    P8 --> P9[9. guest_telemetry_register]
    P9 --> P10[10. cloud_init_seed_files]
    P10 --> P11[11. cloud_localds]
    P11 --> P12[12. runtime_validate<br/>OVS br-exists + NVMe unmounted]
    P12 --> P13[13. dhcp_reservation]
    P13 --> P14[14. image_write_convert<br/>qemu-img convert → boot NVMe]
    P14 --> P15[15. virt_install]
    P15 --> P16[16. readiness<br/>SSH + /var/lib/gpuaas/slice-ready]
    P16 --> P17[17. performance_probe]
    P17 --> OK([success])

    P1 & P4 & P14 & P15 & P16 -.fail.-> CL[cleanup:<br/>drop DHCP / unregister telemetry /<br/>release leases / rm cloud_init_dir]
    CL --> ERR([error])

Inputs (selected): allocation_id, vm_name, image_path, image_sha256, image_trusted, driver_strategy (cloud-init|preinstalled|none), default_username, ssh_public_keys[], slots[], ovs_bridge, graceful_timeout_seconds (30–900).

Each slot carries slot_id, slot_index, pci_address (GPU), fabric_device (VF PCI), nvme_device, numa_node, vcpu_count (default 12), memory_mib (default 65536), mac_address, private_ip.

Output includes vm_name, default_user, private_ip, ssh_port=22, slot_count, readiness, performance, timings.<phase>_ms, raw_vnc=false, console_model=gateway_required.

Source: cmd/node-agent/slice_vm.go:113-635 (handler + provision flow).

slice.vm_release

Implemented

flowchart TB
    R0([slice.vm_release]) --> R1[virsh shutdown]
    R1 --> R2{running after<br/>graceful_timeout?}
    R2 -- yes --> R3[virsh destroy<br/>hard_stopped=true]
    R2 -- no --> R4[virsh undefine<br/>--nvram fallback]
    R3 --> R4
    R4 --> R5[re-bind GPU + fabric VF<br/>to vfio-pci]
    R5 --> R6[rm cloud_init_dir<br/>rm DHCP reservation]
    R6 --> R7{wipe=true?}
    R7 -- yes --> R8[zero each NVMe]
    R7 -- no --> R9
    R8 --> R9[release slot leases<br/>unregister guest telemetry]
    R9 --> DONE([result])

Source: cmd/node-agent/slice_vm.go:1175-1221.

Baremetal family

allocation.provision_user

Implemented RCA

Creates the OS user on a bare-metal host, applies SSH authorized keys, verifies user/home presence.

Drove RCA: 2026-03-node-api-mtls-identity-handoff (DB recorded the task as completed while the node never executed it — fix made the task pull idempotent and identity-bound).

allocation.deprovision_user

Implemented

Inverse — removes user, revokes SSH access.

Diagnostics

diag.health_probe

Implemented

Periodic node health check. Reports kernel version, free RAM, KVM presence, VFIO module status, GPU detection, fabric link state.

Security model

Property Mechanism
Transport mTLS — node identity is its enrollment cert (24 h TTL, X5C renewal via step-ca)
Task signing Every task params payload is signed by cmd/api; node-agent verifies before execution
Expiry Each task carries expires_at; node-agent rejects expired tasks
No remote shell Only the registered handlers run. No exec.task or equivalent
Sandboxed paths Image paths restricted to /var/lib/gpuaas/slice-images or /var/lib/libvirt/images; cloud-init dirs to /var/lib/gpuaas/slices/
Single-use terminal keys Each provision injects the host's per-instance ed25519 pubkey; the gateway uses it to broker SSH
Audit API writes audit_logs for every task enqueue with actor, role, action, target, result

Detail: Node Task Signing Lifecycle (source).

Where to look next