GPU slice (as-built)¶
Implemented
Source:
cmd/node-agent/slice_*.go · packages/services/provisioning/orchestrator/service.go:1499-1915 · scripts/seed.sql
This page describes the slice path as the code actually runs it. For design intent and proposals, see Allocation Capacity Shapes and GPU Slices v1 (source).
At a glance¶
flowchart LR
U([Tenant]) -->|POST /allocations<br/>sku=h200-sxm-slice<br/>gpus=N| API[cmd/api]
API --> ORCH[orchestrator]
ORCH -->|reserve N slots<br/>NUMA-fit, fabric-unique| DB[(Postgres)]
ORCH --> WK[provisioning-worker]
WK -->|slice.vm_provision| AG[node-agent]
AG -->|libvirt + cloud-init +<br/>VFIO + OVS| VM["Slice VM<br/>N GPUs · N NVMe · N IB VF"]
VM -.guest telemetry POST.-> AG
Data model¶
erDiagram
sku_catalog ||--o{ nodes : "sku"
sku_catalog {
text sku
text capacity_shape "gpu_slice|baremetal"
int[] allowed_gpu_counts
jsonb resource_profile
}
nodes ||--o{ node_resource_slots : "hosts"
nodes ||--o{ allocations : "runs"
node_resource_slots ||--o{ allocation_resource_claims : "claimed"
node_resource_slots {
uuid id
int slot_index
text status
text capacity_shape
text pci_address "GPU PCI"
text nvme_device
text mac_address
inet private_ip
int numa_node
jsonb capacity_metadata "storage_ownership, fabric_claim_mode, fabric_vf_pci_address, destructive_wipe_policy, sku"
}
allocation_resource_claims {
uuid allocation_id
uuid slot_id
text claim_kind "slot|node_exclusive"
text status
}
Orchestrator: slot reservation¶
Entrypoint: Service.reserveSliceSlotsForAllocation (service.go:1596). Runs inside the same DB transaction as allocation creation.
flowchart TB
A[POST /allocations] --> B[BEGIN TX]
B --> C[SELECT sku_catalog]
C --> D{capacity_shape == gpu_slice?}
D -- no --> ERR1[sku_unavailable]
D -- yes --> E{gpus_total in allowed_counts?}
E -- no --> ERR2[sku_unavailable]
E -- yes --> F[listSlicePlacementCandidates<br/>filter slots + os_images join]
F --> G{candidates exist?}
G -- no --> ERR3[sku_unavailable]
G -- yes --> H[rankSlicePlacementCandidates<br/>NUMA-fit, best-fit, slot index]
H --> I[lockAvailableSliceSlots<br/>FOR UPDATE SKIP LOCKED]
I --> J[selectSliceSlotIDs<br/>same NUMA when possible<br/>reject duplicate fabric VFs]
J --> K[UPDATE slots status='reserved'<br/>INSERT allocation + N claims<br/>INSERT outbox row]
K --> COMMIT[COMMIT]
Slot filter (the SQL that defines schedulability)¶
The candidate query at service.go:1682-1755 requires:
node.status = 'active', matching region, tenant boundaryslot.status = 'available',capacity_shape = 'gpu_slice', no parent slotsharing_model = 'exclusive_device',max_claims = 1,compute_milli ≥ 1000- Non-empty
pci_address,nvme_device,mac_address,private_ip capacity_metadata.storage_ownership = 'slice'capacity_metadata.destructive_wipe_policy != ''capacity_metadata.fabric_claim_mode = 'per_slot_vf'capacity_metadata.fabric_vf_pci_address != ''- A viable
os_imagesrow (cached or fetchable) fortarget='vm_slice'and the SKU - No other active
slotclaim sharing the samefabric_vf_pci_addresson that node - No active
node_exclusiveclaim on that node - No active
baremetalallocation on that node
Ranking heuristic¶
rankSlicePlacementCandidates (service.go:1889) scores candidates by:
- Single-NUMA fit — can
gpus_totalslots be drawn from one NUMA group - Remaining slots after the pick (best-fit; smaller is better)
- NUMA group count in the chosen subset
- First slot index (deterministic tie-break)
- Node ID (final tie-break)
Node-agent: provision¶
Entrypoint: runSliceVMProvision (slice_vm.go:478). 17 phase-timed steps. Failure runs deferred cleanup.
flowchart TB
P0([slice.vm_provision])
P0 --> P1["1. lease_acquire<br/>/var/lib/gpuaas/node-scheduler/leases/{slot_id}.json"]
P1 --> P2["2. host_dependencies<br/>apt-install qemu-kvm/libvirt/ovs/cloud-image-utils"]
P2 --> P3[3. host_passthrough_check<br/>stat /dev/kvm + modprobe vfio-pci]
P3 --> P4["4. vfio_bind_check<br/>GPU + fabric VF → vfio-pci<br/>(driver_override + bind)"]
P4 --> P5["5. image_stat_download<br/>fetch from image_url if absent<br/>(64 GiB cap)"]
P5 --> P6[6. image_digest_verify<br/>sha256 if fresh or untrusted]
P6 --> P7["7. cloud_init_dir<br/>/var/lib/gpuaas/slices/{allocation_id}/"]
P7 --> P8["8. terminal_key<br/>ensure host ed25519 keypair<br/>(/var/lib/gpuaas/terminal/id_ed25519)"]
P8 --> P9[9. guest_telemetry_register<br/>per-allocation token]
P9 --> P10[10. cloud_init_seed_files<br/>user-data + meta-data 0o600]
P10 --> P11[11. cloud_localds → seed.iso]
P11 --> P12[12. runtime_validate<br/>OVS br-exists + NVMe unmounted]
P12 --> P13[13. dhcp_reservation<br/>/etc/dnsmasq.d/]
P13 --> P14[14. image_write_convert<br/>qemu-img convert -O raw → boot NVMe]
P14 --> P15[15. virt_install<br/>UEFI, host-passthrough,<br/>--host-device per GPU+VF]
P15 --> P16["16. readiness<br/>SSH on private_ip + /var/lib/gpuaas/slice-ready"]
P16 --> P17[17. performance_probe<br/>best-effort nvidia-smi snapshot]
P17 --> OK([return success])
P1 & P4 & P14 & P15 & P16 -.fail.-> CL[deferred cleanup:<br/>• drop DHCP reservation<br/>• unregister guest telemetry<br/>• release slot leases<br/>• RemoveAll cloud_init_dir]
CL --> ERR([return error])
classDef ok fill:#e8f5e9,stroke:#2e7d32
classDef wrn fill:#fff3e0,stroke:#e65100
class OK ok
class ERR,CL wrn
virt-install command shape¶
buildSliceVMVirtInstallArgs (slice_vm.go:1240) constructs:
virt-install \
--name=<vm_name> \
--memory=<sum_of_slot_mib> \
--vcpus=<sum_of_slot_vcpu> \
--cpu=host-passthrough,cache.mode=passthrough \
--os-variant=ubuntu24.04 \
--controller=scsi,model=virtio-scsi \
--disk=path=<boot_nvme>,format=raw,bus=scsi,target.rotation_rate=1,driver.cache=none,driver.io=native,driver.discard=unmap,driver.detect_zeroes=unmap \
--disk=path=<seed_iso>,device=cdrom \
--network=bridge=ovsbr0,virtualport_type=openvswitch,mac=<boot_mac> \
--graphics=none --noautoconsole \
--boot=uefi,loader_secure=no --tpm=none --import \
[--disk for each additional slot NVMe] \
[--numatune=<boot.numa_node>] \
--host-device=pci_<gpu> --host-device=pci_<fabric_vf> # per slot
Notes:
- Multiple slots → one libvirt domain with N disks and N PCI host-device pairs.
- Memory / vCPU are summed across slots.
loader_secure=nobecause the driver auto-install path doesn't have a signed module.--graphics=none— no SPICE/VNC; console only via the terminal gateway.
Networking¶
flowchart LR
subgraph Host[Slice host]
UP[Uplink NIC<br/>default route]
IPT[iptables NAT<br/>POSTROUTING MASQUERADE<br/>10.100.0.0/24]
OVS{{OVS bridge ovsbr0<br/>10.100.0.1/24}}
DNS[dnsmasq<br/>per-VM MAC→IP]
IB[IPoIB ibp*<br/>192.168.x.0/24]
TELE[node-agent<br/>:9110/internal/v1/guest-telemetry]
end
subgraph VM["Slice VM (slot 0..N)"]
E1[eth0 virtio<br/>10.100.0.10]
IB1[ib VF passthrough]
GPU1[GPU vfio-pci]
NVME1[NVMe passthrough]
HELP[gpuaas-metrics-helper]
end
E1 <--> OVS
OVS <--> IPT
IPT <--> UP
OVS -.dhcp.-> DNS
IB1 <-.fabric.-> IB
HELP -. POST + per-alloc token .-> TELE
| Plane | Path | Source of truth |
|---|---|---|
| Management | BF3 VF → OVS ovsbr0 → VM virtio NIC |
dnsmasq reservation + slot row mac_address/private_ip |
| Workload fabric | Mellanox VF → VM (passthrough) | slot row capacity_metadata.fabric_vf_pci_address |
| Telemetry | VM helper → host gateway :9110 |
per-allocation token registered at provision time |
Release¶
runSliceVMRelease (slice_vm.go:1175):
flowchart TB
R0([slice.vm_release]) --> R1["virsh shutdown <vm>"]
R1 --> R2{running after<br/>graceful_timeout?}
R2 -- yes --> R3["virsh destroy<br/>hard_stopped=true"]
R2 -- no --> R4["virsh undefine<br/>(--nvram fallback)"]
R3 --> R4
R4 --> R5[re-bind GPU + fabric VF<br/>to vfio-pci]
R5 --> R6[RemoveAll cloud_init_dir<br/>remove DHCP reservation]
R6 --> R7{wipe=true?}
R7 -- yes --> R8[zero each NVMe]
R7 -- no --> R9
R8 --> R9[release slot leases<br/>unregister guest telemetry]
R9 --> DONE([return result])
Output keys: vm_name, released=true, hard_stopped, wiped, slot_count, leases_released.
Filesystem layout (on a slice host)¶
/var/lib/gpuaas/
├── slice-images/ # image cache
├── slices/
│ └── <allocation_id>/ # per-allocation cloud-init seed
│ ├── user-data.yaml 0o600
│ ├── meta-data.yaml 0o600
│ └── seed.iso 0o644
├── node-scheduler/
│ └── leases/<slot_id>.json 0o600 (per-slot mutex)
├── terminal/
│ ├── id_ed25519 0o600 (host terminal key)
│ └── id_ed25519.pub 0o644
└── site-bootstrap/
└── h200-slice-vm.reboot-required (touched if host bootstrap pending)
/etc/dnsmasq.d/<vm_name>.conf (per-slice MAC→IP reservation)
/etc/netplan/60-ipoib.yaml (IPoIB host config)
/etc/gpuaas/metrics-helper.env (rendered into the guest cloud-init)
/usr/local/bin/gpuaas-metrics-helper (rendered into the guest cloud-init)
Environment variables on the node-agent¶
| Var | Default | Effect |
|---|---|---|
GPUAAS_SLICE_RUNTIME_VFIO_BIND |
unset (off) | When 1, allow runtime rebinding of GPU/fabric VFs to vfio-pci. With default off, wrong driver is a hard error. |
GPUAAS_NODE_SCHEDULER_LEASE_TTL_SECONDS |
86400 | Slot-lease TTL. Clamped to [300, 604800]. |
GPUAAS_SLICE_TERMINAL_SSH_KEY_PATH |
/var/lib/gpuaas/terminal/id_ed25519 |
Override host terminal key location. |
GPUAAS_TELEMETRY_ADDR |
:9110 |
Port the guest helper POSTs to. |
What's explicitly not in the slice path¶
| Item | Status |
|---|---|
| Sub-GPU partitioning (MIG / vGPU / MPS) | Designed — reserved as gpu_partition, gpu_shared |
| Multi-node slice clusters | Designed — non-goal for v1 |
| Live migration | not supported (PCI passthrough) |
| Secure Boot / TPM | disabled (loader_secure=no, --tpm=none) |
| Raw VNC console | disabled (--graphics=none) — terminal gateway only |
full-reimage isolation |
Designed — gated by maas.enabled=true |
| Driver-baked golden images | catalog supports it; build pipeline DESIGNED |
Where to look next¶
- Capacity shapes & SKUs
- Node-agent task catalog
- GPU slice trail — full curated reading path
- Position vs other clouds
- Design intent (source): Allocation Capacity Shapes and GPU Slices v1
- Decisions (source): GPU Slice End-to-End Readiness Decisions v1
- Networking (source): Slice Networking Architecture v1
- Telemetry (source): Slice Guest Telemetry and Benchmark v1