Skip to content

GPU slice (as-built)

Implemented

Source: cmd/node-agent/slice_*.go · packages/services/provisioning/orchestrator/service.go:1499-1915 · scripts/seed.sql

This page describes the slice path as the code actually runs it. For design intent and proposals, see Allocation Capacity Shapes and GPU Slices v1 (source).

At a glance

flowchart LR
    U([Tenant]) -->|POST /allocations<br/>sku=h200-sxm-slice<br/>gpus=N| API[cmd/api]
    API --> ORCH[orchestrator]
    ORCH -->|reserve N slots<br/>NUMA-fit, fabric-unique| DB[(Postgres)]
    ORCH --> WK[provisioning-worker]
    WK -->|slice.vm_provision| AG[node-agent]
    AG -->|libvirt + cloud-init +<br/>VFIO + OVS| VM["Slice VM<br/>N GPUs · N NVMe · N IB VF"]
    VM -.guest telemetry POST.-> AG

Data model

erDiagram
    sku_catalog ||--o{ nodes : "sku"
    sku_catalog {
        text   sku
        text   capacity_shape "gpu_slice|baremetal"
        int[]  allowed_gpu_counts
        jsonb  resource_profile
    }
    nodes ||--o{ node_resource_slots : "hosts"
    nodes ||--o{ allocations : "runs"
    node_resource_slots ||--o{ allocation_resource_claims : "claimed"
    node_resource_slots {
        uuid   id
        int    slot_index
        text   status
        text   capacity_shape
        text   pci_address "GPU PCI"
        text   nvme_device
        text   mac_address
        inet   private_ip
        int    numa_node
        jsonb  capacity_metadata "storage_ownership, fabric_claim_mode, fabric_vf_pci_address, destructive_wipe_policy, sku"
    }
    allocation_resource_claims {
        uuid allocation_id
        uuid slot_id
        text claim_kind "slot|node_exclusive"
        text status
    }

Orchestrator: slot reservation

Entrypoint: Service.reserveSliceSlotsForAllocation (service.go:1596). Runs inside the same DB transaction as allocation creation.

flowchart TB
    A[POST /allocations] --> B[BEGIN TX]
    B --> C[SELECT sku_catalog]
    C --> D{capacity_shape == gpu_slice?}
    D -- no --> ERR1[sku_unavailable]
    D -- yes --> E{gpus_total in allowed_counts?}
    E -- no --> ERR2[sku_unavailable]
    E -- yes --> F[listSlicePlacementCandidates<br/>filter slots + os_images join]
    F --> G{candidates exist?}
    G -- no --> ERR3[sku_unavailable]
    G -- yes --> H[rankSlicePlacementCandidates<br/>NUMA-fit, best-fit, slot index]
    H --> I[lockAvailableSliceSlots<br/>FOR UPDATE SKIP LOCKED]
    I --> J[selectSliceSlotIDs<br/>same NUMA when possible<br/>reject duplicate fabric VFs]
    J --> K[UPDATE slots status='reserved'<br/>INSERT allocation + N claims<br/>INSERT outbox row]
    K --> COMMIT[COMMIT]

Slot filter (the SQL that defines schedulability)

The candidate query at service.go:1682-1755 requires:

  • node.status = 'active', matching region, tenant boundary
  • slot.status = 'available', capacity_shape = 'gpu_slice', no parent slot
  • sharing_model = 'exclusive_device', max_claims = 1, compute_milli ≥ 1000
  • Non-empty pci_address, nvme_device, mac_address, private_ip
  • capacity_metadata.storage_ownership = 'slice'
  • capacity_metadata.destructive_wipe_policy != ''
  • capacity_metadata.fabric_claim_mode = 'per_slot_vf'
  • capacity_metadata.fabric_vf_pci_address != ''
  • A viable os_images row (cached or fetchable) for target='vm_slice' and the SKU
  • No other active slot claim sharing the same fabric_vf_pci_address on that node
  • No active node_exclusive claim on that node
  • No active baremetal allocation on that node

Ranking heuristic

rankSlicePlacementCandidates (service.go:1889) scores candidates by:

  1. Single-NUMA fit — can gpus_total slots be drawn from one NUMA group
  2. Remaining slots after the pick (best-fit; smaller is better)
  3. NUMA group count in the chosen subset
  4. First slot index (deterministic tie-break)
  5. Node ID (final tie-break)

Node-agent: provision

Entrypoint: runSliceVMProvision (slice_vm.go:478). 17 phase-timed steps. Failure runs deferred cleanup.

flowchart TB
    P0([slice.vm_provision])
    P0 --> P1["1. lease_acquire<br/>/var/lib/gpuaas/node-scheduler/leases/{slot_id}.json"]
    P1 --> P2["2. host_dependencies<br/>apt-install qemu-kvm/libvirt/ovs/cloud-image-utils"]
    P2 --> P3[3. host_passthrough_check<br/>stat /dev/kvm + modprobe vfio-pci]
    P3 --> P4["4. vfio_bind_check<br/>GPU + fabric VF → vfio-pci<br/>(driver_override + bind)"]
    P4 --> P5["5. image_stat_download<br/>fetch from image_url if absent<br/>(64 GiB cap)"]
    P5 --> P6[6. image_digest_verify<br/>sha256 if fresh or untrusted]
    P6 --> P7["7. cloud_init_dir<br/>/var/lib/gpuaas/slices/{allocation_id}/"]
    P7 --> P8["8. terminal_key<br/>ensure host ed25519 keypair<br/>(/var/lib/gpuaas/terminal/id_ed25519)"]
    P8 --> P9[9. guest_telemetry_register<br/>per-allocation token]
    P9 --> P10[10. cloud_init_seed_files<br/>user-data + meta-data 0o600]
    P10 --> P11[11. cloud_localds → seed.iso]
    P11 --> P12[12. runtime_validate<br/>OVS br-exists + NVMe unmounted]
    P12 --> P13[13. dhcp_reservation<br/>/etc/dnsmasq.d/]
    P13 --> P14[14. image_write_convert<br/>qemu-img convert -O raw → boot NVMe]
    P14 --> P15[15. virt_install<br/>UEFI, host-passthrough,<br/>--host-device per GPU+VF]
    P15 --> P16["16. readiness<br/>SSH on private_ip + /var/lib/gpuaas/slice-ready"]
    P16 --> P17[17. performance_probe<br/>best-effort nvidia-smi snapshot]
    P17 --> OK([return success])

    P1 & P4 & P14 & P15 & P16 -.fail.-> CL[deferred cleanup:<br/>• drop DHCP reservation<br/>• unregister guest telemetry<br/>• release slot leases<br/>• RemoveAll cloud_init_dir]
    CL --> ERR([return error])

    classDef ok  fill:#e8f5e9,stroke:#2e7d32
    classDef wrn fill:#fff3e0,stroke:#e65100
    class OK ok
    class ERR,CL wrn

virt-install command shape

buildSliceVMVirtInstallArgs (slice_vm.go:1240) constructs:

virt-install \
  --name=<vm_name> \
  --memory=<sum_of_slot_mib>       \
  --vcpus=<sum_of_slot_vcpu>       \
  --cpu=host-passthrough,cache.mode=passthrough \
  --os-variant=ubuntu24.04         \
  --controller=scsi,model=virtio-scsi \
  --disk=path=<boot_nvme>,format=raw,bus=scsi,target.rotation_rate=1,driver.cache=none,driver.io=native,driver.discard=unmap,driver.detect_zeroes=unmap \
  --disk=path=<seed_iso>,device=cdrom \
  --network=bridge=ovsbr0,virtualport_type=openvswitch,mac=<boot_mac> \
  --graphics=none --noautoconsole \
  --boot=uefi,loader_secure=no --tpm=none --import \
  [--disk for each additional slot NVMe] \
  [--numatune=<boot.numa_node>] \
  --host-device=pci_<gpu>  --host-device=pci_<fabric_vf>   # per slot

Notes:

  • Multiple slots → one libvirt domain with N disks and N PCI host-device pairs.
  • Memory / vCPU are summed across slots.
  • loader_secure=no because the driver auto-install path doesn't have a signed module.
  • --graphics=none — no SPICE/VNC; console only via the terminal gateway.

Networking

flowchart LR
    subgraph Host[Slice host]
        UP[Uplink NIC<br/>default route]
        IPT[iptables NAT<br/>POSTROUTING MASQUERADE<br/>10.100.0.0/24]
        OVS{{OVS bridge ovsbr0<br/>10.100.0.1/24}}
        DNS[dnsmasq<br/>per-VM MAC→IP]
        IB[IPoIB ibp*<br/>192.168.x.0/24]
        TELE[node-agent<br/>:9110/internal/v1/guest-telemetry]
    end
    subgraph VM["Slice VM (slot 0..N)"]
        E1[eth0 virtio<br/>10.100.0.10]
        IB1[ib VF passthrough]
        GPU1[GPU vfio-pci]
        NVME1[NVMe passthrough]
        HELP[gpuaas-metrics-helper]
    end

    E1 <--> OVS
    OVS <--> IPT
    IPT <--> UP
    OVS -.dhcp.-> DNS
    IB1 <-.fabric.-> IB
    HELP -. POST + per-alloc token .-> TELE
Plane Path Source of truth
Management BF3 VF → OVS ovsbr0 → VM virtio NIC dnsmasq reservation + slot row mac_address/private_ip
Workload fabric Mellanox VF → VM (passthrough) slot row capacity_metadata.fabric_vf_pci_address
Telemetry VM helper → host gateway :9110 per-allocation token registered at provision time

Release

runSliceVMRelease (slice_vm.go:1175):

flowchart TB
    R0([slice.vm_release]) --> R1["virsh shutdown &lt;vm&gt;"]
    R1 --> R2{running after<br/>graceful_timeout?}
    R2 -- yes --> R3["virsh destroy<br/>hard_stopped=true"]
    R2 -- no --> R4["virsh undefine<br/>(--nvram fallback)"]
    R3 --> R4
    R4 --> R5[re-bind GPU + fabric VF<br/>to vfio-pci]
    R5 --> R6[RemoveAll cloud_init_dir<br/>remove DHCP reservation]
    R6 --> R7{wipe=true?}
    R7 -- yes --> R8[zero each NVMe]
    R7 -- no --> R9
    R8 --> R9[release slot leases<br/>unregister guest telemetry]
    R9 --> DONE([return result])

Output keys: vm_name, released=true, hard_stopped, wiped, slot_count, leases_released.

Filesystem layout (on a slice host)

/var/lib/gpuaas/
├── slice-images/                       # image cache
├── slices/
│   └── <allocation_id>/                # per-allocation cloud-init seed
│       ├── user-data.yaml              0o600
│       ├── meta-data.yaml              0o600
│       └── seed.iso                    0o644
├── node-scheduler/
│   └── leases/<slot_id>.json           0o600  (per-slot mutex)
├── terminal/
│   ├── id_ed25519                      0o600  (host terminal key)
│   └── id_ed25519.pub                  0o644
└── site-bootstrap/
    └── h200-slice-vm.reboot-required   (touched if host bootstrap pending)

/etc/dnsmasq.d/<vm_name>.conf           (per-slice MAC→IP reservation)
/etc/netplan/60-ipoib.yaml              (IPoIB host config)
/etc/gpuaas/metrics-helper.env          (rendered into the guest cloud-init)
/usr/local/bin/gpuaas-metrics-helper    (rendered into the guest cloud-init)

Environment variables on the node-agent

Var Default Effect
GPUAAS_SLICE_RUNTIME_VFIO_BIND unset (off) When 1, allow runtime rebinding of GPU/fabric VFs to vfio-pci. With default off, wrong driver is a hard error.
GPUAAS_NODE_SCHEDULER_LEASE_TTL_SECONDS 86400 Slot-lease TTL. Clamped to [300, 604800].
GPUAAS_SLICE_TERMINAL_SSH_KEY_PATH /var/lib/gpuaas/terminal/id_ed25519 Override host terminal key location.
GPUAAS_TELEMETRY_ADDR :9110 Port the guest helper POSTs to.

What's explicitly not in the slice path

Item Status
Sub-GPU partitioning (MIG / vGPU / MPS) Designed — reserved as gpu_partition, gpu_shared
Multi-node slice clusters Designed — non-goal for v1
Live migration not supported (PCI passthrough)
Secure Boot / TPM disabled (loader_secure=no, --tpm=none)
Raw VNC console disabled (--graphics=none) — terminal gateway only
full-reimage isolation Designed — gated by maas.enabled=true
Driver-baked golden images catalog supports it; build pipeline DESIGNED

Where to look next