Trail: GPU Slice¶
End-to-end reading path through every layer of the GPU slice product. Each step carries its own diagram and concrete facts; nothing requires bouncing out to source docs to follow the story.
Trail map¶
flowchart TB
classDef impl fill:#d1e7dd,stroke:#0a3622,color:#0a3622
classDef des fill:#fff3cd,stroke:#332701,color:#332701
classDef dec fill:#cfe2ff,stroke:#0e2240,color:#0e2240
classDef run fill:#e9d6ff,stroke:#1e1530,color:#1e1530
classDef cmp fill:#f8d7da,stroke:#42101e,color:#42101e
S1[1. What is a slice]:::impl --> S2[2. Capacity shapes & SKUs]:::impl
S2 --> S3[3. Slot model & data]:::impl
S3 --> S4[4. Orchestrator placement]:::impl
S4 --> S5[5. Node-agent task catalog]:::impl
S5 --> S6[6. Provisioning phases]:::impl
S6 --> S7[7. Networking model]:::impl
S7 --> S8[8. Telemetry model]:::des
S8 --> S9[9. Release & cleanup]:::impl
S9 --> S10[10. Design intent]:::dec
S10 --> S11[11. Runbooks]:::run
S11 --> S12[12. Position vs other clouds]:::cmp
1. What is a slice¶
Implemented
A GPU slice is a tenant VM that owns one or more contiguous slot bundles on a single physical GPU host. Each slot bundle is one GPU PCI device + one Mellanox SR-IOV VF + one NVMe namespace + one private IP + one MAC. The slice runs Ubuntu, accessed via SSH or browser terminal, with the GPU(s) visible inside the guest via VFIO passthrough.
flowchart LR
subgraph Host[H200 host - shared across tenants]
direction TB
GPU0[GPU 0<br/>PCI 0000:1b:00.0]:::gpu
GPU1[GPU 1<br/>PCI 0000:3c:00.0]:::gpu
VF0[IB VF 0<br/>PCI 0000:1a:00.2]:::vf
VF1[IB VF 1<br/>PCI 0000:3a:00.2]:::vf
N0[NVMe 0<br/>/dev/disk/by-id/nvme-eui.aaa]:::nvme
N1[NVMe 1<br/>/dev/disk/by-id/nvme-eui.bbb]:::nvme
end
subgraph T1[Tenant A slice VM]
VM1[Ubuntu guest<br/>nvidia-smi shows 1 GPU<br/>IB visible as ibp]:::vm
end
subgraph T2[Tenant B slice VM]
VM2[Ubuntu guest<br/>nvidia-smi shows 1 GPU<br/>separate IB VF]:::vm
end
GPU0 -.vfio passthrough.-> VM1
VF0 -.vfio passthrough.-> VM1
N0 -.virtio-scsi.-> VM1
GPU1 -.vfio passthrough.-> VM2
VF1 -.vfio passthrough.-> VM2
N1 -.virtio-scsi.-> VM2
classDef gpu fill:#fff3e0,stroke:#e65100
classDef vf fill:#e3f2fd,stroke:#1565c0
classDef nvme fill:#eceff1,stroke:#455a64
classDef vm fill:#e8f5e9,stroke:#2e7d32
Why a slice (not a container or whole node):
| Approach | Isolation | Multi-tenant economics | Used here? |
|---|---|---|---|
| Container | Shared kernel; cgroups + namespaces | Best (no virt overhead) | No — too weak for GPU |
| Whole node | Strongest (no shared anything) | Worst (whole node per tenant) | Yes, as baremetal shape |
| VM with passthrough | Separate kernel + dedicated passthrough HW | Good | Yes — this is gpu_slice |
| MIG / vGPU | Hardware partition or vendor driver | Best of both | Not v1 |
2. Capacity shapes & SKUs¶
Implemented
flowchart TB
REQ([POST /api/v1/allocations<br/>sku=h200-sxm-slice<br/>gpus_total=N])
REQ --> CHECK{capacity_shape}
CHECK -- gpu_slice --> SLICE[Reserve N slot rows<br/>same node, NUMA-fit]
CHECK -- baremetal --> BM[Lock entire node]
SLICE --> PROFILE{Resolve VM profile}
PROFILE --> P1[h200_1g_24c_64g<br/>1 GPU / 24 vCPU / 64 GiB]
PROFILE --> P2[h200_2g_48c_128g<br/>2 GPU / 48 vCPU / 128 GiB]
PROFILE --> P4[h200_4g_96c_256g<br/>4 GPU / 96 vCPU / 256 GiB]
PROFILE --> P8[h200_8g_192c_512g<br/>8 GPU / 192 vCPU / 512 GiB]
| SKU | Shape | allowed_gpu_counts |
Hourly | Source |
|---|---|---|---|---|
h200-sxm-slice |
gpu_slice |
[1, 2, 4, 8] |
450 ¢/hr | scripts/seed.sql:18 |
h200-sxm-baremetal-8g |
baremetal |
n/a (fixed 8) | (operator-set) | seed |
Each SKU carries a resource_profile jsonb that names the slice VM profile family. The orchestrator stamps the chosen profile onto the allocation snapshot at create time so later seed edits don't mutate running allocations.
→ Detail: Capacity shapes & SKUs
3. Slot model & data¶
Implemented
erDiagram
nodes ||--o{ node_resource_slots : "hosts"
nodes ||--o{ allocations : "currently runs"
sku_catalog ||--o{ allocations : "billed against"
allocations ||--o{ allocation_resource_claims : "binds"
node_resource_slots ||--o{ allocation_resource_claims : "claimed by"
os_images ||--o{ node_image_cache : "cached on nodes"
nodes ||--o{ node_image_cache : "caches"
nodes {
uuid id PK
text status "active|drained|unavailable"
text region_code
text sku FK
uuid org_id "nullable; null=shared"
}
node_resource_slots {
uuid id PK
uuid node_id FK
int slot_index
text status "available|reserved|provisioning|active|releasing|cleanup|disabled"
text capacity_shape "gpu_slice"
text pci_address "GPU PCI"
text nvme_device "/dev/disk/by-id/..."
text mac_address "52:54:xx:xx:xx:xx"
inet private_ip "10.100.0.{10+i}"
int numa_node
jsonb capacity_metadata "storage_ownership, fabric_claim_mode, fabric_vf_pci_address, destructive_wipe_policy, sku"
}
allocation_resource_claims {
uuid allocation_id FK
uuid slot_id FK
text claim_kind "slot|node_exclusive"
text status "reserved|provisioning|active|releasing"
}
sku_catalog {
text sku PK
text capacity_shape
int_array allowed_gpu_counts
jsonb resource_profile
}
os_images {
text slug PK
text target "vm_slice|baremetal"
text_array compatible_skus
text digest_sha256
}
Required capacity_metadata keys before a slot becomes schedulable (the orchestrator filter rejects rows missing any):
| Key | Required value | Why |
|---|---|---|
storage_ownership |
slice |
NVMe is dedicated to tenant slice use, not host share |
destructive_wipe_policy |
non-empty | Slot has an erase contract for tenant data on release |
fabric_claim_mode |
per_slot_vf |
No shared parent BF/IB as concurrency proof |
fabric_vf_pci_address |
non-empty PCI | The actual VF that gets passed through |
sku |
optional | Per-slot SKU override |
4. Orchestrator placement¶
Implemented
sequenceDiagram
autonumber
participant API as cmd/api
participant ORCH as orchestrator
participant DB as Postgres
participant OR as outbox-relay
API->>ORCH: CreateRequested(sku, gpus_total)
ORCH->>DB: BEGIN
ORCH->>DB: SELECT sku_catalog<br/>(capacity_shape, allowed_gpu_counts)
Note over ORCH: validate shape == gpu_slice<br/>+ gpus_total in allowed_counts
ORCH->>DB: listSlicePlacementCandidates(filter + os_images join)
Note over ORCH: candidates grouped by node,<br/>ordered NUMA + slot_index
ORCH->>ORCH: rankSlicePlacementCandidates<br/>(single-NUMA fit > remaining slots > NUMA groups > slot_index)
loop for each ranked candidate
ORCH->>DB: SELECT slots FOR UPDATE SKIP LOCKED
ORCH->>ORCH: selectSliceSlotIDs(reject duplicate fabric VFs)
alt enough slots locked
ORCH->>DB: UPDATE slots SET status='reserved'
ORCH->>DB: INSERT allocation + N claims
ORCH->>DB: INSERT outbox: provisioning.requested
ORCH->>DB: COMMIT
ORCH-->>API: allocation_id
else not enough
ORCH->>DB: ROLLBACK locked slots
Note over ORCH: try next candidate
end
end
OR->>DB: poll outbox FOR UPDATE SKIP LOCKED
OR-->>OR: publish provisioning.requested → NATS
The candidate query at service.go:1682 enforces every invariant: region match, tenant boundary, slot metadata complete, no overlapping fabric VF claim, no active baremetal on the node, a viable os_images row for the SKU.
Ranking scores:
- Single-NUMA fit — can
gpus_totalslots come from one NUMA group? (Better cross-GPU bandwidth.) - Remaining slots after the pick (best-fit; smaller is better — preserves clean larger blocks).
- NUMA group count in the chosen subset.
- First slot index (deterministic tie-break).
- Node ID (final tie-break).
5. Node-agent task catalog¶
Implemented
flowchart LR
classDef slice fill:#fff3e0,stroke:#e65100
classDef bm fill:#e3f2fd,stroke:#1565c0
classDef diag fill:#f3e5f5,stroke:#6a1b9a
subgraph TaskTypes[Typed task contract — cmd/node-agent/agent.go]
T1[slice.topology_discover]:::slice
T2[slice.vm_provision]:::slice
T3[slice.vm_release]:::slice
T4[allocation.provision_user]:::bm
T5[allocation.deprovision_user]:::bm
T6[diag.health_probe]:::diag
end
T1 -.outputs.-> O1["gpu_devices<br/>fabric_devices<br/>nvme_devices<br/>candidate_slots<br/>blockers<br/>approval_required=true"]
T2 -.outputs.-> O2["vm_name, private_ip,<br/>ssh_port=22,<br/>readiness.ssh_ready,<br/>readiness.guest_ready,<br/>timings.phase_ms"]
T3 -.outputs.-> O3["released, hard_stopped,<br/>wiped, leases_released"]
All tasks are: typed inputs, signed at the API boundary, mTLS-pulled by the node, structured outputs. No remote shell.
→ Full task catalog: Node-agent task catalog
6. Provisioning phases¶
Implemented
runSliceVMProvision at cmd/node-agent/slice_vm.go:478 — 17 phase-timed steps. Failure between phases 1–16 triggers deferred cleanup.
stateDiagram-v2
[*] --> lease_acquire
lease_acquire --> host_dependencies: per-slot JSON lease created
host_dependencies --> host_passthrough_check: apt-get done
host_passthrough_check --> vfio_bind_check: /dev/kvm OK
vfio_bind_check --> image_stat_download: GPUs + fabric VFs bound to vfio-pci
image_stat_download --> image_digest_verify: image present or downloaded
image_digest_verify --> cloud_init_dir: sha256 verified
cloud_init_dir --> terminal_key: dir ready
terminal_key --> guest_telemetry_register: host ed25519 keypair ready
guest_telemetry_register --> cloud_init_seed_files: per-allocation token minted
cloud_init_seed_files --> cloud_localds: user-data + meta-data written
cloud_localds --> runtime_validate: seed.iso packed
runtime_validate --> dhcp_reservation: OVS br-exists + NVMe unmounted
dhcp_reservation --> image_write_convert: per-VM MAC→IP in dnsmasq
image_write_convert --> virt_install: qemu-img convert → boot NVMe
virt_install --> readiness: libvirt domain running
readiness --> performance_probe: SSH + guest marker reached
performance_probe --> [*]: success
lease_acquire --> cleanup: error
vfio_bind_check --> cleanup: error
image_write_convert --> cleanup: error
virt_install --> cleanup: error
readiness --> cleanup: error
cleanup --> [*]: deferred cleanup<br/>drop DHCP, unregister telemetry,<br/>release leases, rm cloud_init_dir
Key phase facts:
| Phase | Detail |
|---|---|
lease_acquire |
File-based exclusive-create JSON lease per slot under /var/lib/gpuaas/node-scheduler/leases/{slot_id}.json. TTL default 24 h. |
vfio_bind_check |
GPU + fabric VF must be bound to vfio-pci. With GPUAAS_SLICE_RUNTIME_VFIO_BIND unset, wrong driver is a hard error → operator must run host bootstrap. |
image_stat_download |
Image must live under /var/lib/gpuaas/slice-images or /var/lib/libvirt/images. 64 GiB cap on download. |
terminal_key |
Host's per-instance ed25519 pubkey is injected as authorized_key — terminal gateway uses this to broker SSH. |
image_write_convert |
qemu-img convert -O raw <image> <boot NVMe> — destructive write to slot's NVMe. |
virt_install |
UEFI, host-passthrough CPU, virtio-scsi, OVS bridge, --host-device per GPU + VF. loader_secure=no, --tpm=none, --graphics=none. |
readiness |
SSH on private_ip:22 + presence of /var/lib/gpuaas/slice-ready. Bounded by graceful_timeout_seconds (30–900, default 300). |
7. Networking model¶
Implemented
flowchart TB
subgraph Host[Slice host networking]
direction TB
UP[Uplink NIC<br/>default route]
IPT[iptables NAT<br/>POSTROUTING MASQUERADE<br/>src 10.100.0.0/24]
OVS{{OVS bridge ovsbr0<br/>10.100.0.1/24}}
DNS[dnsmasq<br/>per-VM MAC→IP reservation<br/>etc/dnsmasq.d/]
IB[IPoIB ibp*<br/>192.168.x.0/24<br/>netplan 60-ipoib.yaml]
TELE[node-agent<br/>:9110/internal/v1/guest-telemetry]
end
subgraph VM[Slice VM]
E0[eth0 virtio<br/>private_ip 10.100.0.10]
IB0[ib0 SR-IOV VF<br/>passthrough]
GPU[GPU vfio-pci]
NVME[NVMe virtio-scsi]
HELP[guest helper<br/>gpuaas-metrics-helper]
end
E0 <--> OVS
OVS <--> IPT
IPT <--> UP
OVS -.dhcp lease.-> DNS
IB0 <-.RDMA.-> IB
HELP -. POST + per-alloc token .-> TELE
| Plane | Path | Source of truth |
|---|---|---|
| Management/public | OVS bridge ovsbr0 → virtio NIC |
mac_address + private_ip on slot row; dnsmasq config writes reservation |
| Workload fabric | Per-slot Mellanox VF passthrough | capacity_metadata.fabric_vf_pci_address |
| Telemetry channel | VM helper → http://10.100.0.1:9110 |
per-allocation token registered at provision time |
| East-west across slices | Denied by default | OVS bridge isolation (no inter-VM flows defined) |
Slot 0 (boot slot) MAC: 52:54: + first 4 bytes of sha256(node_id:0). IP: 10.100.0.10. Slot N IP: 10.100.0.{10+N}. Deterministic so reboots don't break anything.
8. Telemetry model¶
Designed
Source: Slice_Guest_Telemetry_and_Benchmark_v1.md
sequenceDiagram
autonumber
participant VM as Slice VM
participant GH as guest helper<br/>(installed via cloud-init)
participant NA as node-agent<br/>:9110
participant API as cmd/api
participant UI as Web UI
Note over VM,GH: provisioning phase 9<br/>per-allocation token minted
GH->>GH: every 30s: nvidia-smi snapshot
GH->>NA: POST /internal/v1/guest-telemetry<br/>Authorization: Bearer <token><br/>{gpu_util, mem_util, temp, power}
NA-->>GH: 200
NA->>NA: validate token + allocation_id
NA->>API: forward as allocation metric<br/>(source=slice_guest)
API->>UI: WS push allocation telemetry
Three explicit telemetry sources defined in the spec:
| Source | Used by | Why |
|---|---|---|
slice_guest |
gpu_slice allocations |
Host nvidia-smi is misleading once GPUs are vfio-pci'd |
host_local_probe |
baremetal allocations |
Platform-owned host probe (replaces Netdata long-term) |
unavailable |
Either, when neither is available | Explicit, never silently falls back to host GPU values |
The provisioning task registers the per-allocation token; the release task unregisters. Host Netdata stays available for operator-only node health, never tenant metrics.
9. Release & cleanup¶
Implemented
runSliceVMRelease at cmd/node-agent/slice_vm.go:1175.
stateDiagram-v2
[*] --> graceful_shutdown
graceful_shutdown --> waiting: virsh shutdown
waiting --> shutdown_clean: VM exited before timeout
waiting --> hard_destroy: still running at graceful_timeout
hard_destroy --> shutdown_clean: virsh destroy (hard_stopped=true)
shutdown_clean --> undefine
undefine --> vfio_rebind: virsh undefine [--nvram]
vfio_rebind --> cleanup_files: GPU + fabric VF stay bound to vfio-pci
cleanup_files --> wipe_decision: RemoveAll cloud_init_dir<br/>remove dnsmasq reservation
wipe_decision --> wipe_nvme: in.Wipe=true
wipe_decision --> drop_state: in.Wipe=false
wipe_nvme --> drop_state: zero each NVMe
drop_state --> [*]: release per-slot leases<br/>unregister guest telemetry
Output keys reported back to the worker:
{
"vm_name": "...",
"released": true,
"hard_stopped": false,
"wiped": false,
"slot_count": 4,
"leases_released": 4
}
hard_stopped=true is auditable signal that graceful shutdown failed — surfaces in allocation history as a follow-up flag.
10. Design intent¶
Decided
The five design docs that decided the v1 slice shape:
flowchart LR
A[Allocation_Capacity_Shapes_and<br/>_GPU_Slices_v1] --> B[GPU_Slice_End_to_End<br/>_Readiness_Decisions_v1]
B --> C[Hierarchical_Placement_and<br/>_Node_Scheduler_v1]
A --> D[Slice_Networking_Architecture_v1]
A --> E[GPU_Slice_Implementation_Checklist_v1]
A --> F[Slice_Guest_Telemetry_and<br/>_Benchmark_v1]
classDef src fill:#fff3cd,stroke:#332701
class A,B,C,D,E,F src
Read in order:
Allocation_Capacity_Shapes_and_GPU_Slices_v1.md— master design proposal (1388 lines)GPU_Slice_End_to_End_Readiness_Decisions_v1.md— 10 decisions closing v1 architectureHierarchical_Placement_and_Node_Scheduler_v1.md— placement layering (region → cluster → node → executor)Slice_Networking_Architecture_v1.md— two-plane networking, public ingress, OVS extensibilitySlice_Guest_Telemetry_and_Benchmark_v1.md— telemetry contractGPU_Slice_Implementation_Checklist_v1.md— phased plan (Phases 1–8)
11. Runbooks¶
Runbook
Four slice-specific runbooks live under doc/operations/runbooks/. When in doubt, the categorisation:
flowchart TB
INC[Slice incident or onboarding] --> Q1{What broke?}
Q1 -- slot stuck cleanup_blocked --> R1[Cleanup_Blocked_Slot_Runbook]
Q1 -- image fails to clone/verify --> R2[Image_Pipeline_Runbook]
Q1 -- new host won't bootstrap --> R3[Node_Manual_Bootstrap_Runbook]
Q1 -- joint infra+platform onboarding --> R4[Infra_Enablement_Proposal]
classDef rb fill:#e9d6ff,stroke:#1e1530
class R1,R2,R3,R4 rb
| Runbook | When |
|---|---|
| Cleanup-blocked slot | Slot stuck cleanup_blocked — typically mounted host storage on NVMe, wipe verification failed, or drift detected |
| Image pipeline | Image build/verify/import or cache invalidation issue |
| Node manual bootstrap | MAAS commissioning didn't apply slice firmware profile; manual host prep needed |
| Infra enablement proposal | Joint infra + platform process for enabling a new slice-capable host pool |
12. Position vs other GPU clouds¶
quadrantChart
title GPU cloud isolation vs operational simplicity
x-axis "Operationally simple" --> "Operationally rich"
y-axis "Weaker isolation" --> "Stronger isolation"
quadrant-1 "Hyperscaler HPC"
quadrant-2 "GPUaaS sweet spot"
quadrant-3 "Boutique container clouds"
quadrant-4 "Bare metal direct"
"RunPod / Vast / Together": [0.25, 0.25]
"TensorDock / FluidStack": [0.30, 0.30]
"Lambda 1-Click": [0.45, 0.55]
"CoreWeave (K8s)": [0.70, 0.55]
"DGX Cloud": [0.80, 0.65]
"AWS / Azure / GCP HPC": [0.85, 0.80]
"GPUaaS slice": [0.55, 0.78]
GPUaaS lands deliberately in "VM-with-passthrough between hyperscalers and boutique clouds":
- Stronger isolation than RunPod / Vast / Together because of VFIO + per-slot dedicated NVMe + per-slot SR-IOV IB VF
- Operationally simpler than CoreWeave (no K8s + GPU Operator + Multus stack)
- Less feature-rich than AWS / Azure / GCP HPC — no MIG, no confidential compute, no multi-node clusters, no managed driver pipelines
→ Full comparison: Position vs other clouds and Product comparisons → External clouds
End-to-end recap¶
sequenceDiagram
autonumber
participant U as Tenant
participant API as cmd/api
participant ORCH as orchestrator
participant PW as provisioning-worker
participant NA as node-agent
participant VM as Slice VM
participant BW as billing-worker
U->>API: POST /allocations sku=h200-sxm-slice gpus=N
API->>ORCH: place
ORCH-->>API: allocation_id, outbox emitted
API-->>U: 201 status=requested
PW->>NA: slice.vm_provision 17 phases
NA->>VM: virt-install + cloud-init
VM-->>NA: SSH + readiness marker
NA-->>PW: result private_ip timings readiness
PW->>API: allocation.status=active
API-->>U: WS notify — billing starts
BW->>BW: accrue every 60s
U->>API: terminal-token + WS
API->>VM: relay via terminal-gateway
Note over VM: tenant works
U->>API: release
API->>PW: slice.vm_release
NA->>VM: virsh shutdown / destroy
NA->>NA: re-bind to vfio-pci, wipe leases
NA-->>PW: released
PW->>API: status=released
BW->>BW: stop accrual