Trail: GPU Slice¶

End-to-end reading path through every layer of the GPU slice product. Each step carries its own diagram and concrete facts; nothing requires bouncing out to source docs to follow the story.

Trail map¶

flowchart TB
    classDef impl fill:#d1e7dd,stroke:#0a3622,color:#0a3622
    classDef des  fill:#fff3cd,stroke:#332701,color:#332701
    classDef dec  fill:#cfe2ff,stroke:#0e2240,color:#0e2240
    classDef run  fill:#e9d6ff,stroke:#1e1530,color:#1e1530
    classDef cmp  fill:#f8d7da,stroke:#42101e,color:#42101e

    S1[1. What is a slice]:::impl --> S2[2. Capacity shapes & SKUs]:::impl
    S2 --> S3[3. Slot model & data]:::impl
    S3 --> S4[4. Orchestrator placement]:::impl
    S4 --> S5[5. Node-agent task catalog]:::impl
    S5 --> S6[6. Provisioning phases]:::impl
    S6 --> S7[7. Networking model]:::impl
    S7 --> S8[8. Telemetry model]:::des
    S8 --> S9[9. Release & cleanup]:::impl
    S9 --> S10[10. Design intent]:::dec
    S10 --> S11[11. Runbooks]:::run
    S11 --> S12[12. Position vs other clouds]:::cmp

1. What is a slice¶

Implemented

A GPU slice is a tenant VM that owns one or more contiguous slot bundles on a single physical GPU host. Each slot bundle is one GPU PCI device + one Mellanox SR-IOV VF + one NVMe namespace + one private IP + one MAC. The slice runs Ubuntu, accessed via SSH or browser terminal, with the GPU(s) visible inside the guest via VFIO passthrough.

flowchart LR
    subgraph Host[H200 host - shared across tenants]
        direction TB
        GPU0[GPU 0<br/>PCI 0000:1b:00.0]:::gpu
        GPU1[GPU 1<br/>PCI 0000:3c:00.0]:::gpu
        VF0[IB VF 0<br/>PCI 0000:1a:00.2]:::vf
        VF1[IB VF 1<br/>PCI 0000:3a:00.2]:::vf
        N0[NVMe 0<br/>/dev/disk/by-id/nvme-eui.aaa]:::nvme
        N1[NVMe 1<br/>/dev/disk/by-id/nvme-eui.bbb]:::nvme
    end
    subgraph T1[Tenant A slice VM]
        VM1[Ubuntu guest<br/>nvidia-smi shows 1 GPU<br/>IB visible as ibp]:::vm
    end
    subgraph T2[Tenant B slice VM]
        VM2[Ubuntu guest<br/>nvidia-smi shows 1 GPU<br/>separate IB VF]:::vm
    end
    GPU0 -.vfio passthrough.-> VM1
    VF0  -.vfio passthrough.-> VM1
    N0   -.virtio-scsi.-> VM1
    GPU1 -.vfio passthrough.-> VM2
    VF1  -.vfio passthrough.-> VM2
    N1   -.virtio-scsi.-> VM2
    classDef gpu  fill:#fff3e0,stroke:#e65100
    classDef vf   fill:#e3f2fd,stroke:#1565c0
    classDef nvme fill:#eceff1,stroke:#455a64
    classDef vm   fill:#e8f5e9,stroke:#2e7d32

Why a slice (not a container or whole node):

Approach	Isolation	Multi-tenant economics	Used here?
Container	Shared kernel; cgroups + namespaces	Best (no virt overhead)	No — too weak for GPU
Whole node	Strongest (no shared anything)	Worst (whole node per tenant)	Yes, as `baremetal` shape
VM with passthrough	Separate kernel + dedicated passthrough HW	Good	Yes — this is `gpu_slice`
MIG / vGPU	Hardware partition or vendor driver	Best of both	Not v1

2. Capacity shapes & SKUs¶

Implemented

flowchart TB
    REQ([POST /api/v1/allocations<br/>sku=h200-sxm-slice<br/>gpus_total=N])
    REQ --> CHECK{capacity_shape}
    CHECK -- gpu_slice --> SLICE[Reserve N slot rows<br/>same node, NUMA-fit]
    CHECK -- baremetal --> BM[Lock entire node]
    SLICE --> PROFILE{Resolve VM profile}
    PROFILE --> P1[h200_1g_24c_64g<br/>1 GPU / 24 vCPU / 64 GiB]
    PROFILE --> P2[h200_2g_48c_128g<br/>2 GPU / 48 vCPU / 128 GiB]
    PROFILE --> P4[h200_4g_96c_256g<br/>4 GPU / 96 vCPU / 256 GiB]
    PROFILE --> P8[h200_8g_192c_512g<br/>8 GPU / 192 vCPU / 512 GiB]

SKU	Shape	`allowed_gpu_counts`	Hourly	Source
`h200-sxm-slice`	`gpu_slice`	`[1, 2, 4, 8]`	450 ¢/hr	`scripts/seed.sql:18`
`h200-sxm-baremetal-8g`	`baremetal`	n/a (fixed 8)	(operator-set)	seed

Each SKU carries a resource_profile jsonb that names the slice VM profile family. The orchestrator stamps the chosen profile onto the allocation snapshot at create time so later seed edits don't mutate running allocations.

→ Detail: Capacity shapes & SKUs

3. Slot model & data¶

Implemented

erDiagram
    nodes ||--o{ node_resource_slots : "hosts"
    nodes ||--o{ allocations : "currently runs"
    sku_catalog ||--o{ allocations : "billed against"
    allocations ||--o{ allocation_resource_claims : "binds"
    node_resource_slots ||--o{ allocation_resource_claims : "claimed by"
    os_images ||--o{ node_image_cache : "cached on nodes"
    nodes ||--o{ node_image_cache : "caches"

    nodes {
        uuid id PK
        text status "active|drained|unavailable"
        text region_code
        text sku FK
        uuid org_id "nullable; null=shared"
    }
    node_resource_slots {
        uuid id PK
        uuid node_id FK
        int slot_index
        text status "available|reserved|provisioning|active|releasing|cleanup|disabled"
        text capacity_shape "gpu_slice"
        text pci_address "GPU PCI"
        text nvme_device "/dev/disk/by-id/..."
        text mac_address "52:54:xx:xx:xx:xx"
        inet private_ip "10.100.0.{10+i}"
        int numa_node
        jsonb capacity_metadata "storage_ownership, fabric_claim_mode, fabric_vf_pci_address, destructive_wipe_policy, sku"
    }
    allocation_resource_claims {
        uuid allocation_id FK
        uuid slot_id FK
        text claim_kind "slot|node_exclusive"
        text status "reserved|provisioning|active|releasing"
    }
    sku_catalog {
        text sku PK
        text capacity_shape
        int_array allowed_gpu_counts
        jsonb resource_profile
    }
    os_images {
        text slug PK
        text target "vm_slice|baremetal"
        text_array compatible_skus
        text digest_sha256
    }

Required capacity_metadata keys before a slot becomes schedulable (the orchestrator filter rejects rows missing any):

Key	Required value	Why
`storage_ownership`	`slice`	NVMe is dedicated to tenant slice use, not host share
`destructive_wipe_policy`	non-empty	Slot has an erase contract for tenant data on release
`fabric_claim_mode`	`per_slot_vf`	No shared parent BF/IB as concurrency proof
`fabric_vf_pci_address`	non-empty PCI	The actual VF that gets passed through
`sku`	optional	Per-slot SKU override

4. Orchestrator placement¶

Implemented

sequenceDiagram
    autonumber
    participant API as cmd/api
    participant ORCH as orchestrator
    participant DB as Postgres
    participant OR as outbox-relay

    API->>ORCH: CreateRequested(sku, gpus_total)
    ORCH->>DB: BEGIN
    ORCH->>DB: SELECT sku_catalog<br/>(capacity_shape, allowed_gpu_counts)
    Note over ORCH: validate shape == gpu_slice<br/>+ gpus_total in allowed_counts
    ORCH->>DB: listSlicePlacementCandidates(filter + os_images join)
    Note over ORCH: candidates grouped by node,<br/>ordered NUMA + slot_index
    ORCH->>ORCH: rankSlicePlacementCandidates<br/>(single-NUMA fit > remaining slots > NUMA groups > slot_index)
    loop for each ranked candidate
        ORCH->>DB: SELECT slots FOR UPDATE SKIP LOCKED
        ORCH->>ORCH: selectSliceSlotIDs(reject duplicate fabric VFs)
        alt enough slots locked
            ORCH->>DB: UPDATE slots SET status='reserved'
            ORCH->>DB: INSERT allocation + N claims
            ORCH->>DB: INSERT outbox: provisioning.requested
            ORCH->>DB: COMMIT
            ORCH-->>API: allocation_id
        else not enough
            ORCH->>DB: ROLLBACK locked slots
            Note over ORCH: try next candidate
        end
    end
    OR->>DB: poll outbox FOR UPDATE SKIP LOCKED
    OR-->>OR: publish provisioning.requested → NATS

The candidate query at service.go:1682 enforces every invariant: region match, tenant boundary, slot metadata complete, no overlapping fabric VF claim, no active baremetal on the node, a viable os_images row for the SKU.

Ranking scores:

Single-NUMA fit — can gpus_total slots come from one NUMA group? (Better cross-GPU bandwidth.)
Remaining slots after the pick (best-fit; smaller is better — preserves clean larger blocks).
NUMA group count in the chosen subset.
First slot index (deterministic tie-break).
Node ID (final tie-break).

5. Node-agent task catalog¶

Implemented

flowchart LR
    classDef slice fill:#fff3e0,stroke:#e65100
    classDef bm    fill:#e3f2fd,stroke:#1565c0
    classDef diag  fill:#f3e5f5,stroke:#6a1b9a

    subgraph TaskTypes[Typed task contract — cmd/node-agent/agent.go]
        T1[slice.topology_discover]:::slice
        T2[slice.vm_provision]:::slice
        T3[slice.vm_release]:::slice
        T4[allocation.provision_user]:::bm
        T5[allocation.deprovision_user]:::bm
        T6[diag.health_probe]:::diag
    end

    T1 -.outputs.-> O1["gpu_devices<br/>fabric_devices<br/>nvme_devices<br/>candidate_slots<br/>blockers<br/>approval_required=true"]
    T2 -.outputs.-> O2["vm_name, private_ip,<br/>ssh_port=22,<br/>readiness.ssh_ready,<br/>readiness.guest_ready,<br/>timings.phase_ms"]
    T3 -.outputs.-> O3["released, hard_stopped,<br/>wiped, leases_released"]

All tasks are: typed inputs, signed at the API boundary, mTLS-pulled by the node, structured outputs. No remote shell.

→ Full task catalog: Node-agent task catalog

6. Provisioning phases¶

Implemented

runSliceVMProvision at cmd/node-agent/slice_vm.go:478 — 17 phase-timed steps. Failure between phases 1–16 triggers deferred cleanup.

stateDiagram-v2
    [*] --> lease_acquire
    lease_acquire --> host_dependencies: per-slot JSON lease created
    host_dependencies --> host_passthrough_check: apt-get done
    host_passthrough_check --> vfio_bind_check: /dev/kvm OK
    vfio_bind_check --> image_stat_download: GPUs + fabric VFs bound to vfio-pci
    image_stat_download --> image_digest_verify: image present or downloaded
    image_digest_verify --> cloud_init_dir: sha256 verified
    cloud_init_dir --> terminal_key: dir ready
    terminal_key --> guest_telemetry_register: host ed25519 keypair ready
    guest_telemetry_register --> cloud_init_seed_files: per-allocation token minted
    cloud_init_seed_files --> cloud_localds: user-data + meta-data written
    cloud_localds --> runtime_validate: seed.iso packed
    runtime_validate --> dhcp_reservation: OVS br-exists + NVMe unmounted
    dhcp_reservation --> image_write_convert: per-VM MAC→IP in dnsmasq
    image_write_convert --> virt_install: qemu-img convert → boot NVMe
    virt_install --> readiness: libvirt domain running
    readiness --> performance_probe: SSH + guest marker reached
    performance_probe --> [*]: success

    lease_acquire --> cleanup: error
    vfio_bind_check --> cleanup: error
    image_write_convert --> cleanup: error
    virt_install --> cleanup: error
    readiness --> cleanup: error
    cleanup --> [*]: deferred cleanup<br/>drop DHCP, unregister telemetry,<br/>release leases, rm cloud_init_dir

Key phase facts:

Phase	Detail
`lease_acquire`	File-based exclusive-create JSON lease per slot under `/var/lib/gpuaas/node-scheduler/leases/{slot_id}.json`. TTL default 24 h.
`vfio_bind_check`	GPU + fabric VF must be bound to `vfio-pci`. With `GPUAAS_SLICE_RUNTIME_VFIO_BIND` unset, wrong driver is a hard error → operator must run host bootstrap.
`image_stat_download`	Image must live under `/var/lib/gpuaas/slice-images` or `/var/lib/libvirt/images`. 64 GiB cap on download.
`terminal_key`	Host's per-instance ed25519 pubkey is injected as authorized_key — terminal gateway uses this to broker SSH.
`image_write_convert`	`qemu-img convert -O raw <image> <boot NVMe>` — destructive write to slot's NVMe.
`virt_install`	UEFI, host-passthrough CPU, virtio-scsi, OVS bridge, `--host-device` per GPU + VF. `loader_secure=no`, `--tpm=none`, `--graphics=none`.
`readiness`	SSH on `private_ip:22` + presence of `/var/lib/gpuaas/slice-ready`. Bounded by `graceful_timeout_seconds` (30–900, default 300).

7. Networking model¶

Implemented

flowchart TB
    subgraph Host[Slice host networking]
        direction TB
        UP[Uplink NIC<br/>default route]
        IPT[iptables NAT<br/>POSTROUTING MASQUERADE<br/>src 10.100.0.0/24]
        OVS{{OVS bridge ovsbr0<br/>10.100.0.1/24}}
        DNS[dnsmasq<br/>per-VM MAC→IP reservation<br/>etc/dnsmasq.d/]
        IB[IPoIB ibp*<br/>192.168.x.0/24<br/>netplan 60-ipoib.yaml]
        TELE[node-agent<br/>:9110/internal/v1/guest-telemetry]
    end
    subgraph VM[Slice VM]
        E0[eth0 virtio<br/>private_ip 10.100.0.10]
        IB0[ib0 SR-IOV VF<br/>passthrough]
        GPU[GPU vfio-pci]
        NVME[NVMe virtio-scsi]
        HELP[guest helper<br/>gpuaas-metrics-helper]
    end

    E0 <--> OVS
    OVS <--> IPT
    IPT <--> UP
    OVS -.dhcp lease.-> DNS
    IB0 <-.RDMA.-> IB
    HELP -. POST + per-alloc token .-> TELE

Plane	Path	Source of truth
Management/public	OVS bridge `ovsbr0` → virtio NIC	`mac_address` + `private_ip` on slot row; dnsmasq config writes reservation
Workload fabric	Per-slot Mellanox VF passthrough	`capacity_metadata.fabric_vf_pci_address`
Telemetry channel	VM helper → `http://10.100.0.1:9110`	per-allocation token registered at provision time
East-west across slices	Denied by default	OVS bridge isolation (no inter-VM flows defined)

Slot 0 (boot slot) MAC: 52:54: + first 4 bytes of sha256(node_id:0). IP: 10.100.0.10. Slot N IP: 10.100.0.{10+N}. Deterministic so reboots don't break anything.

8. Telemetry model¶

Designed

Source: Slice_Guest_Telemetry_and_Benchmark_v1.md

sequenceDiagram
    autonumber
    participant VM as Slice VM
    participant GH as guest helper<br/>(installed via cloud-init)
    participant NA as node-agent<br/>:9110
    participant API as cmd/api
    participant UI as Web UI

    Note over VM,GH: provisioning phase 9<br/>per-allocation token minted
    GH->>GH: every 30s: nvidia-smi snapshot
    GH->>NA: POST /internal/v1/guest-telemetry<br/>Authorization: Bearer <token><br/>{gpu_util, mem_util, temp, power}
    NA-->>GH: 200
    NA->>NA: validate token + allocation_id
    NA->>API: forward as allocation metric<br/>(source=slice_guest)
    API->>UI: WS push allocation telemetry

Three explicit telemetry sources defined in the spec:

Source	Used by	Why
`slice_guest`	`gpu_slice` allocations	Host nvidia-smi is misleading once GPUs are vfio-pci'd
`host_local_probe`	`baremetal` allocations	Platform-owned host probe (replaces Netdata long-term)
`unavailable`	Either, when neither is available	Explicit, never silently falls back to host GPU values

The provisioning task registers the per-allocation token; the release task unregisters. Host Netdata stays available for operator-only node health, never tenant metrics.

9. Release & cleanup¶

Implemented

runSliceVMRelease at cmd/node-agent/slice_vm.go:1175.

stateDiagram-v2
    [*] --> graceful_shutdown
    graceful_shutdown --> waiting: virsh shutdown
    waiting --> shutdown_clean: VM exited before timeout
    waiting --> hard_destroy: still running at graceful_timeout
    hard_destroy --> shutdown_clean: virsh destroy (hard_stopped=true)
    shutdown_clean --> undefine
    undefine --> vfio_rebind: virsh undefine [--nvram]
    vfio_rebind --> cleanup_files: GPU + fabric VF stay bound to vfio-pci
    cleanup_files --> wipe_decision: RemoveAll cloud_init_dir<br/>remove dnsmasq reservation
    wipe_decision --> wipe_nvme: in.Wipe=true
    wipe_decision --> drop_state: in.Wipe=false
    wipe_nvme --> drop_state: zero each NVMe
    drop_state --> [*]: release per-slot leases<br/>unregister guest telemetry

Output keys reported back to the worker:

{
  "vm_name": "...",
  "released": true,
  "hard_stopped": false,
  "wiped": false,
  "slot_count": 4,
  "leases_released": 4
}

hard_stopped=true is auditable signal that graceful shutdown failed — surfaces in allocation history as a follow-up flag.

10. Design intent¶

Decided

The five design docs that decided the v1 slice shape:

flowchart LR
    A[Allocation_Capacity_Shapes_and<br/>_GPU_Slices_v1] --> B[GPU_Slice_End_to_End<br/>_Readiness_Decisions_v1]
    B --> C[Hierarchical_Placement_and<br/>_Node_Scheduler_v1]
    A --> D[Slice_Networking_Architecture_v1]
    A --> E[GPU_Slice_Implementation_Checklist_v1]
    A --> F[Slice_Guest_Telemetry_and<br/>_Benchmark_v1]
    classDef src fill:#fff3cd,stroke:#332701
    class A,B,C,D,E,F src

Read in order:

Allocation_Capacity_Shapes_and_GPU_Slices_v1.md — master design proposal (1388 lines)
GPU_Slice_End_to_End_Readiness_Decisions_v1.md — 10 decisions closing v1 architecture
Hierarchical_Placement_and_Node_Scheduler_v1.md — placement layering (region → cluster → node → executor)
Slice_Networking_Architecture_v1.md — two-plane networking, public ingress, OVS extensibility
Slice_Guest_Telemetry_and_Benchmark_v1.md — telemetry contract
GPU_Slice_Implementation_Checklist_v1.md — phased plan (Phases 1–8)

11. Runbooks¶

Runbook

Four slice-specific runbooks live under doc/operations/runbooks/. When in doubt, the categorisation:

flowchart TB
    INC[Slice incident or onboarding] --> Q1{What broke?}
    Q1 -- slot stuck cleanup_blocked --> R1[Cleanup_Blocked_Slot_Runbook]
    Q1 -- image fails to clone/verify --> R2[Image_Pipeline_Runbook]
    Q1 -- new host won't bootstrap --> R3[Node_Manual_Bootstrap_Runbook]
    Q1 -- joint infra+platform onboarding --> R4[Infra_Enablement_Proposal]
    classDef rb fill:#e9d6ff,stroke:#1e1530
    class R1,R2,R3,R4 rb

Runbook	When
Cleanup-blocked slot	Slot stuck `cleanup_blocked` — typically mounted host storage on NVMe, wipe verification failed, or drift detected
Image pipeline	Image build/verify/import or cache invalidation issue
Node manual bootstrap	MAAS commissioning didn't apply slice firmware profile; manual host prep needed
Infra enablement proposal	Joint infra + platform process for enabling a new slice-capable host pool

12. Position vs other GPU clouds¶

quadrantChart
    title GPU cloud isolation vs operational simplicity
    x-axis "Operationally simple" --> "Operationally rich"
    y-axis "Weaker isolation" --> "Stronger isolation"
    quadrant-1 "Hyperscaler HPC"
    quadrant-2 "GPUaaS sweet spot"
    quadrant-3 "Boutique container clouds"
    quadrant-4 "Bare metal direct"
    "RunPod / Vast / Together": [0.25, 0.25]
    "TensorDock / FluidStack":   [0.30, 0.30]
    "Lambda 1-Click":            [0.45, 0.55]
    "CoreWeave (K8s)":           [0.70, 0.55]
    "DGX Cloud":                 [0.80, 0.65]
    "AWS / Azure / GCP HPC":     [0.85, 0.80]
    "GPUaaS slice":              [0.55, 0.78]

GPUaaS lands deliberately in "VM-with-passthrough between hyperscalers and boutique clouds":

Stronger isolation than RunPod / Vast / Together because of VFIO + per-slot dedicated NVMe + per-slot SR-IOV IB VF
Operationally simpler than CoreWeave (no K8s + GPU Operator + Multus stack)
Less feature-rich than AWS / Azure / GCP HPC — no MIG, no confidential compute, no multi-node clusters, no managed driver pipelines

→ Full comparison: Position vs other clouds and Product comparisons → External clouds

End-to-end recap¶

sequenceDiagram
    autonumber
    participant U as Tenant
    participant API as cmd/api
    participant ORCH as orchestrator
    participant PW as provisioning-worker
    participant NA as node-agent
    participant VM as Slice VM
    participant BW as billing-worker

    U->>API: POST /allocations sku=h200-sxm-slice gpus=N
    API->>ORCH: place
    ORCH-->>API: allocation_id, outbox emitted
    API-->>U: 201 status=requested
    PW->>NA: slice.vm_provision 17 phases
    NA->>VM: virt-install + cloud-init
    VM-->>NA: SSH + readiness marker
    NA-->>PW: result private_ip timings readiness
    PW->>API: allocation.status=active
    API-->>U: WS notify — billing starts
    BW->>BW: accrue every 60s
    U->>API: terminal-token + WS
    API->>VM: relay via terminal-gateway
    Note over VM: tenant works
    U->>API: release
    API->>PW: slice.vm_release
    NA->>VM: virsh shutdown / destroy
    NA->>NA: re-bind to vfio-pci, wipe leases
    NA-->>PW: released
    PW->>API: status=released
    BW->>BW: stop accrual