Capacity shapes & SKUs¶

Implemented Decided

Source: scripts/seed.sql · packages/services/provisioning/orchestrator/service.go · doc/architecture/Allocation_Capacity_Shapes_and_GPU_Slices_v1.md

Two capacity shapes today¶

Shape	Tenancy unit	Lifecycle	Schedulable inventory
`baremetal`	Whole physical node	MAAS deploy + release (optional full reimage)	`nodes` rows
`gpu_slice`	One VM per allocation with 1–N slot bundles	libvirt/QEMU on a slice-mode node	`node_resource_slots` rows

flowchart LR
    classDef bm fill:#e3f2fd,stroke:#1565c0
    classDef sl fill:#fff3e0,stroke:#e65100

    REQ([Allocation request<br/>sku, gpus_total, region])
    REQ --> SK{capacity_shape?}
    SK -- baremetal --> BM[Lock whole node<br/>NOT EXISTS active claim]:::bm
    SK -- gpu_slice --> SL[Reserve N slot rows<br/>same node<br/>NUMA-fit ranking]:::sl
    BM --> ALOC[Allocation row +<br/>node_exclusive claim]:::bm
    SL --> ALOC2[Allocation row +<br/>N slot claims]:::sl
    ALOC --> EMIT[Outbox →<br/>provisioning.requested]
    ALOC2 --> EMIT

Seeded SKUs¶

From scripts/seed.sql:

`h200-sxm-slice` — NVIDIA H200 GPU VM¶

sku: h200-sxm-slice
vendor: NVIDIA
display_name: NVIDIA H200 GPU VM
gpus_total: 8
capacity_shape: gpu_slice
allowed_gpu_counts: [1, 2, 4, 8]
gpu_hourly_price_minor: 450  # USD cents
active: true
resource_profile:
  fabric: ib
  slice_unit_gpu_count: 1
  default_slice_vm_profile: h200_1g_24c_64g
  slice_vm_profiles:
    h200_1g_12c_64g:   { gpu_count: 1, vcpu_count: 12,  memory_mib:  65536, notes: "early prototype shape" }
    h200_1g_24c_64g:   { gpu_count: 1, vcpu_count: 24,  memory_mib:  65536, notes: "latest tested" }
    h200_2g_48c_128g:  { gpu_count: 2, vcpu_count: 48,  memory_mib: 131072, derived_from: h200_1g_24c_64g }
    h200_4g_96c_256g:  { gpu_count: 4, vcpu_count: 96,  memory_mib: 262144, derived_from: h200_1g_24c_64g }
    h200_8g_192c_512g: { gpu_count: 8, vcpu_count: 192, memory_mib: 524288, derived_from: h200_1g_24c_64g }

`h200-sxm-baremetal-8g` — NVIDIA H200 full node¶

Decided

sku: h200-sxm-baremetal-8g
capacity_shape: baremetal
gpu_count: 8
exclusive_node: true

Slice slot model¶

Each row in node_resource_slots represents one schedulable bundle on a slice-mode node:

erDiagram
    nodes ||--o{ node_resource_slots : hosts
    node_resource_slots {
        uuid   id PK
        uuid   node_id FK
        int    slot_index
        text   status "available|reserved|provisioning|active|releasing|cleanup|failed|disabled"
        text   capacity_shape "'gpu_slice'"
        text   pci_address "GPU PCI"
        text   nvme_device "/dev/disk/by-id/..."
        text   mac_address "52:54:xx:xx:xx:xx"
        inet   private_ip "10.100.0.{10+slot_index}"
        int    numa_node
        text   sharing_model "exclusive_device"
        int    max_claims "1 for v1"
        jsonb  capacity_metadata
    }
    node_resource_slots ||--o{ allocation_resource_claims : claimed
    allocation_resource_claims {
        uuid allocation_id FK
        uuid slot_id FK
        text claim_kind "slot|node_exclusive"
        text status "reserved|provisioning|active|releasing"
    }

Required `capacity_metadata` keys¶

A slot must carry all of these before it becomes schedulable:

Key	Required value	Why
`storage_ownership`	`slice`	Asserts the NVMe is dedicated to tenant slice use (not host share)
`destructive_wipe_policy`	non-empty	Declares the erase contract on release
`fabric_claim_mode`	`per_slot_vf`	Concurrency safety — no shared parent BF/IB as concurrency proof
`fabric_vf_pci_address`	non-empty PCI	The actual VF that gets passed through
`sku`	optional	Per-slot SKU override

Source: packages/services/provisioning/orchestrator/service.go:1696–1709.

Allowed GPU counts¶

The orchestrator validates gpus_total ∈ allowed_gpu_counts and that N slots can be reserved on the same node. Cross-node multi-GPU is not supported.

flowchart TB
    R([request: gpus_total=N])
    R --> A{N in allowed_gpu_counts?}
    A -- no --> X1[sku_unavailable]
    A -- yes --> B[list candidate slots<br/>filter by region, tenant boundary, slot metadata]
    B --> C{any node has N free slots?}
    C -- no --> X2[sku_unavailable]
    C -- yes --> D[rank candidates:<br/>1. single-NUMA fit<br/>2. remaining slots<br/>3. NUMA groups<br/>4. slot index]
    D --> E[FOR UPDATE SKIP LOCKED on chosen node]
    E --> F{still N available?}
    F -- no --> G[try next candidate]
    G --> E
    F -- yes --> H[reserve N slots, write outbox]
    H --> OK([allocation requested])

Why these constraints exist¶

Same-node only (gpu_slice): PCI passthrough requires devices to be on one physical host. Cross-node would need RDMA-aware overlay that isn't built.
Whole-GPU only (gpu_slice): MIG/vGPU/MPS partitioning is DESIGNED (capacity shapes doc reserves gpu_partition / gpu_shared) but not implemented. The scheduler explicitly ignores any slot with parent_slot_id, sharing_model != exclusive_device, max_claims > 1, or compute_milli < 1000.
Operator-approved slot map: topology discovery returns candidates with approval_required: true; only operator action creates the actual node_resource_slots rows.
Per-slot VF fabric: shared host BF/IB parent device is not concurrency proof. Each slot must have its own SR-IOV VF.

Status by area¶

Area	Status
`baremetal` allocation create/release	Implemented
`gpu_slice` allocation reservation (orchestrator)	Implemented
Slice VM provisioning (node-agent)	Implemented
Topology discovery (`slice.topology_discover`)	Implemented
MIG / vGPU / MPS sub-GPU shapes	Designed (reserved as `gpu_partition`, `gpu_shared`)
Multi-node slice clusters	Designed (non-goal for v1)
`full-reimage` isolation between allocations	Designed (gated by `maas.enabled=true`)

Trace points¶

GPU Slice as-built → — the full code-grounded write-up
GPU Slice trail → — curated reading path for slice
Design docs (under source/):