Capacity shapes & SKUs¶
Implemented Decided
Source:
scripts/seed.sql · packages/services/provisioning/orchestrator/service.go · doc/architecture/Allocation_Capacity_Shapes_and_GPU_Slices_v1.md
Two capacity shapes today¶
| Shape | Tenancy unit | Lifecycle | Schedulable inventory |
|---|---|---|---|
baremetal |
Whole physical node | MAAS deploy + release (optional full reimage) | nodes rows |
gpu_slice |
One VM per allocation with 1–N slot bundles | libvirt/QEMU on a slice-mode node | node_resource_slots rows |
flowchart LR
classDef bm fill:#e3f2fd,stroke:#1565c0
classDef sl fill:#fff3e0,stroke:#e65100
REQ([Allocation request<br/>sku, gpus_total, region])
REQ --> SK{capacity_shape?}
SK -- baremetal --> BM[Lock whole node<br/>NOT EXISTS active claim]:::bm
SK -- gpu_slice --> SL[Reserve N slot rows<br/>same node<br/>NUMA-fit ranking]:::sl
BM --> ALOC[Allocation row +<br/>node_exclusive claim]:::bm
SL --> ALOC2[Allocation row +<br/>N slot claims]:::sl
ALOC --> EMIT[Outbox →<br/>provisioning.requested]
ALOC2 --> EMIT
Seeded SKUs¶
From scripts/seed.sql:
h200-sxm-slice — NVIDIA H200 GPU VM¶
sku: h200-sxm-slice
vendor: NVIDIA
display_name: NVIDIA H200 GPU VM
gpus_total: 8
capacity_shape: gpu_slice
allowed_gpu_counts: [1, 2, 4, 8]
gpu_hourly_price_minor: 450 # USD cents
active: true
resource_profile:
fabric: ib
slice_unit_gpu_count: 1
default_slice_vm_profile: h200_1g_24c_64g
slice_vm_profiles:
h200_1g_12c_64g: { gpu_count: 1, vcpu_count: 12, memory_mib: 65536, notes: "early prototype shape" }
h200_1g_24c_64g: { gpu_count: 1, vcpu_count: 24, memory_mib: 65536, notes: "latest tested" }
h200_2g_48c_128g: { gpu_count: 2, vcpu_count: 48, memory_mib: 131072, derived_from: h200_1g_24c_64g }
h200_4g_96c_256g: { gpu_count: 4, vcpu_count: 96, memory_mib: 262144, derived_from: h200_1g_24c_64g }
h200_8g_192c_512g: { gpu_count: 8, vcpu_count: 192, memory_mib: 524288, derived_from: h200_1g_24c_64g }
h200-sxm-baremetal-8g — NVIDIA H200 full node¶
Decided
Slice slot model¶
Each row in node_resource_slots represents one schedulable bundle on a slice-mode node:
erDiagram
nodes ||--o{ node_resource_slots : hosts
node_resource_slots {
uuid id PK
uuid node_id FK
int slot_index
text status "available|reserved|provisioning|active|releasing|cleanup|failed|disabled"
text capacity_shape "'gpu_slice'"
text pci_address "GPU PCI"
text nvme_device "/dev/disk/by-id/..."
text mac_address "52:54:xx:xx:xx:xx"
inet private_ip "10.100.0.{10+slot_index}"
int numa_node
text sharing_model "exclusive_device"
int max_claims "1 for v1"
jsonb capacity_metadata
}
node_resource_slots ||--o{ allocation_resource_claims : claimed
allocation_resource_claims {
uuid allocation_id FK
uuid slot_id FK
text claim_kind "slot|node_exclusive"
text status "reserved|provisioning|active|releasing"
}
Required capacity_metadata keys¶
A slot must carry all of these before it becomes schedulable:
| Key | Required value | Why |
|---|---|---|
storage_ownership |
slice |
Asserts the NVMe is dedicated to tenant slice use (not host share) |
destructive_wipe_policy |
non-empty | Declares the erase contract on release |
fabric_claim_mode |
per_slot_vf |
Concurrency safety — no shared parent BF/IB as concurrency proof |
fabric_vf_pci_address |
non-empty PCI | The actual VF that gets passed through |
sku |
optional | Per-slot SKU override |
Source: packages/services/provisioning/orchestrator/service.go:1696–1709.
Allowed GPU counts¶
The orchestrator validates gpus_total ∈ allowed_gpu_counts and that N slots can be reserved on the same node. Cross-node multi-GPU is not supported.
flowchart TB
R([request: gpus_total=N])
R --> A{N in allowed_gpu_counts?}
A -- no --> X1[sku_unavailable]
A -- yes --> B[list candidate slots<br/>filter by region, tenant boundary, slot metadata]
B --> C{any node has N free slots?}
C -- no --> X2[sku_unavailable]
C -- yes --> D[rank candidates:<br/>1. single-NUMA fit<br/>2. remaining slots<br/>3. NUMA groups<br/>4. slot index]
D --> E[FOR UPDATE SKIP LOCKED on chosen node]
E --> F{still N available?}
F -- no --> G[try next candidate]
G --> E
F -- yes --> H[reserve N slots, write outbox]
H --> OK([allocation requested])
Why these constraints exist¶
- Same-node only (
gpu_slice): PCI passthrough requires devices to be on one physical host. Cross-node would need RDMA-aware overlay that isn't built. - Whole-GPU only (
gpu_slice): MIG/vGPU/MPS partitioning isDESIGNED(capacity shapes doc reservesgpu_partition/gpu_shared) but not implemented. The scheduler explicitly ignores any slot withparent_slot_id,sharing_model != exclusive_device,max_claims > 1, orcompute_milli < 1000. - Operator-approved slot map: topology discovery returns candidates with
approval_required: true; only operator action creates the actualnode_resource_slotsrows. - Per-slot VF fabric: shared host BF/IB parent device is not concurrency proof. Each slot must have its own SR-IOV VF.
Status by area¶
| Area | Status |
|---|---|
baremetal allocation create/release |
Implemented |
gpu_slice allocation reservation (orchestrator) |
Implemented |
| Slice VM provisioning (node-agent) | Implemented |
Topology discovery (slice.topology_discover) |
Implemented |
| MIG / vGPU / MPS sub-GPU shapes | Designed (reserved as gpu_partition, gpu_shared) |
| Multi-node slice clusters | Designed (non-goal for v1) |
full-reimage isolation between allocations |
Designed (gated by maas.enabled=true) |
Trace points¶
- GPU Slice as-built → — the full code-grounded write-up
- GPU Slice trail → — curated reading path for slice
- Design docs (under
source/):