Skip to content

Capacity shapes & SKUs

Implemented Decided

Source: scripts/seed.sql · packages/services/provisioning/orchestrator/service.go · doc/architecture/Allocation_Capacity_Shapes_and_GPU_Slices_v1.md

Two capacity shapes today

Shape Tenancy unit Lifecycle Schedulable inventory
baremetal Whole physical node MAAS deploy + release (optional full reimage) nodes rows
gpu_slice One VM per allocation with 1–N slot bundles libvirt/QEMU on a slice-mode node node_resource_slots rows
flowchart LR
    classDef bm fill:#e3f2fd,stroke:#1565c0
    classDef sl fill:#fff3e0,stroke:#e65100

    REQ([Allocation request<br/>sku, gpus_total, region])
    REQ --> SK{capacity_shape?}
    SK -- baremetal --> BM[Lock whole node<br/>NOT EXISTS active claim]:::bm
    SK -- gpu_slice --> SL[Reserve N slot rows<br/>same node<br/>NUMA-fit ranking]:::sl
    BM --> ALOC[Allocation row +<br/>node_exclusive claim]:::bm
    SL --> ALOC2[Allocation row +<br/>N slot claims]:::sl
    ALOC --> EMIT[Outbox →<br/>provisioning.requested]
    ALOC2 --> EMIT

Seeded SKUs

From scripts/seed.sql:

h200-sxm-slice — NVIDIA H200 GPU VM

sku: h200-sxm-slice
vendor: NVIDIA
display_name: NVIDIA H200 GPU VM
gpus_total: 8
capacity_shape: gpu_slice
allowed_gpu_counts: [1, 2, 4, 8]
gpu_hourly_price_minor: 450  # USD cents
active: true
resource_profile:
  fabric: ib
  slice_unit_gpu_count: 1
  default_slice_vm_profile: h200_1g_24c_64g
  slice_vm_profiles:
    h200_1g_12c_64g:   { gpu_count: 1, vcpu_count: 12,  memory_mib:  65536, notes: "early prototype shape" }
    h200_1g_24c_64g:   { gpu_count: 1, vcpu_count: 24,  memory_mib:  65536, notes: "latest tested" }
    h200_2g_48c_128g:  { gpu_count: 2, vcpu_count: 48,  memory_mib: 131072, derived_from: h200_1g_24c_64g }
    h200_4g_96c_256g:  { gpu_count: 4, vcpu_count: 96,  memory_mib: 262144, derived_from: h200_1g_24c_64g }
    h200_8g_192c_512g: { gpu_count: 8, vcpu_count: 192, memory_mib: 524288, derived_from: h200_1g_24c_64g }

h200-sxm-baremetal-8g — NVIDIA H200 full node

Decided

sku: h200-sxm-baremetal-8g
capacity_shape: baremetal
gpu_count: 8
exclusive_node: true

Slice slot model

Each row in node_resource_slots represents one schedulable bundle on a slice-mode node:

erDiagram
    nodes ||--o{ node_resource_slots : hosts
    node_resource_slots {
        uuid   id PK
        uuid   node_id FK
        int    slot_index
        text   status "available|reserved|provisioning|active|releasing|cleanup|failed|disabled"
        text   capacity_shape "'gpu_slice'"
        text   pci_address "GPU PCI"
        text   nvme_device "/dev/disk/by-id/..."
        text   mac_address "52:54:xx:xx:xx:xx"
        inet   private_ip "10.100.0.{10+slot_index}"
        int    numa_node
        text   sharing_model "exclusive_device"
        int    max_claims "1 for v1"
        jsonb  capacity_metadata
    }
    node_resource_slots ||--o{ allocation_resource_claims : claimed
    allocation_resource_claims {
        uuid allocation_id FK
        uuid slot_id FK
        text claim_kind "slot|node_exclusive"
        text status "reserved|provisioning|active|releasing"
    }

Required capacity_metadata keys

A slot must carry all of these before it becomes schedulable:

Key Required value Why
storage_ownership slice Asserts the NVMe is dedicated to tenant slice use (not host share)
destructive_wipe_policy non-empty Declares the erase contract on release
fabric_claim_mode per_slot_vf Concurrency safety — no shared parent BF/IB as concurrency proof
fabric_vf_pci_address non-empty PCI The actual VF that gets passed through
sku optional Per-slot SKU override

Source: packages/services/provisioning/orchestrator/service.go:1696–1709.

Allowed GPU counts

The orchestrator validates gpus_total ∈ allowed_gpu_counts and that N slots can be reserved on the same node. Cross-node multi-GPU is not supported.

flowchart TB
    R([request: gpus_total=N])
    R --> A{N in allowed_gpu_counts?}
    A -- no --> X1[sku_unavailable]
    A -- yes --> B[list candidate slots<br/>filter by region, tenant boundary, slot metadata]
    B --> C{any node has N free slots?}
    C -- no --> X2[sku_unavailable]
    C -- yes --> D[rank candidates:<br/>1. single-NUMA fit<br/>2. remaining slots<br/>3. NUMA groups<br/>4. slot index]
    D --> E[FOR UPDATE SKIP LOCKED on chosen node]
    E --> F{still N available?}
    F -- no --> G[try next candidate]
    G --> E
    F -- yes --> H[reserve N slots, write outbox]
    H --> OK([allocation requested])

Why these constraints exist

  • Same-node only (gpu_slice): PCI passthrough requires devices to be on one physical host. Cross-node would need RDMA-aware overlay that isn't built.
  • Whole-GPU only (gpu_slice): MIG/vGPU/MPS partitioning is DESIGNED (capacity shapes doc reserves gpu_partition / gpu_shared) but not implemented. The scheduler explicitly ignores any slot with parent_slot_id, sharing_model != exclusive_device, max_claims > 1, or compute_milli < 1000.
  • Operator-approved slot map: topology discovery returns candidates with approval_required: true; only operator action creates the actual node_resource_slots rows.
  • Per-slot VF fabric: shared host BF/IB parent device is not concurrency proof. Each slot must have its own SR-IOV VF.

Status by area

Area Status
baremetal allocation create/release Implemented
gpu_slice allocation reservation (orchestrator) Implemented
Slice VM provisioning (node-agent) Implemented
Topology discovery (slice.topology_discover) Implemented
MIG / vGPU / MPS sub-GPU shapes Designed (reserved as gpu_partition, gpu_shared)
Multi-node slice clusters Designed (non-goal for v1)
full-reimage isolation between allocations Designed (gated by maas.enabled=true)

Trace points