Allocation Capacity Shapes and GPU Slices v1¶
Purpose¶
Define how GPUaaS should evolve from whole-node allocations to multiple capacity shapes on the same physical node:
baremetal: one allocation claims an entire physical node.gpu_slice: one allocation claims one or more host-local resource slots on a physical node.
This document is a design proposal. It does not change the current API or schema until the OpenAPI, database schema, placement service, billing, and node-agent contracts are updated.
Companion architecture docs:
doc/architecture/Slice_Networking_Architecture_v1.md: slice VM management network, BF3/OVS, IPAM, public ingress, console access, and networking drift.
Context¶
Current allocation placement is node-exclusive:
- allocation create selects a
nodesrow, allocations.node_idis the placement anchor,- node availability is effectively binary,
- one active/releasing allocation blocks the whole node.
That model is correct for full bare-metal leasing, but it cannot represent an H200 host split into independent VM-backed slices where each slice owns a GPU, local NVMe device, IB/VF device, network identity, and NUMA placement.
The prototype in ~/Downloads/GPUaaS demonstrates the intended H200 product
shape: an 8-GPU node can expose eight fixed slice bundles. Each bundle maps:
- one GPU PCI device,
- one fabric/network device or VF,
- one raw NVMe device or volume,
- one NUMA node,
- one VM network identity.
The prototype is useful for intent, but GPUaaS must model this centrally rather than as a node-local Flask/libvirt control plane. The platform model must also support non-H200 nodes, including AMD GPU hosts that use RoCE rather than InfiniBand.
Prototype-derived implementation constraints that must not be lost:
- slice inventory is a host-approved topology map, not a blind GPU count;
- node bootstrap has hard prerequisites such as IOMMU, VFIO, SR-IOV, OVS, libvirt, cloud-init tooling, and sometimes a reboot;
- VM disks may be destructive raw NVMe assignments, so disk safety and wipe verification are part of scheduling correctness;
- DHCP/NAT/overlay/public ingress state must be reconciled with control-plane allocation state;
- any VM console access must be authenticated and audited, not exposed as raw VNC from the node.
Terminology¶
Physical Node¶
A GPU server enrolled in GPUaaS and managed by node-agent. A node has lifecycle, identity, health, region, SKU class, topology, and host-level prerequisites.
Capacity Shape¶
The type of allocation realization:
baremetal: exclusive node lease.gpu_slice: VM or isolated runtime lease on a subset of node-local resources.
Capacity shape is a platform primitive. It is not an app-specific concept.
Future fractional or shared-GPU products should be added as explicit capacity
shapes, for example gpu_partition or gpu_shared, only after the underlying
hardware/runtime mechanism is selected and tested. Do not overload gpu_slice
to mean arbitrary time sharing, MPS, MIG, vGPU, or software multiplexing.
Resource Slot¶
A schedulable bundle of host-local resources that can be safely assigned to one allocation. A slot is a platform-approved schedulable unit, not necessarily a whole physical GPU forever. For first H200 slice mode, each slot is expected to map to one full GPU device, but the abstraction must also leave room for future child slots that represent MIG/vGPU/fractional resources under a parent physical GPU. For H200 slice mode, a slot is not just a GPU count. It is a bundle:
- GPU device identity and PCI address,
- optional MIG profile or full-GPU flag,
- NVMe device or volume identity,
- network fabric and VF/device identity,
- NUMA locality,
- VM/runtime metadata such as MAC/IP pool reference,
- CPU and memory reservation metadata,
- health and lifecycle state.
Fractional or shared-GPU inventory, if introduced later, should still be modeled as approved slots or child slots with explicit parentage and capacity metadata. The scheduler must never infer fractional availability from raw GPU memory or device count alone.
Resource Claim¶
The durable binding between an allocation and one or more resource slots.
Allocation¶
The customer-facing lease and billing anchor. An allocation is no longer synonymous with "one node". It is a request for capacity that is realized by a node claim or slot claims.
Core Decision¶
Do not make allocations.node_id the only source of placement truth going
forward.
Instead, model placement as:
allocations: customer lease, lifecycle, billing anchor, request metadata.allocation_placements: optional resolved placement summary/read model for an allocation.allocation_resource_claims: one row per claimed physical node or resource slot.node_resource_slots: host-local schedulable inventory discovered or approved per node.
allocations.node_id can remain as a compatibility/read-model field during
migration, but new scheduling logic should depend on placement/claim tables.
allocation_resource_claims is the required relational source of truth.
allocation_placements is not required as a separate table for v1 if the same
read-model fields can live safely on allocations, for example
capacity_shape, placement_status, and a compatibility node_id. Treat the
placement summary as an implementation choice:
- prefer columns on
allocationsif the summary remains 1:1 and simple; - use a separate
allocation_placementstable only if the summary grows enough to justify its own ownership, lifecycle, or refresh semantics; - never make scheduling correctness depend on the placement summary instead of claims and slots.
Placement Rules¶
Baremetal¶
A baremetal allocation:
- requires an active node matching region, SKU, tenant boundary, and health,
- claims the whole node,
- blocks all resource slots on that node,
- is blocked if any slot on that node is already reserved/provisioning/active,
- may use MAAS full reimage isolation on release when policy requires.
GPU Slice¶
A gpu_slice allocation:
- requires one or more compatible slots,
- must fit on a single physical node for the first implementation,
- cannot span node boundaries,
- must lock selected slots atomically,
- must fail with
sku_unavailablewhen no same-node compatible slot set exists, - should prefer topology-compatible placement such as same NUMA node for multi-GPU slice requests when possible.
Fabric compatibility is part of slot compatibility:
- NVIDIA H200/IB slice products should require slots with the expected InfiniBand/HCA capability when the SKU or app profile requires it.
- AMD/RoCE slice products should require RoCE-capable NIC/VF resources and should validate the RoCE path separately from IB checks.
- Apps that do not require low-latency fabric can use products whose fabric
requirement is
ethernetornone, but they must not be scheduled onto a SKU that promises IB/RoCE semantics unless the slot validation passed.
Mixed Node State¶
A node can be:
- fully available,
- partially allocated,
- fully allocated by slices,
- exclusively allocated by baremetal,
- draining,
- cleanup blocked,
- unavailable.
Node occupancy must become a derived aggregate of node status plus slot and claim states, not a single active-allocation predicate.
Mixed Node Transitions¶
Node mode is derived from active claims plus admin intent. The scheduler should apply these transition rules:
| From | To | Rule |
|---|---|---|
| fully available | slice use | allowed when node supports slice mode and has approved slots |
| fully available | baremetal use | allowed when no slots or claims are reserved/active |
| partially allocated by slices | baremetal use | blocked until all slice claims are released and cleanup passes |
| partially allocated by slices | draining | allowed; no new slice claims are scheduled, existing allocations continue until user/admin release |
| fully allocated by slices | draining | allowed; node becomes available only after all slots release and cleanup succeeds |
| any active allocation | unavailable | allowed for health failure; billing/allocation policy decides whether to fail, migrate later, or keep assigned |
| slot cleanup failed | cleanup blocked | block only affected slot unless health evidence indicates node-wide corruption |
| cleanup blocked | fully/partially available | allowed after admin repair or successful probe clears blocked slots |
| baremetal active | slice use | blocked until baremetal allocation releases and node cleanup/reimage completes |
| host share/storage mounted | slice use | blocked until admin/infra drain unmounts or remaps the devices and approves destructive slice ownership |
| slice active | baremetal/share use | blocked until every slice claim releases, cleanup proof passes, and infra remounts or recreates the baremetal storage layout |
| slice-capable | baremetal-capable | mode is derived from claims plus storage ownership; admins may drain to force the next allocation shape |
Admin drain should be non-disruptive by default. It prevents new placement but does not stop active customer allocations unless a separate retire/force-release policy is invoked.
Slice capability must be a promotion decision, not a hardware inference. The
platform should treat the MAAS slice-host tag or equivalent resource-pool
membership as admin intent, deploy-time firmware profile evidence as readiness
evidence, and approved node_resource_slots as the scheduling source of truth.
Passing the readiness checks without the slice-host tag leaves the node
baremetal-only for GPUaaS. Having the tag without a ready firmware profile
leaves the node a slice candidate, but not deployable. Having both without
approved slots leaves the node infra-ready, but not schedulable for gpu_slice.
MAAS and platform inventory tags should be composable. Use firmware profile
intent tags such as gpuaas-profile-slice-vm and
gpuaas-profile-baremetal to select the deploy-time cloud-init profile, plus
independent hardware tags such as gpu-nvidia-h200,
server-dell-xe9680l, and fabric-bf3. SKU derivation should combine
hardware tags, selected firmware profile, readiness state, and approved
slots instead of overloading one SKU-specific tag. This keeps the same path
usable for H100, B-series, AMD, or later accelerator hosts.
The slice VM and baremetal firmware profiles must be maintained independently because their RACADM settings may diverge. A machine should not carry both GPUaaS profile tags in automated onboarding unless infra is explicitly testing a lab override; otherwise the cloud-init helper should block the node to avoid applying conflicting firmware policy.
The user-facing catalog and admin views should distinguish capability from current occupancy:
baremetal_only: no slice VM profile intent is recorded for the node;slice_candidate: slice VM profile intent exists but readiness or slot approval is incomplete;both_capable: baremetal and slice products can target the node, subject to mutually exclusive placement;baremetal_active: whole-node placement currently blocks slice placement;slice_active: one or more slice claims currently block baremetal placement.
Nodes with convertible NVMe pools require an explicit storage-ownership transition even if the compute claims are clear. A mounted host share partition is valid in baremetal/share mode, but it is not valid evidence for a schedulable slice slot. Slice approval must happen after the node is drained, the share storage is unmounted or remapped, and the destructive wipe policy is accepted.
Proposed Data Model¶
This is directional and must be translated into db_schema_v1.sql only after API
contracts are updated.
node_resource_slots¶
Fields:
id uuid primary keynode_id uuid not null references nodes(id)slot_index integer not nullshape text not nullsuch asgpu_slicestatus text not nullsuch asavailable,reserved,provisioning,active,releasing,cleanup,failed,disabledgpu_model textgpu_count integer not null default 1gpu_pci_addr textgpu_uuid textmig_profile text nullnvme_device text nullnetwork_fabric text nullsuch asinfiniband,roce,ethernet, ornonenetwork_device text nullfabric_pci_addr text nullfabric_device_id text nullfabric_metadata jsonb not null default '{}'numa_node integer nullvcpu_count integer nullmemory_mib integer nullmac_address text nullprivate_ip inet nullcapacity_metadata jsonb not null default '{}'health_metadata jsonb not null default '{}'- uniqueness on
(node_id, slot_index)
Fractional/shared-GPU support should extend this model without changing the claim invariant. The schema and admin slot contract may carry reserved fields before the scheduler uses them:
parent_slot_id uuid null references node_resource_slots(id)or equivalent parent resource identity for child slots under a physical GPU;sharing_model textsuch asexclusive_device,mig_partition,mdev_vgpu,time_sliced,mps, orsoftware_shared;profile_name textfor the vendor/runtime profile, for example a MIG or vGPU profile;- quantitative capacity such as
gpu_memory_mib,compute_milli, ormax_claims; - explicit compatibility and reset metadata in
capacity_metadata.
For slice-capable slots, capacity_metadata.fabric_claim_mode is the fabric
approval gate. The v1 GPU VM slice product is per-slot VF backed: schedulable
whole-GPU slice slots must set fabric_claim_mode=per_slot_vf and include a
non-empty fabric_vf_pci_address for the isolated VF or equivalent attachment.
fabric_device may carry the parent BF/IB device for operator context, but
duplicate parent fabric devices are not sufficient proof of safe concurrency.
Missing fabric_claim_mode, exclusive_device, and shared_host are not
schedulable for GPU VM slices. Keep those slots disabled or cleanup-blocked until
cloud-init/approval automation records the per-slot VF identity. Do not set
per_slot_vf as a workaround for bad discovery; if the candidate map reused the
same BF/IB device because discovery was incomplete, the node-agent/node-scheduler
must validate and publish the VF-backed slot model before the marketplace can
advertise concurrent slices.
These fields are reserved for model extensibility only. The first slice
implementation must keep capacity_shape=gpu_slice as exclusive whole-GPU VM
slots unless a separate fractional/shared-GPU phase changes the public capacity
shape enum, scheduler, node-agent, and billing rules. Admin visibility of these
fields does not make fractional/shared GPU products schedulable.
The v1 slice scheduler must explicitly ignore any slot with parent_slot_id set,
sharing_model other than exclusive_device, max_claims greater than one, or
compute_milli below one whole accelerator unit. This prevents early topology
experiments from being sold as whole-GPU slices by mistake.
GPU_Slice_End_to_End_Readiness_Decisions_v1.md is the current decision record
for the final v1 slice model: same-node placement, per-slot VF-backed fabric,
node-scheduler validation, MAAS tag intent, and launch/task UX boundaries.
Do not hardcode the slot schema around InfiniBand. H200 SXM systems may use InfiniBand/HCA passthrough. AMD GPU systems we have used are expected to use RoCE over Ethernet NICs. Both need first-class fabric metadata because placement, network validation, VM attachment, and app compatibility differ.
The latest tested prototype uses fixed per-slot VM defaults of 24 vCPUs and 64 GiB memory for a one-GPU slice. Earlier scripts used or discussed 12 vCPUs, so these values must not be buried inside node-agent scripts. They should come from SKU/profile metadata or approved slot capacity so scheduling, billing, app validation, and benchmark expectations all see the same resource envelope.
Slot discovery must produce a candidate map, not an automatically trusted map.
The prototype can derive GPU/NVMe ordering from host commands, but IB devices,
BF3 VFs, skipped NVMe device numbers, MAC/IP assignment, and NUMA-safe grouping
need admin approval or a site-specific topology profile before the slots become
available.
Discovery requirements:
- enumerate GPUs from PCI/sysfs rather than hardcoded device lists;
- enumerate NVMe devices by topology and admin-approved disk identity, not by
hardcoded size filters such as
3.5T|3.8T; - enumerate IB/RDMA devices from
/sys/class/infinibandand PCI metadata, including Mellanox vendor ID15b3where applicable; - include IB/RDMA port state, link layer, GUID, PCI address, and NUMA metadata in the candidate bundle;
- surface the candidate map for admin approval before slots become schedulable.
The first platform API for this flow is
POST /api/v1/admin/nodes/{node_id}/slice-topology/discovery, which queues a
signed slice.topology_discover node task, and
GET /api/v1/admin/nodes/{node_id}/slice-topology/discovery, which reads the
latest advisory task output. This deliberately uses node_tasks.output as the
candidate-report store in v1 so discovered topology remains separate from
approved node_resource_slots; promotion into schedulable slots is still an
explicit admin action through the resource-slots API.
allocation_placements¶
Fields:
allocation_id uuid primary key references allocations(id)capacity_shape text not nullnode_id uuid not null references nodes(id)placement_status text not nullplacement_metadata jsonb not null default '{}'created_at,updated_at
This table gives fast read access for API/UI without overloading
allocations.node_id.
allocation_resource_claims¶
Fields:
id uuid primary keyallocation_id uuid not null references allocations(id)node_id uuid not null references nodes(id)slot_id uuid null references node_resource_slots(id)claim_kind text not nullsuch asnode_exclusiveorslotstatus text not nullsuch asreserved,provisioning,active,releasing,released,failedresource_snapshot jsonb not nullcreated_at,released_at
For baremetal, create one node_exclusive claim with slot_id = null. For
slices, create one slot claim per selected slot.
Scheduling Transaction¶
Placement and allocation creation must remain atomic.
Longer-term placement responsibility is defined in
doc/architecture/Hierarchical_Placement_and_Node_Scheduler_v1.md. The current
control-plane slot-selection flow is a bootstrap implementation. The intended
direction is region and cluster placement in the control plane, with
node-agent-owned node scheduling for host-local GPU, fabric, disk, CPU, memory,
and network bundle selection.
For gpu_slice:
- start transaction,
- select candidate active nodes by region, tenant boundary, SKU/product class, shape support, and not node-exclusive claimed,
- select compatible slots on each candidate node with
FOR UPDATE SKIP LOCKED, - require all requested slots to come from the same node,
- insert allocation,
- insert placement row,
- insert claim rows,
- mark selected slots
reserved, - write outbox event,
- commit.
For baremetal:
- lock candidate node,
- ensure no active/reserved slot claims exist,
- insert allocation,
- insert node-exclusive claim,
- mark placement,
- write outbox event,
- commit.
The current NOT EXISTS active allocation on node predicate should be replaced
with claim-aware predicates.
Fragmentation Strategy¶
Fragmentation avoidance is a v1 scheduling requirement, not a later optimization. With eight-slot nodes, greedy small allocations can quickly make a future 4-GPU request impossible even when aggregate free GPUs exist.
Phase 3 should ship with a deterministic topology-aware best-fit policy:
- group candidate slots by node and topology group, such as NUMA domain or approved slot group;
- filter groups that satisfy the requested GPU count, fabric, storage, and health constraints;
- prefer the candidate that leaves the smallest usable remainder while still respecting allowed future shapes;
- pack from stable slot ordering inside each topology group;
- avoid placing 1-GPU requests into a clean 4-GPU group when another fragmented group can satisfy the request;
- expose a scheduling reason when aggregate capacity exists but topology-safe capacity does not.
For H200 v1 with allowed counts [1, 2, 4], the scheduler should understand
approved groups such as two 4-slot NUMA groups when topology discovery supports
that. A 4-GPU request should reserve a whole 4-slot group. A 2-GPU request
should prefer a group that is already partially used or a paired subgroup that
does not strand a 4-GPU placement.
Optional later controls:
- reservation/draining mode for preserving large shapes during high demand;
- per-SKU fragmentation budgets;
- admin visibility showing why free slots cannot satisfy a request;
- defragmentation through natural release only. Live migration is not a v1 defragmentation mechanism.
This same packing policy should preserve baremetal optionality. Prefer placing new slice requests on nodes that are already in slice mode before converting a clean baremetal-capable node. If a baremetal request needs a node currently occupied by slices, v1 should support drain plus natural release, and later a user-visible stop-and-recreate evacuation flow. Transparent live migration is not expected for GPU/NVMe/IB passthrough VMs.
Implementation note: keep concurrency control in PostgreSQL, but do not assume the whole best-fit decision must be one SQL statement. The likely v1 shape is:
- query candidate nodes and compatible free slots with enough metadata to score topology groups;
- score fragmentation/topology candidates in Go using deterministic ordering;
- start or continue a short transaction;
- lock the selected slot rows with
FOR UPDATE SKIP LOCKED; - revalidate the selected group still satisfies the request;
- commit claims, slot state changes, allocation state, and outbox atomically.
If the selected group cannot be locked because another scheduler won the race, retry with the next scored candidate. Prototype this query early because it is a core scheduler risk: SQL should narrow the candidate set, while Go may own the fragmentation scoring if the topology rules become too complex for maintainable SQL.
Marketplace Capacity Accounting¶
Catalog capacity must not treat slice slots as additional bare-metal nodes. A physical H200 host can be sold either as a full bare-metal node or as approved GPU VM slots, but those are competing views of the same underlying GPUs.
NodeSummary.slot_summary.by_sku[] reports schedulable slot capacity using the
SKU stored in slot metadata. Marketplace and launch-wizard code should use this
per-SKU breakdown for gpu_slice products and use node.sku plus in_use only
for full bare-metal products. A node with any active or pending slice claim is
not bare-metal free until every slice claim has released and cleanup proof has
completed.
For j22u05, raw slot inventory can show seven available H200 slice slots while one slice is active. That number is not additive with H200 bare-metal capacity. The launchable count may be lower than raw available slots when fabric, NVMe, or BF ownership rules exclude otherwise-free slots. Until the BF/fabric model is settled, the UI should prefer an explicit "capacity blocked by topology/fabric policy" reason over silently advertising those slots as generally launchable.
Node-Agent Runtime Implications¶
Node-agent remains scoped to a physical node, but node tasks need placement details:
allocation_idcapacity_shapeclaim_idsslot_ids- device bundle snapshot
- requested OS/image/runtime profile
- isolation/release policy
For gpu_slice, the node-agent should execute typed tasks such as:
- check host prerequisites such as IOMMU, VFIO, SR-IOV, OVS, libvirt, UEFI firmware, cloud-init tooling, RDMA tools, and reboot-required status;
- discover and report candidate topology bundles;
- prepare VM disk or clone image;
- create cloud-init;
- attach GPU/fabric/NVMe/network devices;
- start/stop/delete VM;
- report VM IP/access endpoint;
- reconcile libvirt domain, DHCP lease, device binding, and disk state;
- wipe/reclaim slot resources;
- verify VM service readiness before reporting active;
- verify slot health after release.
Do not turn node-agent into a generic remote shell. The VM/slice lifecycle must be implemented as bounded typed tasks with signed inputs and structured outputs.
Slice Guest Image Profiles¶
The first implementation can bootstrap drivers into an Ubuntu cloud image, but that path is slow and produces a confusing first-login experience when drivers are still installing. Production GPU VM launch should expose image/runtime profiles in the same launch wizard as SSH, storage, and network choices.
Minimum profiles:
ubuntu-24.04-base: minimal Ubuntu image with GPU/RDMA driver bootstrap and health checks. Use when users want to bring their own CUDA/toolchain stack.ubuntu-24.04-cuda-dev: prebuilt CUDA-enabled developer image with NVIDIA utilities, CUDA toolkit/runtime, RDMA userland, headers, and common build tools. Use as the default for GPU VM slices.
Image selection is part of the allocation request model, not a host-local special case. The scheduler still selects slots by capacity and topology; the node-agent consumes the selected image/runtime profile while preparing the VM disk. Cached or prebuilt images must be invalidated by image version/digest so CI and deploy acceleration cannot serve stale guest artifacts.
Hypervisor Decision For v1¶
Use a bounded libvirt/QEMU implementation for the first VM-backed slice runtime, controlled by node-agent typed tasks. The node-agent may call libvirt or a small local slice manager, but the control-plane contract should remain task-based so we can replace the local implementation without changing API semantics.
Networking details are owned by
doc/architecture/Slice_Networking_Architecture_v1.md.
Initial assumptions:
- GPU assignment uses VFIO passthrough of full GPU PCI devices for H200 slices.
- Management/public network assignment uses BF3 virtual functions attached through OVS, with one vNIC per VM and optional QoS throttling per vNIC.
- Fabric assignment uses the validated IB device exposed in the slot bundle. The IB card does not require an IP address for RDMA. IPoIB can be supported later when a workload requires it, but it is not required for v1.
- NVMe assignment uses a raw device, partition, volume, or host-backed disk declared in the slot bundle.
- Public ingress, when enabled, is not carried over IB. It is NAT or firewall mapping from public IP to the VM management vNIC, or a controlled overlay such as Tailscale/Funnel while firewall control is being resolved.
- MIG, SR-IOV variants, and vendor-specific partitioning are future extensions represented in slot metadata rather than separate allocation concepts.
GPU and fabric devices should normally remain bound to vfio-pci while a node
is operating in slice mode. Reattaching to the host GPU driver on every slice
release adds latency and device churn. Reattach devices only when a node is
drained, transitioned back to baremetal use, or repaired by an admin workflow.
Known constraint: PCI passthrough pins the VM to a physical node and specific devices. Live migration is not supported for v1 and should not be promised by the API or UI. Recovery from node failure is a reprovision/recreate workflow, not a live migration workflow.
Host Bootstrap And Prerequisites¶
The slice-capable node bootstrap should be explicit and observable. Prototype setup requires packages and host configuration roughly equivalent to:
qemu-kvm, libvirt,virt-install, bridge tooling, OVS;cloud-image-utils/genisoimagefor cloud-init seed images;qemu-imgimage conversion support;- RDMA/InfiniBand diagnostics and runtime packages;
- kernel arguments such as
intel_iommu=on iommu=ptfor Intel hosts; vfio-pciloaded at boot;- host IP forwarding and controlled NAT when private management networking is used.
- optional hugepage reservation for tuned slice VM profiles that use QEMU hugepage-backed memory.
- optional persistent
driverctloverrides for approved GPU and IB PCI devices when the node is dedicated to slice mode.
The first enablement of IOMMU/VFIO may require a host reboot. Node-agent
bootstrap should report this as a lifecycle state such as reboot_required
rather than silently continuing with partial capability. After reboot,
bootstrap must be idempotent and able to resume.
Latest prototype tuning from GPUaaS_0416 adds host-side performance settings
that should be treated as an infra-approved slice-node profile rather than
blind application logic:
default_hugepagesz=1G hugepagesz=1G hugepages=512plus--memorybacking=hugepages=yesfor the tested VM shape.- persistent VFIO binding for both H200 GPUs and ConnectX-7/IB devices with
driverctl, after stopping host NVIDIA/Fabric Manager/RDMA services. - host network sysctls for larger TCP/RDMA buffers and backlog.
- guest cloud-init tuning for NVIDIA server drivers, RDMA packages, larger buffers, IPoIB queue sizes, and unlimited memlock.
- UEFI boot with Secure Boot disabled to avoid guest driver/MOK prompts.
Do not copy the prototype's storage behavior directly. Its host setup unmounts
/dev/nvme*p1 share partitions and then treats the parent NVMe devices as raw
slice disks. In GPUaaS production flow, this can be a valid node-mode
transition only when infra has explicitly drained baremetal/share use,
unmounted or remapped the devices, assigned the disk to tenant-slice use, and
approved destructive wipe/reimage. Blind unmounting from node-agent lifecycle
code is not allowed.
Slice Drift Reconciliation¶
The control plane cannot assume that VM state equals allocation state. The node-agent should periodically or on-demand reconcile:
- libvirt domain existence and runtime state;
- expected PCI devices detached/attached to the right VM;
- raw disk ownership, mount state, and wipe status;
- DHCP/IP lease for the expected MAC address;
- OVS port or VF attachment state;
- public NAT, firewall, or overlay exposure mappings;
- guest health and optional agent/SSH reachability.
Drift should surface in admin node details and block reuse only for the affected slot unless the evidence points to node-wide corruption.
VM Readiness Gate¶
The prototype assumes a slice VM is ready after a fixed wait. GPUaaS should
report a slice allocation as active only after node-agent verifies the
management endpoint is usable.
Initial readiness gate:
- wait for the expected DHCP/IPAM lease or configured static private IP;
- poll TCP
:22or the image/profile-specific management port; - optionally verify guest-agent or cloud-init completion when the image supports it;
- fail the provisioning task with structured output if readiness times out.
The default timeout should be configurable by policy or SKU/profile. A practical starting point is 120 seconds for cloud-image slices.
SKU and Pricing Direction¶
SKU should become a sellable product/profile rather than only a node class. However, avoid one public SKU per slice size unless pricing or guarantees differ materially. The first H200 product model should keep the catalog simple:
h200-sxm-baremetal-8g: exclusive 8-GPU H200 node.h200-sxm-slice: shared-node H200 GPU VM product with a constrained user-selected GPU count. The internal SKU may keep theslicesuffix, but user-facing catalog copy should sayGPU VMorGPU VM slice, notSXM Slice.
The user-facing slice flow should present h200-sxm-slice once, then let the
user select from platform-approved GPU counts. It should remain a launch wizard:
capacity, SSH keys, storage, and network/fabric choices belong inside the same
flow so users do not leave provisioning to complete required setup. Do not
expose raw 1..8 unless all counts are topology-safe and operationally useful.
Initial allowed counts:
1 GPU: smallest practical slice; maps cleanly to one GPU slot bundle.2 GPUs: useful medium shape; should prefer or require same-NUMA placement depending on discovered topology.4 GPUs: larger shape; should require a topology-aligned group, likely a same-NUMA or same-board boundary.
Initial disallowed or deferred counts:
3 GPUs,5 GPUs,6 GPUs,7 GPUs: defer until a real workload justifies capacity fragmentation and placement complexity.8 GPUsas slice is allowed only as a topology-aligned full-host VM shape for the current platform-control rollout. The scheduler still treats it as a slice allocation so users get the VM/runtime flow; operators must see that it consumes every GPU slot on the host.
This gives a clean product surface:
sku: h200-sxm-slice
capacity_shape: gpu_slice
gpu_model: h200
allowed_gpu_counts: [1, 2, 4, 8]
same_node_required: true
exclusive_node: false
topology_policy:
gpu_count_1: any_healthy_slot
gpu_count_2: numa_aligned_preferred_or_required
gpu_count_4: numa_aligned_required
gpu_count_8: full_host_slot_group_required
fabric_claim_mode: per_slot_vf
fabric_vf_pci_address: <slot VF PCI address>
fabric_parent_device: <optional parent BF/IB PCI/device>
sku: h200-sxm-baremetal-8g
capacity_shape: baremetal
gpu_model: h200
gpu_count: 8
exclusive_node: true
SKU metadata should include:
capacity_shape- accelerator vendor, model, and class
- allowed GPU counts for slice products, or fixed GPU count for baremetal
- whether it requires exclusive node access
- local storage requirement
- network fabric requirement, for example
infiniband,roce,ethernet, ornone - NIC/VF attachment and validation requirements
- NUMA constraints
- compatible node topology profile
- price model
For future fractional or shared-GPU SKUs, metadata must be more specific than
gpu_count. It should include the partition/share mechanism, allowed profiles,
GPU memory or compute-share units, whether multiple claims can coexist on the
same parent GPU, and whether the profile is exclusive, overcommitted, or
time-sliced. Do not publish a fractional SKU until those semantics are explicit
and validated by node-agent reconciliation.
Slice VM Runtime Profiles¶
Do not encode the VM shape as a single hardcoded value such as
vcpu_per_gpu=12 or vcpu_per_gpu=24. The prototype is still changing, and
different workloads may need different CPU/memory envelopes for the same GPU
count. Model the VM shape as a named runtime profile selected by SKU, app
profile, admin override, or later user-facing product option.
Short-term implementation can store this in sku_catalog.resource_profile
without adding a new table:
{
"fabric": "ib",
"slice_unit_gpu_count": 1,
"default_slice_vm_profile": "h200_1g_24c_64g",
"slice_vm_profiles": {
"h200_1g_12c_64g": {
"gpu_count": 1,
"vcpu_count": 12,
"memory_mib": 65536,
"notes": "early prototype shape"
},
"h200_1g_24c_64g": {
"gpu_count": 1,
"vcpu_count": 24,
"memory_mib": 65536,
"hugepages": {
"enabled": true,
"page_size": "1G"
},
"guest_driver_profile": "ubuntu_24_04_nvidia_570_server_rdma",
"notes": "latest tested prototype shape"
}
}
}
For multi-GPU counts, the default profile can scale from the selected one-GPU profile, but it should still be materialized as an explicit profile once tested so benchmark expectations, placement, and billing exports have stable names. Example derived profiles:
h200_2g_48c_128gh200_4g_96c_256g
Selection rule for v1:
- if an app profile requires a specific slice VM profile, use that profile;
- otherwise use the SKU
default_slice_vm_profile; - allow admin-only overrides during lab tuning;
- write the selected profile name and resolved CPU/memory values into the allocation placement/claim snapshot so later seed changes do not mutate running allocations.
- record whether the profile requires hugepage-backed memory and fail provisioning if the host has not reported that capability.
When this stabilizes or needs per-region/per-node-pool variants, promote it to
a first-class slice_vm_profiles table with audit history. Until then, keeping
it in SKU metadata keeps the rollout fast and avoids schema churn while still
preventing hidden constants in node-agent.
For NVIDIA H200 slice VMs, the v1 guest bootstrap profile is
ubuntu_24_04_nvidia_570_server_rdma: Ubuntu 24.04 cloud image, UEFI with
Secure Boot disabled, NVIDIA 570 server driver utilities, kernel headers for
the running guest kernel, linux-modules-extra for the running guest kernel,
pciutils, and RDMA/IB tools (rdma-core, ibverbs-utils,
infiniband-diags). Provisioning readiness must not stop at SSH reachability;
the node agent waits for cloud-init to finish and requires a successful
nvidia-smi -L before it marks the slice VM guest ready.
Billing should stay allocation-based. There should be one usage record per allocation billing interval, not one independent billable record per slot. The billing worker should derive billable dimensions from immutable allocation placement/claim snapshots:
- rate source: SKU price model at allocation start, with any baremetal or slice premium encoded in the SKU/price record;
- GPU count: number of claimed GPU slots or fixed baremetal GPU count;
- capacity shape:
baremetalorgpu_slice; - node id and slot ids: recorded for audit/export, not as independent billing accounts;
- effective resource count changes: require an explicit resize or repair state transition, not silent billing drift.
If a 4-GPU slice cannot provide all four GPUs after launch, the allocation should enter a degraded/failed operational state according to provisioning policy. Do not silently bill it as 3 GPUs unless a future resize workflow changes the allocation contract and writes an auditable ledger adjustment.
Baremetal and slice can both be expressed as GPU-hour SKUs, but they do not need the same price. Slice may carry a markup for virtualization/isolation overhead, or baremetal may carry a node-exclusive minimum. The ledger should record the SKU, capacity shape, GPU count, slot count, and node id so usage exports explain why the price was charged.
For future fractional/shared GPU products, do not assume GPU count remains the
only billable dimension. The usage snapshot should support billable accelerator
units such as GPU-memory GiB-hours, compute-share hours, partition-profile
hours, or vendor vGPU-profile hours. The rate still comes from the SKU/price
record at allocation start, but the units must come from claim snapshots so a
later profile change cannot rewrite historical billing.
Vendor And Fabric Variants¶
The SKU catalog should distinguish product family and fabric without forcing a separate SKU for every GPU count.
Examples:
sku: h200-sxm-slice
accelerator_vendor: nvidia
accelerator_model: h200
capacity_shape: gpu_slice
network_fabric: infiniband
allowed_gpu_counts: [1, 2, 4, 8]
sku: amd-mi300x-roce-slice
accelerator_vendor: amd
accelerator_model: mi300x
capacity_shape: gpu_slice
network_fabric: roce
allowed_gpu_counts: [1, 2, 4]
The exact AMD SKU name should match the actual GPU model we onboard. The important rule is that RoCE is not treated as "missing IB"; it is a different fabric capability with its own validation and placement rules.
OS Image Catalog¶
GPU slices are VM-like allocations, so the platform needs an OS image catalog before slices become generally usable. The image chosen for a slice is part of the allocation runtime contract, not a node-local implementation detail.
Do not rely on ad hoc files copied onto one node. Store image metadata in the control plane and make the artifact immutable by digest. The binary image can be hosted in platform object storage, an OCI artifact registry, a controlled HTTP location, or a MAAS image reference, but GPUaaS should track the digest and compatibility rules centrally.
Initial image targets:
baremetal: image references are MAAS deploy images or MAAS-compatible image aliases. These are used for full-node deploy/reimage flows.vm_slice: image references are cloud images such asqcow2orraw, suitable for node-agent download/cache/clone and cloud-init injection.
Directional table:
os_images
id uuid primary key
slug text unique not null
display_name text not null
version text not null
target text not null -- baremetal | vm_slice
family text not null -- ubuntu | rocky | custom
architecture text not null -- amd64 | arm64
accelerator_vendor text null -- nvidia | amd | any
network_fabric text null -- infiniband | roce | ethernet | none | any
image_format text not null -- maas | qcow2 | raw | oci
source_type text not null -- maas | platform_storage | oci_artifact | http
source_ref text not null
digest_sha256 text not null
size_bytes bigint null
cloud_init_supported boolean not null default true
guest_agent_supported boolean not null default false
default_username text not null
driver_strategy text not null -- none | preinstalled_nvidia | preinstalled_rocm | install_on_boot
compatibility_metadata jsonb not null default '{}'
active boolean not null default true
created_at timestamptz not null
updated_at timestamptz not null
retired_at timestamptz null
First implementation default:
slug: ubuntu-24.04-gpuaas-slice
target: vm_slice
family: ubuntu
architecture: amd64
source_type: platform_storage
image_format: qcow2
driver_strategy: install_on_boot
cloud_init_supported: true
For GPU slice products, we should also support accelerator-specific optimized images:
slug: h200-ubuntu-24.04-gpuaas-slice
target: vm_slice
family: ubuntu
architecture: amd64
accelerator_vendor: nvidia
accelerator_model: h200
source_type: platform_storage
image_format: qcow2
driver_strategy: preinstalled_nvidia
cloud_init_supported: true
compatible_skus: ["h200-sxm-slice"]
This is likely the preferred production path for H200 slices because it reduces startup time and removes driver install variability from the hot allocation path. The tradeoff is image maintenance: every driver/CUDA/security baseline change becomes an image build, publish, verify, and rollout event. GPUaaS should therefore keep both concepts:
- generic slice image, useful for development, CPU-only validation, and fallback bootstrapping;
- accelerator-specific slice image, preferred for production GPU products where startup latency and repeatability matter.
Admin API direction, contract-first when implemented:
GET /api/v1/admin/os-images: list images with filters for target, active, family, accelerator vendor, and fabric.POST /api/v1/admin/os-images: add an image metadata record and optionally trigger verification/import.GET /api/v1/admin/os-images/{image_id}: inspect image metadata, compatibility, verification status, and current usage.PATCH /api/v1/admin/os-images/{image_id}: update mutable fields such as display name, active flag, default marker, compatibility metadata, or notes.DELETE /api/v1/admin/os-images/{image_id}: retire the image by default. Hard delete should be allowed only when the image has never been referenced by an allocation, app instance, or audit event.
Privileged image mutations require audit logs. The API should reject missing digests for immutable artifact sources. A later verification task should download or inspect the artifact, validate digest/format/cloud-init support, and mark it usable for scheduling.
Slice provisioning should pass os_image_id, source_ref, digest_sha256, and
compatibility metadata to node-agent. Node-agent then downloads or reuses a
local cache under a controlled path such as /var/lib/gpuaas/images, verifies
the digest, clones a fresh disk for the allocation, injects cloud-init, and never
boots directly from the shared cached image.
The prototype clones an Ubuntu cloud image directly to the raw NVMe device with
qemu-img convert -O raw. GPUaaS can use the same fast path, but it needs
guardrails:
- verify the destination block device is the approved slot disk and is not mounted by the host;
- require the image digest to match the control-plane catalog entry before cloning;
- treat clone/write failures as provisioning failures with structured node-task output;
- never allow a user-provided path to select the destination disk;
- explicitly wipe or recreate the disk on release before the slot is reusable.
Manual ISO installer flows are useful for lab debugging, but they should not be part of the default customer slice product. If kept, expose them only as an admin/debug operation with console access, audit logging, and no billing promise until the VM reaches a managed-ready state.
Latest prototype runtime details worth carrying into implementation:
- guest boot uses UEFI with secure boot disabled because the driver and passthrough path currently assume that compatibility mode;
- optional cloud-init guest driver installation installs NVIDIA server driver packages and RDMA/IB packages, then reboots the guest;
- production should prefer accelerator-specific prebuilt images over hot-path guest driver installation, but the cloud-init install path remains useful for lab fallback;
- the prototype pins each one-GPU slice to NUMA-local CPU IDs, but the current
sample command requests 24 vCPUs while its static cpuset examples list only
12 CPU IDs per NUMA side. GPUaaS should derive a valid CPU set from topology
and requested
vcpu_count, not copy the static list.
Latest prototype benchmark evidence:
| Metric | Bare metal | One-GPU VM slice | Delta |
|---|---|---|---|
| DCGM FP64 | 627 TFLOPS | 614 TFLOPS | -2.1% |
| HBM bandwidth | 4000 GiB/s | 4000 GiB/s | 0% |
| IB write bandwidth | 363 Gb/s | 206 Gb/s | -43% |
| vLLM OPT-125M | 2404 tokens/s | 1993 tokens/s | -17% |
Interpretation: PCI passthrough is strong enough for GPU compute and memory bandwidth, but fabric and vCPU tuning remain real scheduling/profile inputs. The first production profile should record expected benchmark thresholds and fail readiness or mark slots degraded when a node falls materially below them.
Follow-up tuning backlog:
- Fabric/BF3/IB: validate MTU, multi-queue, queue depth, IRQ affinity, NUMA locality, VF/representor mode, OVS offload, and per-vNIC QoS before deciding the production networking profile. The current IB delta is large enough that it should be tracked as a performance workstream, not treated as a normal virtualization cost.
- NVIDIA/GPU guest path: compare guest driver versions, CUDA stack, persistence mode, fabric manager behavior, and GPU reset behavior between bare metal and VM images. Prefer prebuilt accelerator-specific images once the working combination is known.
- CPU/vLLM: tune vCPU count, cpuset generation, CPU governor, hugepages, NUMA memory policy, tensor parallel settings, and container runtime options. The vLLM delta may be more CPU/orchestration-related than GPU-related.
- Benchmark contract: store benchmark profile name, command versions, driver versions, and expected thresholds with the slice VM runtime profile so readiness/acceptance tests are reproducible.
OS Image Build And Rollout¶
The image catalog needs a separate build/publish pipeline spec before implementation. This allocation design assumes the pipeline provides:
- repeatable image build inputs, including base OS, kernel, driver/CUDA or ROCm stack, guest agent, and cloud-init support;
- automated boot and smoke tests before an image can become
active; - digest and optional signature/attestation capture;
- canary rollout by SKU, region, or node pool before fleet-wide defaulting;
- rollback by changing the default image pointer while keeping old image metadata available for existing allocations;
- node-agent cache eviction rules that never delete an image backing a running VM.
For now, the admin API can store and retire image records manually. Production slice rollout should not depend on manual image creation long term.
App Runtime Compatibility¶
Apps should not care whether the underlying allocation is a node or a slice when their contract only needs allocation-local execution. The app should target an allocation, and the platform should decide whether that allocation is suitable.
Allow Apps on Slices?¶
Yes, but only when the app profile declares compatibility with the allocation shape and resource envelope.
Initial rule:
- Launchable OCI/Jupyter/vLLM-style single-node apps may run on
gpu_sliceallocations if: - the allocation has enough GPU/CPU/memory/storage for the app request,
- the slice runtime exposes Docker/Compose or the required container runtime,
- network exposure mode is supported for slice VMs,
- the profile does not require host-level privileges.
- Single-node Slurm or Kubernetes profiles may run on
gpu_sliceallocations if the adapter keeps the entire control plane and worker runtime inside one allocation and does not require cross-node networking. - Multi-node Slurm, Kubernetes, and other cluster-style control-plane app
profiles should default to
baremetalor remain unavailable until their adapter and the slice networking model explicitly support slice-hosted clusters. - Apps that require multi-node workers, host networking, privileged device
control, or full-node reimage must declare
requires_capacity_shape: baremetal.
This rule does not prohibit infra-managed controller VMs on slice-capable hosts.
For example, an operator may create a small Slurm controller or Kubernetes
control-plane VM with no GPU claim, such as 24 vCPUs, 64 GiB RAM, and a 50 GiB
disk, using virt-manager or a future platform-managed system_vm workflow.
That is an infrastructure placement decision, not the same as advertising the
Slurm or Kubernetes customer app adapter as slice-compatible.
The platform should eventually model these as system/control-plane placements:
capacity_shape=system_vmor equivalent internal placement type;- CPU, memory, disk, network, and host affinity requirements;
- no customer GPU slot claim unless explicitly needed;
- admin-only lifecycle and audit trail;
- exclusion from normal customer slice availability and billing unless product policy says otherwise.
Until project/private networking is available, the safe product boundary is:
- single-node Slurm/Kubernetes on one allocation can be considered slice compatible;
- multi-node Slurm/Kubernetes clusters should not be advertised on slices;
- baremetal remains the default for real multi-node cluster products.
App Manifest Additions¶
Phase 4 should start with a small manifest contract:
requires_capacity_shape, for example["baremetal"]or["baremetal", "gpu_slice"];min_gpu_count;requires_exclusive_node.
The platform can infer most v1 fabric, device, runtime, and networking checks from the selected SKU, allocation placement, OCI profile, and app adapter. More specific declarations can be added later only when a real app needs them, such as explicit accelerator runtime, required fabric access, host networking, or slice NAT tolerance.
The app create flow should keep using placement_intent.target_allocation_id
for allocation-local apps. The validation layer should inspect the target
allocation placement and reject incompatible profiles before dispatching node
tasks.
Future fractional/shared-GPU support should be opt-in per app adapter. Existing
gpu_slice compatibility must not automatically imply fractional compatibility,
because device visibility, isolation, driver behavior, performance guarantees,
and reset semantics differ between MIG, vGPU, MPS, and time-sliced sharing.
Initial fractional-compatible candidates are likely simple allocation-local
Jupyter or vLLM profiles after validation. Slurm, Kubernetes, and other
control-plane apps should remain on baremetal or whole-GPU gpu_slice
allocations until their adapters explicitly support fractional devices.
App Instance Members¶
app_instance_members.bound_node_id is not sufficient for slices. Member views
should eventually include:
bound_allocation_idbound_node_idbound_slot_idscapacity_shape- resource snapshot
This keeps app detail pages understandable when multiple app instances run on different slices of the same physical node.
UI Implications¶
User Allocation Flow¶
The allocation flow should ask for product shape before placement details:
- choose SKU/product: baremetal node or GPU slice,
- choose region,
- choose GPU count,
- optionally choose storage/network add-ons,
- review price and availability,
- create allocation.
Users should not be asked to pick raw PCI devices.
Admin Nodes¶
Node detail must show:
- physical node status,
- agent version/health,
- total slots,
- used/reserved/cleanup/disabled slots,
- active allocations per slot,
- whether baremetal is currently blocked by slice use,
- whether slice scheduling is blocked by node-exclusive/baremetal use,
- bootstrap capability status and whether a reboot is required,
- per-slot drift status for VM, device, disk, network, and cleanup evidence,
- optional admin console action when a slice VM requires manual debugging.
App Deploy Wizard¶
The app deploy wizard should show only compatible target allocations. If a Jupyter profile supports slices, slice allocations appear. If a Slurm profile requires baremetal, slice allocations are hidden or marked incompatible.
Slice Networking¶
Slice networking is a first-class part of slot compatibility, but the detailed
network architecture is maintained separately in
doc/architecture/Slice_Networking_Architecture_v1.md.
For placement purposes, this document depends on three decisions from that networking spec:
- slice VMs have a management/public plane and a workload fabric plane;
- public ingress is a management-plane IPAM/firewall/proxy concern, not an IB feature;
- console access, if provided, must use an authenticated/audited gateway rather than raw node-exposed VNC.
Isolation and Cleanup¶
Baremetal and slice release paths are different.
Baremetal:
- revoke user/session access,
- stop app runtimes,
- optionally MAAS full reimage,
- return node to active pool.
GPU slice:
- gracefully stop VM or slice runtime;
- hard-stop only after the configured graceful timeout expires;
- wipe raw disk or recreate image;
- reset GPU/network devices where supported;
- release DHCP/IP/MAC lease;
- validate the slot can run a health smoke;
- mark slot available.
Slice cleanup failure should block only the affected slot unless the failure indicates node-wide device/driver corruption.
Raw disk cleanup must be explicit. virsh undefine --remove-all-storage is not
sufficient proof that a raw NVMe device was wiped or safe for another tenant.
The release task should unmount any host-visible partitions, run the configured
wipe/blkdiscard/reimage operation, and verify the disk no longer contains the
previous tenant's filesystem signature before marking the slot available.
Raw NVMe wipe verification is a blocking release criterion for slice reuse. A
slot with unverified wipe state must remain cleanup or cleanup_blocked and
must not be scheduled to another tenant.
The VM shutdown sequence should be:
- request graceful shutdown, for example
virsh shutdown; - wait for a configurable timeout, defaulting to 120 seconds unless policy says otherwise;
- use hard stop, for example
virsh destroy, only as fallback; - continue disk, network, and device cleanup after the VM is confirmed stopped.
Capacity Read Models¶
Needed API/read-model surfaces:
- region/SKU availability by capacity shape,
- node summary with aggregate slot counts,
- node detail with per-slot state for admins,
- allocation detail with placement shape and claims,
- app target allocation picker with compatibility reasons.
For product UX, users should see product availability, not raw slot internals. For operators, slot internals must be visible.
Migration Strategy¶
Phase 1: Document and read-model groundwork.
- Add capacity shape fields to docs/contracts.
- Add slot/claim tables behind feature flags.
- Keep existing whole-node path unchanged.
- Backfill existing active allocations as
capacity_shape=baremetalwith a node-exclusive claim.
Phase 2: Admin inventory.
- Let node-agent/bootstrap report discovered resource topology.
- Let admins approve or correct slot bundles.
- Show slot inventory in Admin Nodes.
- Add admin OS image catalog CRUD for slice VM and baremetal image metadata.
- Add image verification/import tasks before images become selectable.
- Add topology discovery evidence for IB/RDMA devices, NVMe identity, NUMA, BF3/management network identity, MAC/IP reservations, and CPU/memory slot capacity.
Phase 3: Slice allocation.
- Add
gpu_sliceSKU entries. - Add claim-aware placement.
- Add topology-aware best-fit placement for
[1, 2, 4]H200 slice counts, with control-plane region/cluster placement and node-agent-owned host-local bundle validation. - Add node-agent node-scheduler planning for complete slice bundles: GPU, IB/RDMA, raw NVMe, CPU, memory, MAC/IP, and NUMA policy.
- Add node-agent typed tasks for VM slice lifecycle.
- Add slice release/cleanup verification.
- Add usage-record dimensions sourced from allocation claim snapshots.
- Resolve default
vm_sliceOS image from SKU/profile when the user does not explicitly choose one. - Enforce graceful shutdown, readiness gates, and blocking raw NVMe wipe verification before slot reuse.
Phase 3a: OS image pipeline.
- Add build/publish/verify pipeline spec for slice images.
- Add image canary/default/rollback model.
- Add node-agent image cache policy.
- Keep manual admin image registration as the bootstrap path only.
Phase 4: App compatibility.
- Add app profile compatibility declarations.
- Filter target allocation picker by compatibility.
- Enable launchable OCI/Jupyter/vLLM on slices first.
- Enable single-node Slurm/Kubernetes profiles on slices only when the adapter runs fully inside one allocation.
- Keep multi-node Slurm/Kubernetes cluster profiles baremetal-only until slice networking supports clusters.
- Separately evaluate infra-managed
system_vmplacements for Slurm controllers, Kubernetes control planes, and similar admin-owned services on slice-capable hosts.
Deferred Phase 5: Fractional/shared GPU model readiness.
- Keep the first implementation disabled for fractional/shared scheduling.
- Reserve model semantics for child slots under a parent physical GPU.
- Require explicit capacity shapes such as
gpu_partitionorgpu_sharedbefore exposing a user-facing fractional product. - Require node-agent support for the selected mechanism, for example MIG, vGPU/mdev, SR-IOV-style accelerator partitioning, MPS, or time slicing.
- Extend billing dimensions from whole-GPU count to claim-snapshotted accelerator units before charging for fractional products.
- Make app compatibility opt-in and adapter-specific; do not treat whole-GPU slice support as fractional support.
Open Questions¶
- Is the first slice implementation always one GPU per VM, or do we support multi-slot VM allocations immediately?
- Should CPU and memory be first-class slot resources or node-level soft capacity constraints in the first slice?
- Is local NVMe always exclusive per slot, or can some profiles use shared project storage instead?
- Is InfiniBand required for all H200 slices, or should network device requirements be SKU-specific?
- What RoCE validation is sufficient for AMD slices before a slot can be marked available?
- Should node-agent call libvirt/QEMU directly, or should it call a bounded local slice manager while preserving the same typed-task contract?
- Which VM network attachment should be the default for kind, MAAS lab, and production: bridge, macvtap, OVS, or another controlled model?
- How do we expose private slice endpoints through existing app access models?
- What is the minimum cleanup validation before a slot can be reused?
- Should OS images be stored primarily as OCI artifacts, platform object storage objects, or both?
- Do we require image signing/attestation before a slice image is active?
- What image cache eviction policy should node-agent use when a node hosts many slice image versions?
- Which GPU driver strategy is safest for each slice family: preinstalled driver, install-on-boot, or node-agent mounted driver payload?
- What is the exact source of truth for per-slice CPU and memory: SKU defaults, approved slot metadata, or both?
- Which node-local IPAM implementation should back private slice networking: dnsmasq, libvirt network DHCP, central DHCP/IPAM, or a small node-agent-owned allocator with reconciliation?
- Which admin-only console gateway should replace direct VNC exposure for manual slice debugging?
- What destructive-disk safety checks are mandatory before a raw NVMe device can be assigned or reused?
- Should topology discovery be entirely node-agent driven, or should each site maintain a reviewed topology profile for known server layouts?
- Which guest readiness signal is required per OS image family: SSH only, cloud-init completion, guest agent, or an app-specific probe?
- What policy key or SKU/profile value controls graceful VM shutdown timeout?
- Should infra-managed controller VMs be modeled as a separate internal
system_vmcapacity shape, or as reserved CPU/memory/disk placements on a slice-capable node? - Which fractional/shared-GPU mechanism, if any, should be supported first: MIG, NVIDIA vGPU/mdev, SR-IOV-style accelerator partitioning, CUDA MPS, time slicing, or app-level software multiplexing?
- Should fractional inventory be represented as child rows in
node_resource_slots, a separate accelerator partition table, or generated only after an admin approves a parent GPU partition profile? - What billable units should fractional GPU products use: GPU-memory GiB-hour, compute-share hour, named partition-profile hour, or another SKU-defined unit?
Non-Goals For First Slice¶
- Cross-node sliced allocations.
- Arbitrary overcommit of GPUs, NVMe, or IB devices.
- Generic VM cloud product semantics.
- User-visible PCI/device selection.
- Running every app type on slices immediately.
- Replacing MAAS baremetal lifecycle.
- Live migration or live defragmentation of GPU slice VMs.
- Fractional/shared-GPU scheduling, including MIG, vGPU, MPS, time slicing, or software multiplexing.