GPU Slice End-to-End Readiness Decisions v1¶

Purpose¶

Close the remaining architecture and product decisions before continuing end-to-end GPU slice validation.

This document is the bridge between:

Allocation_Capacity_Shapes_and_GPU_Slices_v1.md;
Hierarchical_Placement_and_Node_Scheduler_v1.md;
GPU_Slice_Implementation_Checklist_v1.md;
Product_UI_System_Redesign_v1.md;
UX_Redesign_Implementation_Plan.md;
GPU_Slice_Infra_Enablement_Proposal_v1.md.

It does not replace those documents. It records the decisions that unblock the next implementation and test pass.

Finalized Decisions¶

1. Fabric Model¶

The v1 GPU VM slice product uses per-slot VF-backed fabric.

Each schedulable whole-GPU slice slot must have:

one physical GPU PCI function;
one raw NVMe device approved for tenant slice ownership;
one unique fabric VF or equivalent isolated fabric attachment;
one management MAC/private-IP reservation;
SKU/profile-owned CPU and memory;
cleanup/wipe policy metadata.

The scheduler must not treat a shared host BF/IB parent device as schedulable concurrency. Parent fabric metadata is operator context only. A slot is schedulable only when capacity_metadata.fabric_claim_mode=per_slot_vf and capacity_metadata.fabric_vf_pci_address is non-empty.

This preserves max performance and gives a clean extension point for future fractional or shared GPU products. Fractional/shared GPU support remains a future capacity shape and must not be hidden under gpu_slice.

2. Same-Node Placement¶

Multi-GPU slice allocations must not span nodes.

If a user asks for 4 GPUs and the platform has two nodes with two schedulable slots each, the request is unavailable. It must not reserve capacity across both nodes for one VM.

The user-facing message should distinguish:

no capacity in region;
capacity exists but no single node can satisfy the requested count;
capacity exists but topology/fabric/storage readiness blocks scheduling.

The API can initially return sku_unavailable, but the product should evolve toward structured unavailable reasons so the catalog and wizard can explain the blocker without direct DB inspection.

3. Controller Scheduler and Node Scheduler Boundary¶

GPUaaS uses a hierarchical placement model:

Region scheduler -> Cluster scheduler -> Node scheduler -> Runtime executor

For the current code path, the controller scheduler already performs durable reservation using node_resource_slots and allocation_resource_claims. That is acceptable for the first end-to-end test, but it is not the final boundary.

The target model is:

controller scheduler chooses region, pool, and candidate node;
node scheduler validates host-local state and returns a concrete bundle plan;
controller records the returned plan as durable claims;
runtime executor creates or releases the VM from the accepted plan.

The node scheduler may live inside node-agent initially. It must be a bounded typed-task path, not an SSH shell or unstructured script runner.

4. Node Scheduler Lease Model¶

The node scheduler must take a host-local lease before reporting that a bundle is safe to use.

Minimum lease requirements:

one lease per slot or per bundle;
lease file under /var/lib/gpuaas/node-scheduler/leases;
lease contains allocation ID, task ID, slot IDs, device identities, and expiry;
stale leases are reconciled by node-agent before new placement;
controller claims remain the durable source of truth after reservation.

This protects against same-node races where the controller has stale inventory or two node tasks are in flight concurrently.

5. Node Mode and Baremetal Preservation¶

Nodes can be tagged as slice-capable, baremetal-capable, or both-capable, but placement is mutually exclusive at runtime.

Initial states:

baremetal_only: no slice profile tag or no approved slots;
slice_candidate: profile tag exists but readiness or slot approval is incomplete;
both_capable: node has profile tag, readiness evidence, and approved slots, with no active claims;
slice_active: at least one slice claim is reserved/provisioning/active;
baremetal_active: whole-node claim is active or reserved.

Scheduler policy should prefer already-sliced nodes for new slice requests so clean nodes remain available for baremetal. Freeing a sliced node for baremetal is a drain/release/reconcile operation, not transparent live migration.

6. Host Bootstrap and Cloud-Init¶

Standard MAAS commissioning remains the default.

GPUaaS-specific firmware/profile work runs through deploy-time cloud-init or an infra wrapper, selected by MAAS tags:

gpuaas-profile-slice-vm;
gpuaas-profile-baremetal;
hardware tags such as gpu-nvidia-h200;
server tags such as server-dell-xe9680l;
fabric tags such as fabric-bf3.

The helper checks profile readiness, applies one-time BIOS/RACADM drift only when approved, performs at most one controlled reboot for that profile pass, and records evidence under /var/lib/gpuaas/firmware-profile/.

Runtime bootstrap remains separate from firmware readiness. Libvirt, OVS, VFIO, image cache, hugepages, host networking, and node-agent installation are deployed-host bootstrap responsibilities.

7. Slice Images¶

End-to-end testing needs at least two image tracks:

base Ubuntu slice image;
CUDA/NVIDIA developer image.

The slice runtime must resolve image metadata from SKU/profile or approved slot metadata. The first test may use a manually registered path, but productized launch must use an image catalog with compatibility metadata.

Images must have:

immutable digest;
compatibility shape (gpu_slice or baremetal);
accelerator compatibility;
default username;
readiness contract;
rollout state such as active/canary/deprecated.

8. Task Visibility¶

Allocation timeline pagination is sufficient for the immediate slice test, but the product direction is a first-class task read model.

The task model must cover:

allocation provisioning and release;
GPU slice slot reservation;
image clone/cache;
cloud-init and boot;
readiness probes;
MAAS onboarding and decommission;
image build/import/verify.

The UI should initially load a small latest-event page and support loading older events. Backend pagination and sorting are required; the UI must not load unbounded task histories for long-lived allocations.

9. Launch UX¶

The product should expose one H200 launch flow with branching inside the wizard:

baremetal SKU path: fixed GPU count, whole-node semantics;
GPU slice path: selectable allowed counts, same-node availability, image selection, SSH/access, storage, network, and review.

Region is a top-bar and wizard context even while only one region is enabled. Initial display value is US Buffalo / us-buffalo-1 or the currently mapped platform region label for the MAAS site. Future regions can be disabled rather than hidden.

10. Test Node Scope¶

Current slice validation nodes:

Node	MAAS IP	Intended use
`j22u15`	`10.177.36.197`	Primary slice test host
`j22u11`	`10.177.36.100`	Secondary slice test host
`j27u15`	`10.177.36.198`	Baremetal or temporary slice test if retagged

j22u05 is being returned and should not be used for new slice validation.

Immediate Engineering Order¶

Complete these before broad UI redesign work resumes:

make GPUaaS inventory reflect the MAAS slice tags for the active test nodes;
run host readiness/bootstrap on j22u15;
run topology discovery and confirm per-slot fabric VFs exist;
approve only slots with raw NVMe slice ownership and unique per-slot fabric VF metadata;
verify the base slice image path/digest exists on the node;
launch one 1-GPU slice allocation;
verify VM boot, cloud-init, SSH/terminal, and nvidia-smi;
release allocation and verify wipe/cleanup evidence before slot reuse;
repeat on j22u11 only after the primary path works.

Explicit Non-Goals For This Pass¶

fractional GPU scheduling;
live migration or automatic evacuation;
multi-node slice VMs;
shared host-owned fabric as a concurrency model;
public ingress for slice VMs beyond the existing private management path;
full product UI rewrite before the slice runtime path is proven.