GPU Slice Infra Enablement Proposal v1¶
Purpose¶
Propose a complete GPUaaS-provided enablement path for GPU slice hosts. This is intentionally framed as a target profile, scripts, and validation plan for infra review and adoption, not an open-ended request for infra to design the whole flow.
The platform control plane can continue implementation with slice scheduling disabled by default. End-to-end slice provisioning should only be enabled after infra approves the host profile and the target node has passed the checks below.
Proposed Target¶
Create MAAS-backed GPUaaS profile tags for nodes approved to run GPUaaS VM-backed GPU slices or GPUaaS baremetal allocations. The first tested slice target is Dell XE9680L/H200, but the MAAS profile tags should stay generic so the same deploy-time cloud-init check/apply flow can be reused for H100, B-series, AMD, or later accelerator hosts with hardware-specific validation added underneath the same platform role.
The profile prepares a node for VM-backed one-GPU slices where each slice gets:
- one full GPU through VFIO passthrough;
- one infra-approved raw NVMe device or volume;
- one BF3/OVS-backed management vNIC or the approved fallback OVS private-NAT bridge;
- optional IB/RDMA passthrough when the slot profile requires it;
- SKU/profile-driven CPU and memory, starting with the tested
24 vCPU / 64 GiBone-GPU shape; - optional tuned host profile: 1 GiB hugepages, persistent VFIO binding, guest Secure Boot disabled, and network/RDMA sysctl tuning.
This profile should be opt-in per node or resource pool. Do not infer slice eligibility from GPU count alone. Nodes that remain baremetal-only should not run the slice VM firmware profile.
Capacity Mode Source Of Truth¶
MAAS tagging and GPUaaS readiness evidence serve different purposes:
- the MAAS tag/profile records operator intent that a node is allowed to become a GPUaaS slice host;
- deploy-time firmware profile evidence records whether that intended slice host is technically ready for the slice-host deploy path;
- platform slot approval records the exact GPU, disk, fabric, CPU, and memory bundles that are schedulable by GPUaaS.
A node must not be marked as GPU-slice-supported from readiness output alone. The correct promotion rule is:
slice_supported = has_gpuaas_slice_host_tag
AND firmware_profile_ready
AND approved_node_resource_slots_exist
If the slice-host tag/profile is absent, the node remains baremetal-only for GPUaaS even if the hardware would pass the readiness checks. If the tag/profile is present but the firmware profile fails, the node is an intended slice host but not deployable. If both pass but slots are not approved, the node is slice-ready from an infra perspective but not schedulable by the platform.
The provisioning path should expose these states distinctly:
baremetal_only: node can serve whole-node allocations only;slice_candidate: operator selected the node for slice hosting, but readiness or slot approval is incomplete;both_capable: node can serve either whole-node or slice allocations, but not both at the same time;slice_active: at least one slice claim is active or reserved, so baremetal placement is blocked until drain and cleanup finish;baremetal_active: a whole-node allocation is active, so slice placement is blocked until release and the slice-mode transition passes.
For the initial automation, the platform can derive this state from the MAAS
tag/profile, the latest deploy-time firmware profile state under
/var/lib/gpuaas/firmware-profile/, and approved node_resource_slots. Longer
term, GPUaaS should persist an explicit node capability/mode read model so the
catalog can say "baremetal only", "slice candidate", or "both capable" without
re-querying raw host bootstrap logs.
Composable MAAS Tags¶
Use small composable MAAS tags instead of encoding SKU identity into one tag. This lets GPUaaS derive catalog SKUs and validation profiles from the same metadata as new hardware arrives.
Recommended initial tags:
| Tag | Meaning |
|---|---|
gpuaas-profile-slice-vm |
Profile intent: node may be prepared for GPUaaS VM-backed GPU slices. |
gpuaas-profile-baremetal |
Profile intent: node may be prepared for GPUaaS baremetal allocations. |
gpu-nvidia-h200 |
Hardware identity: NVIDIA H200 GPU node. |
server-dell-xe9680l |
Server/chassis profile used for BIOS/RACADM expectations. |
fabric-bf3 |
Fabric profile: BF3/ConnectX resources are expected for slice networking/RDMA validation. |
Future tags should follow the same pattern, for example gpu-nvidia-h100,
gpu-nvidia-b200, gpu-amd-mi300x, fabric-cx7, or a site-specific storage
pool tag. The gpuaas-profile-slice-vm tag selects the slice VM firmware
profile. Hardware tags by themselves do not make a node slice-supported.
For SKU derivation:
- GPU/server/fabric tags describe what products a node could support;
gpuaas-profile-slice-vmdecides whether slice products may be considered;- deploy-time firmware profile evidence decides whether the selected slice-host profile is ready;
- approved
node_resource_slotsdecide exact schedulable GPU count, CPU, memory, disk, and fabric bundles.
For example, a node tagged gpu-nvidia-h200 without
gpuaas-profile-slice-vm should feed the H200 baremetal SKU only. A node
tagged gpu-nvidia-h200 plus gpuaas-profile-slice-vm, with ready firmware
profile evidence and approved slots, can feed both the H200 baremetal SKU and
H200 GPU slice SKUs, subject to mutually exclusive placement.
The slice VM and baremetal firmware profiles are intentionally separate. Infra correctly identified that baremetal may need a different RACADM profile than slice VMs. The GPUaaS cloud-init helper therefore selects one firmware profile from the profile tag and blocks machines that carry both profile tags by default.
Immediate Ask For Infra¶
Use standard MAAS commissioning. Do not add GPUaaS commissioning scripts to the normal MAAS path for this model.
Infra only needs to provide profile intent through MAAS tags or an equivalent resource-pool/profile mechanism. GPUaaS deploy cloud-init will read the selected profile intent, run a fast readiness check, and apply the one-time BIOS/RACADM profile only when the node is not already prepared.
Initial tags:
| Tag | Meaning |
|---|---|
gpuaas-profile-slice-vm |
Node is intended to support GPUaaS VM-backed GPU slices. |
gpuaas-profile-baremetal |
Node is intended to support GPUaaS baremetal allocations. |
gpu-nvidia-h200 |
Hardware identity: NVIDIA H200 GPU node. |
server-dell-xe9680l |
Server/chassis profile used for BIOS/RACADM expectations. |
fabric-bf3 |
BF3/ConnectX fabric is expected for slice networking/RDMA validation. |
A node should carry exactly one GPUaaS profile tag for automated onboarding. Hardware tags by themselves do not imply slice readiness.
Example tag creation:
maas maas280 tags create name=gpuaas-profile-slice-vm \
comment="GPUaaS slice VM profile intent"
maas maas280 tags create name=gpuaas-profile-baremetal \
comment="GPUaaS baremetal profile intent"
maas maas280 tags create name=gpu-nvidia-h200 \
comment="NVIDIA H200 GPU nodes"
maas maas280 tags create name=server-dell-xe9680l \
comment="Dell XE9680L platform nodes"
maas maas280 tags create name=fabric-bf3 \
comment="Nodes with BF3/ConnectX fabric expected for GPUaaS slice profiles"
Example slice VM assignment:
maas maas280 tag update-nodes gpuaas-profile-slice-vm add=<system_id>
maas maas280 tag update-nodes gpu-nvidia-h200 add=<system_id>
maas maas280 tag update-nodes server-dell-xe9680l add=<system_id>
maas maas280 tag update-nodes fabric-bf3 add=<system_id>
GPUaaS cloud-init then follows this rule:
- if the selected profile/version is already verified, skip firmware changes;
- if the fast readiness check passes, record readiness evidence and continue;
- if BIOS/RACADM drift is detected and this is onboarding/reprofile, apply the profile and perform one controlled reboot;
- after reboot, verify again and continue node-agent/bootstrap;
- if readiness still fails, fail onboarding/reprofile and keep the node out of scheduling.
This work is one-time per node profile. It runs on first onboarding, profile tag change, profile version change, or explicit admin repair. It must not run as a customer allocation hot-path operation.
The deploy-time helper is:
Cloud-init should install it as a systemd oneshot with the selected profile environment. For the first H200 slice VM profile:
GPUAAS_FIRMWARE_PROFILE_TAG=gpuaas-profile-slice-vm
GPUAAS_FIRMWARE_PROFILE_VERSION=h200-xe9680l-slice-vm-v1
GPUAAS_FIRMWARE_APPLY_IF_NEEDED=1
GPUAAS_FIRMWARE_REBOOT_ON_CHANGE=1
GPUAAS_FIRMWARE_MAX_REBOOT_ATTEMPTS=1
GPUAAS_FIRMWARE_EXPECTED_BIOS_ATTRS="BIOS.ProcSettings.ProcVirtualization=Enabled BIOS.ProcSettings.LogicalProc=Enabled BIOS.IntegratedDevices.SriovGlobalEnable=Enabled"
The helper stores evidence under /var/lib/gpuaas/firmware-profile/ and
disables its systemd service after the selected profile/version is verified.
Do not put these actions into MAAS commissioning:
- destructive NVMe unmount/wipe/blkdiscard;
- GRUB/kernel argument changes for the deployed OS;
- package installs for libvirt/OVS/node-agent runtime;
- host NVIDIA service disablement or persistent VFIO binding;
- slot approval or scheduler enablement.
Those remain deployed-host bootstrap, explicit slice-mode transition, or node-agent topology approval work after the node is intentionally moved into slice mode.
Review Decisions We Need¶
1. Firmware and BIOS¶
Use standard MAAS commissioning. GPUaaS firmware profile checks run in deploy-time cloud-init during onboarding, reprofile, or explicit admin repair. The minimum slice VM firmware settings are:
- CPU virtualization enabled:
ProcVirtualization=Enabled; - hyperthreading enabled:
LogicalProc=Enabled; - SR-IOV enabled:
SriovGlobalEnable=Enabled; - VT-d/IOMMU enabled where exposed by the platform;
- Secure Boot policy documented for host and guest. The current VM runtime profile disables guest Secure Boot to avoid NVIDIA driver/MOK prompts.
The deploy-time helper is:
Recommended operating model:
- infra tags the node with exactly one GPUaaS profile intent tag;
- GPUaaS deploy cloud-init installs the helper with the matching profile version;
- the helper skips when the cached state already matches the selected profile/version and reads back as ready;
- if drift is found during onboarding/reprofile, the helper stages RACADM changes and performs one controlled reboot;
- after reboot, the helper verifies readiness and records evidence under
/var/lib/gpuaas/firmware-profile/; - if verification fails, onboarding fails and the node remains unschedulable.
This avoids depending on iDRAC Redfish credentials from the platform. On
j22u05, local RACADM could read and stage BIOS settings as root without a
separate Redfish password.
2. Deployed Host Bootstrap¶
Run the repo-owned deployed-host bootstrap through MAAS deploy userdata, cloud-init, or an infra wrapper:
Expected responsibilities:
- install KVM/libvirt/OVMF/cloud-init tooling;
- stage
intel_iommu=on iommu=ptand VFIO modules; - install OVS and baseline private-NAT support;
- optionally reserve 1 GiB hugepages when the selected slice runtime profile requires it;
- optionally apply host network/RDMA sysctl tuning after benchmark approval;
- run the deployed-host runtime prepare helper to stop host NVIDIA services that hold passthrough devices, fix libvirt traversal permissions, and clean stale GPUaaS dnsmasq reservations;
- report when a reboot is required instead of marking the node schedulable.
The bootstrap should be idempotent. A node is not slice-schedulable just because this script completed; node-agent topology discovery and slot approval still gate scheduling.
3. Raw NVMe Ownership And Node Mode¶
GPUaaS proposes a convertible raw-device pool for nodes that can operate in either baremetal/share mode or slice mode. The same physical NVMe devices may be used by the host while the node is baremetal, then unmounted and handed to slice VMs after an explicit mode transition.
The platform proposal is stricter than the prototype because the transition must be deliberate and recorded:
- mounted host share disks must not be approved as live slice disks;
- devices with mounted child partitions must block slot approval until the node has been drained from baremetal/share use and infra has approved unmount or remap;
- devices with unexpected filesystem signatures must block approval until the operator has explicitly wiped, remapped, or recorded the expected signature for the current mode;
- the platform should record the approved disk identity, current ownership mode, and destructive-wipe policy before scheduling;
- released slots cannot be reused until node-agent proves wipe/blkdiscard or a site-approved secure erase completed.
Mode rule:
- while any slice is running, the node cannot be used for baremetal or host share storage;
- while a baremetal allocation or host share mount owns the devices, slice
slots remain disabled or
cleanup_blocked; - transition from baremetal/share to slice requires drain, unmount/remap, topology rediscovery, and explicit slot approval;
- transition from slice back to baremetal requires all slice allocations to be released, cleanup proof to pass, and infra to remount/recreate the host storage layout.
For j22u05, this is the current end-to-end blocker. The devices initially
approved as slice NVMe mapped to host share partitions. That is acceptable only
after an explicit baremetal/share-to-slice transition; until then the slots
should stay cleanup_blocked.
GPUaaS now provides a repeatable transition helper:
sudo scripts/ops/gpuaas_slice_storage_transition.sh --apply \
--devices /dev/nvme0n1,/dev/nvme1n1,/dev/nvme2n1,/dev/nvme3n1,/dev/nvme5n1,/dev/nvme6n1,/dev/nvme7n1,/dev/nvme8n1
The helper is dry-run by default and requires --apply for destructive
changes. It backs up and comments host-share fstab entries, unmounts matching
share mounts, rejects the current OS disk, wipes stale filesystem signatures
and partition tables, runs blkdiscard, and records a host manifest under
/var/log/gpuaas/. Infra should review the script and site-specific device
selection before adding it to deployed-host mode-transition automation.
j22u05 applied this model on 2026-04-17. The approved slice-owned raw devices
are /dev/nvme0n1, /dev/nvme1n1, /dev/nvme2n1, /dev/nvme3n1,
/dev/nvme5n1, /dev/nvme6n1, /dev/nvme7n1, and /dev/nvme8n1.
/dev/nvme4n1 remains the OS disk and must never be approved as slice storage.
3.1 Packing To Free Baremetal Nodes¶
When baremetal demand arrives, the platform should try to keep future options open by packing new slice allocations onto already-sliced nodes before opening a fresh node. That helps preserve fully available nodes for baremetal.
For v1, freeing a node that already has active slices is a drain problem, not a live migration promise:
- mark the node draining so no new slices land there;
- place new slice requests on other compatible nodes when capacity exists;
- wait for users to release existing slices or use an explicit admin/user stop-and-recreate flow if the product supports it;
- after the last slice releases and cleanup passes, transition the node back to baremetal/share mode.
Live migration of GPU/NVMe/IB passthrough VMs is not part of v1. A later "evacuate slices" feature can be implemented as planned shutdown, reprovision on another node, and user-visible restart semantics rather than transparent live migration.
4. BF3 / OVS Networking¶
GPUaaS proposes the BF3/OVS target state and will provide the bootstrap script shape, but the site-specific VF count, uplink mapping, and QoS numbers need infra review because they depend on fabric policy. The target state is:
- v1 default: node-local OVS private-NAT management network for VM SSH/control access;
- BF3 path: create the required VFs, attach them to OVS, and define one vNIC per VM when the site profile is ready;
- QoS: define the per-vNIC rate-limit knobs in the reviewed site profile, not in ad-hoc node-agent shell;
- inter-slice networking: keep OVS as the extension point for future VLAN, overlay, OVH-style segregation, or controlled slice-to-slice communication;
- public ingress: keep NAT/overlay ingress explicit and auditable; do not expose raw VNC or unauthenticated console paths.
This lets us help implement the BF3 bootstrap while keeping the control-plane contract stable: each approved slot reports a management MAC, private IP reservation, OVS bridge/port identity, and drift status.
5. Slot Topology Approval¶
Node-agent discovery should produce candidate topology only. Infra/operator approval turns candidates into schedulable slots.
Candidate output should include:
- GPU PCI address, NUMA node, IOMMU group, and VFIO binding state;
- NVMe identity, mount/signature state, and approved destructive-wipe status;
- IB/RDMA device identity, port state, GUID, and passthrough readiness;
- BF3/OVS management network identity;
- CPU/memory profile compatibility;
- host prerequisites: KVM, IOMMU groups, VFIO, OVS, libvirt, hugepages, reboot required state.
The scheduler must ignore incomplete or unapproved candidates even if they appear physically present.
Proposed Validation Path¶
Phase A: Profile Tag And Read-Only Evidence¶
Tag j22u05 and one additional candidate of the same hardware class with the
agreed composable MAAS tags. Run the deploy-time firmware profile helper in
read-only mode first if infra wants to review output before allowing RACADM
apply.
Success criteria:
- MAAS tags express profile intent and hardware identity;
- BIOS values are readable where RACADM is available;
- VMX/VT-d/SR-IOV status is reported;
- BF3, H200, NVMe, and IB visibility are reported by the host bootstrap and node-agent discovery path;
- no destructive changes occur in read-only mode.
Phase B: Apply Firmware Profile¶
Apply the minimal BIOS profile through deploy-time cloud-init only when the selected profile/version is missing or drifted. Power-cycle through the helper, MAAS, or BMC as appropriate for the host.
Success criteria:
/dev/kvmexists after boot;- IOMMU groups are present;
vfio-pcican load;- SR-IOV capability is visible for BF3 where expected;
- firmware profile state is recorded as ready for the selected profile/version.
Phase C: Deploy Host Bootstrap¶
Run deployed-host bootstrap and reboot if required.
Success criteria:
- libvirt/OVS services are installed and active;
- IOMMU kernel args are active;
- VFIO modules are active;
- optional hugepage profile reports enough configured/free pages;
- host GPU persistence/fabric-manager services are disabled or proven safe for VFIO passthrough;
/var/lib/libvirtpermissions allowlibvirt-qemutraversal;- stale GPUaaS dnsmasq reservations are absent before provisioning;
- node-agent preflight reports slice runtime prerequisites as ready.
Phase D: Storage Remap and Topology Approval¶
GPUaaS provides the storage transition checks and asks infra to approve the raw-device set that is safe for tenant slice use. Platform reruns node-agent topology discovery and approves only safe slot bundles.
Success criteria:
- no approved slice disk has mounted child partitions;
- destructive wipe policy is recorded;
- approved slice disks use stable by-id/WWN identity, never volatile
/dev/nvmeXn1names; - all approved slots have GPU, NVMe, network identity, CPU/memory profile, and health state.
Phase E: First End-to-End Slice¶
Run one one-GPU slice on j22u05, then release it.
Success criteria:
- VM boots with 24 vCPU / 64 GiB shape;
- expected GPU appears in guest;
- SSH readiness gate passes;
- release performs graceful shutdown then cleanup;
- raw NVMe wipe verification passes;
- slot returns to available only after cleanup proof.
Current j22u05 lab evidence reached the first active slice on
2026-04-17:
allocation_id=23bd713b-6073-4ef7-8a51-2897e902365d
vm=gpuaas-slice-23bd713b60734ef78a512897e902365d
private_ip=10.100.0.10
readiness.ssh_ready=true
shape=24 vCPU / 64 GiB
Host adjustments required to reach that state:
- stop
nvidia-persistenced.serviceandnvidia-fabricmanager.servicefor slice-mode passthrough; - set
/var/lib/libvirtto0755; - keep
dnsmasqclean of stalegpuaas-gpuaas-slice-*.confreservations; - use
virt-install --tpm=noneuntil vTPM is explicitly added and tested; - use a
20mnode-task TTL /30mworker activity window for slice VM provisioning.
Additional terminal validation on 2026-04-17:
allocation_id=158ea147-0115-407f-bbd5-1850c32b9517
vm=gpuaas-slice-158ea1470115407fbbd51850c32b9517
private_ip=10.100.0.10
browser-terminal backend=slice SSH relay through node-agent
terminal smoke=session_ready plus command-output marker returned
relay key path=/var/lib/gpuaas/terminal/id_ed25519
Terminal enablement proposal:
- During slice VM provisioning, node-agent creates or reuses a node-local SSH relay key and injects the public key into the guest cloud-init user.
- During terminal open, terminal-gateway includes the allocation shape and target guest host/port in the typed node task.
- For
gpu_slice, node-agent opens the browser terminal by SSHing into the guest using the node-local relay key. Forbaremetal, it keeps the local UNIX-user PTY path. - Node-agent release cleanup must use
virsh undefine --nvramfallback for UEFI slice VMs so stale shut-off domains do not keep raw disks marked in-use.
Closed infra/model decision:
- Early
j22u05metadata mapped the same BF/fabric PCI device into more than one slice slot. A second concurrent slice failed while the first VM held that device. - The v1 GPU VM slice fabric model is now per-slot VF-backed. Duplicate parent BF/fabric devices are operator context only; they are not concurrency proof.
- Schedulable slots must carry
fabric_claim_mode=per_slot_vfand a uniquefabric_vf_pci_addressor equivalent isolated fabric attachment. j22u05is no longer in the active test pool. Current end-to-end validation should usej22u15first, thenj22u11.
What GPUaaS Will Provide For Review¶
We are not asking infra to design the GPUaaS slice path. GPUaaS will provide the host profile tags, cloud-init/deploy scripts, validation commands, and first-node evidence. Infra review is needed for site-specific safety and fabric choices:
- confirm the MAAS tag names and assignment workflow for profile intent and hardware identity;
- confirm whether GPUaaS cloud-init may apply first-onboarding RACADM changes or whether infra will pre-align those values before deploy;
- identify the raw NVMe devices that are safe for destructive tenant slice use
on
j22u05; - review the BF3 VF/OVS/QoS bootstrap profile and provide site-specific values where needed;
- confirm whether hugepages and the current network/RDMA tuning should be enabled for the first benchmark pass or kept off until A/B testing;
- review the host GPU service policy for VFIO passthrough mode;
- review the first-slot validation evidence before we enable scheduling beyond a lab node.
GPUaaS Commitments¶
GPUaaS will:
- keep slice scheduling disabled until approved slots exist;
- reject mounted/unsafe NVMe devices before clone or wipe;
- treat discovery as advisory and require explicit slot approval;
- keep CPU/memory/network shape in SKU/profile metadata, not node-agent scripts;
- record every manual host change in the runbook until it moves into cloud-init or infra bootstrap;
- expose slot health/drift so operators can see exactly why a node is not schedulable;
- help adapt the provided scripts into MAAS deploy/cloud-init workflows after infra approves the reviewed profile.