GPU Slice Node Manual Bootstrap Runbook¶
Purpose¶
Record the temporary manual steps used to prepare a GPU slice host before the same checks are moved into MAAS commissioning or infra bootstrap.
This runbook is for lab validation only. Production slice hosts should be marked schedulable only after automated bootstrap verifies the same prerequisites.
For the infra-facing enablement proposal, see
doc/operations/runbooks/GPU_Slice_Infra_Enablement_Proposal_v1.md.
2026-04-24 Remediation Notes¶
Live drift was confirmed on slice hosts j22u11 and j22u15:
ovsbr0existed with10.100.0.1/24;- dnsmasq was active and slice reservations existed;
- the
iptablesNAT rule for10.100.0.0/24was missing; - default route remained the host uplink on
eno8303; - slice traffic could therefore leave with
10.100.x.xsource addresses and hit the firewall directly.
Infra separately confirmed this symptom from firewall logs. For the current
private-nat slice mode, that means guest egress can fail even when the host
looks healthy enough to schedule.
Immediate live remediation applied:
- installed
/usr/local/sbin/gpuaas-slice-network-baseline.shonj22u11andj22u15; - installed
/etc/systemd/system/gpuaas-slice-network-baseline.serviceon both nodes; - verified on both nodes:
iptables -t nat -S POSTROUTINGcontains-A POSTROUTING -s 10.100.0.0/24 -j MASQUERADEtmfifo_net0has192.168.100.1/30- BF3 peer
192.168.100.2responds to ping
Important service-model note:
The network baseline script works as a manual one-shot, but the persistent
systemd service must not recursively enable --now dnsmasq from inside the
script. The service-managed form should:
- run the script with systemd management disabled; and
- restart
dnsmasq.servicefromExecStartPost=.
Control-plane follow-up completed in code:
- node-agent slice topology discovery now reports live
slice_networkevidence; - for the current
private-natmode, missing NAT is a hard blocker; - missing IP forwarding is also surfaced as a blocker;
tmfifopresence and IPv4 are exposed as validation evidence rather than being implicitly assumed from bootstrap.
Verification still required after this remediation:
- force or wait for node-agent topology rediscovery and confirm
/admin/nodesno longer reports false-green slice networking; - verify a slice VM on
j22u15actually reaches the internet with NAT in place; - inspect the remaining
cleanup_blockedslot separately, because NAT repair does not by itself clear stale slot lifecycle residue; - add BM vs slice-VM benchmark validation so infra and product can confirm the networking/performance profile is correct, not merely bootable.
Automated Provisioning Parity Target¶
Manual repair is only a temporary way to discover the required host state. MAAS stock-image bootstrap, custom-image bootstrap, and any future node-agent repair task must converge to the same slice host baseline before the node is marked schedulable:
- MAAS machine has the
gpuaas-profile-slice-vmintent tag; - node-agent is enrolled and reporting the current host instance;
- VFIO/GPU passthrough preparation has completed;
- fabric VFs and IPoIB networking have completed;
/etc/netplan/60-ipoib.yamlexists and anibp*interface owns the site fabric address;- a fabric route exists for the configured IPoIB subnet;
ovsbr0owns10.100.0.1/24;dnsmasqhas the GPU subnet configuration;libvirtdis active before launching slice VMs;- node-agent topology evidence reports no hard slice-network blockers.
The service names must also be normalized. Early manual hosts used
gpuaas-slice-host-net-baseline; newer site-bootstrap hosts use
gpuaas-slice-network-baseline. Treat either name as historical evidence only;
the automated path should converge on one service name and the node-agent
readiness check should verify behavior, not just service existence.
2026-04-26 reimage comparison:
- known-good
j22u15matched the parity target after manual/bootstrap repair; - custom-image
j22u11had node-agent, Netdata, VFIO, and OVS, but was missing IPoIB netplan/fabric route andgpuaas-slice-network-baseline; - bare-metal-intent
j22u05correctly lacked slice services because MAAS tagged itgpuaas-profile-baremetal.
Do not treat the j22u11 state as a successful slice-host image. It is a useful
partial-convergence signal showing that the custom image has the static runtime
pieces, while the site/profile bootstrap is still missing the fabric-network
baseline.
Automated Parity Check¶
Run the repo-owned parity checker after every MAAS deploy or reimage before trusting slice capacity in the scheduler:
The checker emits one machine-readable line per target condition:
GPUAAS_SLICE_PARITY status=PASS check=node_agent detail=active/running
GPUAAS_SLICE_PARITY status=FAIL check=ipoib_netplan detail=command\ exited\ with\ 1 remediation=render\ /etc/netplan/60-ipoib.yaml\ from\ site/profile\ bootstrap
Use the API topology evidence check when a platform-admin token and GPUaaS node ID are available:
scripts/ops/gpuaas_slice_host_parity_check.sh \
--host 100.75.6.89 \
--user hpcadmin \
--api-base https://api.gpuaas.localhost \
--api-token-file /tmp/platform-admin.token \
--node-id b2d44b62-1efd-4aed-92c1-f56244b2b3d8
Expected result for a schedulable slice host:
- all hard local checks report
status=PASS; historical_network_serviceis eitherPASSor at mostWARN;api_topologyreportscompleted with no candidate_summary.blockers;- the final summary line has
failures=0.
If only local checks are being run, api_topology reports WARN because the
script cannot prove node-agent topology evidence without an API token and node
ID. Treat that as acceptable for first-pass host diagnosis, but not as final
scheduler approval.
The checker intentionally validates behavior instead of only package presence:
gpuaas-slice-network-baseline, VFIO/KVM/IOMMU, /etc/netplan/60-ipoib.yaml,
ibp* fabric address and route, ovsbr0, dnsmasq, libvirt, NAT, IPv4
forwarding, and optional tmfifo_net0 management evidence.
2026-04-24 Guest Telemetry and Benchmark Direction¶
The slice metrics gap should be closed without exposing guest dashboards directly.
Recommended product boundary:
- bare-metal allocations keep host Netdata and
Open Netdata; - slice allocations use guest telemetry collected through node-agent over the
controlled
10.100.x.xmanagement path; - host Netdata remains operator-only for slice-node health and network diagnosis.
Direct guest Netdata proxying is not the preferred surface because it expands tenant VM network exposure and recreates the same proxy/session problems we already had to solve for other apps.
Reference design note:
Repo-owned benchmark capture harness:
scripts/ops/gpuaas_benchmark_capture.sh \
--host 10.177.36.197 \
--user ubuntu \
--label baremetal-j22u15 \
--output dist/benchmarks/baremetal-j22u15.json
scripts/ops/gpuaas_benchmark_capture.sh \
--host 10.100.0.10 \
--user gpuuser \
--ssh-key /var/lib/gpuaas/terminal/id_ed25519 \
--label slice-vm-j22u15-s01 \
--output dist/benchmarks/slice-vm-j22u15-s01.json
Optional probes:
--fio-target /dev/nvme0n1for a read-only storage capture on an approved benchmark target;--rdma-peer <host>and optional--rdma-device <name>forib_write_bwoutput when infra provides the peer path.
Use the same harness for BM and VM so the comparison artifact reflects the same measurement method and command set.
Automation Artifacts¶
The manual flow is now split into two repo-owned automation entry points:
scripts/maas/commissioning/50-gpuaas-slice-firmware-preflight.shruns during MAAS commissioning. It installs Dell RACADM from the signed Dell OpenManage repository when needed, reads the slice-critical BIOS settings, validates KVM/IOMMU/GPU/BF3/SR-IOV visibility, reports NVMe whole-device inventory with stable by-id paths, child partitions, mountpoints, filesystem types, and wipefs signatures, and printsGPUAAS_SLICE_*result lines into the commissioning output. It is read-only by default. SetGPUAAS_SLICE_APPLY_BIOS=1only after infra approval to stage BIOS changes and queue a RACADM power-cycle job.scripts/ops/gpuaas_slice_host_deploy_bootstrap.shruns on the deployed Ubuntu host through cloud-init, MAAS deploy userdata, or an operator shell. It installs the persistent runtime packages, stagesintel_iommu=on iommu=pt, loads VFIO modules, enables libvirt/OVMF, installs the OVS/NAT/RShim repair service fromscripts/ops/gpuaas_slice_host_net_baseline.sh, and runs the deployed-host runtime prepare helper fromscripts/ops/gpuaas_slice_host_runtime_prepare.sh.
Commissioning should decide whether the node is firmware-capable. Deployed-host bootstrap should make the installed OS slice-runtime capable. Slot approval must still wait for node-agent topology discovery and operator approval; neither script marks GPU/NVMe/IB bundles schedulable.
Initial MAAS upload path after infra review:
maas maas280 node-scripts create \
name=50-gpuaas-slice-firmware-preflight \
script@=scripts/maas/commissioning/50-gpuaas-slice-firmware-preflight.sh
Keep the first run in read-only mode. After the commissioning output is
reviewed for the target hardware class, set GPUAAS_SLICE_APPLY_BIOS=1 through
the MAAS script environment or a site-specific wrapper to let RACADM queue the
BIOS commit and power-cycle job.
Lab Host: j22u05¶
Observed on 2026-04-16:
- Host:
j22u05 - Tailscale IP:
100.127.75.81 - OS: Ubuntu 24.04.4 LTS
- CPU: Intel Xeon Platinum 8570
- Platform blocker: VMX is disabled in firmware.
Evidence:
x86/cpu: VMX (outside TXT) disabled by BIOS
kvm_intel: VMX not enabled (by BIOS) in MSR_IA32_FEAT_CTL
Until firmware VT-x/VMX is enabled, /dev/kvm will remain missing and
node-agent slice VM preflight must fail fast before image clone or disk writes.
Post-reboot validation on 2026-04-16 after applying the OS-side changes:
cmdline=... intel_iommu=on iommu=pt
kvm=missing
iommu_groups=0
libvirtd=active
DMAR: IOMMU enabled
iommu: Default domain type: Passthrough (set via kernel command line)
x86/cpu: VMX (outside TXT) disabled by BIOS
Conclusion: the Linux boot configuration is now staged correctly, but the host is still not slice-VM capable. Infra must enable firmware virtualization features, at minimum VT-x/VMX and VT-d/IOMMU, through BIOS/BMC/Redfish/MAAS commissioning before this node can pass node-agent preflight.
Additional discovery on 2026-04-16:
- Hardware: Dell PowerEdge XE9680L.
- BIOS version: Dell
2.7.5. - iDRAC/BMC address from local IPMI:
10.177.3.241. - MAAS site:
maas280. - MAAS system ID:
a4mkx3. - GPUaaS node ID:
273666b0-485d-4bf5-a54f-67c625ad3544. - Platform-control MAAS site record says default power credentials are
configured, but the active Vault runtime did not contain
kv/maas-sites/9995eff3-1967-4615-ac78-29c9202702b3/power/default. - Platform-control MAAS CLI initially had only the older
lab-maas-aprofile. Amaas280profile was added with the operator-provided API key on 2026-04-16. - No local Dell management tooling was installed on the host initially:
racadm,syscfg,omreport,omconfig, anddsuwere absent. - MAAS power parameters for system
a4mkx3showpower_type=ipmi,power_address=10.177.3.241, andpower_user=maas. - The MAAS IPMI credentials can query power state but do not authenticate to
iDRAC Redfish. Both Basic auth and Redfish SessionService login returned
401.
RACADM validation on 2026-04-16:
- The prototype path
openmanage/930/focalis stale for these Ubuntu 24.04 hosts. It returns 404 and the prototype script only simulated BIOS changes. - The signed Dell OpenManage repository path that exposed RACADM on
j22u05wasopenmanage/11010/jammy. - Installed packages:
srvadmin-hapi,srvadmin-idracadm7,srvadmin-idracadm8version11.0.1.0. - Local RACADM reads worked as root without Redfish credentials:
racadm getversion,racadm get BIOS.ProcSettings, andracadm get BIOS.IntegratedDevices. - Read-only BIOS values on
j22u05:ProcVirtualization=Disabled,LogicalProc=Disabled,SriovGlobalEnable=Disabled.
Current blocker:
- Configure BlueField-3 virtual functions, OVS attachment, and per-vNIC QoS policy through infra bootstrap.
- Reboot and re-run the full slice validation below.
Minimal local RACADM install path used for lab validation:
curl -fsSL https://linux.dell.com/repo/pgp_pubkeys/0x1285491434D8786F.asc \
| sudo gpg --dearmor -o /usr/share/keyrings/dell-openmanage.gpg
echo 'deb [signed-by=/usr/share/keyrings/dell-openmanage.gpg] http://linux.dell.com/repo/community/openmanage/11010/jammy jammy main' \
| sudo tee /etc/apt/sources.list.d/linux.dell.com.sources.list
sudo apt-get update
sudo DEBIAN_FRONTEND=noninteractive apt-get install -y srvadmin-idracadm8
sudo racadm get BIOS.ProcSettings
sudo racadm get BIOS.IntegratedDevices
Do not use the prototype's --allow-unauthenticated install path in the
commissioning script. Use Dell's signed repo key and a scoped APT keyring.
Minimum BIOS changes applied on j22u05 on 2026-04-16:
sudo racadm set BIOS.ProcSettings.ProcVirtualization Enabled
sudo racadm set BIOS.ProcSettings.LogicalProc Enabled
sudo racadm set BIOS.IntegratedDevices.SriovGlobalEnable Enabled
sudo racadm jobqueue create BIOS.Setup.1-1 -r pwrcycle -s TIME_NOW
RACADM job IDs:
- BIOS commit:
JID_763788807801 - Power cycle:
RID_763788809560
Post-power-cycle validation:
cmdline=... intel_iommu=on iommu=pt
kvm=present
iommu_groups=510
kvm_intel loaded
vfio, vfio_iommu_type1, vfio_pci loaded
ProcVirtualization=Enabled
LogicalProc=Enabled
SriovGlobalEnable=Enabled
DMAR: Intel(R) Virtualization Technology for Directed I/O
This clears the firmware/KVM blocker. It does not complete the networking bootstrap for GPU slices.
BlueField-3 / OVS Bootstrap Gap¶
Infra confirmed that slice networking needs BF3 virtual functions wired into OVS, with one vNIC per VM and optional QoS throttling per vNIC. This remains a separate prerequisite from BIOS enablement.
Initial j22u05 observation after BIOS enablement:
0000:bc:00.0 Ethernet controller: Mellanox MT43244 BlueField-3 integrated ConnectX-7
0000:bc:00.1 Ethernet controller: Mellanox MT43244 BlueField-3 integrated ConnectX-7
0000:bc:00.2 DMA controller: Mellanox MT43244 BlueField-3 SoC Management Interface
openvswitch-switch=inactive
The current prototype host does not create BF3 SR-IOV VFs either
(sriov_numvfs=0). It uses a node-local OVS bridge plus libvirt tap ports
(vnet*) for VM management networking, while BF3 is validated separately for
RShim/RDMA.
Latest prototype notes from maas280:/root/node-investigation/GPUaaS:
- The tested VM shape is now documented as 24 vCPUs and 64 GiB RAM for a one-GPU slice.
- Guest boot uses UEFI with secure boot disabled.
- The optional guest driver path installs NVIDIA server driver packages and RDMA/IB packages through cloud-init, then reboots the VM.
- Benchmark results show near-native GPU compute and HBM bandwidth, but lower IB and vLLM performance: FP64 -2.1%, HBM 0%, IB write bandwidth -43%, vLLM OPT-125M -17%.
- The updated prototype
install.shshould not be copied directly into automation: itsapt-get installcommand is split across lines without shell continuations. Use the repo-owned bootstrap scripts instead.
Latest local archive review from ~/Downloads/GPUaaS_0416:
setup_host.shnow force-stops host NVIDIA/Fabric Manager/DCGM services, removes native NVIDIA modules, and binds both GPUs and IB devices tovfio-pci. Production bootstrap should do this only after the node is explicitly placed in slice mode, and should use persistentdriverctloverrides generated from approved topology, not hardcoded PCI lists.install.shstages 1 GiB hugepages withhugepages=512, andapp.pylaunches VMs with--memorybacking=hugepages=yes. Treat this as part of the tunedh200_1g_24c_64gruntime profile. It needs benchmark validation against non-hugepage mode and must be reported as a host capability before scheduling a profile that requires it.- The prototype disables UEFI Secure Boot for slice VMs. Keep this in the VM runtime profile because guest NVIDIA driver installation can otherwise hang on MOK/Secure Boot prompts.
- The prototype adds host and guest network tuning:
net.core.rmem_max,net.core.wmem_max, TCP rmem/wmem, backlog,ib_ipoibqueue sizes, and unlimited memlock. These are promising for the observed IB/vLLM delta, but should be introduced as a named tuning profile and benchmarked before becoming the default. - The prototype unmounts
/dev/nvme*p1share partitions before using parent NVMe devices as VM disks. Do not automate this behavior as-is. GPUaaS should instead require an infra-approved storage profile that declares which disks are tenant slice disks, blocks mounted child partitions, and requires explicit destructive wipe approval before slot reuse.
Follow-up benchmarks to run once infra provides approved raw NVMe slice disks:
- hugepages on/off for the 24 vCPU / 64 GiB profile;
- guest driver profile
nvidia-driver-570-serverversus any future pre-baked platform image catalog default; - host and guest network sysctl tuning on/off;
- IB bandwidth before/after IPoIB queue and memlock tuning;
- vLLM throughput with 12, 24, and any proposed higher vCPU profile.
Low-risk lab baseline applied on j22u05:
openvswitch-switch=active
dnsmasq=active
rshim=active
gpuaas-slice-host-net-baseline=active
ovsbr0=10.100.0.1/24
tmfifo_net0=192.168.100.1/30
NAT: POSTROUTING -s 10.100.0.0/24 -j MASQUERADE
BF3 RShim peer 192.168.100.2 ping succeeds
BF3 PF 0000:bc:00.0: sriov_totalvfs=16, sriov_numvfs=0
BF3 PF 0000:bc:00.1: sriov_totalvfs=16, sriov_numvfs=0
Live slice repair applied on j22u05 on 2026-04-17:
- Replaced
/opt/gpuaas/node-agent/gpuaas-node-agentwith a local build that disables Secure Boot for new slice VMs and installs the guest NVIDIA/RDMA runtime during cloud-init. The previous binary was backed up under/opt/gpuaas/node-agent/gpuaas-node-agent.backup-*. - Repaired allocation
158ea147-0115-407f-bbd5-1850c32b9517in place after confirming PCI passthrough was correct but guest drivers were missing. - Installed guest packages:
nvidia-driver-570-server,nvidia-utils-570-server,rdma-core,ibverbs-utils,infiniband-diags,linux-headers-$(uname -r), andlinux-modules-extra-$(uname -r). - The original VM had Secure Boot enabled, so
modprobe nvidiafailed withKey was rejected by service. The libvirt domain was redefined to use/usr/share/OVMF/OVMF_CODE_4M.fdand/usr/share/OVMF/OVMF_VARS_4M.fdwithsecure='no'; the old NVRAM file was backed up before reset. - Final validation inside the slice showed
nvidia-smireporting NVIDIA H200 with driver570.211.01andibv_devinfo -lreporting one HCA (ibp6s0).
The reusable bootstrap script is:
For full deployed-host setup, use:
Set GPUAAS_SLICE_REBOOT_IF_REQUIRED=1 only when the caller is allowed to
reboot the node automatically after GRUB or initramfs changes.
For the current tuned prototype profile, infra can explicitly request 512x 1 GiB hugepages during deployed-host bootstrap:
For host network buffer/backlog tuning from the prototype, enable the performance profile in the network baseline:
Both settings are opt-in because they change host-wide behavior and should be paired with benchmark evidence for the target node class.
On j22u05, it is installed as:
/usr/local/sbin/gpuaas-slice-host-net-baseline
/etc/systemd/system/gpuaas-slice-host-net-baseline.service
The service repairs the OVS bridge IP, dnsmasq config, NAT rule, and
tmfifo_net0 address after reboot. It intentionally does not create BF3 VFs or
configure QoS.
Infra bootstrap still needs to decide and implement:
- Enable or verify BF3 embedded switch / representor mode for VM networking.
- Create the required SR-IOV VFs for the slice capacity profile.
- Attach VFs or representors to the node-local OVS bridge.
- Apply QoS shaping policy per VM vNIC where required.
- Reconcile the resulting MAC/VF/OVS-port map into control-plane slot inventory before marking slots schedulable.
The dry-run/apply helper for this approval path is:
IDRAC_HOST=10.177.3.241 \
IDRAC_USER=<redacted> \
IDRAC_PASSWORD=<redacted> \
scripts/ops/dell_redfish_slice_firmware.py
Lab Node-Agent Refresh¶
Manual lab change on j22u05 on 2026-04-17:
- Built
cmd/node-agentlocally from commit80d8a259. - Copied the binary to
/tmp/gpuaas-node-agent-80d8a259. - Backed up the previous running binary under
/opt/gpuaas/node-agent/gpuaas-node-agent.backup-<UTC timestamp>. - Installed the new binary at
/opt/gpuaas/node-agent/gpuaas-node-agent. - Restarted
gpuaas-node-agent.
Validation:
previous node-agent: version=5bb47e4 built_at=2026-04-16T05:55:02Z
current node-agent: version=80d8a259 built_at=2026-04-17T11:21:12Z
service state: active
Reason: 80d8a259 adds node-agent-managed dnsmasq host reservations for slice
VM MAC/private-IP pairs and keeps the existing slice VM lifecycle task support.
This was applied manually only to the lab node so the change can be validated
before it becomes part of the normal node-agent lifecycle or provisioning flow.
Additional lab refreshes on 2026-04-17:
7123a111was installed manually after the first provisioning failure showed a cleanup path writing invalid slot health stateunhealthy. The schema acceptsunknown,healthy,degraded, andfailed; failed cleanup now marks the slot health state asfailed.c45003f7was installed manually after storage validation found that an approved slot disk can have mounted child partitions even when the parent device itself is not reported as mounted byfindmnt --source <disk>. Node-agent now checkslsblkchild mountpoints before clone or wipe.
Current validation:
These refreshes were fast-path lab deployments only. They must still move through the normal platform-control deploy path before they are considered part of the managed environment.
First Slice Runtime Validation Attempt¶
Manual lab validation on j22u05 on 2026-04-17 proved the control-plane and
node-agent task path far enough to expose a real host topology blocker:
- Public allocation API created a one-GPU
h200-sxm-sliceallocation. - The scheduler selected an approved slot on node
273666b0-485d-4bf5-a54f-67c625ad3544. - The node-agent claimed and executed the
slice.vm_provisiontask. - Libvirt/cloud-init path reached VM launch but failed during runtime validation and cleanup.
Critical storage finding:
Other local NVMe devices also looked like host data disks rather than approved raw slice disks:
/dev/nvme2n1p1 mounted at /share3
/dev/nvme3n1p1 mounted at /share4
/dev/nvme5n1p1 mounted at /share5
/dev/nvme6n1p1 mounted at /share6
/dev/nvme7n1p1 mounted at /share7
/dev/nvme8n1p1 mounted at /share8
/dev/nvme1n1 had existing partitions and no mountpoint at inspection time
Operational impact:
j22u05must not be treated as a schedulable slice target until infra provides or approves raw unmounted NVMe devices for slice use.- All
j22u05control-plane slots were manually moved tocleanup_blocked/failedso the scheduler will not reuse them. - The failed test allocation was manually marked
failedwith failure reason indicating that slot NVMe was host-mounted and requires infra remap or cleanup approval. - During cleanup investigation,
wipefs --all --force /dev/nvme0n1erased disk signatures before the mounted child partition was identified. Infra should review/share2before rebooting or relying on that mount.
Required infra follow-up before re-enabling this host for slice runtime tests:
- Decide which NVMe devices are tenant-slice disks versus host share disks.
- Remove host mounts from slice-owned disks, or update slot inventory to use only dedicated raw devices.
- Approve destructive wipe or reimage for any disk assigned to a tenant slice.
- Re-run node-agent topology discovery and approve slots from the candidate map only after mounted child partitions and unexpected filesystems are absent.
- Keep mounted child partition detection in node-agent and commissioning checks; parent-device mount checks are not sufficient.
Apply only after infra approval:
IDRAC_HOST=10.177.3.241 \
IDRAC_USER=<redacted> \
IDRAC_PASSWORD=<redacted> \
scripts/ops/dell_redfish_slice_firmware.py --apply
The helper is dry-run by default. It uses Redfish session authentication, discovers current BIOS attributes, prints only non-secret virtualization-related values, and plans these Dell BIOS changes unless overridden:
ProcVirtualization=EnabledSriovGlobalEnable=Enabled
If this BIOS revision uses different attribute names, run the helper with
explicit --attr NAME=VALUE pairs after confirming the names from Redfish.
Manual OS-Side Changes Applied¶
The following changes were applied manually on j22u05:
sudo cp /etc/default/grub /etc/default/grub.gpuaas-backup-<timestamp>
sudo sed -i -E 's|^GRUB_CMDLINE_LINUX=.*|GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt"|' /etc/default/grub
printf "vfio\nvfio_iommu_type1\nvfio_pci\n" | sudo tee /etc/modules-load.d/gpuaas-vfio.conf
sudo update-grub
sudo systemctl enable --now libvirtd
virtqemud.service was not present on this host; libvirtd is the active
libvirt daemon.
Required Reboot¶
The GRUB kernel arguments require a reboot before validation:
After the host returns:
cat /proc/cmdline
test -e /dev/kvm && echo kvm=present || echo kvm=missing
find /sys/kernel/iommu_groups -mindepth 1 -maxdepth 1 | wc -l
lsmod | egrep '^(kvm|kvm_intel|vfio|vfio_pci|vfio_iommu_type1)'
sudo dmesg | egrep -i 'kvm|iommu|dmar|vt-d|vmx|disabled by bios' | tail -120
Expected healthy state:
/proc/cmdlinecontainsintel_iommu=on iommu=pt./dev/kvmexists.- IOMMU groups are non-empty.
kvm_inteland VFIO modules are loaded.dmesgno longer reports VMX disabled by BIOS.
If /dev/kvm is still missing and dmesg reports VMX disabled by BIOS, the
next action is firmware/BMC/MAAS-side enablement, not another node-agent change.
Infra Bootstrap Target¶
Move this into MAAS commissioning or the node bootstrap pipeline:
- Enable or validate firmware virtualization: VT-x/VMX and VT-d/IOMMU.
- Apply kernel args:
intel_iommu=on iommu=ptfor Intel hosts; AMD hosts need the equivalent IOMMU policy. - Install slice packages:
qemu-kvm,libvirt-daemon-system,libvirt-clients,virtinst,openvswitch-switch,cloud-image-utils,ovmf,rdma-core, and optional host tools such asrshim,infiniband-diags,ibverbs-utils, anddriverctl. - Configure VFIO modules to load at boot.
- Reboot if kernel args or firmware state changed.
- Validate
/dev/kvm, IOMMU groups, VFIO, OVS, libvirt, RDMA, and image cache. - Only approve
node_resource_slotsafter validation passes.
Current ownership split:
- MAAS commissioning script: firmware read/apply, RACADM availability, KVM/IOMMU live-effect validation, H200 GPU inventory, BF3/SR-IOV visibility inventory, and read-only NVMe identity/mount/signature evidence.
- Deployed-host bootstrap script: OS packages, GRUB kernel args, VFIO modules, libvirt, OVS private NAT, RShim reachability repair service, host GPU service quiescing for slice mode, libvirt traversal permission repair, and stale GPUaaS dnsmasq reservation cleanup.
- Explicit slice-mode storage transition: destructive unmount, fstab disable, wipefs, blkdiscard, and manifest recording for infra-approved raw slice devices.
- Infra BF3 profile: embedded switch mode, virtual functions, representor or VF attachment model, and per-vNIC QoS.
- Persistent VFIO binding: apply only after the node is intentionally in slice mode and the approved topology profile is known.
- Node-agent future work: typed topology discovery, slot candidate reporting, drift reconciliation, and VM lifecycle execution.
j22u05 Slice Storage Transition¶
Manual lab change on j22u05 on 2026-04-17:
- Installed
scripts/ops/gpuaas_slice_storage_transition.shas/usr/local/sbin/gpuaas-slice-storage-transition. - Converted the eight approved tenant slice NVMe devices from host-share mode to slice-owned raw devices:
sudo /usr/local/sbin/gpuaas-slice-storage-transition --apply \
--devices /dev/nvme0n1,/dev/nvme1n1,/dev/nvme2n1,/dev/nvme3n1,/dev/nvme5n1,/dev/nvme6n1,/dev/nvme7n1,/dev/nvme8n1
The script:
- rejected the OS disk (
/dev/nvme4n1) as a slice candidate; - backed up
/etc/fstabto/etc/fstab.gpuaas-slice-storage-backup-20260417152532; - unmounted
/share2through/share8; - commented all
/share1through/share8fstab entries with thegpuaas-slice-storage-disabledmarker; - ran
wipefs --all --forceon old child partitions and parent devices; - ran
blkdiscard -fon each approved slice NVMe; - wrote the transition manifest to
/var/log/gpuaas/slice-storage-transition-20260417152532.txt.
Post-change host validation:
/dev/nvme0n1 raw, unmounted
/dev/nvme1n1 raw, unmounted
/dev/nvme2n1 raw, unmounted
/dev/nvme3n1 raw, unmounted
/dev/nvme5n1 raw, unmounted
/dev/nvme6n1 raw, unmounted
/dev/nvme7n1 raw, unmounted
/dev/nvme8n1 raw, unmounted
/dev/nvme4n1 remains the mounted OS disk
Control-plane slot approval was updated through the admin
/resource-slots API:
- slot status:
available; - health state:
healthy; - one-GPU slice shape:
24vCPU and64 GiBmemory; capacity_metadata.storage_ownership=slice;capacity_metadata.storage_mode=slice;capacity_metadata.destructive_wipe_policy=blkdiscard;capacity_metadata.storage_transition_manifestpoints to the manifest above.
Fast-path node-agent refresh:
- The initial post-transition topology discovery exposed that
/sys/bus/pci/devicesentries are symlinks on this host, so the node-agent skipped GPUs and fabric devices. - The discovery code was patched to accept PCI sysfs symlink entries.
- The advisory NVMe candidate filter was patched to ignore non-standard names
such as
nvme4c4n1and mounted devices such as the OS disk. - A fast-path lab binary was installed at
/opt/gpuaas/node-agent/gpuaas-node-agentand the previous binary was backed up under/opt/gpuaas/node-agent/gpuaas-node-agent.backup-*.
Reboot follow-up:
- A reboot proved that
/dev/nvmeXn1kernel names are not stable onj22u05. The BOSS OS disk moved from/dev/nvme4n1before reboot to/dev/nvme7n1after reboot. - Approved slice slots must therefore use stable disk identity, not volatile
kernel paths. The node-agent was patched to report
/dev/disk/by-id/nvme-eui.*paths and keep the kernel path only as advisory metadata. - Slot approval was rewritten to use
storage_identity_kind=nvme_wwn_by_id.
Final topology discovery result after the by-id patch:
gpu_devices=8
fabric_devices=12
nvme_devices=9
mounted_nvme_devices=1
candidate_slots=8
candidate slot NVMe map:
0 -> /dev/disk/by-id/nvme-eui.000000000000000100a075244a103331
1 -> /dev/disk/by-id/nvme-eui.000000000000000100a075244a10dcab
2 -> /dev/disk/by-id/nvme-eui.000000000000000100a075244a103489
3 -> /dev/disk/by-id/nvme-eui.000000000000000100a075244a10dc41
4 -> /dev/disk/by-id/nvme-eui.000000000000000100a075244a10dc98
5 -> /dev/disk/by-id/nvme-eui.000000000000000100a075244a10e101
6 -> /dev/disk/by-id/nvme-eui.000000000000000100a075244a10e115
7 -> /dev/disk/by-id/nvme-eui.000000000000000100a075244a103483
This host is now in slice mode. Do not remount /share* or use these NVMe
devices for baremetal/share workloads while any slice slots are available,
reserved, provisioning, active, releasing, or cleanup.
j22u05 Slice Runtime Preparation¶
Manual lab changes applied on 2026-04-17 to get the first VM to boot:
- Installed
scripts/ops/gpuaas_slice_host_runtime_prepare.shas the proposed host runtime preparation helper. - Stopped
nvidia-persistenced.serviceandnvidia-fabricmanager.service. These services held/dev/nvidia*, NVSwitch, and NVLink devices and blocked libvirt from detaching0000:1b:00.0for VFIO passthrough. - Set
/var/lib/libvirtto0755. It was0700 root:root, which prevented thelibvirt-qemuprocess from reading its generateddomain-*/master-key.aes. - Removed stale
/etc/dnsmasq.d/gpuaas-gpuaas-slice-*.confreservations left by failed lab attempts and restarteddnsmasq. - Patched node-agent VM launch to pass
--tpm=nonebecausevirt-installauto-added a vTPM for Ubuntu 24.04 and this host'sswtpm_setupfailed. - Increased the deployed provisioning node-task TTL to
20mand the worker Temporal activity window to30mfor slice VM import/readiness.
Current active smoke evidence:
allocation_id=23bd713b-6073-4ef7-8a51-2897e902365d
status=active
vm=gpuaas-slice-23bd713b60734ef78a512897e902365d
private_ip=10.100.0.10
ssh_port=22
slot=0
shape=24 vCPU / 64 GiB
node-agent readiness: ssh_ready=true
These runtime steps must move into MAAS/deployed-host bootstrap before another node is enabled for slices. In particular, slice-mode hosts should not run host GPU persistence/fabric-manager services unless the approved VFIO/BF profile says they are safe for the passthrough model.
j22u05 Slice Terminal Follow-Up¶
Manual lab changes applied on 2026-04-17 after the first active slice showed that browser console worked for baremetal but not for slices:
- Root cause:
terminal.openalways opened a local baremetal UNIX user withuser.Lookup(username). Slice users live inside the VM, so the node-agent returnedlookup terminal user: unknown user ...for slice allocations. - Platform-control terminal-gateway was fast-deployed with terminal task
payloads that include
capacity_shape,target_host, andtarget_port. - j22u05 node-agent was fast-deployed with a slice terminal backend. For
capacity_shape=gpu_slice, node-agent starts an SSH PTY into the guest using/var/lib/gpuaas/terminal/id_ed25519. - Slice VM provisioning now creates the node-local terminal relay key if
missing and injects its public key into the allocation user's cloud-init
ssh_authorized_keys. - Release cleanup now retries
virsh undefine <vm> --nvramwhen the first undefine fails, because UEFI slice VMs can otherwise remain as shut-off domains and keep raw NVMe disks marked in-use by libvirt.
Live validation evidence:
allocation_id=158ea147-0115-407f-bbd5-1850c32b9517
status=active
vm=gpuaas-slice-158ea1470115407fbbd51850c32b9517
private_ip=10.100.0.10
terminal task params: capacity_shape=gpu_slice target_host=10.100.0.10 target_port=22
node-agent terminal command: ssh -tt -i /var/lib/gpuaas/terminal/id_ed25519 ... u_adda9308def04a2f@10.100.0.10
terminal websocket smoke: session_ready plus command-output marker returned
direct relay-key SSH smoke: slice-terminal-ok, guest hostname returned
Additional lab cleanup performed:
- Released the pre-terminal-key slice allocation
23bd713b-6073-4ef7-8a51-2897e902365d. - Manually undefined its stale libvirt domain with
virsh undefine --nvramafter observing the old release path left the shut-off domain behind. - Manually reset slots
0and1fromcleanup_blockedtoavailableafter failed validation retries that did not create running VMs. - Left the current validation slice
158ea147-0115-407f-bbd5-1850c32b9517active for UI/browser verification.
Known follow-up before concurrent slices:
- Current slot metadata maps the same
fabric_device(0000:1a:00.0) into multiple slice slots. A second concurrent slice failed while the first slice held that device. Platform scheduling now treats non-empty duplicate fabric devices as exclusive shared constraints, so j22u05 will not place another slice on a slot with the same fabric device while the first claim is reserved/provisioning/active/releasing. We still need infra to confirm the final BF/VF model for safe multi-slice fabric sharing.