GPU Slice Node Manual Bootstrap Runbook¶

Purpose¶

Record the temporary manual steps used to prepare a GPU slice host before the same checks are moved into MAAS commissioning or infra bootstrap.

This runbook is for lab validation only. Production slice hosts should be marked schedulable only after automated bootstrap verifies the same prerequisites.

For the infra-facing enablement proposal, see doc/operations/runbooks/GPU_Slice_Infra_Enablement_Proposal_v1.md.

2026-04-24 Remediation Notes¶

Live drift was confirmed on slice hosts j22u11 and j22u15:

ovsbr0 existed with 10.100.0.1/24;
dnsmasq was active and slice reservations existed;
the iptables NAT rule for 10.100.0.0/24 was missing;
default route remained the host uplink on eno8303;
slice traffic could therefore leave with 10.100.x.x source addresses and hit the firewall directly.

Infra separately confirmed this symptom from firewall logs. For the current private-nat slice mode, that means guest egress can fail even when the host looks healthy enough to schedule.

Immediate live remediation applied:

installed /usr/local/sbin/gpuaas-slice-network-baseline.sh on j22u11 and j22u15;
installed /etc/systemd/system/gpuaas-slice-network-baseline.service on both nodes;
verified on both nodes:
iptables -t nat -S POSTROUTING contains -A POSTROUTING -s 10.100.0.0/24 -j MASQUERADE
tmfifo_net0 has 192.168.100.1/30
BF3 peer 192.168.100.2 responds to ping

Important service-model note:

The network baseline script works as a manual one-shot, but the persistent systemd service must not recursively enable --now dnsmasq from inside the script. The service-managed form should:

run the script with systemd management disabled; and
restart dnsmasq.service from ExecStartPost=.

Control-plane follow-up completed in code:

node-agent slice topology discovery now reports live slice_network evidence;
for the current private-nat mode, missing NAT is a hard blocker;
missing IP forwarding is also surfaced as a blocker;
tmfifo presence and IPv4 are exposed as validation evidence rather than being implicitly assumed from bootstrap.

Verification still required after this remediation:

force or wait for node-agent topology rediscovery and confirm /admin/nodes no longer reports false-green slice networking;
verify a slice VM on j22u15 actually reaches the internet with NAT in place;
inspect the remaining cleanup_blocked slot separately, because NAT repair does not by itself clear stale slot lifecycle residue;
add BM vs slice-VM benchmark validation so infra and product can confirm the networking/performance profile is correct, not merely bootable.

Automated Provisioning Parity Target¶

Manual repair is only a temporary way to discover the required host state. MAAS stock-image bootstrap, custom-image bootstrap, and any future node-agent repair task must converge to the same slice host baseline before the node is marked schedulable:

MAAS machine has the gpuaas-profile-slice-vm intent tag;
node-agent is enrolled and reporting the current host instance;
VFIO/GPU passthrough preparation has completed;
fabric VFs and IPoIB networking have completed;
/etc/netplan/60-ipoib.yaml exists and an ibp* interface owns the site fabric address;
a fabric route exists for the configured IPoIB subnet;
ovsbr0 owns 10.100.0.1/24;
dnsmasq has the GPU subnet configuration;
libvirtd is active before launching slice VMs;
node-agent topology evidence reports no hard slice-network blockers.

The service names must also be normalized. Early manual hosts used gpuaas-slice-host-net-baseline; newer site-bootstrap hosts use gpuaas-slice-network-baseline. Treat either name as historical evidence only; the automated path should converge on one service name and the node-agent readiness check should verify behavior, not just service existence.

2026-04-26 reimage comparison:

known-good j22u15 matched the parity target after manual/bootstrap repair;
custom-image j22u11 had node-agent, Netdata, VFIO, and OVS, but was missing IPoIB netplan/fabric route and gpuaas-slice-network-baseline;
bare-metal-intent j22u05 correctly lacked slice services because MAAS tagged it gpuaas-profile-baremetal.

Do not treat the j22u11 state as a successful slice-host image. It is a useful partial-convergence signal showing that the custom image has the static runtime pieces, while the site/profile bootstrap is still missing the fabric-network baseline.

Automated Parity Check¶

Run the repo-owned parity checker after every MAAS deploy or reimage before trusting slice capacity in the scheduler:

scripts/ops/gpuaas_slice_host_parity_check.sh \
  --host 100.75.6.89 \
  --user hpcadmin

The checker emits one machine-readable line per target condition:

GPUAAS_SLICE_PARITY status=PASS check=node_agent detail=active/running
GPUAAS_SLICE_PARITY status=FAIL check=ipoib_netplan detail=command\ exited\ with\ 1 remediation=render\ /etc/netplan/60-ipoib.yaml\ from\ site/profile\ bootstrap

Use the API topology evidence check when a platform-admin token and GPUaaS node ID are available:

scripts/ops/gpuaas_slice_host_parity_check.sh \
  --host 100.75.6.89 \
  --user hpcadmin \
  --api-base https://api.gpuaas.localhost \
  --api-token-file /tmp/platform-admin.token \
  --node-id b2d44b62-1efd-4aed-92c1-f56244b2b3d8

Expected result for a schedulable slice host:

all hard local checks report status=PASS;
historical_network_service is either PASS or at most WARN;
api_topology reports completed with no candidate_summary.blockers;
the final summary line has failures=0.

If only local checks are being run, api_topology reports WARN because the script cannot prove node-agent topology evidence without an API token and node ID. Treat that as acceptable for first-pass host diagnosis, but not as final scheduler approval.

The checker intentionally validates behavior instead of only package presence: gpuaas-slice-network-baseline, VFIO/KVM/IOMMU, /etc/netplan/60-ipoib.yaml, ibp* fabric address and route, ovsbr0, dnsmasq, libvirt, NAT, IPv4 forwarding, and optional tmfifo_net0 management evidence.

2026-04-24 Guest Telemetry and Benchmark Direction¶

The slice metrics gap should be closed without exposing guest dashboards directly.

Automation Artifacts¶

The manual flow is now split into two repo-owned automation entry points:

scripts/maas/commissioning/50-gpuaas-slice-firmware-preflight.sh runs during MAAS commissioning. It installs Dell RACADM from the signed Dell OpenManage repository when needed, reads the slice-critical BIOS settings, validates KVM/IOMMU/GPU/BF3/SR-IOV visibility, reports NVMe whole-device inventory with stable by-id paths, child partitions, mountpoints, filesystem types, and wipefs signatures, and prints GPUAAS_SLICE_* result lines into the commissioning output. It is read-only by default. Set GPUAAS_SLICE_APPLY_BIOS=1 only after infra approval to stage BIOS changes and queue a RACADM power-cycle job.
scripts/ops/gpuaas_slice_host_deploy_bootstrap.sh runs on the deployed Ubuntu host through cloud-init, MAAS deploy userdata, or an operator shell. It installs the persistent runtime packages, stages intel_iommu=on iommu=pt, loads VFIO modules, enables libvirt/OVMF, installs the OVS/NAT/RShim repair service from scripts/ops/gpuaas_slice_host_net_baseline.sh, and runs the deployed-host runtime prepare helper from scripts/ops/gpuaas_slice_host_runtime_prepare.sh.

Commissioning should decide whether the node is firmware-capable. Deployed-host bootstrap should make the installed OS slice-runtime capable. Slot approval must still wait for node-agent topology discovery and operator approval; neither script marks GPU/NVMe/IB bundles schedulable.

Initial MAAS upload path after infra review:

maas maas280 node-scripts create \
  name=50-gpuaas-slice-firmware-preflight \
  script@=scripts/maas/commissioning/50-gpuaas-slice-firmware-preflight.sh

Keep the first run in read-only mode. After the commissioning output is reviewed for the target hardware class, set GPUAAS_SLICE_APPLY_BIOS=1 through the MAAS script environment or a site-specific wrapper to let RACADM queue the BIOS commit and power-cycle job.

Lab Host: j22u05¶

Observed on 2026-04-16:

Host: j22u05
Tailscale IP: 100.127.75.81
OS: Ubuntu 24.04.4 LTS
CPU: Intel Xeon Platinum 8570
Platform blocker: VMX is disabled in firmware.

Evidence:

x86/cpu: VMX (outside TXT) disabled by BIOS
kvm_intel: VMX not enabled (by BIOS) in MSR_IA32_FEAT_CTL

Until firmware VT-x/VMX is enabled, /dev/kvm will remain missing and node-agent slice VM preflight must fail fast before image clone or disk writes.

Post-reboot validation on 2026-04-16 after applying the OS-side changes:

cmdline=... intel_iommu=on iommu=pt
kvm=missing
iommu_groups=0
libvirtd=active
DMAR: IOMMU enabled
iommu: Default domain type: Passthrough (set via kernel command line)
x86/cpu: VMX (outside TXT) disabled by BIOS

Conclusion: the Linux boot configuration is now staged correctly, but the host is still not slice-VM capable. Infra must enable firmware virtualization features, at minimum VT-x/VMX and VT-d/IOMMU, through BIOS/BMC/Redfish/MAAS commissioning before this node can pass node-agent preflight.

Additional discovery on 2026-04-16:

Hardware: Dell PowerEdge XE9680L.
BIOS version: Dell 2.7.5.
iDRAC/BMC address from local IPMI: 10.177.3.241.
MAAS site: maas280.
MAAS system ID: a4mkx3.
GPUaaS node ID: 273666b0-485d-4bf5-a54f-67c625ad3544.
Platform-control MAAS site record says default power credentials are configured, but the active Vault runtime did not contain kv/maas-sites/9995eff3-1967-4615-ac78-29c9202702b3/power/default.
Platform-control MAAS CLI initially had only the older lab-maas-a profile. A maas280 profile was added with the operator-provided API key on 2026-04-16.
No local Dell management tooling was installed on the host initially: racadm, syscfg, omreport, omconfig, and dsu were absent.
MAAS power parameters for system a4mkx3 show power_type=ipmi, power_address=10.177.3.241, and power_user=maas.
The MAAS IPMI credentials can query power state but do not authenticate to iDRAC Redfish. Both Basic auth and Redfish SessionService login returned 401.

RACADM validation on 2026-04-16:

The prototype path openmanage/930/focal is stale for these Ubuntu 24.04 hosts. It returns 404 and the prototype script only simulated BIOS changes.
The signed Dell OpenManage repository path that exposed RACADM on j22u05 was openmanage/11010/jammy.
Installed packages: srvadmin-hapi, srvadmin-idracadm7, srvadmin-idracadm8 version 11.0.1.0.
Local RACADM reads worked as root without Redfish credentials: racadm getversion, racadm get BIOS.ProcSettings, and racadm get BIOS.IntegratedDevices.
Read-only BIOS values on j22u05: ProcVirtualization=Disabled, LogicalProc=Disabled, SriovGlobalEnable=Disabled.

Current blocker:

Configure BlueField-3 virtual functions, OVS attachment, and per-vNIC QoS policy through infra bootstrap.
Reboot and re-run the full slice validation below.

Minimal local RACADM install path used for lab validation:

curl -fsSL https://linux.dell.com/repo/pgp_pubkeys/0x1285491434D8786F.asc \
  | sudo gpg --dearmor -o /usr/share/keyrings/dell-openmanage.gpg
echo 'deb [signed-by=/usr/share/keyrings/dell-openmanage.gpg] http://linux.dell.com/repo/community/openmanage/11010/jammy jammy main' \
  | sudo tee /etc/apt/sources.list.d/linux.dell.com.sources.list
sudo apt-get update
sudo DEBIAN_FRONTEND=noninteractive apt-get install -y srvadmin-idracadm8
sudo racadm get BIOS.ProcSettings
sudo racadm get BIOS.IntegratedDevices

Do not use the prototype's --allow-unauthenticated install path in the commissioning script. Use Dell's signed repo key and a scoped APT keyring.

Minimum BIOS changes applied on j22u05 on 2026-04-16:

sudo racadm set BIOS.ProcSettings.ProcVirtualization Enabled
sudo racadm set BIOS.ProcSettings.LogicalProc Enabled
sudo racadm set BIOS.IntegratedDevices.SriovGlobalEnable Enabled
sudo racadm jobqueue create BIOS.Setup.1-1 -r pwrcycle -s TIME_NOW

RACADM job IDs:

BIOS commit: JID_763788807801
Power cycle: RID_763788809560

Post-power-cycle validation:

cmdline=... intel_iommu=on iommu=pt
kvm=present
iommu_groups=510
kvm_intel loaded
vfio, vfio_iommu_type1, vfio_pci loaded
ProcVirtualization=Enabled
LogicalProc=Enabled
SriovGlobalEnable=Enabled
DMAR: Intel(R) Virtualization Technology for Directed I/O

This clears the firmware/KVM blocker. It does not complete the networking bootstrap for GPU slices.

BlueField-3 / OVS Bootstrap Gap¶

Infra confirmed that slice networking needs BF3 virtual functions wired into OVS, with one vNIC per VM and optional QoS throttling per vNIC. This remains a separate prerequisite from BIOS enablement.

Initial j22u05 observation after BIOS enablement:

0000:bc:00.0 Ethernet controller: Mellanox MT43244 BlueField-3 integrated ConnectX-7
0000:bc:00.1 Ethernet controller: Mellanox MT43244 BlueField-3 integrated ConnectX-7
0000:bc:00.2 DMA controller: Mellanox MT43244 BlueField-3 SoC Management Interface
openvswitch-switch=inactive

The current prototype host does not create BF3 SR-IOV VFs either (sriov_numvfs=0). It uses a node-local OVS bridge plus libvirt tap ports (vnet*) for VM management networking, while BF3 is validated separately for RShim/RDMA.

Latest prototype notes from maas280:/root/node-investigation/GPUaaS:

The tested VM shape is now documented as 24 vCPUs and 64 GiB RAM for a one-GPU slice.
Guest boot uses UEFI with secure boot disabled.
The optional guest driver path installs NVIDIA server driver packages and RDMA/IB packages through cloud-init, then reboots the VM.
Benchmark results show near-native GPU compute and HBM bandwidth, but lower IB and vLLM performance: FP64 -2.1%, HBM 0%, IB write bandwidth -43%, vLLM OPT-125M -17%.
The updated prototype install.sh should not be copied directly into automation: its apt-get install command is split across lines without shell continuations. Use the repo-owned bootstrap scripts instead.

Latest local archive review from ~/Downloads/GPUaaS_0416:

setup_host.sh now force-stops host NVIDIA/Fabric Manager/DCGM services, removes native NVIDIA modules, and binds both GPUs and IB devices to vfio-pci. Production bootstrap should do this only after the node is explicitly placed in slice mode, and should use persistent driverctl overrides generated from approved topology, not hardcoded PCI lists.
install.sh stages 1 GiB hugepages with hugepages=512, and app.py launches VMs with --memorybacking=hugepages=yes. Treat this as part of the tuned h200_1g_24c_64g runtime profile. It needs benchmark validation against non-hugepage mode and must be reported as a host capability before scheduling a profile that requires it.
The prototype disables UEFI Secure Boot for slice VMs. Keep this in the VM runtime profile because guest NVIDIA driver installation can otherwise hang on MOK/Secure Boot prompts.
The prototype adds host and guest network tuning: net.core.rmem_max, net.core.wmem_max, TCP rmem/wmem, backlog, ib_ipoib queue sizes, and unlimited memlock. These are promising for the observed IB/vLLM delta, but should be introduced as a named tuning profile and benchmarked before becoming the default.
The prototype unmounts /dev/nvme*p1 share partitions before using parent NVMe devices as VM disks. Do not automate this behavior as-is. GPUaaS should instead require an infra-approved storage profile that declares which disks are tenant slice disks, blocks mounted child partitions, and requires explicit destructive wipe approval before slot reuse.

Follow-up benchmarks to run once infra provides approved raw NVMe slice disks:

hugepages on/off for the 24 vCPU / 64 GiB profile;
guest driver profile nvidia-driver-570-server versus any future pre-baked platform image catalog default;
host and guest network sysctl tuning on/off;
IB bandwidth before/after IPoIB queue and memlock tuning;
vLLM throughput with 12, 24, and any proposed higher vCPU profile.

Low-risk lab baseline applied on j22u05:

openvswitch-switch=active
dnsmasq=active
rshim=active
gpuaas-slice-host-net-baseline=active
ovsbr0=10.100.0.1/24
tmfifo_net0=192.168.100.1/30
NAT: POSTROUTING -s 10.100.0.0/24 -j MASQUERADE
BF3 RShim peer 192.168.100.2 ping succeeds
BF3 PF 0000:bc:00.0: sriov_totalvfs=16, sriov_numvfs=0
BF3 PF 0000:bc:00.1: sriov_totalvfs=16, sriov_numvfs=0

Live slice repair applied on j22u05 on 2026-04-17:

Replaced /opt/gpuaas/node-agent/gpuaas-node-agent with a local build that disables Secure Boot for new slice VMs and installs the guest NVIDIA/RDMA runtime during cloud-init. The previous binary was backed up under /opt/gpuaas/node-agent/gpuaas-node-agent.backup-*.
Repaired allocation 158ea147-0115-407f-bbd5-1850c32b9517 in place after confirming PCI passthrough was correct but guest drivers were missing.
Installed guest packages: nvidia-driver-570-server, nvidia-utils-570-server, rdma-core, ibverbs-utils, infiniband-diags, linux-headers-$(uname -r), and linux-modules-extra-$(uname -r).
The original VM had Secure Boot enabled, so modprobe nvidia failed with Key was rejected by service. The libvirt domain was redefined to use /usr/share/OVMF/OVMF_CODE_4M.fd and /usr/share/OVMF/OVMF_VARS_4M.fd with secure='no'; the old NVRAM file was backed up before reset.
Final validation inside the slice showed nvidia-smi reporting NVIDIA H200 with driver 570.211.01 and ibv_devinfo -l reporting one HCA (ibp6s0).

The reusable bootstrap script is:

scripts/ops/gpuaas_slice_host_net_baseline.sh

For full deployed-host setup, use:

sudo scripts/ops/gpuaas_slice_host_deploy_bootstrap.sh

Set GPUAAS_SLICE_REBOOT_IF_REQUIRED=1 only when the caller is allowed to reboot the node automatically after GRUB or initramfs changes.

For the current tuned prototype profile, infra can explicitly request 512x 1 GiB hugepages during deployed-host bootstrap:

sudo GPUAAS_SLICE_HUGEPAGES_1G=512 \
  scripts/ops/gpuaas_slice_host_deploy_bootstrap.sh

For host network buffer/backlog tuning from the prototype, enable the performance profile in the network baseline:

sudo GPUAAS_SLICE_ENABLE_PERF_TUNING=1 \
  scripts/ops/gpuaas_slice_host_net_baseline.sh

Both settings are opt-in because they change host-wide behavior and should be paired with benchmark evidence for the target node class.

On j22u05, it is installed as:

/usr/local/sbin/gpuaas-slice-host-net-baseline
/etc/systemd/system/gpuaas-slice-host-net-baseline.service

The service repairs the OVS bridge IP, dnsmasq config, NAT rule, and tmfifo_net0 address after reboot. It intentionally does not create BF3 VFs or configure QoS.

Infra bootstrap still needs to decide and implement:

Enable or verify BF3 embedded switch / representor mode for VM networking.
Create the required SR-IOV VFs for the slice capacity profile.
Attach VFs or representors to the node-local OVS bridge.
Apply QoS shaping policy per VM vNIC where required.
Reconcile the resulting MAC/VF/OVS-port map into control-plane slot inventory before marking slots schedulable.

The dry-run/apply helper for this approval path is:

IDRAC_HOST=10.177.3.241 \
IDRAC_USER=<redacted> \
IDRAC_PASSWORD=<redacted> \
scripts/ops/dell_redfish_slice_firmware.py

Lab Node-Agent Refresh¶

Manual lab change on j22u05 on 2026-04-17:

Built cmd/node-agent locally from commit 80d8a259.
Copied the binary to /tmp/gpuaas-node-agent-80d8a259.
Backed up the previous running binary under /opt/gpuaas/node-agent/gpuaas-node-agent.backup-<UTC timestamp>.
Installed the new binary at /opt/gpuaas/node-agent/gpuaas-node-agent.
Restarted gpuaas-node-agent.

Validation:

previous node-agent: version=5bb47e4 built_at=2026-04-16T05:55:02Z
current node-agent:  version=80d8a259 built_at=2026-04-17T11:21:12Z
service state: active

Reason: 80d8a259 adds node-agent-managed dnsmasq host reservations for slice VM MAC/private-IP pairs and keeps the existing slice VM lifecycle task support. This was applied manually only to the lab node so the change can be validated before it becomes part of the normal node-agent lifecycle or provisioning flow.

Additional lab refreshes on 2026-04-17:

7123a111 was installed manually after the first provisioning failure showed a cleanup path writing invalid slot health state unhealthy. The schema accepts unknown, healthy, degraded, and failed; failed cleanup now marks the slot health state as failed.
c45003f7 was installed manually after storage validation found that an approved slot disk can have mounted child partitions even when the parent device itself is not reported as mounted by findmnt --source <disk>. Node-agent now checks lsblk child mountpoints before clone or wipe.

Current validation:

current node-agent: version=c45003f7 built_at=2026-04-17T11:54:28Z
service state: active

These refreshes were fast-path lab deployments only. They must still move through the normal platform-control deploy path before they are considered part of the managed environment.

First Slice Runtime Validation Attempt¶

Manual lab validation on j22u05 on 2026-04-17 proved the control-plane and node-agent task path far enough to expose a real host topology blocker:

Public allocation API created a one-GPU h200-sxm-slice allocation.
The scheduler selected an approved slot on node 273666b0-485d-4bf5-a54f-67c625ad3544.
The node-agent claimed and executed the slice.vm_provision task.
Libvirt/cloud-init path reached VM launch but failed during runtime validation and cleanup.

Critical storage finding:

/dev/nvme0n1p1 was mounted at /share2 while the slot inventory pointed at /dev/nvme0n1

Other local NVMe devices also looked like host data disks rather than approved raw slice disks:

/dev/nvme2n1p1 mounted at /share3
/dev/nvme3n1p1 mounted at /share4
/dev/nvme5n1p1 mounted at /share5
/dev/nvme6n1p1 mounted at /share6
/dev/nvme7n1p1 mounted at /share7
/dev/nvme8n1p1 mounted at /share8
/dev/nvme1n1 had existing partitions and no mountpoint at inspection time

Operational impact:

j22u05 must not be treated as a schedulable slice target until infra provides or approves raw unmounted NVMe devices for slice use.
All j22u05 control-plane slots were manually moved to cleanup_blocked / failed so the scheduler will not reuse them.
The failed test allocation was manually marked failed with failure reason indicating that slot NVMe was host-mounted and requires infra remap or cleanup approval.
During cleanup investigation, wipefs --all --force /dev/nvme0n1 erased disk signatures before the mounted child partition was identified. Infra should review /share2 before rebooting or relying on that mount.

Required infra follow-up before re-enabling this host for slice runtime tests:

Decide which NVMe devices are tenant-slice disks versus host share disks.
Remove host mounts from slice-owned disks, or update slot inventory to use only dedicated raw devices.
Approve destructive wipe or reimage for any disk assigned to a tenant slice.
Re-run node-agent topology discovery and approve slots from the candidate map only after mounted child partitions and unexpected filesystems are absent.
Keep mounted child partition detection in node-agent and commissioning checks; parent-device mount checks are not sufficient.

Apply only after infra approval:

IDRAC_HOST=10.177.3.241 \
IDRAC_USER=<redacted> \
IDRAC_PASSWORD=<redacted> \
scripts/ops/dell_redfish_slice_firmware.py --apply

The helper is dry-run by default. It uses Redfish session authentication, discovers current BIOS attributes, prints only non-secret virtualization-related values, and plans these Dell BIOS changes unless overridden:

ProcVirtualization=Enabled
SriovGlobalEnable=Enabled

If this BIOS revision uses different attribute names, run the helper with explicit --attr NAME=VALUE pairs after confirming the names from Redfish.

Manual OS-Side Changes Applied¶

The following changes were applied manually on j22u05:

sudo cp /etc/default/grub /etc/default/grub.gpuaas-backup-<timestamp>
sudo sed -i -E 's|^GRUB_CMDLINE_LINUX=.*|GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt"|' /etc/default/grub
printf "vfio\nvfio_iommu_type1\nvfio_pci\n" | sudo tee /etc/modules-load.d/gpuaas-vfio.conf
sudo update-grub
sudo systemctl enable --now libvirtd

virtqemud.service was not present on this host; libvirtd is the active libvirt daemon.

Required Reboot¶

The GRUB kernel arguments require a reboot before validation:

sudo reboot

After the host returns:

cat /proc/cmdline
test -e /dev/kvm && echo kvm=present || echo kvm=missing
find /sys/kernel/iommu_groups -mindepth 1 -maxdepth 1 | wc -l
lsmod | egrep '^(kvm|kvm_intel|vfio|vfio_pci|vfio_iommu_type1)'
sudo dmesg | egrep -i 'kvm|iommu|dmar|vt-d|vmx|disabled by bios' | tail -120

Expected healthy state:

/proc/cmdline contains intel_iommu=on iommu=pt.
/dev/kvm exists.
IOMMU groups are non-empty.
kvm_intel and VFIO modules are loaded.
dmesg no longer reports VMX disabled by BIOS.

If /dev/kvm is still missing and dmesg reports VMX disabled by BIOS, the next action is firmware/BMC/MAAS-side enablement, not another node-agent change.

Infra Bootstrap Target¶

Move this into MAAS commissioning or the node bootstrap pipeline:

Enable or validate firmware virtualization: VT-x/VMX and VT-d/IOMMU.
Apply kernel args: intel_iommu=on iommu=pt for Intel hosts; AMD hosts need the equivalent IOMMU policy.
Install slice packages: qemu-kvm, libvirt-daemon-system, libvirt-clients, virtinst, openvswitch-switch, cloud-image-utils, ovmf, rdma-core, and optional host tools such as rshim, infiniband-diags, ibverbs-utils, and driverctl.
Configure VFIO modules to load at boot.
Reboot if kernel args or firmware state changed.
Validate /dev/kvm, IOMMU groups, VFIO, OVS, libvirt, RDMA, and image cache.
Only approve node_resource_slots after validation passes.

Current ownership split:

MAAS commissioning script: firmware read/apply, RACADM availability, KVM/IOMMU live-effect validation, H200 GPU inventory, BF3/SR-IOV visibility inventory, and read-only NVMe identity/mount/signature evidence.
Deployed-host bootstrap script: OS packages, GRUB kernel args, VFIO modules, libvirt, OVS private NAT, RShim reachability repair service, host GPU service quiescing for slice mode, libvirt traversal permission repair, and stale GPUaaS dnsmasq reservation cleanup.
Explicit slice-mode storage transition: destructive unmount, fstab disable, wipefs, blkdiscard, and manifest recording for infra-approved raw slice devices.
Infra BF3 profile: embedded switch mode, virtual functions, representor or VF attachment model, and per-vNIC QoS.
Persistent VFIO binding: apply only after the node is intentionally in slice mode and the approved topology profile is known.
Node-agent future work: typed topology discovery, slot candidate reporting, drift reconciliation, and VM lifecycle execution.

j22u05 Slice Storage Transition¶

Manual lab change on j22u05 on 2026-04-17:

Installed scripts/ops/gpuaas_slice_storage_transition.sh as /usr/local/sbin/gpuaas-slice-storage-transition.
Converted the eight approved tenant slice NVMe devices from host-share mode to slice-owned raw devices:

sudo /usr/local/sbin/gpuaas-slice-storage-transition --apply \
  --devices /dev/nvme0n1,/dev/nvme1n1,/dev/nvme2n1,/dev/nvme3n1,/dev/nvme5n1,/dev/nvme6n1,/dev/nvme7n1,/dev/nvme8n1

The script:

rejected the OS disk (/dev/nvme4n1) as a slice candidate;
backed up /etc/fstab to /etc/fstab.gpuaas-slice-storage-backup-20260417152532;
unmounted /share2 through /share8;
commented all /share1 through /share8 fstab entries with the gpuaas-slice-storage-disabled marker;
ran wipefs --all --force on old child partitions and parent devices;
ran blkdiscard -f on each approved slice NVMe;
wrote the transition manifest to /var/log/gpuaas/slice-storage-transition-20260417152532.txt.

Post-change host validation:

/dev/nvme0n1 raw, unmounted
/dev/nvme1n1 raw, unmounted
/dev/nvme2n1 raw, unmounted
/dev/nvme3n1 raw, unmounted
/dev/nvme5n1 raw, unmounted
/dev/nvme6n1 raw, unmounted
/dev/nvme7n1 raw, unmounted
/dev/nvme8n1 raw, unmounted
/dev/nvme4n1 remains the mounted OS disk

Control-plane slot approval was updated through the admin /resource-slots API:

slot status: available;
health state: healthy;
one-GPU slice shape: 24 vCPU and 64 GiB memory;
capacity_metadata.storage_ownership=slice;
capacity_metadata.storage_mode=slice;
capacity_metadata.destructive_wipe_policy=blkdiscard;
capacity_metadata.storage_transition_manifest points to the manifest above.

Fast-path node-agent refresh:

The initial post-transition topology discovery exposed that /sys/bus/pci/devices entries are symlinks on this host, so the node-agent skipped GPUs and fabric devices.
The discovery code was patched to accept PCI sysfs symlink entries.
The advisory NVMe candidate filter was patched to ignore non-standard names such as nvme4c4n1 and mounted devices such as the OS disk.
A fast-path lab binary was installed at /opt/gpuaas/node-agent/gpuaas-node-agent and the previous binary was backed up under /opt/gpuaas/node-agent/gpuaas-node-agent.backup-*.

Reboot follow-up:

A reboot proved that /dev/nvmeXn1 kernel names are not stable on j22u05. The BOSS OS disk moved from /dev/nvme4n1 before reboot to /dev/nvme7n1 after reboot.
Approved slice slots must therefore use stable disk identity, not volatile kernel paths. The node-agent was patched to report /dev/disk/by-id/nvme-eui.* paths and keep the kernel path only as advisory metadata.
Slot approval was rewritten to use storage_identity_kind=nvme_wwn_by_id.

Final topology discovery result after the by-id patch:

gpu_devices=8
fabric_devices=12
nvme_devices=9
mounted_nvme_devices=1
candidate_slots=8
candidate slot NVMe map:
0 -> /dev/disk/by-id/nvme-eui.000000000000000100a075244a103331
1 -> /dev/disk/by-id/nvme-eui.000000000000000100a075244a10dcab
2 -> /dev/disk/by-id/nvme-eui.000000000000000100a075244a103489
3 -> /dev/disk/by-id/nvme-eui.000000000000000100a075244a10dc41
4 -> /dev/disk/by-id/nvme-eui.000000000000000100a075244a10dc98
5 -> /dev/disk/by-id/nvme-eui.000000000000000100a075244a10e101
6 -> /dev/disk/by-id/nvme-eui.000000000000000100a075244a10e115
7 -> /dev/disk/by-id/nvme-eui.000000000000000100a075244a103483

This host is now in slice mode. Do not remount /share* or use these NVMe devices for baremetal/share workloads while any slice slots are available, reserved, provisioning, active, releasing, or cleanup.

j22u05 Slice Runtime Preparation¶

Manual lab changes applied on 2026-04-17 to get the first VM to boot:

Installed scripts/ops/gpuaas_slice_host_runtime_prepare.sh as the proposed host runtime preparation helper.
Stopped nvidia-persistenced.service and nvidia-fabricmanager.service. These services held /dev/nvidia*, NVSwitch, and NVLink devices and blocked libvirt from detaching 0000:1b:00.0 for VFIO passthrough.
Set /var/lib/libvirt to 0755. It was 0700 root:root, which prevented the libvirt-qemu process from reading its generated domain-*/master-key.aes.
Removed stale /etc/dnsmasq.d/gpuaas-gpuaas-slice-*.conf reservations left by failed lab attempts and restarted dnsmasq.
Patched node-agent VM launch to pass --tpm=none because virt-install auto-added a vTPM for Ubuntu 24.04 and this host's swtpm_setup failed.
Increased the deployed provisioning node-task TTL to 20m and the worker Temporal activity window to 30m for slice VM import/readiness.

Current active smoke evidence:

allocation_id=23bd713b-6073-4ef7-8a51-2897e902365d
status=active
vm=gpuaas-slice-23bd713b60734ef78a512897e902365d
private_ip=10.100.0.10
ssh_port=22
slot=0
shape=24 vCPU / 64 GiB
node-agent readiness: ssh_ready=true

These runtime steps must move into MAAS/deployed-host bootstrap before another node is enabled for slices. In particular, slice-mode hosts should not run host GPU persistence/fabric-manager services unless the approved VFIO/BF profile says they are safe for the passthrough model.

j22u05 Slice Terminal Follow-Up¶

Manual lab changes applied on 2026-04-17 after the first active slice showed that browser console worked for baremetal but not for slices:

Root cause: terminal.open always opened a local baremetal UNIX user with user.Lookup(username). Slice users live inside the VM, so the node-agent returned lookup terminal user: unknown user ... for slice allocations.
Platform-control terminal-gateway was fast-deployed with terminal task payloads that include capacity_shape, target_host, and target_port.
j22u05 node-agent was fast-deployed with a slice terminal backend. For capacity_shape=gpu_slice, node-agent starts an SSH PTY into the guest using /var/lib/gpuaas/terminal/id_ed25519.
Slice VM provisioning now creates the node-local terminal relay key if missing and injects its public key into the allocation user's cloud-init ssh_authorized_keys.
Release cleanup now retries virsh undefine <vm> --nvram when the first undefine fails, because UEFI slice VMs can otherwise remain as shut-off domains and keep raw NVMe disks marked in-use by libvirt.

Live validation evidence:

allocation_id=158ea147-0115-407f-bbd5-1850c32b9517
status=active
vm=gpuaas-slice-158ea1470115407fbbd51850c32b9517
private_ip=10.100.0.10
terminal task params: capacity_shape=gpu_slice target_host=10.100.0.10 target_port=22
node-agent terminal command: ssh -tt -i /var/lib/gpuaas/terminal/id_ed25519 ... u_adda9308def04a2f@10.100.0.10
terminal websocket smoke: session_ready plus command-output marker returned
direct relay-key SSH smoke: slice-terminal-ok, guest hostname returned

Additional lab cleanup performed:

Released the pre-terminal-key slice allocation 23bd713b-6073-4ef7-8a51-2897e902365d.
Manually undefined its stale libvirt domain with virsh undefine --nvram after observing the old release path left the shut-off domain behind.
Manually reset slots 0 and 1 from cleanup_blocked to available after failed validation retries that did not create running VMs.
Left the current validation slice 158ea147-0115-407f-bbd5-1850c32b9517 active for UI/browser verification.

Known follow-up before concurrent slices:

Current slot metadata maps the same fabric_device (0000:1a:00.0) into multiple slice slots. A second concurrent slice failed while the first slice held that device. Platform scheduling now treats non-empty duplicate fabric devices as exclusive shared constraints, so j22u05 will not place another slice on a slot with the same fabric device while the first claim is reserved/provisioning/active/releasing. We still need infra to confirm the final BF/VF model for safe multi-slice fabric sharing.