MAAS H200 Host Image Pipeline Runbook¶
This runbook builds a curated Ubuntu 24.04 H200 host image on platform-control and uploads it to MAAS. The goal is to remove repeated first-boot package installation from the provisioning critical path while keeping per-host hardware configuration in site/bootstrap logic.
Default Artifact¶
- MAAS image name:
custom/gpuaas-h200-host-ubuntu2404 - MAAS visible boot-resource name after upload:
gpuaas-h200-host-ubuntu2404 - MAAS title:
GPUaaS H200 Host Ubuntu 24.04 - Architecture:
amd64/generic - File type:
tgz - Base image:
ubuntu/noble - Build host: platform-control
- Build output:
/var/lib/gpuaas/maas-host-images/gpuaas-h200-host-ubuntu2404.tar.gz
The image is intended for H200 bare-metal hosts and slice-capable H200 hosts. It bakes heavy/static prerequisites and leaves environment-specific setup for cloud-init/site bootstrap.
What Goes In The Image¶
The default package set includes:
- cloud-init and base diagnostics,
- Docker and compose plugin,
- NVIDIA host driver utilities,
- NVIDIA container runtime configuration for Docker when
nvidia-ctkis available, - libvirt, qemu, OVMF, and cloud image tools for slice guests,
- Open vSwitch, dnsmasq, iptables/nftables for slice host networking,
- RDMA/IB tools and perf test utilities,
- the H200 site-bootstrap base package set (
environment-modules,ipmitool,lldpd,nfs-common,pdsh, and related diagnostics), - optional Mellanox DOCA/OFED packages (
doca-ofed,mlnx-fw-updater) when explicitly enabled from the configured DOCA repository, - Netdata with bounded dbengine retention and
nvidia-smicollection enabled, - GPUaaS slice passthrough kernel args (
intel_iommu=on iommu=pt) staged in both/etc/default/gruband a grub.d drop-in so the first deployed boot has IOMMU groups without a second reboot, - a bounded
systemd-networkd-wait-onlinedefault so first boot does not wait indefinitely on unused secondary/fabric ports, - optional NVIDIA container toolkit, DCGM, fabric manager, driverctl, and rshim when available from the configured package repositories.
The launcher also copies the current
infra/env/maas/site-bootstrap/h200-slice-vm/bootstrap.sh into the image at:
That bootstrap should still run at first boot for hardware-specific operations such as IOMMU/VFIO configuration, OVS bridge state, SR-IOV VF creation, and runtime validation. For the prebaked image path, run it with package install disabled where possible:
Run From A Developer Workstation¶
Dry-run the remote plan:
GPUAAS_MAAS_HOST_IMAGE_DRY_RUN=1 \
scripts/ops/gpuaas_maas_h200_host_image_platform_control_pipeline.sh
Build on platform-control and upload to MAAS:
Build only, without uploading:
GPUAAS_MAAS_HOST_IMAGE_UPLOAD=0 \
scripts/ops/gpuaas_maas_h200_host_image_platform_control_pipeline.sh
Reuse an already-built artifact and only upload it:
GPUAAS_MAAS_HOST_IMAGE_SKIP_BUILD=1 \
scripts/ops/gpuaas_maas_h200_host_image_platform_control_pipeline.sh
Useful Overrides¶
PLATFORM_CONTROL_SSH_HOST: platform-control SSH target. Defaults tohpcadmin@100.90.157.34.GPUAAS_MAAS_HOST_IMAGE_BASE_URL: source Ubuntu cloud image URL.GPUAAS_MAAS_HOST_IMAGE_SIZE: expanded image size. Defaults to80G.GPUAAS_MAAS_HOST_IMAGE_INSTALL_PACKAGES: comma-separated required package list installed during image customization.GPUAAS_MAAS_HOST_IMAGE_OPTIONAL_PACKAGES: comma-separated best-effort package list. Missing packages are logged but do not fail the build.GPUAAS_MAAS_HOST_IMAGE_ENABLE_NVIDIA_REPOS=0: skip adding NVIDIA package repositories during customization.GPUAAS_MAAS_HOST_IMAGE_ENABLE_DOCA_REPOS=1: add the Mellanox DOCA repository and try best-effortdoca-ofed/mlnx-fw-updaterinstallation during customization. This is disabled by default because OFED DKMS packages can make MAAS curtin fail kernel/header post-install during deployment; prefer the site-bootstrap bundle for node-specific OFED setup until the image artifact is validated.GPUAAS_MAAS_HOST_IMAGE_DOCA_KEY_URL: override the DOCA repository signing key URL.GPUAAS_MAAS_HOST_IMAGE_DOCA_REPO_URL: override the DOCA Ubuntu 24.04 repository URL.GPUAAS_MAAS_HOST_IMAGE_ENABLE_NETDATA=0: skip writing Netdata config.GPUAAS_MAAS_HOST_IMAGE_NETDATA_RETENTION_MB: Netdata dbengine disk budget. Defaults to256.GPUAAS_MAAS_HOST_IMAGE_NETDATA_BIND: Netdata bind address. Defaults to127.0.0.1:19998. Site bootstrap adds the nginx telemetry edge on:19999.GPUAAS_MAAS_HOST_IMAGE_SLICE_KERNEL_ARGS: kernel args baked into the image for slice passthrough. Defaults tointel_iommu=on iommu=pt.GPUAAS_MAAS_HOST_IMAGE_MAAS_PROFILE: MAAS CLI profile. Defaults tomaas280.GPUAAS_MAAS_HOST_IMAGE_MAAS_NAME: MAAS boot resource name.GPUAAS_MAAS_HOST_IMAGE_MAAS_FILETYPE: MAAS file type. Defaults totgz. Use add*file type only when MAAS should write a full disk image and bypass its normal target storage layout.GPUAAS_MAAS_HOST_IMAGE_MAAS_BASE_IMAGE: MAAS base image. Defaults toubuntu/noble.GPUAAS_MAAS_HOST_IMAGE_UPLOAD_STAGING_DIR: local path on platform-control used for MAAS CLI upload. Defaults to$HOME/gpuaas-maas-host-imagesbecause snap-confined MAAS CLI installs may not be able to read/var/lib/gpuaasor hidden home paths.
Verification¶
List uploaded custom images:
maas maas280 boot-resources read | jq -r \
'.[] | select(.name=="gpuaas-h200-host-ubuntu2404" or .name=="custom/gpuaas-h200-host-ubuntu2404") | [.name,.title,.architecture,.type] | @tsv'
Check the built artifact on platform-control:
ssh hpcadmin@100.90.157.34 \
'ls -lh /var/lib/gpuaas/maas-host-images/gpuaas-h200-host-ubuntu2404.tar.gz && sha256sum /var/lib/gpuaas/maas-host-images/gpuaas-h200-host-ubuntu2404.tar.gz'
Current known-good artifact is recorded in the MAAS boot-resource set after each upload:
Selecting The Image For A Profile¶
The GPUaaS MAAS deploy path passes the profile distro_series into MAAS
machine deploy. For this custom image, set the target MAAS site profile
distro_series to the visible boot-resource name. The execution client
normalizes this value to MAAS' deploy namespace (custom/<name>) when
submitting the deploy request:
Use the admin profile API or UI to patch the selected profile:
curl -fsS -X PATCH \
"${API_BASE_URL}/api/v1/admin/maas-sites/${SITE_ID}/profiles/${PROFILE_ID}" \
-H "Authorization: Bearer ${ADMIN_TOKEN}" \
-H "Content-Type: application/json" \
-H "X-Idempotency-Key: maas-profile-image-$(date +%s)" \
--data '{"distro_series":"gpuaas-h200-host-ubuntu2404"}'
Switch the profile back to the stock image with:
If MAAS fails during curtin installing-kernel with linux-generic or DKMS
errors, switch the profile back to ubuntu/noble and keep the
site_bootstrap_bundle_ref enabled. That path deploys a stock base image first
and applies GPUaaS host customizations after MAAS has completed installation.
After deploying a host with the image, validate baked runtime prerequisites:
Also confirm the user-facing slice host checkpoints separately:
ssh ubuntu@<host> 'systemctl is-active gpuaas-node-agent && docker compose version'
ssh ubuntu@<host> 'cat /proc/cmdline | tr " " "\n" | grep -E "^(intel_iommu=on|iommu=pt)$"'
ssh ubuntu@<host> 'sudo cat /var/lib/gpuaas/site-bootstrap/h200-slice-vm.validation 2>/dev/null || true'
For a bare-metal profile, host nvidia-smi should still succeed and report
the H200 devices. For a slice profile, host NVML can be unavailable after
the GPUs are bound for passthrough. Do not use host nvidia-smi success as the
slice-readiness gate.
For slice-host profiles, Deployed in MAAS and node-agent enrollment are not
enough. Validate the host converged to the same runtime baseline as the manual
slice host before marking it schedulable:
ssh ubuntu@<host> '
set -eu
for svc in \
gpuaas-node-agent \
gpuaas-slice-fabric-vfs \
gpuaas-slice-vfio-devices \
gpuaas-slice-network-baseline \
gpuaas-slice-ovs-bridge \
libvirtd \
dnsmasq
do
systemctl is-active --quiet "$svc" && echo "$svc=active" || echo "$svc=NOT_ACTIVE"
done
test -f /etc/netplan/60-ipoib.yaml && echo ipoib_netplan=present
ip -br addr show | grep -E "^ibp.*192\\.168\\." || true
ip route | grep -E "^192\\.168\\.0\\.0/16 .*dev ibp|^192\\.168\\.[0-9]+\\.0/24 .*dev ibp" || true
ip -br addr show ovsbr0 | grep "10.100.0.1/24"
test -f /etc/dnsmasq.d/gpu_subnet.conf && echo dnsmasq_gpu_subnet=present
'
Expected slice baseline:
gpuaas-slice-fabric-vfs,gpuaas-slice-vfio-devices,gpuaas-slice-network-baseline, andgpuaas-slice-ovs-bridgeare active;/etc/netplan/60-ipoib.yamlexists and anibp*interface has a192.168.x.xfabric address;- a fabric route exists for the configured IPoIB subnet;
ovsbr0has10.100.0.1/24;dnsmasqis active withgpu_subnet.conf;libvirtdis active before slice VMs are expected to launch;- Netdata may be present for ops telemetry, but it is not a slice-readiness dependency.
If the MAAS site profile still declares deploy_user=hpcadmin, password-based
access may exist for that user, but the MAAS-injected SSH key can still land on
the image default ubuntu user. Until the profile/user-data contract is made
explicit, use ubuntu for image validation and treat hpcadmin SSH-key
availability as a separate site-profile check.
Collect baseline timings from the current MAAS image before switching
j22u11 to this image. The first comparison should focus on:
- MAAS deploy duration,
- first boot to cloud-init start,
- site bootstrap duration,
- GPUaaS bootstrap duration,
- node-agent enrollment duration,
- final active-state confirmation,
- slice slot discovery/reporting time.
The 2026-04-21 j22u11 run gives a useful baseline split: MAAS onboarding
completed and the node became active on the custom tgz image at
10.177.36.171, with Docker, Netdata, node-agent, and all 8 H200 GPUs
validated. The remaining wall clock was dominated by first-boot site bootstrap
work, especially package phases that should now be skipped after rebuilding the
image with the site-bootstrap package and DOCA/OFED additions. Treat node
active/enrolled and slice-ready inventory as separate optimization gates.
2026-04-26 Slice Custom-Image Convergence Gap¶
Reimage validation found a difference between the known-good manually converged slice host and the custom-image slice host:
j22u15was deployed from stock Noble with thegpuaas-profile-slice-vmtag and a site bootstrap recorded asmaas-site-bootstrap-h200-ib. It had the IPoIB netplan, fabric route, OVS bridge, dnsmasq, libvirt, VFIO, and slice networking services converged.j22u11was deployed from the customgpuaas-h200-host-ubuntu2404image with the samegpuaas-profile-slice-vmtag and a site bootstrap recorded asmaas-site-bootstrap-h200-slice-vm. It had node-agent, Netdata, VFIO, and OVS baseline, but was missing/etc/netplan/60-ipoib.yaml, the IPoIB fabric interface/route, andgpuaas-slice-network-baseline.
Treat this as an incomplete custom-image convergence issue, not a node-agent
issue. The h200-slice-vm image/bootstrap path must either compose the
h200-ib fabric-network baseline or provide an equivalent IPoIB convergence
step. Do not mark a custom-image slice host schedulable until the slice baseline
check above passes.
The host-image pipeline now carries both site-bootstrap scripts into curated images and the slice bootstrap runs the embedded H200/IB baseline first when needed:
/usr/local/share/gpuaas/site-bootstrap/h200-ib/bootstrap.sh/usr/local/share/gpuaas/site-bootstrap/h200-slice-vm/bootstrap.sh
This is intentional. Slice-capable H200 hosts still need the H200/IB package,
NVIDIA, OFED/DKMS, Netdata, and fabric convergence layer before the
slice-specific libvirt/OVS/VFIO setup can be considered schedulable. If
h200-ib.done is absent after reimage, treat the node as host-baseline
incomplete even if h200-slice-vm.done and node-agent enrollment are present.
Operational Notes¶
Netdata is installed at the host image layer because node-level telemetry should cover H200 hosts regardless of whether they are serving bare-metal or slice allocations. A slice VM still needs its own agent only if per-guest process or application telemetry is required. Host Netdata can see host GPU and node health but cannot fully replace guest-local telemetry inside tenant VMs.
For the first image pass, host Netdata should stay bounded:
- dbengine retention defaults to
256 MB, - collection interval defaults to 5 seconds,
- NVIDIA GPU metrics use
nvidia-smivia Netdatago.d/nvidia_smi, - Docker/container/cgroup collectors are enabled for host runtime visibility.
Site bootstrap also enforces Netdata convergence for both H200 profiles:
- bare-metal H200:
infra/env/maas/site-bootstrap/h200-ib/bootstrap.sh, - slice-capable H200:
infra/env/maas/site-bootstrap/h200-slice-vm/bootstrap.sh.
This keeps reimaged nodes consistent even when the MAAS base image does not already contain Netdata. Use the profile-specific overrides only when needed:
- bare-metal:
H200_ENABLE_NETDATA,H200_NETDATA_BIND,H200_NETDATA_EDGE_LISTEN,H200_NETDATA_RETENTION_MB, - slice-capable:
GPUAAS_SLICE_ENABLE_NETDATA,GPUAAS_SLICE_NETDATA_BIND,GPUAAS_SLICE_NETDATA_EDGE_LISTEN,GPUAAS_SLICE_NETDATA_RETENTION_MB.
The default telemetry edge posture is:
- Netdata backend binds to
127.0.0.1:19998. - Node-local nginx listens on
0.0.0.0:19999. - nginx proxies only to the local Netdata backend and exposes
/gpuaas/telemetry/health. /gpuaas/telemetry/netdata/redirects to the locally detected working Netdata dashboard route, for example/v3/spaces/<host>/rooms/local/overview,/v2/spaces/<host>/rooms/local/overview, or/v1/.
After reimage, verify the host before debugging the platform proxy:
systemctl is-active gpuaas-node-agent
systemctl is-active netdata
systemctl is-active nginx
ss -ltnp | grep -E '127\.0\.0\.1:19998|:19999'
curl -fsS http://127.0.0.1:19998/api/v1/info | jq -r .version
curl -fsS http://127.0.0.1:19999/gpuaas/telemetry/health
curl -I http://127.0.0.1:19999/gpuaas/telemetry/netdata/
curl -fsS http://<maas-host-ip>:19999/api/v1/info | jq -r .version
If ss shows Netdata listening on 0.0.0.0:19999, the host did not converge to
the telemetry edge posture. Re-run the site bootstrap or inspect
/etc/netdata/netdata.conf and /etc/nginx/sites-enabled/gpuaas-netdata-edge
before relying on the platform proxy.
For existing pre-edge nodes that should not be reimaged yet, converge the host in place with the idempotent ops script:
Use --ssh-option for temporary known-host overrides during incident recovery.
The script installs or repairs Netdata and nginx, moves Netdata to
127.0.0.1:19998, exposes nginx on 0.0.0.0:19999, and verifies the stable
/gpuaas/telemetry/* paths.
The first image pass intentionally keeps hardware mutation out of the image. GPU binding, fabric VF creation, bridge creation, and bootstrap token handoff remain first-boot actions because they depend on the target host and selected site/profile.
MAAS API mutation timeouts during release, commission, and deploy are treated as unknown outcomes. The workflow records the timeout and then polls MAAS for the expected final state. This matters for image experiments because a slow MAAS controller or intermittent network path should not invalidate a provisioning baseline when MAAS actually accepted the operation.
Troubleshooting¶
If MAAS deploy fails during curtin block-meta after downloading and writing
the full disk image, check the installation log for:
Did not find any filesystem ... that contained one of ['curtin', 'system-data/var/lib/snapd', 'snaps']
For dd* custom images, curtin uses those paths as root filesystem markers
after writing the image to disk. The GPUaaS image builder writes the
/curtin/ marker directory into the root filesystem before conversion so MAAS
can identify the deployed root partition.
Prefer the default tgz artifact for GPUaaS host images. The root filesystem
archive lets MAAS own partitioning, ESP creation, and /boot/efi mounting.
That matches the GPUaaS set_storage_layout=flat provisioning path. A ddgz
full-disk image can conflict with MAAS-created storage because curtin may
extract the image root from the disk image while installing GRUB against the
MAAS-created EFI system partition.