Skip to content

MAAS H200 Host Image Pipeline Runbook

This runbook builds a curated Ubuntu 24.04 H200 host image on platform-control and uploads it to MAAS. The goal is to remove repeated first-boot package installation from the provisioning critical path while keeping per-host hardware configuration in site/bootstrap logic.

Default Artifact

  • MAAS image name: custom/gpuaas-h200-host-ubuntu2404
  • MAAS visible boot-resource name after upload: gpuaas-h200-host-ubuntu2404
  • MAAS title: GPUaaS H200 Host Ubuntu 24.04
  • Architecture: amd64/generic
  • File type: tgz
  • Base image: ubuntu/noble
  • Build host: platform-control
  • Build output: /var/lib/gpuaas/maas-host-images/gpuaas-h200-host-ubuntu2404.tar.gz

The image is intended for H200 bare-metal hosts and slice-capable H200 hosts. It bakes heavy/static prerequisites and leaves environment-specific setup for cloud-init/site bootstrap.

What Goes In The Image

The default package set includes:

  • cloud-init and base diagnostics,
  • Docker and compose plugin,
  • NVIDIA host driver utilities,
  • NVIDIA container runtime configuration for Docker when nvidia-ctk is available,
  • libvirt, qemu, OVMF, and cloud image tools for slice guests,
  • Open vSwitch, dnsmasq, iptables/nftables for slice host networking,
  • RDMA/IB tools and perf test utilities,
  • the H200 site-bootstrap base package set (environment-modules, ipmitool, lldpd, nfs-common, pdsh, and related diagnostics),
  • optional Mellanox DOCA/OFED packages (doca-ofed, mlnx-fw-updater) when explicitly enabled from the configured DOCA repository,
  • Netdata with bounded dbengine retention and nvidia-smi collection enabled,
  • GPUaaS slice passthrough kernel args (intel_iommu=on iommu=pt) staged in both /etc/default/grub and a grub.d drop-in so the first deployed boot has IOMMU groups without a second reboot,
  • a bounded systemd-networkd-wait-online default so first boot does not wait indefinitely on unused secondary/fabric ports,
  • optional NVIDIA container toolkit, DCGM, fabric manager, driverctl, and rshim when available from the configured package repositories.

The launcher also copies the current infra/env/maas/site-bootstrap/h200-slice-vm/bootstrap.sh into the image at:

/usr/local/share/gpuaas/site-bootstrap/h200-slice-vm/bootstrap.sh

That bootstrap should still run at first boot for hardware-specific operations such as IOMMU/VFIO configuration, OVS bridge state, SR-IOV VF creation, and runtime validation. For the prebaked image path, run it with package install disabled where possible:

GPUAAS_SLICE_INSTALL_PACKAGES=0 \
  /usr/local/share/gpuaas/site-bootstrap/h200-slice-vm/bootstrap.sh

Run From A Developer Workstation

Dry-run the remote plan:

GPUAAS_MAAS_HOST_IMAGE_DRY_RUN=1 \
  scripts/ops/gpuaas_maas_h200_host_image_platform_control_pipeline.sh

Build on platform-control and upload to MAAS:

scripts/ops/gpuaas_maas_h200_host_image_platform_control_pipeline.sh

Build only, without uploading:

GPUAAS_MAAS_HOST_IMAGE_UPLOAD=0 \
  scripts/ops/gpuaas_maas_h200_host_image_platform_control_pipeline.sh

Reuse an already-built artifact and only upload it:

GPUAAS_MAAS_HOST_IMAGE_SKIP_BUILD=1 \
  scripts/ops/gpuaas_maas_h200_host_image_platform_control_pipeline.sh

Useful Overrides

  • PLATFORM_CONTROL_SSH_HOST: platform-control SSH target. Defaults to hpcadmin@100.90.157.34.
  • GPUAAS_MAAS_HOST_IMAGE_BASE_URL: source Ubuntu cloud image URL.
  • GPUAAS_MAAS_HOST_IMAGE_SIZE: expanded image size. Defaults to 80G.
  • GPUAAS_MAAS_HOST_IMAGE_INSTALL_PACKAGES: comma-separated required package list installed during image customization.
  • GPUAAS_MAAS_HOST_IMAGE_OPTIONAL_PACKAGES: comma-separated best-effort package list. Missing packages are logged but do not fail the build.
  • GPUAAS_MAAS_HOST_IMAGE_ENABLE_NVIDIA_REPOS=0: skip adding NVIDIA package repositories during customization.
  • GPUAAS_MAAS_HOST_IMAGE_ENABLE_DOCA_REPOS=1: add the Mellanox DOCA repository and try best-effort doca-ofed/mlnx-fw-updater installation during customization. This is disabled by default because OFED DKMS packages can make MAAS curtin fail kernel/header post-install during deployment; prefer the site-bootstrap bundle for node-specific OFED setup until the image artifact is validated.
  • GPUAAS_MAAS_HOST_IMAGE_DOCA_KEY_URL: override the DOCA repository signing key URL.
  • GPUAAS_MAAS_HOST_IMAGE_DOCA_REPO_URL: override the DOCA Ubuntu 24.04 repository URL.
  • GPUAAS_MAAS_HOST_IMAGE_ENABLE_NETDATA=0: skip writing Netdata config.
  • GPUAAS_MAAS_HOST_IMAGE_NETDATA_RETENTION_MB: Netdata dbengine disk budget. Defaults to 256.
  • GPUAAS_MAAS_HOST_IMAGE_NETDATA_BIND: Netdata bind address. Defaults to 127.0.0.1:19998. Site bootstrap adds the nginx telemetry edge on :19999.
  • GPUAAS_MAAS_HOST_IMAGE_SLICE_KERNEL_ARGS: kernel args baked into the image for slice passthrough. Defaults to intel_iommu=on iommu=pt.
  • GPUAAS_MAAS_HOST_IMAGE_MAAS_PROFILE: MAAS CLI profile. Defaults to maas280.
  • GPUAAS_MAAS_HOST_IMAGE_MAAS_NAME: MAAS boot resource name.
  • GPUAAS_MAAS_HOST_IMAGE_MAAS_FILETYPE: MAAS file type. Defaults to tgz. Use a dd* file type only when MAAS should write a full disk image and bypass its normal target storage layout.
  • GPUAAS_MAAS_HOST_IMAGE_MAAS_BASE_IMAGE: MAAS base image. Defaults to ubuntu/noble.
  • GPUAAS_MAAS_HOST_IMAGE_UPLOAD_STAGING_DIR: local path on platform-control used for MAAS CLI upload. Defaults to $HOME/gpuaas-maas-host-images because snap-confined MAAS CLI installs may not be able to read /var/lib/gpuaas or hidden home paths.

Verification

List uploaded custom images:

maas maas280 boot-resources read | jq -r \
  '.[] | select(.name=="gpuaas-h200-host-ubuntu2404" or .name=="custom/gpuaas-h200-host-ubuntu2404") | [.name,.title,.architecture,.type] | @tsv'

Check the built artifact on platform-control:

ssh hpcadmin@100.90.157.34 \
  'ls -lh /var/lib/gpuaas/maas-host-images/gpuaas-h200-host-ubuntu2404.tar.gz && sha256sum /var/lib/gpuaas/maas-host-images/gpuaas-h200-host-ubuntu2404.tar.gz'

Current known-good artifact is recorded in the MAAS boot-resource set after each upload:

MAAS boot-resource id: 11

Selecting The Image For A Profile

The GPUaaS MAAS deploy path passes the profile distro_series into MAAS machine deploy. For this custom image, set the target MAAS site profile distro_series to the visible boot-resource name. The execution client normalizes this value to MAAS' deploy namespace (custom/<name>) when submitting the deploy request:

gpuaas-h200-host-ubuntu2404

Use the admin profile API or UI to patch the selected profile:

curl -fsS -X PATCH \
  "${API_BASE_URL}/api/v1/admin/maas-sites/${SITE_ID}/profiles/${PROFILE_ID}" \
  -H "Authorization: Bearer ${ADMIN_TOKEN}" \
  -H "Content-Type: application/json" \
  -H "X-Idempotency-Key: maas-profile-image-$(date +%s)" \
  --data '{"distro_series":"gpuaas-h200-host-ubuntu2404"}'

Switch the profile back to the stock image with:

{"distro_series":"ubuntu/noble"}

If MAAS fails during curtin installing-kernel with linux-generic or DKMS errors, switch the profile back to ubuntu/noble and keep the site_bootstrap_bundle_ref enabled. That path deploys a stock base image first and applies GPUaaS host customizations after MAAS has completed installation.

After deploying a host with the image, validate baked runtime prerequisites:

ssh ubuntu@<host> 'sudo /usr/local/sbin/gpuaas-h200-host-image-validate.sh || true'

Also confirm the user-facing slice host checkpoints separately:

ssh ubuntu@<host> 'systemctl is-active gpuaas-node-agent && docker compose version'
ssh ubuntu@<host> 'cat /proc/cmdline | tr " " "\n" | grep -E "^(intel_iommu=on|iommu=pt)$"'
ssh ubuntu@<host> 'sudo cat /var/lib/gpuaas/site-bootstrap/h200-slice-vm.validation 2>/dev/null || true'

For a bare-metal profile, host nvidia-smi should still succeed and report the H200 devices. For a slice profile, host NVML can be unavailable after the GPUs are bound for passthrough. Do not use host nvidia-smi success as the slice-readiness gate.

For slice-host profiles, Deployed in MAAS and node-agent enrollment are not enough. Validate the host converged to the same runtime baseline as the manual slice host before marking it schedulable:

ssh ubuntu@<host> '
  set -eu
  for svc in \
    gpuaas-node-agent \
    gpuaas-slice-fabric-vfs \
    gpuaas-slice-vfio-devices \
    gpuaas-slice-network-baseline \
    gpuaas-slice-ovs-bridge \
    libvirtd \
    dnsmasq
  do
    systemctl is-active --quiet "$svc" && echo "$svc=active" || echo "$svc=NOT_ACTIVE"
  done
  test -f /etc/netplan/60-ipoib.yaml && echo ipoib_netplan=present
  ip -br addr show | grep -E "^ibp.*192\\.168\\." || true
  ip route | grep -E "^192\\.168\\.0\\.0/16 .*dev ibp|^192\\.168\\.[0-9]+\\.0/24 .*dev ibp" || true
  ip -br addr show ovsbr0 | grep "10.100.0.1/24"
  test -f /etc/dnsmasq.d/gpu_subnet.conf && echo dnsmasq_gpu_subnet=present
'

Expected slice baseline:

  1. gpuaas-slice-fabric-vfs, gpuaas-slice-vfio-devices, gpuaas-slice-network-baseline, and gpuaas-slice-ovs-bridge are active;
  2. /etc/netplan/60-ipoib.yaml exists and an ibp* interface has a 192.168.x.x fabric address;
  3. a fabric route exists for the configured IPoIB subnet;
  4. ovsbr0 has 10.100.0.1/24;
  5. dnsmasq is active with gpu_subnet.conf;
  6. libvirtd is active before slice VMs are expected to launch;
  7. Netdata may be present for ops telemetry, but it is not a slice-readiness dependency.

If the MAAS site profile still declares deploy_user=hpcadmin, password-based access may exist for that user, but the MAAS-injected SSH key can still land on the image default ubuntu user. Until the profile/user-data contract is made explicit, use ubuntu for image validation and treat hpcadmin SSH-key availability as a separate site-profile check.

Collect baseline timings from the current MAAS image before switching j22u11 to this image. The first comparison should focus on:

  • MAAS deploy duration,
  • first boot to cloud-init start,
  • site bootstrap duration,
  • GPUaaS bootstrap duration,
  • node-agent enrollment duration,
  • final active-state confirmation,
  • slice slot discovery/reporting time.

The 2026-04-21 j22u11 run gives a useful baseline split: MAAS onboarding completed and the node became active on the custom tgz image at 10.177.36.171, with Docker, Netdata, node-agent, and all 8 H200 GPUs validated. The remaining wall clock was dominated by first-boot site bootstrap work, especially package phases that should now be skipped after rebuilding the image with the site-bootstrap package and DOCA/OFED additions. Treat node active/enrolled and slice-ready inventory as separate optimization gates.

2026-04-26 Slice Custom-Image Convergence Gap

Reimage validation found a difference between the known-good manually converged slice host and the custom-image slice host:

  1. j22u15 was deployed from stock Noble with the gpuaas-profile-slice-vm tag and a site bootstrap recorded as maas-site-bootstrap-h200-ib. It had the IPoIB netplan, fabric route, OVS bridge, dnsmasq, libvirt, VFIO, and slice networking services converged.
  2. j22u11 was deployed from the custom gpuaas-h200-host-ubuntu2404 image with the same gpuaas-profile-slice-vm tag and a site bootstrap recorded as maas-site-bootstrap-h200-slice-vm. It had node-agent, Netdata, VFIO, and OVS baseline, but was missing /etc/netplan/60-ipoib.yaml, the IPoIB fabric interface/route, and gpuaas-slice-network-baseline.

Treat this as an incomplete custom-image convergence issue, not a node-agent issue. The h200-slice-vm image/bootstrap path must either compose the h200-ib fabric-network baseline or provide an equivalent IPoIB convergence step. Do not mark a custom-image slice host schedulable until the slice baseline check above passes.

The host-image pipeline now carries both site-bootstrap scripts into curated images and the slice bootstrap runs the embedded H200/IB baseline first when needed:

  • /usr/local/share/gpuaas/site-bootstrap/h200-ib/bootstrap.sh
  • /usr/local/share/gpuaas/site-bootstrap/h200-slice-vm/bootstrap.sh

This is intentional. Slice-capable H200 hosts still need the H200/IB package, NVIDIA, OFED/DKMS, Netdata, and fabric convergence layer before the slice-specific libvirt/OVS/VFIO setup can be considered schedulable. If h200-ib.done is absent after reimage, treat the node as host-baseline incomplete even if h200-slice-vm.done and node-agent enrollment are present.

Operational Notes

Netdata is installed at the host image layer because node-level telemetry should cover H200 hosts regardless of whether they are serving bare-metal or slice allocations. A slice VM still needs its own agent only if per-guest process or application telemetry is required. Host Netdata can see host GPU and node health but cannot fully replace guest-local telemetry inside tenant VMs.

For the first image pass, host Netdata should stay bounded:

  • dbengine retention defaults to 256 MB,
  • collection interval defaults to 5 seconds,
  • NVIDIA GPU metrics use nvidia-smi via Netdata go.d/nvidia_smi,
  • Docker/container/cgroup collectors are enabled for host runtime visibility.

Site bootstrap also enforces Netdata convergence for both H200 profiles:

  • bare-metal H200: infra/env/maas/site-bootstrap/h200-ib/bootstrap.sh,
  • slice-capable H200: infra/env/maas/site-bootstrap/h200-slice-vm/bootstrap.sh.

This keeps reimaged nodes consistent even when the MAAS base image does not already contain Netdata. Use the profile-specific overrides only when needed:

  • bare-metal: H200_ENABLE_NETDATA, H200_NETDATA_BIND, H200_NETDATA_EDGE_LISTEN, H200_NETDATA_RETENTION_MB,
  • slice-capable: GPUAAS_SLICE_ENABLE_NETDATA, GPUAAS_SLICE_NETDATA_BIND, GPUAAS_SLICE_NETDATA_EDGE_LISTEN, GPUAAS_SLICE_NETDATA_RETENTION_MB.

The default telemetry edge posture is:

  • Netdata backend binds to 127.0.0.1:19998.
  • Node-local nginx listens on 0.0.0.0:19999.
  • nginx proxies only to the local Netdata backend and exposes /gpuaas/telemetry/health.
  • /gpuaas/telemetry/netdata/ redirects to the locally detected working Netdata dashboard route, for example /v3/spaces/<host>/rooms/local/overview, /v2/spaces/<host>/rooms/local/overview, or /v1/.

After reimage, verify the host before debugging the platform proxy:

systemctl is-active gpuaas-node-agent
systemctl is-active netdata
systemctl is-active nginx
ss -ltnp | grep -E '127\.0\.0\.1:19998|:19999'
curl -fsS http://127.0.0.1:19998/api/v1/info | jq -r .version
curl -fsS http://127.0.0.1:19999/gpuaas/telemetry/health
curl -I http://127.0.0.1:19999/gpuaas/telemetry/netdata/
curl -fsS http://<maas-host-ip>:19999/api/v1/info | jq -r .version

If ss shows Netdata listening on 0.0.0.0:19999, the host did not converge to the telemetry edge posture. Re-run the site bootstrap or inspect /etc/netdata/netdata.conf and /etc/nginx/sites-enabled/gpuaas-netdata-edge before relying on the platform proxy.

For existing pre-edge nodes that should not be reimaged yet, converge the host in place with the idempotent ops script:

scripts/ops/gpuaas_netdata_edge_converge.sh --host <node-hostname> --user ubuntu

Use --ssh-option for temporary known-host overrides during incident recovery. The script installs or repairs Netdata and nginx, moves Netdata to 127.0.0.1:19998, exposes nginx on 0.0.0.0:19999, and verifies the stable /gpuaas/telemetry/* paths.

The first image pass intentionally keeps hardware mutation out of the image. GPU binding, fabric VF creation, bridge creation, and bootstrap token handoff remain first-boot actions because they depend on the target host and selected site/profile.

MAAS API mutation timeouts during release, commission, and deploy are treated as unknown outcomes. The workflow records the timeout and then polls MAAS for the expected final state. This matters for image experiments because a slow MAAS controller or intermittent network path should not invalidate a provisioning baseline when MAAS actually accepted the operation.

Troubleshooting

If MAAS deploy fails during curtin block-meta after downloading and writing the full disk image, check the installation log for:

Did not find any filesystem ... that contained one of ['curtin', 'system-data/var/lib/snapd', 'snaps']

For dd* custom images, curtin uses those paths as root filesystem markers after writing the image to disk. The GPUaaS image builder writes the /curtin/ marker directory into the root filesystem before conversion so MAAS can identify the deployed root partition.

Prefer the default tgz artifact for GPUaaS host images. The root filesystem archive lets MAAS own partitioning, ESP creation, and /boot/efi mounting. That matches the GPUaaS set_storage_layout=flat provisioning path. A ddgz full-disk image can conflict with MAAS-created storage because curtin may extract the image root from the disk image while installing GRUB against the MAAS-created EFI system partition.