GPU Slice Image Pipeline Runbook¶
This runbook builds the H200 GPU slice developer VM image on platform-control and publishes it to slice hosts over the private MAAS/node network. The goal is to avoid transferring multi-GB raw images over the developer VPN path.
Default Artifact¶
- Catalog slug:
ubuntu-24.04-h200-cuda - Runtime cache path on slice hosts:
/var/lib/gpuaas/slice-images/ubuntu-24.04-h200-cuda.raw - Source of truth: OCI artifact in the platform registry.
- Driver strategy:
preinstalled - Compatible SKU:
h200-sxm-slice - Default publish target:
hpcadmin@10.177.36.197
The base image remains ubuntu-24.04-h200-base with cloud-init driver
installation. The CUDA developer image should be active only after this pipeline
successfully publishes the artifact to every schedulable slice host. Add
J22u11 or other slice hosts to GPUAAS_SLICE_IMAGE_TARGETS before making them
schedulable for GPU slice allocations.
Run From A Developer Workstation¶
Dry-run the remote plan:
Build on platform-control, publish an OCI artifact to the platform registry, and upsert the live catalog row:
Update verified node-local cache state for a different host set:
GPUAAS_SLICE_IMAGE_TARGETS='hpcadmin@10.177.36.197' \
scripts/ops/gpuaas_slice_image_platform_control_pipeline.sh
Reuse an already-built platform-control artifact and only republish/catalog it:
The pipeline no longer copies the raw image to nodes by default. To use the
temporary rsync fallback for a lab node, set
GPUAAS_SLICE_IMAGE_PUBLISH_NODE_CACHE=1. The production direction is direct
registry-to-node prewarm with node-local verified cache.
Runtime Behavior¶
The local launcher copies
scripts/ops/gpuaas_slice_image_remote_build_publish.sh to platform-control,
then runs it there. The remote script:
- Installs
qemu-utilsandlibguestfs-toolsif missing. - Downloads the Ubuntu Noble cloud image to the platform-control cache.
- Resizes the image and uses NBD plus a host-network chroot to install guest GPU/RDMA packages.
- Converts the customized qcow2 image to sparse raw format.
- Pushes the raw image as an OCI artifact to the platform registry.
- Upserts the
os_imagesrow asactive, withsource_uri=oci://.... - Verifies any existing node-local cache on target hosts and records
node_image_cache.status=verifiedwhen the digest matches.
GPUAAS_SLICE_IMAGE_CUSTOMIZE_METHOD=guestfs is available as a fallback, but
the default NBD path is preferred because it uses the platform-control host
network and avoids guest DNS failures during package installation.
The direct os_images and node_image_cache upserts are temporary operator
paths because the admin API currently supports list/create/delete but not
update/enable for an existing image or node image-cache state. Once those
endpoints exist, this script should use the API instead of direct SQL.
scripts/seed.sql preserves an already-materialized
ubuntu-24.04-h200-cuda digest/status so release deploys do not silently revert
the active image back to the disabled placeholder row.
Useful Overrides¶
PLATFORM_CONTROL_SSH_HOST: platform-control SSH target. Defaults tohpcadmin@100.90.157.34.GPUAAS_SLICE_IMAGE_BASE_URL: source cloud image URL.GPUAAS_SLICE_IMAGE_INSTALL_PACKAGES: comma-separated package list passed tovirt-customize --install.GPUAAS_SLICE_IMAGE_TARGETS: space-separateduser@hostpublish targets.GPUAAS_SLICE_IMAGE_REGISTRY_HOST: registry host. Defaults to the platform-control registry host. This is the host stored in catalog metadata.GPUAAS_SLICE_IMAGE_REGISTRY_PUSH_HOST: optional internal registry endpoint used only for upload. Use this to keep large raw-image pushes off public ingress while preserving the public catalog reference.GPUAAS_SLICE_IMAGE_REGISTRY_NAMESPACE: registry namespace. Defaults toslice-guest-images.GPUAAS_SLICE_IMAGE_REGISTRY_TAG: registry tag. Defaults to UTC build time.GPUAAS_SLICE_IMAGE_REGISTRY_TLS_MODE:verify,insecure, orplain-http; useplain-httponly for trusted internal registry endpoints.GPUAAS_SLICE_IMAGE_PUBLISH_NODE_CACHE=1: temporary rsync fallback that writes the raw image to target nodes.GPUAAS_SLICE_IMAGE_UPDATE_CATALOG=0: publish image without changing catalog.DATABASE_URL: catalog database URL on platform-control. Defaults to the local platform-control Postgres URL.
Verification¶
Check the catalog:
psql "$DATABASE_URL" -c \
"select slug,status,driver_strategy,source_uri,digest_sha256,metadata->>'image_path',metadata->>'artifact_ref' from os_images where slug='ubuntu-24.04-h200-cuda';"
Check node cache state:
psql "$DATABASE_URL" -c \
"select n.hostname,c.image_slug,c.status,c.digest_sha256,c.local_path,c.verified_at from node_image_cache c join nodes n on n.id=c.node_id where c.image_slug='ubuntu-24.04-h200-cuda';"
Check a target host:
ssh hpcadmin@10.177.36.197 \
"ls -lh /var/lib/gpuaas/slice-images/ubuntu-24.04-h200-cuda.raw && sha256sum /var/lib/gpuaas/slice-images/ubuntu-24.04-h200-cuda.raw"
End-to-end validation is a normal h200-sxm-slice allocation. The provisioning
task payload should show driver_strategy=preinstalled and image_path pointing
to the CUDA developer raw image.
For direct API smoke tests against the current slice-dev environment, use
region_code=region-maas-1, capacity_shape=gpu_slice, and
scheduler_type=slice. The API normalizes the scheduler type internally while
preserving the slice capacity shape.