Skip to content

GPU Slice Image Pipeline Runbook

This runbook builds the H200 GPU slice developer VM image on platform-control and publishes it to slice hosts over the private MAAS/node network. The goal is to avoid transferring multi-GB raw images over the developer VPN path.

Default Artifact

  • Catalog slug: ubuntu-24.04-h200-cuda
  • Runtime cache path on slice hosts: /var/lib/gpuaas/slice-images/ubuntu-24.04-h200-cuda.raw
  • Source of truth: OCI artifact in the platform registry.
  • Driver strategy: preinstalled
  • Compatible SKU: h200-sxm-slice
  • Default publish target: hpcadmin@10.177.36.197

The base image remains ubuntu-24.04-h200-base with cloud-init driver installation. The CUDA developer image should be active only after this pipeline successfully publishes the artifact to every schedulable slice host. Add J22u11 or other slice hosts to GPUAAS_SLICE_IMAGE_TARGETS before making them schedulable for GPU slice allocations.

Run From A Developer Workstation

Dry-run the remote plan:

GPUAAS_SLICE_IMAGE_DRY_RUN=1 \
  scripts/ops/gpuaas_slice_image_platform_control_pipeline.sh

Build on platform-control, publish an OCI artifact to the platform registry, and upsert the live catalog row:

scripts/ops/gpuaas_slice_image_platform_control_pipeline.sh

Update verified node-local cache state for a different host set:

GPUAAS_SLICE_IMAGE_TARGETS='hpcadmin@10.177.36.197' \
  scripts/ops/gpuaas_slice_image_platform_control_pipeline.sh

Reuse an already-built platform-control artifact and only republish/catalog it:

GPUAAS_SLICE_IMAGE_SKIP_BUILD=1 \
  scripts/ops/gpuaas_slice_image_platform_control_pipeline.sh

The pipeline no longer copies the raw image to nodes by default. To use the temporary rsync fallback for a lab node, set GPUAAS_SLICE_IMAGE_PUBLISH_NODE_CACHE=1. The production direction is direct registry-to-node prewarm with node-local verified cache.

Runtime Behavior

The local launcher copies scripts/ops/gpuaas_slice_image_remote_build_publish.sh to platform-control, then runs it there. The remote script:

  1. Installs qemu-utils and libguestfs-tools if missing.
  2. Downloads the Ubuntu Noble cloud image to the platform-control cache.
  3. Resizes the image and uses NBD plus a host-network chroot to install guest GPU/RDMA packages.
  4. Converts the customized qcow2 image to sparse raw format.
  5. Pushes the raw image as an OCI artifact to the platform registry.
  6. Upserts the os_images row as active, with source_uri=oci://....
  7. Verifies any existing node-local cache on target hosts and records node_image_cache.status=verified when the digest matches.

GPUAAS_SLICE_IMAGE_CUSTOMIZE_METHOD=guestfs is available as a fallback, but the default NBD path is preferred because it uses the platform-control host network and avoids guest DNS failures during package installation.

The direct os_images and node_image_cache upserts are temporary operator paths because the admin API currently supports list/create/delete but not update/enable for an existing image or node image-cache state. Once those endpoints exist, this script should use the API instead of direct SQL. scripts/seed.sql preserves an already-materialized ubuntu-24.04-h200-cuda digest/status so release deploys do not silently revert the active image back to the disabled placeholder row.

Useful Overrides

  • PLATFORM_CONTROL_SSH_HOST: platform-control SSH target. Defaults to hpcadmin@100.90.157.34.
  • GPUAAS_SLICE_IMAGE_BASE_URL: source cloud image URL.
  • GPUAAS_SLICE_IMAGE_INSTALL_PACKAGES: comma-separated package list passed to virt-customize --install.
  • GPUAAS_SLICE_IMAGE_TARGETS: space-separated user@host publish targets.
  • GPUAAS_SLICE_IMAGE_REGISTRY_HOST: registry host. Defaults to the platform-control registry host. This is the host stored in catalog metadata.
  • GPUAAS_SLICE_IMAGE_REGISTRY_PUSH_HOST: optional internal registry endpoint used only for upload. Use this to keep large raw-image pushes off public ingress while preserving the public catalog reference.
  • GPUAAS_SLICE_IMAGE_REGISTRY_NAMESPACE: registry namespace. Defaults to slice-guest-images.
  • GPUAAS_SLICE_IMAGE_REGISTRY_TAG: registry tag. Defaults to UTC build time.
  • GPUAAS_SLICE_IMAGE_REGISTRY_TLS_MODE: verify, insecure, or plain-http; use plain-http only for trusted internal registry endpoints.
  • GPUAAS_SLICE_IMAGE_PUBLISH_NODE_CACHE=1: temporary rsync fallback that writes the raw image to target nodes.
  • GPUAAS_SLICE_IMAGE_UPDATE_CATALOG=0: publish image without changing catalog.
  • DATABASE_URL: catalog database URL on platform-control. Defaults to the local platform-control Postgres URL.

Verification

Check the catalog:

psql "$DATABASE_URL" -c \
  "select slug,status,driver_strategy,source_uri,digest_sha256,metadata->>'image_path',metadata->>'artifact_ref' from os_images where slug='ubuntu-24.04-h200-cuda';"

Check node cache state:

psql "$DATABASE_URL" -c \
  "select n.hostname,c.image_slug,c.status,c.digest_sha256,c.local_path,c.verified_at from node_image_cache c join nodes n on n.id=c.node_id where c.image_slug='ubuntu-24.04-h200-cuda';"

Check a target host:

ssh hpcadmin@10.177.36.197 \
  "ls -lh /var/lib/gpuaas/slice-images/ubuntu-24.04-h200-cuda.raw && sha256sum /var/lib/gpuaas/slice-images/ubuntu-24.04-h200-cuda.raw"

End-to-end validation is a normal h200-sxm-slice allocation. The provisioning task payload should show driver_strategy=preinstalled and image_path pointing to the CUDA developer raw image.

For direct API smoke tests against the current slice-dev environment, use region_code=region-maas-1, capacity_shape=gpu_slice, and scheduler_type=slice. The API normalizes the scheduler type internally while preserving the slice capacity shape.