MAAS Provisioning Time Optimization v1¶

Status: draft; first instrumentation, wait-loop improvements, and host image pipeline implemented Audience: GPUaaS engineering, infra/MAAS operators, platform-control reviewers Last updated: 2026-04-21

Purpose¶

Capture the concrete optimization directions for reducing MAAS-backed node provisioning time without weakening the existing lifecycle contract.

This document is intentionally narrower than the full MAAS lifecycle baseline:

It focuses on:

where provisioning time is currently spent,
which work belongs in the image versus first boot versus post-enrollment,
which optimizations preserve control-plane correctness,
what should be measured before and after each change.

It does not redefine:

the MAAS workflow API,
the public endpoint/public IP model,
the embedded UI gateway or workload proxy model.

Problem Statement¶

Current provisioning paths still spend too much time after MAAS deploy completes. The likely bottlenecks are no longer just MAAS image transfer. They are spread across:

boot-to-cloud-init timing,
bootstrap package fetch and validation,
package install or runtime prerequisite install during first boot,
enrollment token/bootstrap token validation,
node-agent install/start and first enrollment,
post-deploy readiness polling and conservative completion gates.

The product goal is:

keep the control path safe and auditable,
keep the data path direct,
shift repeatable heavy work into the image or controlled site/profile bundles,
leave only small environment-specific configuration for first boot.

Guiding Position¶

Provisioning speed should come primarily from moving work earlier in the lifecycle, not from weakening readiness checks.

The preferred layering is:

MAAS image owns the base OS and heavy/static prerequisites.
Site/profile bootstrap bundle owns site-specific environment wiring.
GPUaaS bootstrap owns node-agent enrollment and final platform handoff.
Post-enrollment typed node tasks own ongoing drift repair or later runtime changes.

Avoid using first boot as a generic install script phase for large packages or repeated host bootstrap that could be prebuilt.

Current Execution Boundary¶

The existing MAAS docs already preserve several behaviors that must remain:

deterministic discovery/adoption,
PXE safety before deploy,
bounded classified deploy retry,
infra-owned site bootstrap before GPUaaS bootstrap,
explicit completion gates before a node is considered active.

Those behaviors should stay. Optimization should change where work happens, not remove those guardrails.

Provisioning Time Buckets¶

For optimization work, measure at least these buckets separately:

request accepted to MAAS workflow start,
discovery/adoption time,
commission time,
ready-to-deploy wait,
image deploy time,
first boot to cloud-init start,
cloud-init/bootstrap package fetch,
site bootstrap execution,
GPUaaS bootstrap execution,
node-agent enrollment,
post-enrollment runtime prereq checks,
control-plane readiness confirmation.

Without this split, image and first-boot changes will be hard to evaluate.

Current implementation notes:

site bootstrap can receive the bootstrap progress callback environment,
GPUaaS node bootstrap now emits explicit node_bootstrap started, succeeded, and failed callbacks to the control plane when a progress URL is available,
onboarding and decommission wait loops record MAAS INFO events into the workflow timeline for the active stage window,
node enrollment polling now runs on a shorter interval than MAAS deploy polling so an already-enrolled node is not held behind a coarse wait tick,
onboarding and decommission reenrollment loops record node status changes while waiting for active state.
a platform-control hosted MAAS H200 host-image build/upload pipeline now exists for the first prebaked image experiment; see MAAS_H200_Host_Image_Pipeline_Runbook.md.
onboarding and decommission release/deploy submission timeouts are treated as unknown-outcome MAAS mutations: the workflow records the timeout and polls MAAS for the target state before failing.

Optimization Layers¶

1. Image-level optimization¶

Move these into the MAAS-deployed image whenever they are stable enough:

base OS packages required on every managed node,
container runtime packages when they are standard for the target profile,
GPU/driver support packages that do not need per-host dynamic install,
large package repositories, apt sources, and trusted keys,
common diagnostics used by first boot and incident workflows,
any static bootstrap helper binaries or launchers.

Initial H200 host image content:

Ubuntu 24.04 Noble cloud-image base expanded for H200 host use,
Docker and compose plugin,
NVIDIA host driver utilities and NVIDIA container toolkit where available,
Docker daemon baseline config with BuildKit enabled and bounded JSON log retention,
optional DCGM/fabric manager packages where available from configured repos,
Netdata with bounded retention plus host nvidia-smi, Docker, cgroup, and systemd collectors,
libvirt/qemu/OVS/dnsmasq/cloud-image tooling required by the slice host runtime,
RDMA/IB diagnostics and perf tooling,
slice passthrough kernel args (intel_iommu=on iommu=pt) and a bounded systemd-networkd-wait-online baseline to avoid waiting on unused secondary/fabric ports,
the current h200-slice-vm site bootstrap bundle staged under /usr/local/share/gpuaas/site-bootstrap/.

The image does not pre-run hardware-specific configuration. VFIO binding, SR-IOV VF creation, slice bridge creation, and token/enrollment work remain first-boot responsibilities. IOMMU is the exception: the kernel args should be baked so first boot has IOMMU groups and the slice host does not need a second reboot before topology discovery can pass.

Benefits:

removes repeated package download/install cost,
reduces exposure to transient mirror/network delays at first boot,
keeps first-boot cloud-init smaller and easier to debug.

Risks:

image rebuild cadence becomes important,
stale images can hide drift until deploy,
multiple site/profile variants can multiply image count if unmanaged.

Recommendation:

use a small number of curated base images per host class/profile family,
do not let each site invent arbitrary one-off images,
version and publish image lineage explicitly.

The first H200 baseline should compare these two readiness moments separately:

node active/enrolled: node-agent is up, authenticated, and reporting host identity,
slice-ready: approved or discoverable slice slots have been reported and can be scheduled.

For slice hosts, slice-ready must be stricter than "node-agent is active" or "VFIO devices are bound." It also requires the fabric and VM runtime baseline:

the slice profile tag in MAAS is gpuaas-profile-slice-vm;
gpuaas-slice-fabric-vfs, gpuaas-slice-vfio-devices, gpuaas-slice-network-baseline, and gpuaas-slice-ovs-bridge are active;
IPoIB netplan exists, an ibp* interface has the site fabric address, and a fabric route is installed;
ovsbr0 has the guest subnet gateway;
dnsmasq and libvirtd are active;
node-agent topology reports the slice network evidence without blockers.

This is intentionally a convergence gate, not just an image-build gate. A prebaked image can remove package-install time, but the first-boot site/profile bootstrap still owns host-specific fabric, bridge, and scheduler readiness.

j22u11 showed that a node can be active while slice slots are still not reported. That makes slot discovery/reporting a distinct performance and correctness gate, not just part of generic onboarding completion.

2. Site/profile bootstrap optimization¶

The site/profile bootstrap bundle should stay infra-owned or tightly controlled, but it should be minimal.

It should do only what must remain site-specific at first boot:

network/site wiring,
site CA or trust bundle placement,
hardware/site toggles not safe to bake globally,
bootstrap launcher execution with versioned inputs,
small site-specific configuration files.

It should not become the place where:

Docker or large runtime packages are installed every time,
full hardware-sync reseeds run without reason,
ad hoc remote shell logic accumulates.

3. GPUaaS bootstrap optimization¶

GPUaaS bootstrap should be limited to:

validating rendered bootstrap inputs,
installing or starting node-agent if not already baked,
enrolling with bootstrap/enrollment tokens,
writing minimal control-plane managed config,
confirming the first successful node heartbeat.

It should not own:

large package installation by default,
long-lived SSH-based host mutation,
recurring repair logic after enrollment.

4. Post-enrollment repair split¶

Anything that is:

repeatable,
bounded,
useful after the node is already under control,

should move to typed node-agent tasks instead of first boot.

Examples:

runtime prerequisite drift repair,
registry trust refresh,
optional package/runtime updates,
diagnostics collection,
controlled service lifecycle actions.

This keeps MAAS first-boot responsibility smaller and faster.

Fastest Safe Path¶

The best near-term optimization path is:

build a curated MAAS image with the heavy common packages already installed,
keep site/profile bootstrap focused on environment configuration,
keep GPUaaS bootstrap focused on enrollment,
move any recurring post-enrollment setup into typed node tasks,
preflight all bootstrap inputs before MAAS deploy starts.

This is the highest-signal path because it reduces time in the longest repeated steps without changing the public lifecycle model.

Predeploy Hardening¶

Before deploy starts, validate:

bootstrap endpoint reachability from the MAAS-served node network,
exact bootstrap artifact/package URL fetchability,
CA bundle content is valid and final, not a template placeholder,
bootstrap-token and enrollment-token persistence is internally consistent,
site/profile bootstrap inputs resolve cleanly,
selected image/profile version exists and is approved for the site.

The point is to fail before deploy when the inputs are known-bad instead of waiting for a full deploy plus cloud-init failure.

MAAS Mutation Timeout Rule¶

Some MAAS mutation calls can time out after MAAS has accepted the operation. Those failures must not be treated as terminal until the workflow verifies the machine's final state.

For long-running state transitions, use this rule:

submit mutation,
if the request succeeds, poll for the target state,
if the request times out, record an unknown-outcome progress event and poll for the target state,
fail only if the follow-up poll reaches a MAAS failure state or expires.

Currently covered transitions:

onboarding release deployed machine to Ready,
onboarding retry release to Ready,
decommission release to Ready,
onboarding commission to Ready,
onboarding deploy to Deployed,
decommission reimage deploy to Deployed.

Short configuration mutations such as boot-disk selection, storage layout, and RoCE link assignment still need explicit post-write verification before they can use this timeout rule safely.

Post-Deploy Completion Gates¶

The node should not transition to active until all are true:

MAAS reports Deployed,
site bootstrap completed successfully,
GPUaaS bootstrap completed successfully,
node-agent started,
first enrollment succeeded,
node row reached the expected active state,
onboarding attempt metadata reflects the currently executing run.

These gates should remain explicit even if image/bootstrap work becomes faster.

Image Strategy Options¶

Option A: Generic Ubuntu + first-boot install¶

Pros:

lowest image maintenance,
easiest early experimentation.

Cons:

slowest path,
highest dependence on mirrors/network/package repos during deploy,
largest first-boot variance.

This should not be the preferred production direction.

Option B: Curated MAAS base image per host/profile family¶

Pros:

best balance of speed and control,
predictable boot/runtime environment,
first boot stays small.

Cons:

image pipeline required,
versioning and rollout discipline required.

This is the recommended default.

Option C: Fully site-custom image per site¶

Pros:

maximum local optimization.

Cons:

image sprawl,
hard-to-debug drift,
poor reproducibility across sites.

Avoid unless a site truly requires it.

Suggested Ownership Split¶

GPUaaS engineering owns:

image/version policy,
which prerequisites are image-level versus bootstrap-level,
bootstrap token/enrollment flow correctness,
typed post-enrollment repair tasks,
timing instrumentation and read models.

Infra/MAAS operators own:

MAAS image upload/availability process,
site-local bootstrap bundle content where infra-owned,
mirror/package source reliability,
MAAS network/PXE health.

Instrumentation Requirements¶

Add or preserve timeline evidence for:

image selected,
deploy started,
deploy finished,
first boot observed,
site bootstrap started/finished,
GPUaaS bootstrap started/finished,
node-agent enrollment started/finished,
node became active.

Optimization work should not proceed without timing visibility for these stages.

Recommended Next Execution Slices¶

Slice 1: Measure current path precisely¶

Before changing behavior broadly:

capture timing for the buckets listed above on the current MAAS flow,
identify which steps dominate the wall clock,
record whether failures cluster around image deploy, cloud-init, package install, or enrollment.

Slice 2: Prebaked image experiment¶

Run a controlled comparison between:

current image + full first-boot path,
prebaked MAAS image with common prerequisites already installed.

Compare:

deploy time,
boot-to-enrollment time,
failure rate,
operational complexity.

Implementation entry point:

scripts/ops/gpuaas_maas_h200_host_image_platform_control_pipeline.sh

The pipeline builds on platform-control and uploads custom/gpuaas-h200-host-ubuntu2404 to MAAS. Use the current j22u11 run as the baseline before switching the host/profile to the prebaked image.

2026-04-21 update: the custom image must be a MAAS tgz root filesystem archive, not a ddgz full-disk artifact, for the current flat-storage deploy path. The ddgz experiment reached curtin but failed at GRUB/EFI because the disk image root and MAAS-created EFI partition were not aligned. The tgz artifact lets MAAS own partitioning and completed deployment successfully.

The first successful j22u11 custom-image run still spent most time in first-boot site bootstrap. The next image build therefore bakes the H200 site-bootstrap base package set and best-effort DOCA/OFED packages into the image, and the h200-ib site bootstrap skips those apt-heavy phases when /etc/gpuaas/maas-host-image-build.json is present and the expected packages are installed.

2026-04-26 follow-up: a later custom-image reimage of j22u11 proved that node active/enrolled can pass while slice-ready is still false. The host had node-agent, Netdata, VFIO, and OVS, but did not have the IPoIB netplan, fabric route, or gpuaas-slice-network-baseline service present on the known-good j22u15 slice host. The optimization path must therefore preserve or compose the h200-ib fabric-network baseline into the h200-slice-vm path before the custom image can be considered production-ready for slice scheduling.

Slice 3: Bootstrap minimization¶

Reduce first-boot cloud-init/bootstrap to:

environment wiring,
token materialization,
node-agent start/enroll,
final readiness report.

Slice 4: Post-enrollment drift repair extraction¶

Move anything still done in first boot but not required for first enrollment into typed node-agent tasks.

Explicit Non-Goals¶

replacing the MAAS lifecycle contract,
weakening safety checks just to improve timing,
coupling public endpoint/network design into MAAS optimization,
making the MAAS bootstrap path a generic shell execution channel,
turning platform API services into a data-plane proxy.

Summary¶

The strongest provisioning-time reduction path is not more boot-time scripting. It is:

prebuilt images for heavy/static prerequisites,
smaller site/profile bootstrap,
minimal GPUaaS bootstrap,
typed post-enrollment repair,
better timing instrumentation and predeploy validation.

That path keeps the control-plane model intact while cutting repeated work from the critical path.