H200 MAAS Fit Analysis v1¶

Status: - comparative analysis only - no scope change proposed in this document

1. Goal¶

This document compares the newly received H200 MAAS automation bundle in:

/Volumes/DataExt2TB/dev/h200/maas-automation-TG-main

against the current GPUaaS MAAS design and implementation to answer one question:

how well does the current GPUaaS model fit this environment without changing core scope?

This is a fit analysis, not a redesign proposal.

2. Current GPUaaS Intent¶

The current GPUaaS MAAS direction is:

GPUaaS owns MAAS-backed node lifecycle orchestration.
GPUaaS does not own MAAS server bootstrap and tuning.
GPUaaS is data-model driven, not file/CSV/script driven.
Site-specific node personality should be delivered through controlled site/profile inputs and bootstrap artifacts, not copied into core logic.

Relevant current code and design surfaces:

3. What The H200 Bundle Contains¶

The H200 bundle is not just a node deploy script. It includes four layers:

MAAS server bootstrap and tuning
install_maas_3_7.sh
deploy-anywhere.sh
tune_maas.sh
postgres_tuning.sh
Site/network/operator configuration
site.env
inventory files such as B/bnodes
roce_ips.csv
MAAS node lifecycle scripts
onboard_node.sh
deploy-h200.sh
release_node.sh
apply_roce_phase2.sh
Vendor and site personality payloads
cloudinit
commissioning scripts
firmware/testing helpers
hardware-sync enforcement helpers

4. Fit Matrix¶

Legend:

Fits as-is: current GPUaaS model already covers this cleanly
Fits via existing extension point: current model supports it, but through a site/profile/bootstrap artifact path rather than direct core logic
Partial: concept exists, but current platform surface is thinner than the H200 bundle
Out of scope: should remain external to current GPUaaS core scope

H200 bundle capability	H200 source	GPUaaS original design fit	Current support	Notes
MAAS server install/bootstrap	`install_maas_3_7.sh`, `deploy-anywhere.sh`	Out of scope	No	GPUaaS assumes an existing MAAS service; this should remain infra-owned.
MAAS server tuning	`tune_maas.sh`, `postgres_tuning.sh`	Out of scope	No	Same as above; platform consumes MAAS, does not install/tune it.
Site network baseline on MAAS host	`site.env`, VLAN/DHCP/static pool setup	Out of scope	No	GPUaaS stores site/profile metadata, but does not configure the MAAS host fabric itself.
Site credentials and MAAS profile binding	`site.env`, `apikey`	Fits as-is	Yes	Covered by `Site`, `SiteProfile`, Vault-backed credentials in `packages/services/maas/service.go`.
Resolve machine by hostname/IPMI and create if missing	`onboard_node.sh`	Fits as-is	Yes	Covered in onboarding resolve/create flow in execution.go.
Accept/commission machine	`onboard_node.sh`	Fits as-is	Yes	Covered by onboarding commission stage.
PXE preflight and interface safety checks	`onboard_node.sh`, `deploy-h200.sh`	Fits as-is	Yes	Covered by `PXEPreflight` stages in onboarding and decommission/reimage.
Boot disk selection and flat storage layout	`onboard_node.sh`, `deploy-h200.sh`	Fits as-is	Yes	Covered by `prepare_storage` stage and BOSS disk detection.
Phase-2 RoCE assignment from site data	`apply_roce_phase2.sh`, `roce_ips.csv`	Fits as-is	Yes	GPUaaS already has DB-backed `maas_roce_assignments`; current model is stronger than CSV scripts.
MAAS release to Ready	`release_node.sh`	Fits as-is	Yes	Covered by decommission execution.
MAAS power off after release	`release_node.sh`	Fits as-is	Yes	Covered for `full_decommission` mode.
GPUaaS-only node record cleanup after MAAS repurpose	not in H200 bundle, but needed operationally	Fits as-is	Yes	Covered by the new GPUaaS-only detach flow and decommission remove-node path.
Batch onboarding from CSV	`source.csv`, `onboard_node.sh` batch mode	Partial	Yes	GPUaaS supports batch onboarding, but the source of truth is API/data driven rather than CSV import tooling.
Deploy cloud-init payload	`cloudinit`, `deploy-h200.sh`	Fits via existing extension point	Partial	GPUaaS has site bootstrap bundle support, but the packaging/materialization workflow is still thin.
Minimal node bootstrap + enrollment into GPUaaS	H200 cloud-init equivalent	Fits as-is	Yes	Covered by node bootstrap token + enrollment model and MAAS deploy cloud-init generation.
Site bootstrap bundle fetched from API/registry	not in H200 scripts; they use files	Fits via existing extension point	Partial	Current GPUaaS has the hook, but it is not yet a polished site-ops publishing workflow.
Post-deploy login user/password policy	H200 cloud-init sets `hpcadmin` and sudo	Fits via existing extension point	Partial	GPUaaS can carry this in a site bootstrap artifact, but it is not a first-class profile contract today.
Non-root disk partitioning/mounting as `/shareN`	H200 cloud-init	Fits via existing extension point	Partial	This is node personality; current model can host it in site bundle/cloud-init, but does not model it explicitly.
Vendor driver and package stack install	Lambda stack, DOCA/OFED, `rshim`	Fits via existing extension point	Partial	Should be site/bootstrap artifact content, not GPUaaS core logic. Current system can inject, but does not manage lifecycle cleanly yet.
Hardware-specific low-level host tweaks	ACS disable service	Fits via existing extension point	Partial	Same category as above; should stay artifact/script driven, not embedded in core.
Enforce MAAS hardware sync policy	`ensure_hw_sync.sh`, timers	Fits as-is	Yes	GPUaaS already carries hardware-sync policy fields and completion guards.
Install/start `maas-agent` on deployed node	H200 cloud-init	Partial	Partial	GPUaaS validates hardware sync outcomes, but the explicit deployed-node `maas-agent` bootstrap behavior is still site-bootstrap content rather than first-class managed behavior.
Commissioning/testing script registration in MAAS	`register_node_scripts.sh`, `commissioning-scripts/*`	Partial	No	This is not represented as a first-class GPUaaS-managed artifact set today.
Firmware workflows for deployed nodes	`run_deployed_fw_updates.sh`	Partial	No	Operationally useful, but not currently in core scope. Could be modeled later as site tooling or MAAS script assets.
MAAS event alerting/Telegram/email	`monitoring-email/*`	Out of scope	No	Useful ops tooling, but not part of current GPUaaS control-plane scope.
File-based operator inventory and environment model	`site.env`, `B/bnodes`, CSVs	Out of scope	No	GPUaaS intentionally uses DB/API models instead. This is a design difference, not necessarily a gap.

5. Summary By Area¶

5.1 Strong alignment with original design¶

These H200 capabilities already fit the original GPUaaS MAAS model very well:

resolve or create machine
commission machine
PXE preflight
BOSS/boot disk selection
flat storage layout
phase-2 RoCE assignment
MAAS deploy/release/reimage lifecycle
node bootstrap and enrollment
hardware-sync enforcement
full decommission versus reimage split

This means the orchestration model itself is not far off.

5.2 The main gap is not lifecycle logic¶

The biggest difference is the H200 kit's heavy use of site-specific payloads:

cloud-init content
vendor driver installation
host personality shaping
firmware/test script bundles

GPUaaS has an intended place for this:

site/profile policy
site bootstrap bundle
node bootstrap package

But the packaging and publication workflow for those artifacts is still thinner than the H200 shell kit.

5.3 Intentional scope differences¶

The following are not evidence that GPUaaS is behind the design. They are scope differences:

MAAS server bootstrap
MAAS server tuning
Telegram/email alerting for MAAS host
env-file and CSV-driven operator workflow

Those remain infra/ops concerns in the original GPUaaS model.

6. Bottom-Line Assessment¶

Against the original design, the H200 bundle:

fits well for lifecycle orchestration,
fits partially for site-specific bootstrap and post-deploy personality,
should remain external for MAAS server bootstrap/tuning and ad hoc operator shell tooling.

So the platform is not far off in orchestration design.

The main thing still under-realized is:

a robust site bootstrap bundle / artifact workflow for vendor- and site-specific node personality.

That is different from saying GPUaaS should absorb the H200 scripts wholesale. It should not.

7. Recommended Next Validation¶

Before changing any design or implementation, validate the H200 drop against these three questions:

Which parts of cloudinit are truly required for every H200 deployment?
Which parts belong in a site bootstrap bundle versus a future commissioning artifact set?
Which scripts are just MAAS-host ops conveniences and should never move into GPUaaS core?

If those three are answered, the remaining delta can be classified without changing platform scope.