H200 MAAS Fit Analysis v1¶
Status: - comparative analysis only - no scope change proposed in this document
1. Goal¶
This document compares the newly received H200 MAAS automation bundle in:
/Volumes/DataExt2TB/dev/h200/maas-automation-TG-main
against the current GPUaaS MAAS design and implementation to answer one question:
- how well does the current GPUaaS model fit this environment without changing core scope?
This is a fit analysis, not a redesign proposal.
2. Current GPUaaS Intent¶
The current GPUaaS MAAS direction is:
- GPUaaS owns MAAS-backed node lifecycle orchestration.
- GPUaaS does not own MAAS server bootstrap and tuning.
- GPUaaS is data-model driven, not file/CSV/script driven.
- Site-specific node personality should be delivered through controlled site/profile inputs and bootstrap artifacts, not copied into core logic.
Relevant current code and design surfaces:
- packages/services/maas/service.go
- packages/services/maas/onboarding.go
- packages/services/maas/execution.go
- packages/services/maas/decommission_execution.go
- packages/services/maas/roce.go
- Node_Agent_Spec.md
3. What The H200 Bundle Contains¶
The H200 bundle is not just a node deploy script. It includes four layers:
- MAAS server bootstrap and tuning
install_maas_3_7.shdeploy-anywhere.shtune_maas.sh-
postgres_tuning.sh -
Site/network/operator configuration
site.env- inventory files such as
B/bnodes -
roce_ips.csv -
MAAS node lifecycle scripts
onboard_node.shdeploy-h200.shrelease_node.sh-
apply_roce_phase2.sh -
Vendor and site personality payloads
cloudinit- commissioning scripts
- firmware/testing helpers
- hardware-sync enforcement helpers
4. Fit Matrix¶
Legend:
Fits as-is: current GPUaaS model already covers this cleanlyFits via existing extension point: current model supports it, but through a site/profile/bootstrap artifact path rather than direct core logicPartial: concept exists, but current platform surface is thinner than the H200 bundleOut of scope: should remain external to current GPUaaS core scope
| H200 bundle capability | H200 source | GPUaaS original design fit | Current support | Notes |
|---|---|---|---|---|
| MAAS server install/bootstrap | install_maas_3_7.sh, deploy-anywhere.sh |
Out of scope | No | GPUaaS assumes an existing MAAS service; this should remain infra-owned. |
| MAAS server tuning | tune_maas.sh, postgres_tuning.sh |
Out of scope | No | Same as above; platform consumes MAAS, does not install/tune it. |
| Site network baseline on MAAS host | site.env, VLAN/DHCP/static pool setup |
Out of scope | No | GPUaaS stores site/profile metadata, but does not configure the MAAS host fabric itself. |
| Site credentials and MAAS profile binding | site.env, apikey |
Fits as-is | Yes | Covered by Site, SiteProfile, Vault-backed credentials in packages/services/maas/service.go. |
| Resolve machine by hostname/IPMI and create if missing | onboard_node.sh |
Fits as-is | Yes | Covered in onboarding resolve/create flow in execution.go. |
| Accept/commission machine | onboard_node.sh |
Fits as-is | Yes | Covered by onboarding commission stage. |
| PXE preflight and interface safety checks | onboard_node.sh, deploy-h200.sh |
Fits as-is | Yes | Covered by PXEPreflight stages in onboarding and decommission/reimage. |
| Boot disk selection and flat storage layout | onboard_node.sh, deploy-h200.sh |
Fits as-is | Yes | Covered by prepare_storage stage and BOSS disk detection. |
| Phase-2 RoCE assignment from site data | apply_roce_phase2.sh, roce_ips.csv |
Fits as-is | Yes | GPUaaS already has DB-backed maas_roce_assignments; current model is stronger than CSV scripts. |
| MAAS release to Ready | release_node.sh |
Fits as-is | Yes | Covered by decommission execution. |
| MAAS power off after release | release_node.sh |
Fits as-is | Yes | Covered for full_decommission mode. |
| GPUaaS-only node record cleanup after MAAS repurpose | not in H200 bundle, but needed operationally | Fits as-is | Yes | Covered by the new GPUaaS-only detach flow and decommission remove-node path. |
| Batch onboarding from CSV | source.csv, onboard_node.sh batch mode |
Partial | Yes | GPUaaS supports batch onboarding, but the source of truth is API/data driven rather than CSV import tooling. |
| Deploy cloud-init payload | cloudinit, deploy-h200.sh |
Fits via existing extension point | Partial | GPUaaS has site bootstrap bundle support, but the packaging/materialization workflow is still thin. |
| Minimal node bootstrap + enrollment into GPUaaS | H200 cloud-init equivalent | Fits as-is | Yes | Covered by node bootstrap token + enrollment model and MAAS deploy cloud-init generation. |
| Site bootstrap bundle fetched from API/registry | not in H200 scripts; they use files | Fits via existing extension point | Partial | Current GPUaaS has the hook, but it is not yet a polished site-ops publishing workflow. |
| Post-deploy login user/password policy | H200 cloud-init sets hpcadmin and sudo |
Fits via existing extension point | Partial | GPUaaS can carry this in a site bootstrap artifact, but it is not a first-class profile contract today. |
Non-root disk partitioning/mounting as /shareN |
H200 cloud-init | Fits via existing extension point | Partial | This is node personality; current model can host it in site bundle/cloud-init, but does not model it explicitly. |
| Vendor driver and package stack install | Lambda stack, DOCA/OFED, rshim |
Fits via existing extension point | Partial | Should be site/bootstrap artifact content, not GPUaaS core logic. Current system can inject, but does not manage lifecycle cleanly yet. |
| Hardware-specific low-level host tweaks | ACS disable service | Fits via existing extension point | Partial | Same category as above; should stay artifact/script driven, not embedded in core. |
| Enforce MAAS hardware sync policy | ensure_hw_sync.sh, timers |
Fits as-is | Yes | GPUaaS already carries hardware-sync policy fields and completion guards. |
Install/start maas-agent on deployed node |
H200 cloud-init | Partial | Partial | GPUaaS validates hardware sync outcomes, but the explicit deployed-node maas-agent bootstrap behavior is still site-bootstrap content rather than first-class managed behavior. |
| Commissioning/testing script registration in MAAS | register_node_scripts.sh, commissioning-scripts/* |
Partial | No | This is not represented as a first-class GPUaaS-managed artifact set today. |
| Firmware workflows for deployed nodes | run_deployed_fw_updates.sh |
Partial | No | Operationally useful, but not currently in core scope. Could be modeled later as site tooling or MAAS script assets. |
| MAAS event alerting/Telegram/email | monitoring-email/* |
Out of scope | No | Useful ops tooling, but not part of current GPUaaS control-plane scope. |
| File-based operator inventory and environment model | site.env, B/bnodes, CSVs |
Out of scope | No | GPUaaS intentionally uses DB/API models instead. This is a design difference, not necessarily a gap. |
5. Summary By Area¶
5.1 Strong alignment with original design¶
These H200 capabilities already fit the original GPUaaS MAAS model very well:
- resolve or create machine
- commission machine
- PXE preflight
- BOSS/boot disk selection
- flat storage layout
- phase-2 RoCE assignment
- MAAS deploy/release/reimage lifecycle
- node bootstrap and enrollment
- hardware-sync enforcement
- full decommission versus reimage split
This means the orchestration model itself is not far off.
5.2 The main gap is not lifecycle logic¶
The biggest difference is the H200 kit's heavy use of site-specific payloads:
- cloud-init content
- vendor driver installation
- host personality shaping
- firmware/test script bundles
GPUaaS has an intended place for this:
- site/profile policy
- site bootstrap bundle
- node bootstrap package
But the packaging and publication workflow for those artifacts is still thinner than the H200 shell kit.
5.3 Intentional scope differences¶
The following are not evidence that GPUaaS is behind the design. They are scope differences:
- MAAS server bootstrap
- MAAS server tuning
- Telegram/email alerting for MAAS host
- env-file and CSV-driven operator workflow
Those remain infra/ops concerns in the original GPUaaS model.
6. Bottom-Line Assessment¶
Against the original design, the H200 bundle:
- fits well for lifecycle orchestration,
- fits partially for site-specific bootstrap and post-deploy personality,
- should remain external for MAAS server bootstrap/tuning and ad hoc operator shell tooling.
So the platform is not far off in orchestration design.
The main thing still under-realized is:
- a robust site bootstrap bundle / artifact workflow for vendor- and site-specific node personality.
That is different from saying GPUaaS should absorb the H200 scripts wholesale. It should not.
7. Recommended Next Validation¶
Before changing any design or implementation, validate the H200 drop against these three questions:
- Which parts of
cloudinitare truly required for every H200 deployment? - Which parts belong in a site bootstrap bundle versus a future commissioning artifact set?
- Which scripts are just MAAS-host ops conveniences and should never move into GPUaaS core?
If those three are answered, the remaining delta can be classified without changing platform scope.