Skip to content

H200 MAAS Fit Analysis v1

Status: - comparative analysis only - no scope change proposed in this document

1. Goal

This document compares the newly received H200 MAAS automation bundle in:

  • /Volumes/DataExt2TB/dev/h200/maas-automation-TG-main

against the current GPUaaS MAAS design and implementation to answer one question:

  • how well does the current GPUaaS model fit this environment without changing core scope?

This is a fit analysis, not a redesign proposal.

2. Current GPUaaS Intent

The current GPUaaS MAAS direction is:

  • GPUaaS owns MAAS-backed node lifecycle orchestration.
  • GPUaaS does not own MAAS server bootstrap and tuning.
  • GPUaaS is data-model driven, not file/CSV/script driven.
  • Site-specific node personality should be delivered through controlled site/profile inputs and bootstrap artifacts, not copied into core logic.

Relevant current code and design surfaces:

3. What The H200 Bundle Contains

The H200 bundle is not just a node deploy script. It includes four layers:

  1. MAAS server bootstrap and tuning
  2. install_maas_3_7.sh
  3. deploy-anywhere.sh
  4. tune_maas.sh
  5. postgres_tuning.sh

  6. Site/network/operator configuration

  7. site.env
  8. inventory files such as B/bnodes
  9. roce_ips.csv

  10. MAAS node lifecycle scripts

  11. onboard_node.sh
  12. deploy-h200.sh
  13. release_node.sh
  14. apply_roce_phase2.sh

  15. Vendor and site personality payloads

  16. cloudinit
  17. commissioning scripts
  18. firmware/testing helpers
  19. hardware-sync enforcement helpers

4. Fit Matrix

Legend:

  • Fits as-is: current GPUaaS model already covers this cleanly
  • Fits via existing extension point: current model supports it, but through a site/profile/bootstrap artifact path rather than direct core logic
  • Partial: concept exists, but current platform surface is thinner than the H200 bundle
  • Out of scope: should remain external to current GPUaaS core scope
H200 bundle capability H200 source GPUaaS original design fit Current support Notes
MAAS server install/bootstrap install_maas_3_7.sh, deploy-anywhere.sh Out of scope No GPUaaS assumes an existing MAAS service; this should remain infra-owned.
MAAS server tuning tune_maas.sh, postgres_tuning.sh Out of scope No Same as above; platform consumes MAAS, does not install/tune it.
Site network baseline on MAAS host site.env, VLAN/DHCP/static pool setup Out of scope No GPUaaS stores site/profile metadata, but does not configure the MAAS host fabric itself.
Site credentials and MAAS profile binding site.env, apikey Fits as-is Yes Covered by Site, SiteProfile, Vault-backed credentials in packages/services/maas/service.go.
Resolve machine by hostname/IPMI and create if missing onboard_node.sh Fits as-is Yes Covered in onboarding resolve/create flow in execution.go.
Accept/commission machine onboard_node.sh Fits as-is Yes Covered by onboarding commission stage.
PXE preflight and interface safety checks onboard_node.sh, deploy-h200.sh Fits as-is Yes Covered by PXEPreflight stages in onboarding and decommission/reimage.
Boot disk selection and flat storage layout onboard_node.sh, deploy-h200.sh Fits as-is Yes Covered by prepare_storage stage and BOSS disk detection.
Phase-2 RoCE assignment from site data apply_roce_phase2.sh, roce_ips.csv Fits as-is Yes GPUaaS already has DB-backed maas_roce_assignments; current model is stronger than CSV scripts.
MAAS release to Ready release_node.sh Fits as-is Yes Covered by decommission execution.
MAAS power off after release release_node.sh Fits as-is Yes Covered for full_decommission mode.
GPUaaS-only node record cleanup after MAAS repurpose not in H200 bundle, but needed operationally Fits as-is Yes Covered by the new GPUaaS-only detach flow and decommission remove-node path.
Batch onboarding from CSV source.csv, onboard_node.sh batch mode Partial Yes GPUaaS supports batch onboarding, but the source of truth is API/data driven rather than CSV import tooling.
Deploy cloud-init payload cloudinit, deploy-h200.sh Fits via existing extension point Partial GPUaaS has site bootstrap bundle support, but the packaging/materialization workflow is still thin.
Minimal node bootstrap + enrollment into GPUaaS H200 cloud-init equivalent Fits as-is Yes Covered by node bootstrap token + enrollment model and MAAS deploy cloud-init generation.
Site bootstrap bundle fetched from API/registry not in H200 scripts; they use files Fits via existing extension point Partial Current GPUaaS has the hook, but it is not yet a polished site-ops publishing workflow.
Post-deploy login user/password policy H200 cloud-init sets hpcadmin and sudo Fits via existing extension point Partial GPUaaS can carry this in a site bootstrap artifact, but it is not a first-class profile contract today.
Non-root disk partitioning/mounting as /shareN H200 cloud-init Fits via existing extension point Partial This is node personality; current model can host it in site bundle/cloud-init, but does not model it explicitly.
Vendor driver and package stack install Lambda stack, DOCA/OFED, rshim Fits via existing extension point Partial Should be site/bootstrap artifact content, not GPUaaS core logic. Current system can inject, but does not manage lifecycle cleanly yet.
Hardware-specific low-level host tweaks ACS disable service Fits via existing extension point Partial Same category as above; should stay artifact/script driven, not embedded in core.
Enforce MAAS hardware sync policy ensure_hw_sync.sh, timers Fits as-is Yes GPUaaS already carries hardware-sync policy fields and completion guards.
Install/start maas-agent on deployed node H200 cloud-init Partial Partial GPUaaS validates hardware sync outcomes, but the explicit deployed-node maas-agent bootstrap behavior is still site-bootstrap content rather than first-class managed behavior.
Commissioning/testing script registration in MAAS register_node_scripts.sh, commissioning-scripts/* Partial No This is not represented as a first-class GPUaaS-managed artifact set today.
Firmware workflows for deployed nodes run_deployed_fw_updates.sh Partial No Operationally useful, but not currently in core scope. Could be modeled later as site tooling or MAAS script assets.
MAAS event alerting/Telegram/email monitoring-email/* Out of scope No Useful ops tooling, but not part of current GPUaaS control-plane scope.
File-based operator inventory and environment model site.env, B/bnodes, CSVs Out of scope No GPUaaS intentionally uses DB/API models instead. This is a design difference, not necessarily a gap.

5. Summary By Area

5.1 Strong alignment with original design

These H200 capabilities already fit the original GPUaaS MAAS model very well:

  • resolve or create machine
  • commission machine
  • PXE preflight
  • BOSS/boot disk selection
  • flat storage layout
  • phase-2 RoCE assignment
  • MAAS deploy/release/reimage lifecycle
  • node bootstrap and enrollment
  • hardware-sync enforcement
  • full decommission versus reimage split

This means the orchestration model itself is not far off.

5.2 The main gap is not lifecycle logic

The biggest difference is the H200 kit's heavy use of site-specific payloads:

  • cloud-init content
  • vendor driver installation
  • host personality shaping
  • firmware/test script bundles

GPUaaS has an intended place for this:

  • site/profile policy
  • site bootstrap bundle
  • node bootstrap package

But the packaging and publication workflow for those artifacts is still thinner than the H200 shell kit.

5.3 Intentional scope differences

The following are not evidence that GPUaaS is behind the design. They are scope differences:

  • MAAS server bootstrap
  • MAAS server tuning
  • Telegram/email alerting for MAAS host
  • env-file and CSV-driven operator workflow

Those remain infra/ops concerns in the original GPUaaS model.

6. Bottom-Line Assessment

Against the original design, the H200 bundle:

  • fits well for lifecycle orchestration,
  • fits partially for site-specific bootstrap and post-deploy personality,
  • should remain external for MAAS server bootstrap/tuning and ad hoc operator shell tooling.

So the platform is not far off in orchestration design.

The main thing still under-realized is:

  • a robust site bootstrap bundle / artifact workflow for vendor- and site-specific node personality.

That is different from saying GPUaaS should absorb the H200 scripts wholesale. It should not.

Before changing any design or implementation, validate the H200 drop against these three questions:

  1. Which parts of cloudinit are truly required for every H200 deployment?
  2. Which parts belong in a site bootstrap bundle versus a future commissioning artifact set?
  3. Which scripts are just MAAS-host ops conveniences and should never move into GPUaaS core?

If those three are answered, the remaining delta can be classified without changing platform scope.