MAAS Hardware Profile Capability Matrix v1¶
Status: - comparative analysis only - no contract change proposed in this document
1. Goal¶
This document evaluates whether the current GPUaaS MAAS site/profile model is sufficient across likely hardware and fabric environments in the GPUaaS service space.
It is a mapping exercise against the current profile policy surface, not a redesign.
2. Current Profile Policy Surface¶
Current implemented MAAS site/profile policy fields are:
site_bootstrap_bundle_refstrict_pxe_preflightenable_phase2_rocerequire_hw_synchardware_sync_intervalrelease_fallback_no_eraseenable_deploy_retry_on_datasource_failuremax_deploy_retry_attemptsauto_claim_single_new_machinebatch_max_parallelenrollment_token_ttl_seconds
Reference:
3. Hardware / Site Classes Considered¶
This matrix uses hardware/site classes that are likely to recur in GPUaaS:
- Standard Ethernet GPU nodes
- RoCE Ethernet GPU nodes
- InfiniBand GPU nodes
- Storage-heavy GPU nodes
- Vendor-custom GPU nodes
- Customer-handoff or restricted-release MAAS sites
4. Capability Matrix¶
Legend:
Yes: current model covers this cleanlyPartial: current model can support it, but only through a site artifact or operational conventionNo: current model does not represent it well enough yetOut of scope: should remain external to GPUaaS core
| Capability | Standard Ethernet | RoCE Ethernet | InfiniBand | Storage-heavy | Vendor-custom | Restricted-release site | Notes |
|---|---|---|---|---|---|---|---|
| MAAS site binding and credentials | Yes | Yes | Yes | Yes | Yes | Yes | Already covered by site + Vault-backed credential model. |
| PXE/deploy interface safety | Yes | Yes | Yes | Yes | Yes | Yes | Covered by strict_pxe_preflight. |
| MAAS deploy retry on datasource issues | Yes | Yes | Yes | Yes | Yes | Yes | Covered by retry policy fields. |
| Batch onboarding throttle | Yes | Yes | Yes | Yes | Yes | Yes | Covered by batch_max_parallel. |
| GPUaaS enrollment bootstrap | Yes | Yes | Yes | Yes | Yes | Yes | Covered by deploy cloud-init generation and enrollment token TTL. |
| Site-specific bootstrap payload | Partial | Partial | Partial | Partial | Partial | Partial | Covered conceptually by site_bootstrap_bundle_ref, but artifact publication/materialization is still thin. |
| Phase-2 RoCE IP assignment | No | Yes | No | Partial | Partial | No | Covered by enable_phase2_roce plus DB-backed assignment model; only meaningful for RoCE Ethernet sites. |
| IB / fabric-aware host validation | No | No | Partial | No | Partial | No | Current model can skip RoCE, but has no explicit fabric-type or IB validation contract. |
| Storage layout variants beyond current flat prep | Partial | Partial | Partial | Partial | Partial | Partial | Current storage prep likely covers simple cases, but repeatable named storage layout profiles are not modeled. |
| Vendor driver / runtime package shaping | Partial | Partial | Partial | Partial | Partial | Partial | Should live in site bootstrap artifacts, not core logic. Current model can host it, but not with a rich lifecycle. |
| Commissioning / firmware script bundle | No | No | No | No | No | No | Not represented as first-class managed site/profile assets today. |
| Hardware sync enforcement | Yes | Yes | Yes | Yes | Yes | Yes | Covered by require_hw_sync and hardware_sync_interval. |
| Release behavior with “do not erase if MAAS cannot wipe” | Yes | Yes | Yes | Yes | Yes | Yes | Covered by release_fallback_no_erase. |
| GPUaaS-only detach after external MAAS repurpose | Yes | Yes | Yes | Yes | Yes | Yes | Covered operationally by node detach flow; not a site/profile field. |
| Customer-handoff “do not touch MAAS again” mode | Partial | Partial | Partial | Partial | Partial | Partial | Current profile has only part of this via fallback-no-erase; richer release-mode expression is not modeled. |
| MAAS server install/tune | Out of scope | Out of scope | Out of scope | Out of scope | Out of scope | Out of scope | Correctly remains infra-owned. |
5. What The Current Model Already Covers Well¶
The current model is already strong enough for:
- generic MAAS-backed onboarding and deploy
- safe PXE deployment behavior
- RoCE-specific phase-2 network mutation
- hardware sync enforcement
- datasource retry and deploy robustness
- restricted erase fallback behavior
- API/data-driven site configuration rather than CSV-driven operator workflows
This means the original orchestration design still fits most expected MAAS-backed GPUaaS environments.
6. What Looks Under-Represented¶
The likely under-represented capabilities are:
6.1 Fabric type intent¶
Today:
enable_phase2_roce=true|false
is the only fabric-adjacent knob.
That is enough to say:
- RoCE site: enable the phase-2 step
- non-RoCE site: skip the phase-2 step
But it does not say whether the site is:
- plain Ethernet
- RoCE
- InfiniBand
For first-pass validation, this is acceptable.
For a mature model, a future fabric_mode field would probably be cleaner.
6.2 Post-deploy validation beyond generic enrollment¶
Some environments, especially InfiniBand or vendor-custom nodes, may need explicit host validation after deploy:
- expected HCA presence
- expected fabric port state
- expected site package stack
- expected mount or disk topology
Today this can only live inside the site bootstrap payload or external operator tooling.
6.3 Commissioning / firmware asset sets¶
The H200-style MAAS bundles often include:
- commissioning scripts
- hardware tests
- firmware update helpers
These are not represented as first-class site/profile-managed assets today.
That is not necessarily wrong, but it is a gap if the platform wants to own those assets rather than relying on external MAAS host tooling.
6.4 Storage layout intent¶
Current storage preparation is enough for simple cases.
If GPUaaS needs named storage personalities across hardware families, the model may eventually need an explicit storage layout selector rather than only site bootstrap scripting.
7. Minimal Additional Options That Seem Justified¶
If the platform eventually needs more profile expressiveness, the most defensible additions appear to be:
fabric_modeethernet | roce | ib-
makes the intent explicit without encoding vendor specifics
-
commissioning_bundle_ref -
for MAAS commissioning/testing/firmware helper assets
-
post_deploy_validation_bundle_ref -
for hardware-family or fabric-family validation after deploy
-
storage_layout_profile - only if repeated storage personalities become a real operational requirement
8. Options That Should Not Move Into Core¶
These should remain outside current GPUaaS core scope:
- MAAS server install and tuning
- MAAS host monitoring/Telegram/email tooling
- ad hoc inventory CSVs
- operator shell wrappers for MAAS host administration
- hardcoded vendor-specific runtime choices embedded directly in control-plane code
9. Bottom-Line Assessment¶
The current GPUaaS MAAS site/profile model already covers most of what is needed across likely GPUaaS hardware classes.
The main gap is not lifecycle orchestration.
The main under-realized areas are:
- explicit fabric intent
- richer post-deploy validation
- optional commissioning/firmware artifact handling
- possibly explicit storage layout intent if repeated site patterns require it
So the platform is not far off from the original design.
The likely next step is not to expand core immediately. It is to validate upcoming environments against this matrix and only add new profile options when a concrete site proves the current knobs are insufficient.