MAAS Hardware Profile Capability Matrix v1¶

Status: - comparative analysis only - no contract change proposed in this document

1. Goal¶

This document evaluates whether the current GPUaaS MAAS site/profile model is sufficient across likely hardware and fabric environments in the GPUaaS service space.

It is a mapping exercise against the current profile policy surface, not a redesign.

2. Current Profile Policy Surface¶

Current implemented MAAS site/profile policy fields are:

site_bootstrap_bundle_ref
strict_pxe_preflight
enable_phase2_roce
require_hw_sync
hardware_sync_interval
release_fallback_no_erase
enable_deploy_retry_on_datasource_failure
max_deploy_retry_attempts
auto_claim_single_new_machine
batch_max_parallel
enrollment_token_ttl_seconds

Reference:

3. Hardware / Site Classes Considered¶

This matrix uses hardware/site classes that are likely to recur in GPUaaS:

Standard Ethernet GPU nodes
RoCE Ethernet GPU nodes
InfiniBand GPU nodes
Storage-heavy GPU nodes
Vendor-custom GPU nodes
Customer-handoff or restricted-release MAAS sites

4. Capability Matrix¶

Legend:

Yes: current model covers this cleanly
Partial: current model can support it, but only through a site artifact or operational convention
No: current model does not represent it well enough yet
Out of scope: should remain external to GPUaaS core

Capability	Standard Ethernet	RoCE Ethernet	InfiniBand	Storage-heavy	Vendor-custom	Restricted-release site	Notes
MAAS site binding and credentials	Yes	Yes	Yes	Yes	Yes	Yes	Already covered by site + Vault-backed credential model.
PXE/deploy interface safety	Yes	Yes	Yes	Yes	Yes	Yes	Covered by `strict_pxe_preflight`.
MAAS deploy retry on datasource issues	Yes	Yes	Yes	Yes	Yes	Yes	Covered by retry policy fields.
Batch onboarding throttle	Yes	Yes	Yes	Yes	Yes	Yes	Covered by `batch_max_parallel`.
GPUaaS enrollment bootstrap	Yes	Yes	Yes	Yes	Yes	Yes	Covered by deploy cloud-init generation and enrollment token TTL.
Site-specific bootstrap payload	Partial	Partial	Partial	Partial	Partial	Partial	Covered conceptually by `site_bootstrap_bundle_ref`, but artifact publication/materialization is still thin.
Phase-2 RoCE IP assignment	No	Yes	No	Partial	Partial	No	Covered by `enable_phase2_roce` plus DB-backed assignment model; only meaningful for RoCE Ethernet sites.
IB / fabric-aware host validation	No	No	Partial	No	Partial	No	Current model can skip RoCE, but has no explicit fabric-type or IB validation contract.
Storage layout variants beyond current flat prep	Partial	Partial	Partial	Partial	Partial	Partial	Current storage prep likely covers simple cases, but repeatable named storage layout profiles are not modeled.
Vendor driver / runtime package shaping	Partial	Partial	Partial	Partial	Partial	Partial	Should live in site bootstrap artifacts, not core logic. Current model can host it, but not with a rich lifecycle.
Commissioning / firmware script bundle	No	No	No	No	No	No	Not represented as first-class managed site/profile assets today.
Hardware sync enforcement	Yes	Yes	Yes	Yes	Yes	Yes	Covered by `require_hw_sync` and `hardware_sync_interval`.
Release behavior with “do not erase if MAAS cannot wipe”	Yes	Yes	Yes	Yes	Yes	Yes	Covered by `release_fallback_no_erase`.
GPUaaS-only detach after external MAAS repurpose	Yes	Yes	Yes	Yes	Yes	Yes	Covered operationally by node detach flow; not a site/profile field.
Customer-handoff “do not touch MAAS again” mode	Partial	Partial	Partial	Partial	Partial	Partial	Current profile has only part of this via fallback-no-erase; richer release-mode expression is not modeled.
MAAS server install/tune	Out of scope	Out of scope	Out of scope	Out of scope	Out of scope	Out of scope	Correctly remains infra-owned.

5. What The Current Model Already Covers Well¶

The current model is already strong enough for:

generic MAAS-backed onboarding and deploy
safe PXE deployment behavior
RoCE-specific phase-2 network mutation
hardware sync enforcement
datasource retry and deploy robustness
restricted erase fallback behavior
API/data-driven site configuration rather than CSV-driven operator workflows

This means the original orchestration design still fits most expected MAAS-backed GPUaaS environments.

6. What Looks Under-Represented¶

The likely under-represented capabilities are:

6.1 Fabric type intent¶

Today:

enable_phase2_roce=true|false

is the only fabric-adjacent knob.

That is enough to say:

RoCE site: enable the phase-2 step
non-RoCE site: skip the phase-2 step

But it does not say whether the site is:

plain Ethernet
RoCE
InfiniBand

For first-pass validation, this is acceptable.

For a mature model, a future fabric_mode field would probably be cleaner.

6.2 Post-deploy validation beyond generic enrollment¶

Some environments, especially InfiniBand or vendor-custom nodes, may need explicit host validation after deploy:

expected HCA presence
expected fabric port state
expected site package stack
expected mount or disk topology

Today this can only live inside the site bootstrap payload or external operator tooling.

6.3 Commissioning / firmware asset sets¶

The H200-style MAAS bundles often include:

commissioning scripts
hardware tests
firmware update helpers

These are not represented as first-class site/profile-managed assets today.

That is not necessarily wrong, but it is a gap if the platform wants to own those assets rather than relying on external MAAS host tooling.

6.4 Storage layout intent¶

Current storage preparation is enough for simple cases.

If GPUaaS needs named storage personalities across hardware families, the model may eventually need an explicit storage layout selector rather than only site bootstrap scripting.

7. Minimal Additional Options That Seem Justified¶

If the platform eventually needs more profile expressiveness, the most defensible additions appear to be:

fabric_mode
ethernet | roce | ib
makes the intent explicit without encoding vendor specifics
commissioning_bundle_ref
for MAAS commissioning/testing/firmware helper assets
post_deploy_validation_bundle_ref
for hardware-family or fabric-family validation after deploy
storage_layout_profile
only if repeated storage personalities become a real operational requirement

8. Options That Should Not Move Into Core¶

These should remain outside current GPUaaS core scope:

MAAS server install and tuning
MAAS host monitoring/Telegram/email tooling
ad hoc inventory CSVs
operator shell wrappers for MAAS host administration
hardcoded vendor-specific runtime choices embedded directly in control-plane code

9. Bottom-Line Assessment¶

The current GPUaaS MAAS site/profile model already covers most of what is needed across likely GPUaaS hardware classes.

The main gap is not lifecycle orchestration.

The main under-realized areas are:

explicit fabric intent
richer post-deploy validation
optional commissioning/firmware artifact handling
possibly explicit storage layout intent if repeated site patterns require it

So the platform is not far off from the original design.

The likely next step is not to expand core immediately. It is to validate upcoming environments against this matrix and only add new profile options when a concrete site proves the current knobs are insufficient.