Skip to content

MAAS Hardware Profile Capability Matrix v1

Status: - comparative analysis only - no contract change proposed in this document

1. Goal

This document evaluates whether the current GPUaaS MAAS site/profile model is sufficient across likely hardware and fabric environments in the GPUaaS service space.

It is a mapping exercise against the current profile policy surface, not a redesign.

2. Current Profile Policy Surface

Current implemented MAAS site/profile policy fields are:

  • site_bootstrap_bundle_ref
  • strict_pxe_preflight
  • enable_phase2_roce
  • require_hw_sync
  • hardware_sync_interval
  • release_fallback_no_erase
  • enable_deploy_retry_on_datasource_failure
  • max_deploy_retry_attempts
  • auto_claim_single_new_machine
  • batch_max_parallel
  • enrollment_token_ttl_seconds

Reference:

3. Hardware / Site Classes Considered

This matrix uses hardware/site classes that are likely to recur in GPUaaS:

  1. Standard Ethernet GPU nodes
  2. RoCE Ethernet GPU nodes
  3. InfiniBand GPU nodes
  4. Storage-heavy GPU nodes
  5. Vendor-custom GPU nodes
  6. Customer-handoff or restricted-release MAAS sites

4. Capability Matrix

Legend:

  • Yes: current model covers this cleanly
  • Partial: current model can support it, but only through a site artifact or operational convention
  • No: current model does not represent it well enough yet
  • Out of scope: should remain external to GPUaaS core
Capability Standard Ethernet RoCE Ethernet InfiniBand Storage-heavy Vendor-custom Restricted-release site Notes
MAAS site binding and credentials Yes Yes Yes Yes Yes Yes Already covered by site + Vault-backed credential model.
PXE/deploy interface safety Yes Yes Yes Yes Yes Yes Covered by strict_pxe_preflight.
MAAS deploy retry on datasource issues Yes Yes Yes Yes Yes Yes Covered by retry policy fields.
Batch onboarding throttle Yes Yes Yes Yes Yes Yes Covered by batch_max_parallel.
GPUaaS enrollment bootstrap Yes Yes Yes Yes Yes Yes Covered by deploy cloud-init generation and enrollment token TTL.
Site-specific bootstrap payload Partial Partial Partial Partial Partial Partial Covered conceptually by site_bootstrap_bundle_ref, but artifact publication/materialization is still thin.
Phase-2 RoCE IP assignment No Yes No Partial Partial No Covered by enable_phase2_roce plus DB-backed assignment model; only meaningful for RoCE Ethernet sites.
IB / fabric-aware host validation No No Partial No Partial No Current model can skip RoCE, but has no explicit fabric-type or IB validation contract.
Storage layout variants beyond current flat prep Partial Partial Partial Partial Partial Partial Current storage prep likely covers simple cases, but repeatable named storage layout profiles are not modeled.
Vendor driver / runtime package shaping Partial Partial Partial Partial Partial Partial Should live in site bootstrap artifacts, not core logic. Current model can host it, but not with a rich lifecycle.
Commissioning / firmware script bundle No No No No No No Not represented as first-class managed site/profile assets today.
Hardware sync enforcement Yes Yes Yes Yes Yes Yes Covered by require_hw_sync and hardware_sync_interval.
Release behavior with “do not erase if MAAS cannot wipe” Yes Yes Yes Yes Yes Yes Covered by release_fallback_no_erase.
GPUaaS-only detach after external MAAS repurpose Yes Yes Yes Yes Yes Yes Covered operationally by node detach flow; not a site/profile field.
Customer-handoff “do not touch MAAS again” mode Partial Partial Partial Partial Partial Partial Current profile has only part of this via fallback-no-erase; richer release-mode expression is not modeled.
MAAS server install/tune Out of scope Out of scope Out of scope Out of scope Out of scope Out of scope Correctly remains infra-owned.

5. What The Current Model Already Covers Well

The current model is already strong enough for:

  • generic MAAS-backed onboarding and deploy
  • safe PXE deployment behavior
  • RoCE-specific phase-2 network mutation
  • hardware sync enforcement
  • datasource retry and deploy robustness
  • restricted erase fallback behavior
  • API/data-driven site configuration rather than CSV-driven operator workflows

This means the original orchestration design still fits most expected MAAS-backed GPUaaS environments.

6. What Looks Under-Represented

The likely under-represented capabilities are:

6.1 Fabric type intent

Today:

  • enable_phase2_roce=true|false

is the only fabric-adjacent knob.

That is enough to say:

  • RoCE site: enable the phase-2 step
  • non-RoCE site: skip the phase-2 step

But it does not say whether the site is:

  • plain Ethernet
  • RoCE
  • InfiniBand

For first-pass validation, this is acceptable.

For a mature model, a future fabric_mode field would probably be cleaner.

6.2 Post-deploy validation beyond generic enrollment

Some environments, especially InfiniBand or vendor-custom nodes, may need explicit host validation after deploy:

  • expected HCA presence
  • expected fabric port state
  • expected site package stack
  • expected mount or disk topology

Today this can only live inside the site bootstrap payload or external operator tooling.

6.3 Commissioning / firmware asset sets

The H200-style MAAS bundles often include:

  • commissioning scripts
  • hardware tests
  • firmware update helpers

These are not represented as first-class site/profile-managed assets today.

That is not necessarily wrong, but it is a gap if the platform wants to own those assets rather than relying on external MAAS host tooling.

6.4 Storage layout intent

Current storage preparation is enough for simple cases.

If GPUaaS needs named storage personalities across hardware families, the model may eventually need an explicit storage layout selector rather than only site bootstrap scripting.

7. Minimal Additional Options That Seem Justified

If the platform eventually needs more profile expressiveness, the most defensible additions appear to be:

  1. fabric_mode
  2. ethernet | roce | ib
  3. makes the intent explicit without encoding vendor specifics

  4. commissioning_bundle_ref

  5. for MAAS commissioning/testing/firmware helper assets

  6. post_deploy_validation_bundle_ref

  7. for hardware-family or fabric-family validation after deploy

  8. storage_layout_profile

  9. only if repeated storage personalities become a real operational requirement

8. Options That Should Not Move Into Core

These should remain outside current GPUaaS core scope:

  • MAAS server install and tuning
  • MAAS host monitoring/Telegram/email tooling
  • ad hoc inventory CSVs
  • operator shell wrappers for MAAS host administration
  • hardcoded vendor-specific runtime choices embedded directly in control-plane code

9. Bottom-Line Assessment

The current GPUaaS MAAS site/profile model already covers most of what is needed across likely GPUaaS hardware classes.

The main gap is not lifecycle orchestration.

The main under-realized areas are:

  • explicit fabric intent
  • richer post-deploy validation
  • optional commissioning/firmware artifact handling
  • possibly explicit storage layout intent if repeated site patterns require it

So the platform is not far off from the original design.

The likely next step is not to expand core immediately. It is to validate upcoming environments against this matrix and only add new profile options when a concrete site proves the current knobs are insufficient.