Skip to content

Provisioning.BareMetal.MAAS API Boundary v1

1. Purpose

Define the domain boundary for Provisioning.BareMetal.MAAS so implementation, review, and future handoff stay coherent.

This document answers: - what the MAAS domain owns - which APIs it should expose through the control plane - what remains outside the domain - how it should interact with other GPUasService domains

Use this together with: - MAAS_Bare_Metal_Lifecycle_v1.md - MAAS_Node_State_Model_v1.md - Domain_Ownership_Map.md

2. Domain Scope

Provisioning.BareMetal.MAAS owns: - MAAS site records, credentials, and policy - the future MAAS site profile model used to express multiple operational policy bundles per site - site bootstrap bundle references used during MAAS onboarding - machine onboarding/reimage/full decommission orchestration - MAAS reconciliation for MAAS-managed nodes - MAAS-specific capability probes and preflight validation - MAAS-specific workflow/read models and recovery semantics

It does not own: - generic user allocation provisioning - catalog browsing UX - arbitrary MAAS script payload distribution - MAAS host installation/bootstrap/tuning - external monitoring bots or one-off operator tools

3. Internal Dependency Map

Domain Relationship to Provisioning.BareMetal.MAAS
IAM admin authz/audit for site and lifecycle actions
Inventory.Catalog node records, status transitions, SKU binding
Provisioning.Allocation force-release/drain dependencies before reimage/remove
Operations deploy/runbook/validation ownership
UX admin nodes/site pages, CLI/SDK wrappers

Rules: - MAAS integration should not scatter across unrelated packages. - MAAS-specific orchestration logic should live behind a clear service boundary, even if still inside the monorepo.

4. API Surface Owned by the Domain

These are control-plane APIs that naturally belong to the MAAS domain.

4.1 Site management

Endpoint Purpose
POST /api/v1/admin/maas-sites create site metadata
GET /api/v1/admin/maas-sites list sites
GET /api/v1/admin/maas-sites/{id} read one site
PATCH /api/v1/admin/maas-sites/{id} update site config/policy
DELETE /api/v1/admin/maas-sites/{id} soft-delete/disable site (or equivalent PATCH status transition)
POST /api/v1/admin/maas-sites/{id}/credentials write/rotate Vault-backed site credentials
POST /api/v1/admin/maas-sites/{id}/probe validate MAAS connectivity/capabilities

4.2 Site configuration adjuncts

Endpoint Purpose
GET /api/v1/admin/maas-sites/{id}/profiles list site profiles
POST /api/v1/admin/maas-sites/{id}/profiles create profile
GET /api/v1/admin/maas-sites/{id}/profiles/{profile_id} read one profile
PATCH /api/v1/admin/maas-sites/{id}/profiles/{profile_id} update profile
DELETE /api/v1/admin/maas-sites/{id}/profiles/{profile_id} disable one profile
POST /api/v1/admin/maas-sites/{id}/roce-assignments bulk upsert typed RoCE assignment records
GET /api/v1/admin/maas-sites/{id}/roce-assignments list RoCE assignment records
DELETE /api/v1/admin/maas-sites/{id}/roce-assignments/{assignment_id} remove one assignment

Notes: - Current implementation stores policy at the site level as the bootstrap/default profile surface. - Profile CRUD is the planned next structural step once operators need multiple policy bundles per MAAS site. - Site scope should remain focused on connectivity, deploy identity, and site defaults. - Profiles should own runtime target selection such as architecture and distro_series, plus optional PXE overrides where a site needs more than one PXE/network shape. - RoCE assignments remain site-scoped records keyed by hostname; they are not profile-owned resources. - Bulk upsert is the intended operator path for real hosts with multiple interface/IP rows.

Example bulk upsert body:

{
  "items": [
    { "hostname": "c07u31", "interface": "enp28s0np0", "ipv4_cidr": "172.30.9.61/31" },
    { "hostname": "c07u31", "interface": "enp29s0np0", "ipv4_cidr": "172.29.1.219/31" }
  ]
}

4.3 Workflow execution and recovery

Endpoint Purpose
POST /api/v1/admin/onboardings start single-node MAAS onboarding
POST /api/v1/admin/onboardings/batch start batch MAAS onboarding
GET /api/v1/admin/onboardings/{id} read one onboarding workflow
GET /api/v1/admin/onboardings filter/search onboarding workflows
POST /api/v1/admin/onboardings/{id}/retry retry current stage
POST /api/v1/admin/onboardings/{id}/resume resume from safe checkpoint
POST /api/v1/admin/onboardings/{id}/rerun rerun workflow from top with adoption
POST /api/v1/admin/onboardings/{id}/restart-clean explicit reset/restart
POST /api/v1/admin/onboardings/{id}/cancel cancel + compensate where possible
POST /api/v1/admin/onboardings/{id}/adopt adopt externally observed MAAS/node state
POST /api/v1/admin/onboardings/{id}/mark-manual-intervention freeze into manual intervention state

Request-shape assumptions for v1 onboarding: - site_id, profile_id, and sku_id are required for single and batch onboarding. - hostname and ipmi_ip are required per node for onboarding requests. - pxe_mac is not part of the normal operator-facing onboarding contract; if retained in storage, it is an observed/internal fallback field rather than a required input. - batch onboarding shares site_id, profile_id, and sku_id at the top level and carries only per-node identity rows. - discovery/adoption behavior remains policy-driven rather than request-driven. - RoCE phase-2 assignment stays a separate site-scoped resource keyed by site_id + hostname; onboarding consumes it when the selected profile enables phase-2 RoCE.

4.4 Decommission and reconcile

Endpoint Purpose
POST /api/v1/admin/nodes/{id}/decommission start reimage/full decommission workflow
POST /api/v1/admin/nodes/{id}/storage-cleanup storage-only cleanup path
GET /api/v1/admin/decommissions/{id} read decommission workflow
GET /api/v1/admin/decommissions filter/search decommission workflows
POST /api/v1/admin/decommissions/{id}/retry retry failed stage
POST /api/v1/admin/decommissions/{id}/resume resume from safe checkpoint
POST /api/v1/admin/decommissions/{id}/rerun rerun decommission workflow from top with state adoption
POST /api/v1/admin/decommissions/{id}/restart-clean explicit reset/restart where safe
POST /api/v1/admin/decommissions/{id}/cancel cancel/best-effort abort
POST /api/v1/admin/decommissions/{id}/adopt adopt externally observed MAAS/node state
POST /api/v1/admin/decommissions/{id}/mark-manual-intervention freeze into manual intervention state
GET /api/v1/admin/reconciliation/status reconciliation summary
GET /api/v1/admin/reconciliation/drift drift list
POST /api/v1/admin/reconciliation/run trigger immediate reconcile
POST /api/v1/admin/reconciliation/drift/{node_id}/resolve acknowledge/resolve drift

5. API Shape Rules

5.1 Canonical transport

  • Control-plane APIs are JSON-only.
  • CSV is not part of the canonical service contract.
  • If operators use CSV for bulk onboarding or RoCE assignments, conversion belongs in CLI/import tooling before the API call.

5.2 Security rules

  • Admin-only surface; privileged mutations must audit.
  • No API for arbitrary MAAS script blob upload.
  • No API for arbitrary remote shell payload execution.
  • Cloud-init content must come from controlled bundle/template inputs, not free-form operator shell blobs.

5.3 Recovery semantics

Recovery endpoints are intentionally distinct: - retry - resume - rerun - restart-clean - adopt - cancel

These are separate control-plane operations, not synonyms.

6. Runtime Boundary

6.1 What talks to MAAS

Only control-plane MAAS integration code should talk to the MAAS API.

That means: - cmd/api may start/query workflows and serve admin APIs - worker/activity code executes MAAS API calls - node-agent does not talk to MAAS directly

6.2 What talks to the node

For MAAS onboarding: - initial post-deploy hardware-sync seed may require SSH before node-agent enrollment - once node-agent is running, follow-up host actions should move through typed node tasks

This keeps the exception narrow and explicit.

7. Data Ownership

The domain should own these records: - maas_sites - maas_site_profiles (planned evolution from site-only policy) - maas_power_credential_overrides - maas_roce_assignments - maas_site_policies - node_onboardings - node_onboarding_events - node_decommissions - node_decommission_events - node_maas_state

It may coordinate transitions in: - nodes - allocation lifecycle state

But long-lived MAAS workflow and reconciliation data should stay domain-local.

8. Events

The domain should emit typed events such as: - node.onboarding.started - node.onboarding.completed - node.onboarding.failed - node.onboarding.manual_intervention_required - node.decommission.started - node.decommission.completed - node.decommission.failed - node.decommission.manual_intervention_required

Exact shapes belong in asyncapi.draft.yaml.

9. Boundary With Infra-Owned MAAS Capabilities

Infra may continue to own, out of band: - commissioning/testing script payloads - MAAS host tuning/install - site-specific operator utilities

GPUasService may: - record that a site depends on these capabilities - probe/validate their presence/version

GPUasService should not: - accept arbitrary script blobs via API - become the owner of generic MAAS script distribution in v1

10. Packaging Direction

Near term: - keep implementation in the monorepo under a dedicated domain boundary - prefer a dedicated package path such as packages/services/maas/ - allow site-level policy to stand in as the site's implicit default profile until profile CRUD/schema lands

Later, if complexity/ownership warrants: - extract Provisioning.BareMetal.MAAS to its own service

This document is meant to make that extraction path possible without changing the core domain contract.

11. Implementation Guardrail: Script Intent Reference

During the first implementation phase, the reviewed MAAS design docs are the architecture baseline. The current working MAAS scripts in ../maas should remain a behavioral intent reference for MAAS-specific quirks that are easy to lose in abstraction, but they are not the implementation contract and must not be copied mechanically into the control plane.

Examples: - commissioning flags such as skip_bmc_config=1 - MAAS release recovery sequences such as abort -> mark-fixed -> release - datasource/cloud-init failure classification and bounded retry behavior - hardware-sync token timing and reseed assumptions - PXE/interface safety expectations observed on real hardware

Rule: - when implementing Temporal activities or MAAS-facing recovery logic, compare the planned behavior to the current ../maas scripts before merge - use the scripts to preserve operator intent and real-environment quirks, not to preserve shell structure or transport details - deviations from script behavior should be intentional and documented in the implementation PR/commit notes - an infra-coupled reviewer should review MAAS activity logic until the new workflow has been proven on real hardware

This keeps the control-plane implementation aligned with the real MAAS environment while still redesigning the shell prototype into typed workflow logic, explicit recovery semantics, and audited control-plane operations.