Provisioning.BareMetal.MAAS API Boundary v1¶
1. Purpose¶
Define the domain boundary for Provisioning.BareMetal.MAAS so implementation, review, and future handoff stay coherent.
This document answers: - what the MAAS domain owns - which APIs it should expose through the control plane - what remains outside the domain - how it should interact with other GPUasService domains
Use this together with:
- MAAS_Bare_Metal_Lifecycle_v1.md
- MAAS_Node_State_Model_v1.md
- Domain_Ownership_Map.md
2. Domain Scope¶
Provisioning.BareMetal.MAAS owns:
- MAAS site records, credentials, and policy
- the future MAAS site profile model used to express multiple operational policy bundles per site
- site bootstrap bundle references used during MAAS onboarding
- machine onboarding/reimage/full decommission orchestration
- MAAS reconciliation for MAAS-managed nodes
- MAAS-specific capability probes and preflight validation
- MAAS-specific workflow/read models and recovery semantics
It does not own: - generic user allocation provisioning - catalog browsing UX - arbitrary MAAS script payload distribution - MAAS host installation/bootstrap/tuning - external monitoring bots or one-off operator tools
3. Internal Dependency Map¶
| Domain | Relationship to Provisioning.BareMetal.MAAS |
|---|---|
IAM |
admin authz/audit for site and lifecycle actions |
Inventory.Catalog |
node records, status transitions, SKU binding |
Provisioning.Allocation |
force-release/drain dependencies before reimage/remove |
Operations |
deploy/runbook/validation ownership |
UX |
admin nodes/site pages, CLI/SDK wrappers |
Rules: - MAAS integration should not scatter across unrelated packages. - MAAS-specific orchestration logic should live behind a clear service boundary, even if still inside the monorepo.
4. API Surface Owned by the Domain¶
These are control-plane APIs that naturally belong to the MAAS domain.
4.1 Site management¶
| Endpoint | Purpose |
|---|---|
POST /api/v1/admin/maas-sites |
create site metadata |
GET /api/v1/admin/maas-sites |
list sites |
GET /api/v1/admin/maas-sites/{id} |
read one site |
PATCH /api/v1/admin/maas-sites/{id} |
update site config/policy |
DELETE /api/v1/admin/maas-sites/{id} |
soft-delete/disable site (or equivalent PATCH status transition) |
POST /api/v1/admin/maas-sites/{id}/credentials |
write/rotate Vault-backed site credentials |
POST /api/v1/admin/maas-sites/{id}/probe |
validate MAAS connectivity/capabilities |
4.2 Site configuration adjuncts¶
| Endpoint | Purpose |
|---|---|
GET /api/v1/admin/maas-sites/{id}/profiles |
list site profiles |
POST /api/v1/admin/maas-sites/{id}/profiles |
create profile |
GET /api/v1/admin/maas-sites/{id}/profiles/{profile_id} |
read one profile |
PATCH /api/v1/admin/maas-sites/{id}/profiles/{profile_id} |
update profile |
DELETE /api/v1/admin/maas-sites/{id}/profiles/{profile_id} |
disable one profile |
POST /api/v1/admin/maas-sites/{id}/roce-assignments |
bulk upsert typed RoCE assignment records |
GET /api/v1/admin/maas-sites/{id}/roce-assignments |
list RoCE assignment records |
DELETE /api/v1/admin/maas-sites/{id}/roce-assignments/{assignment_id} |
remove one assignment |
Notes:
- Current implementation stores policy at the site level as the bootstrap/default profile surface.
- Profile CRUD is the planned next structural step once operators need multiple policy bundles per MAAS site.
- Site scope should remain focused on connectivity, deploy identity, and site defaults.
- Profiles should own runtime target selection such as architecture and distro_series, plus optional PXE overrides where a site needs more than one PXE/network shape.
- RoCE assignments remain site-scoped records keyed by hostname; they are not profile-owned resources.
- Bulk upsert is the intended operator path for real hosts with multiple interface/IP rows.
Example bulk upsert body:
{
"items": [
{ "hostname": "c07u31", "interface": "enp28s0np0", "ipv4_cidr": "172.30.9.61/31" },
{ "hostname": "c07u31", "interface": "enp29s0np0", "ipv4_cidr": "172.29.1.219/31" }
]
}
4.3 Workflow execution and recovery¶
| Endpoint | Purpose |
|---|---|
POST /api/v1/admin/onboardings |
start single-node MAAS onboarding |
POST /api/v1/admin/onboardings/batch |
start batch MAAS onboarding |
GET /api/v1/admin/onboardings/{id} |
read one onboarding workflow |
GET /api/v1/admin/onboardings |
filter/search onboarding workflows |
POST /api/v1/admin/onboardings/{id}/retry |
retry current stage |
POST /api/v1/admin/onboardings/{id}/resume |
resume from safe checkpoint |
POST /api/v1/admin/onboardings/{id}/rerun |
rerun workflow from top with adoption |
POST /api/v1/admin/onboardings/{id}/restart-clean |
explicit reset/restart |
POST /api/v1/admin/onboardings/{id}/cancel |
cancel + compensate where possible |
POST /api/v1/admin/onboardings/{id}/adopt |
adopt externally observed MAAS/node state |
POST /api/v1/admin/onboardings/{id}/mark-manual-intervention |
freeze into manual intervention state |
Request-shape assumptions for v1 onboarding:
- site_id, profile_id, and sku_id are required for single and batch onboarding.
- hostname and ipmi_ip are required per node for onboarding requests.
- pxe_mac is not part of the normal operator-facing onboarding contract; if retained in storage, it is an observed/internal fallback field rather than a required input.
- batch onboarding shares site_id, profile_id, and sku_id at the top level and carries only per-node identity rows.
- discovery/adoption behavior remains policy-driven rather than request-driven.
- RoCE phase-2 assignment stays a separate site-scoped resource keyed by site_id + hostname; onboarding consumes it when the selected profile enables phase-2 RoCE.
4.4 Decommission and reconcile¶
| Endpoint | Purpose |
|---|---|
POST /api/v1/admin/nodes/{id}/decommission |
start reimage/full decommission workflow |
POST /api/v1/admin/nodes/{id}/storage-cleanup |
storage-only cleanup path |
GET /api/v1/admin/decommissions/{id} |
read decommission workflow |
GET /api/v1/admin/decommissions |
filter/search decommission workflows |
POST /api/v1/admin/decommissions/{id}/retry |
retry failed stage |
POST /api/v1/admin/decommissions/{id}/resume |
resume from safe checkpoint |
POST /api/v1/admin/decommissions/{id}/rerun |
rerun decommission workflow from top with state adoption |
POST /api/v1/admin/decommissions/{id}/restart-clean |
explicit reset/restart where safe |
POST /api/v1/admin/decommissions/{id}/cancel |
cancel/best-effort abort |
POST /api/v1/admin/decommissions/{id}/adopt |
adopt externally observed MAAS/node state |
POST /api/v1/admin/decommissions/{id}/mark-manual-intervention |
freeze into manual intervention state |
GET /api/v1/admin/reconciliation/status |
reconciliation summary |
GET /api/v1/admin/reconciliation/drift |
drift list |
POST /api/v1/admin/reconciliation/run |
trigger immediate reconcile |
POST /api/v1/admin/reconciliation/drift/{node_id}/resolve |
acknowledge/resolve drift |
5. API Shape Rules¶
5.1 Canonical transport¶
- Control-plane APIs are JSON-only.
- CSV is not part of the canonical service contract.
- If operators use CSV for bulk onboarding or RoCE assignments, conversion belongs in CLI/import tooling before the API call.
5.2 Security rules¶
- Admin-only surface; privileged mutations must audit.
- No API for arbitrary MAAS script blob upload.
- No API for arbitrary remote shell payload execution.
- Cloud-init content must come from controlled bundle/template inputs, not free-form operator shell blobs.
5.3 Recovery semantics¶
Recovery endpoints are intentionally distinct:
- retry
- resume
- rerun
- restart-clean
- adopt
- cancel
These are separate control-plane operations, not synonyms.
6. Runtime Boundary¶
6.1 What talks to MAAS¶
Only control-plane MAAS integration code should talk to the MAAS API.
That means:
- cmd/api may start/query workflows and serve admin APIs
- worker/activity code executes MAAS API calls
- node-agent does not talk to MAAS directly
6.2 What talks to the node¶
For MAAS onboarding: - initial post-deploy hardware-sync seed may require SSH before node-agent enrollment - once node-agent is running, follow-up host actions should move through typed node tasks
This keeps the exception narrow and explicit.
7. Data Ownership¶
The domain should own these records:
- maas_sites
- maas_site_profiles (planned evolution from site-only policy)
- maas_power_credential_overrides
- maas_roce_assignments
- maas_site_policies
- node_onboardings
- node_onboarding_events
- node_decommissions
- node_decommission_events
- node_maas_state
It may coordinate transitions in:
- nodes
- allocation lifecycle state
But long-lived MAAS workflow and reconciliation data should stay domain-local.
8. Events¶
The domain should emit typed events such as:
- node.onboarding.started
- node.onboarding.completed
- node.onboarding.failed
- node.onboarding.manual_intervention_required
- node.decommission.started
- node.decommission.completed
- node.decommission.failed
- node.decommission.manual_intervention_required
Exact shapes belong in asyncapi.draft.yaml.
9. Boundary With Infra-Owned MAAS Capabilities¶
Infra may continue to own, out of band: - commissioning/testing script payloads - MAAS host tuning/install - site-specific operator utilities
GPUasService may: - record that a site depends on these capabilities - probe/validate their presence/version
GPUasService should not: - accept arbitrary script blobs via API - become the owner of generic MAAS script distribution in v1
10. Packaging Direction¶
Near term:
- keep implementation in the monorepo under a dedicated domain boundary
- prefer a dedicated package path such as packages/services/maas/
- allow site-level policy to stand in as the site's implicit default profile until profile CRUD/schema lands
Later, if complexity/ownership warrants:
- extract Provisioning.BareMetal.MAAS to its own service
This document is meant to make that extraction path possible without changing the core domain contract.
11. Implementation Guardrail: Script Intent Reference¶
During the first implementation phase, the reviewed MAAS design docs are the architecture baseline. The current working MAAS scripts in ../maas
should remain a behavioral intent reference for MAAS-specific quirks that are easy to lose in abstraction, but they are not the implementation
contract and must not be copied mechanically into the control plane.
Examples:
- commissioning flags such as skip_bmc_config=1
- MAAS release recovery sequences such as abort -> mark-fixed -> release
- datasource/cloud-init failure classification and bounded retry behavior
- hardware-sync token timing and reseed assumptions
- PXE/interface safety expectations observed on real hardware
Rule:
- when implementing Temporal activities or MAAS-facing recovery logic, compare the planned behavior to the current ../maas scripts before merge
- use the scripts to preserve operator intent and real-environment quirks, not to preserve shell structure or transport details
- deviations from script behavior should be intentional and documented in the implementation PR/commit notes
- an infra-coupled reviewer should review MAAS activity logic until the new workflow has been proven on real hardware
This keeps the control-plane implementation aligned with the real MAAS environment while still redesigning the shell prototype into typed workflow logic, explicit recovery semantics, and audited control-plane operations.