Provisioning.BareMetal.MAAS API Boundary v1¶

1. Purpose¶

Define the domain boundary for Provisioning.BareMetal.MAAS so implementation, review, and future handoff stay coherent.

This document answers: - what the MAAS domain owns - which APIs it should expose through the control plane - what remains outside the domain - how it should interact with other GPUasService domains

Use this together with: - MAAS_Bare_Metal_Lifecycle_v1.md - MAAS_Node_State_Model_v1.md - Domain_Ownership_Map.md

2. Domain Scope¶

Provisioning.BareMetal.MAAS owns: - MAAS site records, credentials, and policy - the future MAAS site profile model used to express multiple operational policy bundles per site - site bootstrap bundle references used during MAAS onboarding - machine onboarding/reimage/full decommission orchestration - MAAS reconciliation for MAAS-managed nodes - MAAS-specific capability probes and preflight validation - MAAS-specific workflow/read models and recovery semantics

It does not own: - generic user allocation provisioning - catalog browsing UX - arbitrary MAAS script payload distribution - MAAS host installation/bootstrap/tuning - external monitoring bots or one-off operator tools

3. Internal Dependency Map¶

Domain	Relationship to `Provisioning.BareMetal.MAAS`
`IAM`	admin authz/audit for site and lifecycle actions
`Inventory.Catalog`	node records, status transitions, SKU binding
`Provisioning.Allocation`	force-release/drain dependencies before reimage/remove
`Operations`	deploy/runbook/validation ownership
`UX`	admin nodes/site pages, CLI/SDK wrappers

Rules: - MAAS integration should not scatter across unrelated packages. - MAAS-specific orchestration logic should live behind a clear service boundary, even if still inside the monorepo.

4. API Surface Owned by the Domain¶

These are control-plane APIs that naturally belong to the MAAS domain.

4.1 Site management¶

Endpoint	Purpose
`POST /api/v1/admin/maas-sites`	create site metadata
`GET /api/v1/admin/maas-sites`	list sites
`GET /api/v1/admin/maas-sites/{id}`	read one site
`PATCH /api/v1/admin/maas-sites/{id}`	update site config/policy
`DELETE /api/v1/admin/maas-sites/{id}`	soft-delete/disable site (or equivalent PATCH status transition)
`POST /api/v1/admin/maas-sites/{id}/credentials`	write/rotate Vault-backed site credentials
`POST /api/v1/admin/maas-sites/{id}/probe`	validate MAAS connectivity/capabilities

4.2 Site configuration adjuncts¶

Endpoint	Purpose
`GET /api/v1/admin/maas-sites/{id}/profiles`	list site profiles
`POST /api/v1/admin/maas-sites/{id}/profiles`	create profile
`GET /api/v1/admin/maas-sites/{id}/profiles/{profile_id}`	read one profile
`PATCH /api/v1/admin/maas-sites/{id}/profiles/{profile_id}`	update profile
`DELETE /api/v1/admin/maas-sites/{id}/profiles/{profile_id}`	disable one profile
`POST /api/v1/admin/maas-sites/{id}/roce-assignments`	bulk upsert typed RoCE assignment records
`GET /api/v1/admin/maas-sites/{id}/roce-assignments`	list RoCE assignment records
`DELETE /api/v1/admin/maas-sites/{id}/roce-assignments/{assignment_id}`	remove one assignment

Notes: - Current implementation stores policy at the site level as the bootstrap/default profile surface. - Profile CRUD is the planned next structural step once operators need multiple policy bundles per MAAS site. - Site scope should remain focused on connectivity, deploy identity, and site defaults. - Profiles should own runtime target selection such as architecture and distro_series, plus optional PXE overrides where a site needs more than one PXE/network shape. - RoCE assignments remain site-scoped records keyed by hostname; they are not profile-owned resources. - Bulk upsert is the intended operator path for real hosts with multiple interface/IP rows.

Example bulk upsert body:

{
  "items": [
    { "hostname": "c07u31", "interface": "enp28s0np0", "ipv4_cidr": "172.30.9.61/31" },
    { "hostname": "c07u31", "interface": "enp29s0np0", "ipv4_cidr": "172.29.1.219/31" }
  ]
}

4.3 Workflow execution and recovery¶

Endpoint	Purpose
`POST /api/v1/admin/onboardings`	start single-node MAAS onboarding
`POST /api/v1/admin/onboardings/batch`	start batch MAAS onboarding
`GET /api/v1/admin/onboardings/{id}`	read one onboarding workflow
`GET /api/v1/admin/onboardings`	filter/search onboarding workflows
`POST /api/v1/admin/onboardings/{id}/retry`	retry current stage
`POST /api/v1/admin/onboardings/{id}/resume`	resume from safe checkpoint
`POST /api/v1/admin/onboardings/{id}/rerun`	rerun workflow from top with adoption
`POST /api/v1/admin/onboardings/{id}/restart-clean`	explicit reset/restart
`POST /api/v1/admin/onboardings/{id}/cancel`	cancel + compensate where possible
`POST /api/v1/admin/onboardings/{id}/adopt`	adopt externally observed MAAS/node state
`POST /api/v1/admin/onboardings/{id}/mark-manual-intervention`	freeze into manual intervention state

Request-shape assumptions for v1 onboarding: - site_id, profile_id, and sku_id are required for single and batch onboarding. - hostname and ipmi_ip are required per node for onboarding requests. - pxe_mac is not part of the normal operator-facing onboarding contract; if retained in storage, it is an observed/internal fallback field rather than a required input. - batch onboarding shares site_id, profile_id, and sku_id at the top level and carries only per-node identity rows. - discovery/adoption behavior remains policy-driven rather than request-driven. - RoCE phase-2 assignment stays a separate site-scoped resource keyed by site_id + hostname; onboarding consumes it when the selected profile enables phase-2 RoCE.

4.4 Decommission and reconcile¶

Endpoint	Purpose
`POST /api/v1/admin/nodes/{id}/decommission`	start reimage/full decommission workflow
`POST /api/v1/admin/nodes/{id}/storage-cleanup`	storage-only cleanup path
`GET /api/v1/admin/decommissions/{id}`	read decommission workflow
`GET /api/v1/admin/decommissions`	filter/search decommission workflows
`POST /api/v1/admin/decommissions/{id}/retry`	retry failed stage
`POST /api/v1/admin/decommissions/{id}/resume`	resume from safe checkpoint
`POST /api/v1/admin/decommissions/{id}/rerun`	rerun decommission workflow from top with state adoption
`POST /api/v1/admin/decommissions/{id}/restart-clean`	explicit reset/restart where safe
`POST /api/v1/admin/decommissions/{id}/cancel`	cancel/best-effort abort
`POST /api/v1/admin/decommissions/{id}/adopt`	adopt externally observed MAAS/node state
`POST /api/v1/admin/decommissions/{id}/mark-manual-intervention`	freeze into manual intervention state
`GET /api/v1/admin/reconciliation/status`	reconciliation summary
`GET /api/v1/admin/reconciliation/drift`	drift list
`POST /api/v1/admin/reconciliation/run`	trigger immediate reconcile
`POST /api/v1/admin/reconciliation/drift/{node_id}/resolve`	acknowledge/resolve drift

5. API Shape Rules¶

5.1 Canonical transport¶

Control-plane APIs are JSON-only.
CSV is not part of the canonical service contract.
If operators use CSV for bulk onboarding or RoCE assignments, conversion belongs in CLI/import tooling before the API call.

5.2 Security rules¶

Admin-only surface; privileged mutations must audit.
No API for arbitrary MAAS script blob upload.
No API for arbitrary remote shell payload execution.
Cloud-init content must come from controlled bundle/template inputs, not free-form operator shell blobs.

5.3 Recovery semantics¶

Recovery endpoints are intentionally distinct: - retry - resume - rerun - restart-clean - adopt - cancel

These are separate control-plane operations, not synonyms.

6. Runtime Boundary¶

6.1 What talks to MAAS¶

Only control-plane MAAS integration code should talk to the MAAS API.

That means: - cmd/api may start/query workflows and serve admin APIs - worker/activity code executes MAAS API calls - node-agent does not talk to MAAS directly

6.2 What talks to the node¶

For MAAS onboarding: - initial post-deploy hardware-sync seed may require SSH before node-agent enrollment - once node-agent is running, follow-up host actions should move through typed node tasks

This keeps the exception narrow and explicit.

7. Data Ownership¶

The domain should own these records: - maas_sites - maas_site_profiles (planned evolution from site-only policy) - maas_power_credential_overrides - maas_roce_assignments - maas_site_policies - node_onboardings - node_onboarding_events - node_decommissions - node_decommission_events - node_maas_state

It may coordinate transitions in: - nodes - allocation lifecycle state

But long-lived MAAS workflow and reconciliation data should stay domain-local.

8. Events¶

The domain should emit typed events such as: - node.onboarding.started - node.onboarding.completed - node.onboarding.failed - node.onboarding.manual_intervention_required - node.decommission.started - node.decommission.completed - node.decommission.failed - node.decommission.manual_intervention_required

Exact shapes belong in asyncapi.draft.yaml.

9. Boundary With Infra-Owned MAAS Capabilities¶

Infra may continue to own, out of band: - commissioning/testing script payloads - MAAS host tuning/install - site-specific operator utilities

GPUasService may: - record that a site depends on these capabilities - probe/validate their presence/version

GPUasService should not: - accept arbitrary script blobs via API - become the owner of generic MAAS script distribution in v1

10. Packaging Direction¶

Near term: - keep implementation in the monorepo under a dedicated domain boundary - prefer a dedicated package path such as packages/services/maas/ - allow site-level policy to stand in as the site's implicit default profile until profile CRUD/schema lands

Later, if complexity/ownership warrants: - extract Provisioning.BareMetal.MAAS to its own service

This document is meant to make that extraction path possible without changing the core domain contract.

11. Implementation Guardrail: Script Intent Reference¶

During the first implementation phase, the reviewed MAAS design docs are the architecture baseline. The current working MAAS scripts in ../maas should remain a behavioral intent reference for MAAS-specific quirks that are easy to lose in abstraction, but they are not the implementation contract and must not be copied mechanically into the control plane.

Examples: - commissioning flags such as skip_bmc_config=1 - MAAS release recovery sequences such as abort -> mark-fixed -> release - datasource/cloud-init failure classification and bounded retry behavior - hardware-sync token timing and reseed assumptions - PXE/interface safety expectations observed on real hardware

Rule: - when implementing Temporal activities or MAAS-facing recovery logic, compare the planned behavior to the current ../maas scripts before merge - use the scripts to preserve operator intent and real-environment quirks, not to preserve shell structure or transport details - deviations from script behavior should be intentional and documented in the implementation PR/commit notes - an infra-coupled reviewer should review MAAS activity logic until the new workflow has been proven on real hardware

This keeps the control-plane implementation aligned with the real MAAS environment while still redesigning the shell prototype into typed workflow logic, explicit recovery semantics, and audited control-plane operations.