MAAS Bare Metal Lifecycle v1¶
1. Purpose¶
Define the end-to-end bare-metal node lifecycle managed through MAAS integration: automated onboarding, decommissioning modes, site configuration management, and state drift reconciliation between MAAS and GPUasService.
The dedicated state model for this lifecycle lives in:
- MAAS_Node_State_Model_v1.md
- MAAS_Recovery_Matrix_v1.md
- Provisioning_BareMetal_MAAS_API_Boundary_v1.md
- MAAS_Execution_Readiness_v1.md
This document covers: - MAAS site entity model and credential lifecycle (Vault-backed), - automated node onboarding via Temporal workflows (single + batch), - cloud-init bootstrap injection (agent enrollment without manual intervention), - decommission modes (soft reset, reimage, full decommission, storage-only cleanup), - MAAS ↔ GPUasService state drift detection and reconciliation, - failure handling, compensation, and recovery at each workflow stage, - future extensibility for LXC/LXD and Weka storage integration.
This document does not cover: - MAAS installation/bootstrap on a new server, - PostgreSQL tuning for MAAS itself, - external notification integrations such as Telegram/email monitoring.
Those are treated as MAAS site operations outside the GPUasService lifecycle contract. GPUasService assumes a working MAAS site already exists.
2. Context¶
2.1 Current state¶
- GPUasService node onboarding supports two modes:
manualandmaas. manualmode is fully implemented: admin creates node via API, gets a bootstrapcurlcommand with a temp enrollment token, runs it on the node, agent enrolls, node goes active.maasmode has a placeholder (maas.Client.PrepareNodeOnboarding) that calls an external HTTP endpoint — fire-and-forget with no state tracking.- A separate repository (
maas-automation-TG) contains production-tested bash scripts (onboard_node.sh,deploy.sh,release_node.sh, etc.) that drive the full MAAS lifecycle via CLI. These scripts are the working model for the automation logic. - The bash scripts assume a happy path — no durable state, no retry across stages, no visibility into progress, no compensation on failure.
2.2 Target state¶
- MAAS onboarding is a Temporal workflow with per-stage retry, compensation, and observability.
- Admin triggers onboarding via API (single or batch), monitors progress, and can retry or cancel.
- The GPUasService node agent bootstrap is injected into MAAS cloud-init at deploy time — no manual SSH or script delivery.
- Site-level MAAS configuration is a first-class DB entity with Vault-backed secrets.
- A reconciliation loop detects and resolves state drift between MAAS and GPUasService.
2.3 Prototype boundary from maas-automation-TG¶
The external scripting repo remains the reference for current operational behavior, but not every script belongs in GPUasService:
- In scope for GPUasService lifecycle v1
- machine discovery/claiming policy
- commissioning, deploy, release/reimage flows
- RoCE phase-2 MAAS configuration
- cloud-init bundle composition
- MAAS hardware sync seed/reseed/health
-
node cleanup/decommission semantics
-
Out of scope for GPUasService lifecycle v1
- MAAS install/bootstrap on a new host
- MAAS PostgreSQL tuning
- external monitoring/notification apps (Telegram/email)
- one-off MAAS operator utilities that can remain standalone CLI/SDK tools
2.4 Capability ownership boundary¶
GPUasService should own the lifecycle contract and required invariants for MAAS-managed nodes. It should not become a generic remote-script distribution system for MAAS.
GPUasService-managed - MAAS site records, secrets, and lifecycle policy - onboarding/reimage/decommission workflows - required runtime invariants for GPUasService-managed nodes: - expected boot image/profile - hardware sync policy - RoCE phase-2 enablement - discovery/claiming policy - bounded deploy retry policy - controlled cloud-init bundle/profile composition - capability probes and validation of required site prerequisites
Infra-managed initially - MAAS commissioning/testing script payloads - MAAS host install/bootstrap/tuning - optional MAAS operator tooling and monitoring integrations
Security boundary - GPUasService v1 must not expose an API for arbitrary MAAS script blob upload or arbitrary remote shell payload execution. - If a MAAS site depends on commissioning/testing scripts, GPUasService may record that dependency and validate presence/version, but the script content itself remains infra-owned until promoted into a controlled platform-managed artifact.
This preserves the stronger GPUasService security model built around typed workflows, audited admin actions, and controlled bootstrap content rather than ad hoc script execution.
3. MAAS Site Configuration¶
3.1 Site entity¶
Each MAAS region controller is represented as a site. A site is the unit of credential management, network topology, and workflow scoping.
For the first implementation slice, site-level policy is stored directly against the site so the control plane can stand up a usable MAAS admin surface quickly. The intended steady-state model is site + profile(s): - a site owns identity, connectivity, secret paths, and default environment wiring - a profile owns operational policy bundles applied to onboarding/reimage/decommission flows
That future split matters because one MAAS site may need multiple operational modes over time: - GPU default onboarding - stricter reimage/recovery behavior - alternate cloud-init bundles - future non-GPU host classes such as LXC/LXD-oriented profiles
Until profile CRUD lands, the current site-level policy row should be treated as the site's implicit default profile.
maas_sites
id uuid PK
name text UNIQUE, NOT NULL -- "dc1-maas"
region_code text NOT NULL -- maps to GPUasService region
api_base_url text NOT NULL -- "http://10.176.46.1:5240/MAAS"
api_token_vault_path text NOT NULL -- "kv/maas-sites/{id}/api-token"
default_power_creds_vault_path text NOT NULL -- "kv/maas-sites/{id}/power/default"
pxe_iface text NOT NULL -- "ens19" (MAAS server side)
pxe_vlan_vid int NOT NULL -- 46
node_pxe_iface text NOT NULL -- "eno8303" (node side)
distro_series text NOT NULL DEFAULT 'ubuntu/noble'
architecture text NOT NULL DEFAULT 'amd64/generic'
upstream_dns text DEFAULT '1.1.1.1 8.8.8.8'
status text NOT NULL DEFAULT 'active' -- active, disabled
created_at timestamptz NOT NULL DEFAULT now()
updated_at timestamptz NOT NULL DEFAULT now()
3.2 Secrets in Vault¶
Secrets are never stored in the database. The DB stores Vault KV v2 paths only.
kv/maas-sites/{site_id}/api-token
→ { "token": "<consumer_key>:<token_key>:<token_secret>" }
kv/maas-sites/{site_id}/power/default
→ { "user": "root", "pass": "..." }
The Vault client (packages/shared/vault) already supports ReadKVV2 and WriteKVV2. Workflow activities read from Vault at execution time — not at startup — so credential rotation is automatic without service restart.
3.2.1 Optional per-node power credential overrides¶
Site-wide defaults are the normal operating model. Some sites will still have exceptions: old BMC firmware, break-glass credentials for a subset of racks, or staged credential rotation. Model those as explicit overrides, not as ad hoc workflow inputs.
maas_power_credential_overrides
id uuid PK
site_id uuid FK → maas_sites
selector_type text NOT NULL -- hostname | ipmi_ip | pxe_mac
selector_value text NOT NULL
vault_path text NOT NULL -- "kv/maas-sites/{id}/power/overrides/{oid}"
status text NOT NULL DEFAULT 'active'
created_at timestamptz NOT NULL DEFAULT now()
updated_at timestamptz NOT NULL DEFAULT now()
UNIQUE (site_id, selector_type, selector_value)
Credential selection order:
1. Exact active override match by pxe_mac
2. Exact active override match by ipmi_ip
3. Exact active override match by hostname
4. Site default power credentials
This keeps the steady-state model simple while still supporting mixed fleets without baking credential logic into workflow inputs.
3.3 RoCE IP assignments¶
Replace the static roce_ips.csv file with a database table:
maas_roce_assignments
id uuid PK
site_id uuid FK → maas_sites
hostname text NOT NULL
interface text NOT NULL
ipv4_cidr text NOT NULL
created_at timestamptz NOT NULL DEFAULT now()
updated_at timestamptz NOT NULL DEFAULT now()
UNIQUE (site_id, hostname, interface)
Populated via admin API using typed JSON payloads. If operators maintain CSV files, conversion to JSON should happen in CLI/import tooling before calling the API. Queried by the RoCE phase-2 activity during onboarding.
3.4 Admin APIs for site management¶
POST /api/v1/admin/maas-sites -- create site (non-secret fields)
GET /api/v1/admin/maas-sites -- list sites
GET /api/v1/admin/maas-sites/{id} -- get site details
PATCH /api/v1/admin/maas-sites/{id} -- update site config
POST /api/v1/admin/maas-sites/{id}/credentials -- write secrets to Vault
POST /api/v1/admin/maas-sites/{id}/probe -- verify MAAS API reachable + token valid
DELETE /api/v1/admin/maas-sites/{id} -- disable (soft delete)
POST /api/v1/admin/maas-sites/{id}/roce-assignments -- bulk upsert (JSON array)
GET /api/v1/admin/maas-sites/{id}/roce-assignments -- list
DELETE /api/v1/admin/maas-sites/{id}/roce-assignments/{aid} -- remove single
Validation on credential write:
- Call MAAS GET /api/2.0/version/ with the provided token.
- If MAAS is unreachable or token is invalid, reject with 422.
- On success, write to Vault and store path in DB.
Seeding flow (first-time setup):
1. Admin calls POST /api/v1/admin/maas-sites with site config.
2. Admin calls POST /api/v1/admin/maas-sites/{id}/credentials with MAAS token + IPMI creds.
3. System validates MAAS connectivity, writes secrets to Vault, stores paths in DB.
4. Admin optionally uploads RoCE assignments via JSON API or a CLI/import helper that converts CSV into the same JSON shape.
5. Site is ready for onboarding workflows.
Example bulk RoCE assignment payload:
{
"items": [
{ "hostname": "c07u31", "interface": "enp28s0np0", "ipv4_cidr": "172.30.9.61/31" },
{ "hostname": "c07u31", "interface": "enp29s0np0", "ipv4_cidr": "172.29.1.219/31" },
{ "hostname": "c07u31", "interface": "enp62s0np0", "ipv4_cidr": "172.30.9.63/31" }
]
}
Notes:
- the resource is site-scoped and host-keyed by site_id + hostname
- multiple rows per hostname are expected, one per interface assignment
- operators may source this from CSV, but the API contract stays JSON-first
3.5 Credential lifecycle¶
| Operation | What happens |
|---|---|
| Initial seed | Admin calls credentials endpoint, system validates + writes to Vault |
| Rotate MAAS token | Admin calls credentials endpoint again — same Vault path, new value. Next activity execution picks it up. No restart. |
| Rotate default power creds | Same — default site Vault path stays the same, value changes. In-flight workflows use whatever was read at activity start. |
| Rotate node-specific power creds | Update the override Vault entry. Matching future workflows pick it up automatically. |
| Disable site | PATCH /api/v1/admin/maas-sites/{id} { "status": "disabled" }. New onboardings rejected. In-flight workflows continue to completion. |
| Re-enable site | PATCH /api/v1/admin/maas-sites/{id} { "status": "active" }. |
| Test connectivity | POST /api/v1/admin/maas-sites/{id}/probe — hits MAAS API, verifies token, returns version + rack controller info. |
3.6 Site policy knobs¶
The prototype scripts rely on site-specific behavior toggles. Capture them explicitly so workflow behavior is reproducible per site and not hidden in shell defaults.
maas_sites
id uuid PK
name text UNIQUE
region_code text NOT NULL
api_base_url text NOT NULL
api_token_vault_path text NOT NULL
default_power_creds_vault_path text NOT NULL
pxe_iface text NOT NULL
pxe_vlan_vid int NOT NULL
node_pxe_iface text NOT NULL
distro_series text NOT NULL
architecture text NOT NULL
deploy_user text NOT NULL DEFAULT 'hpcadmin'
deploy_password_vault_path text NOT NULL
deploy_ssh_iface text NOT NULL DEFAULT 'eno8303'
upstream_dns_servers text[] NOT NULL
status text NOT NULL DEFAULT 'active'
created_at timestamptz NOT NULL DEFAULT now()
updated_at timestamptz NOT NULL DEFAULT now()
maas_site_policies
site_id uuid PK/FK → maas_sites
strict_pxe_preflight bool NOT NULL DEFAULT true
enable_phase2_roce bool NOT NULL DEFAULT true
require_hw_sync bool NOT NULL DEFAULT true
hardware_sync_interval text NOT NULL DEFAULT '15m'
release_fallback_no_erase bool NOT NULL DEFAULT true
enable_deploy_retry_on_datasource_failure bool NOT NULL DEFAULT true
max_deploy_retry_attempts int NOT NULL DEFAULT 1
auto_claim_single_new_machine bool NOT NULL DEFAULT false
batch_max_parallel int NOT NULL DEFAULT 10
site_bootstrap_bundle_ref text NULL
enrollment_token_ttl_seconds int NOT NULL DEFAULT 7200
created_at timestamptz NOT NULL DEFAULT now()
updated_at timestamptz NOT NULL DEFAULT now()
Notes:
- require_hw_sync should remain true for MAAS-managed nodes in v1.
- hardware_sync_interval comes directly from prototype behavior and governs MAAS-side sync policy.
- deploy_user and deploy_ssh_iface are site-scoped, not profile-scoped. One MAAS site should expose one standard deploy identity.
- architecture and distro_series should be profile-scoped so one MAAS site can target more than one runtime shape.
- GPUaaS custom uploaded host images may be stored in product profile intent without the
MAAS namespace prefix, but deploy calls must send the MAAS custom namespace form
(custom/<image-name>). The MAAS execution client normalizes gpuaas-* image
names at the API boundary so profile authors do not accidentally trigger MAAS'
Ubuntu-series lookup path.
- pxe_iface, node_pxe_iface, and pxe_vlan_vid should behave as site defaults with profile-level overrides available when racks or node classes diverge.
- enable_deploy_retry_on_datasource_failure and max_deploy_retry_attempts capture the current script behavior of allowing a bounded redeploy only for datasource/cloud-init class failures.
- site_bootstrap_bundle_ref should replace the older extra_cloud_init_bundle_path concept. The real abstraction is an infra-owned, versioned site bootstrap bundle or script reference delivered at first boot before GPUaaS node bootstrap.
- auto_claim_single_new_machine=false should be the default. Ambiguous discovery must fail closed unless the site explicitly opts in.
- deploy password is still secret material and should live in Vault, not inline in the site row.
3.7 Future evolution: site profiles¶
The site policy model above is acceptable for the first MAAS implementation slice, but it should evolve into an explicit one-to-many profile model:
maas_site_profiles
id uuid PK
site_id uuid FK -> maas_sites
name text NOT NULL
description text NULL
status text NOT NULL DEFAULT 'active'
strict_pxe_preflight bool NOT NULL DEFAULT true
enable_phase2_roce bool NOT NULL DEFAULT true
require_hw_sync bool NOT NULL DEFAULT true
hardware_sync_interval text NOT NULL DEFAULT '15m'
release_fallback_no_erase bool NOT NULL DEFAULT true
enable_deploy_retry_on_datasource_failure bool NOT NULL DEFAULT true
max_deploy_retry_attempts int NOT NULL DEFAULT 1
auto_claim_single_new_machine bool NOT NULL DEFAULT false
batch_max_parallel int NOT NULL DEFAULT 10
site_bootstrap_bundle_ref text NULL
enrollment_token_ttl_seconds int NOT NULL DEFAULT 7200
created_at timestamptz NOT NULL DEFAULT now()
updated_at timestamptz NOT NULL DEFAULT now()
UNIQUE (site_id, name)
Expected transition:
- maas_sites continues to hold site identity and connectivity
- deploy_user, deploy_password_vault_path, and deploy_ssh_iface remain site-scoped
- maas_site_policies is treated as the current implicit default profile
- later, maas_site_profiles becomes the normal policy surface
- onboarding/decommission requests carry both site_id and profile_id
- each site may also declare a default_profile_id for simple operator flows
The profile split is a design direction, not a blocker for the current v1 bootstrap implementation.
4. Node Onboarding Workflow¶
4.1 API input¶
Assumptions for the first MAAS onboarding contract:
- hostname is required and operator-supplied.
- ipmi_ip is required and operator-supplied.
- pxe_mac is not part of the normal operator-facing onboarding request. If retained in storage later, it is observed/internal data for reconciliation or fallback, not a required input.
- site_id, profile_id, and sku_id are required request-level selectors.
- discovery/adoption behavior stays policy-driven; it is not overridden per request.
- RoCE phase-2 assignment is not inline onboarding input. It is a separately managed, pre-created assignment record keyed by site_id + hostname, and onboarding consumes it only when the selected profile enables phase-2 RoCE.
- storage/runtime attach behavior such as Weka is out of scope for initial node onboarding. For now onboarding only covers host bring-up and attached-local storage preparation.
Single node:
POST /api/v1/admin/onboardings
{
"site_id": "uuid",
"profile_id": "uuid",
"sku_id": "mi300x.192g.8gpu",
"ipmi_ip": "10.176.16.128",
"hostname": "c07u43"
}
Response:
{
"onboarding_id": "uuid", // Temporal workflow ID
"status": "pending"
}
Batch:
POST /api/v1/admin/onboardings/batch
{
"site_id": "uuid",
"profile_id": "uuid",
"sku_id": "mi300x.192g.8gpu",
"nodes": [
{ "ipmi_ip": "10.176.16.128", "hostname": "c07u43" },
{ "ipmi_ip": "10.176.16.129", "hostname": "c07u44" },
{ "ipmi_ip": "10.176.16.130", "hostname": "c07u45" }
]
}
Response:
{
"batch_id": "uuid", // parent workflow ID
"onboardings": [
{ "hostname": "c07u43", "onboarding_id": "uuid", "node_id": "uuid" },
{ "hostname": "c07u44", "onboarding_id": "uuid", "node_id": "uuid" },
{ "hostname": "c07u45", "onboarding_id": "uuid", "node_id": "uuid" }
]
}
Batch onboarding uses one shared site_id, profile_id, and sku_id. Per-node rows carry only identity/discovery input. Batch spawns a parent Temporal workflow that fans out child workflows per node with configurable concurrency.
API contract rule: - The control-plane API is JSON-only for single and batch onboarding. - CSV, if supported at all, belongs in CLI/import tooling that validates required headers and converts rows into the JSON API shape before submission.
Prototype import intent carried forward from the current scripts:
- Header-driven import should be the only supported CSV mode in tooling (hostname,ipmi_ip).
- Legacy positional CSV parsing is a script convenience, not part of the GPUasService contract.
4.2 Monitoring APIs¶
These query a durable onboarding/decommission read model populated by workflow progress events. Temporal visibility remains useful for workflow debugging and search, but it should not be the only operator-facing source of history or status.
The full operator API surface, including recovery actions, is defined in section 4.15 Operator APIs needed. This section should be read as the read/query surface rather than the complete mutation contract.
4.2.1 Onboarding read model¶
node_onboardings
onboarding_id uuid PK -- external/job id, also Temporal workflow ID
batch_id uuid NULL -- parent batch workflow id when present
node_id uuid NULL -- GPUasService node id once created
site_id uuid NOT NULL
hostname text NOT NULL
ipmi_ip inet NOT NULL
pxe_mac text NULL
maas_system_id text NULL
requested_by_user_id uuid NULL
current_stage text NOT NULL
current_attempt int NOT NULL DEFAULT 0
status text NOT NULL -- pending | running | completed | failed_retryable | failed_manual_intervention | cancelled | compensating | reconciled
error_code text NULL -- internal workflow/stage classification
error_message text NULL
error_details jsonb NOT NULL DEFAULT '{}'::jsonb
workflow_id text NOT NULL
workflow_run_id text NULL
requested_at timestamptz NOT NULL DEFAULT now()
started_at timestamptz NULL
completed_at timestamptz NULL
updated_at timestamptz NOT NULL DEFAULT now()
Optional stage-event detail for operator history:
node_onboarding_events
id uuid PK
onboarding_id uuid FK → node_onboardings
stage text NOT NULL
attempt int NOT NULL
status text NOT NULL -- started | succeeded | failed | compensated | skipped
message text NULL
details jsonb NOT NULL DEFAULT '{}'::jsonb
occurred_at timestamptz NOT NULL DEFAULT now()
API behavior:
- GET /api/v1/admin/onboardings/{id} reads node_onboardings + recent node_onboarding_events
- GET /api/v1/admin/onboardings?batch_id=... groups by batch_id
- filters should use the richer workflow status model, not a collapsed generic failed
- recovery endpoints are defined in section 4.15 Operator APIs needed
4.3 Workflow stages¶
MaasNodeOnboardWorkflow(input)
│
├─ Activity: LoadSiteConfig(site_id)
│ Read site record from DB + secrets from Vault.
│ Validate MAAS reachable.
│ Output: resolved SiteConfig (MAAS URL, token, default power creds, network config).
│
├─ Activity: ResolvePowerCredentials(site_config, hostname, ipmi_ip, pxe_mac)
│ 1. Evaluate active overrides in priority order.
│ 2. Fall back to site default power credentials.
│ 3. Return resolved BMC login for this workflow execution.
│ Output: PowerCredentials(user, pass).
│
├─ Activity: CreateOrFindInMaas(site_config, power_credentials, ipmi_ip, hostname, pxe_mac)
│ 1. Look up by hostname, then by IPMI power_address, then by PXE MAC.
│ 2. If not found, create machine via MAAS API (hostname + IPMI power config).
│ 3. If create doesn't return usable machine, fall back to IPMI PXE boot cycle + discovery poll.
│ 4. During discovery polling:
│ a. prefer a machine discovered during this workflow execution,
│ b. only auto-claim a sole `New` machine if site policy explicitly enables it,
│ c. otherwise fail closed on ambiguity.
│ Output: maas_system_id.
│ Idempotent: yes (lookup-first).
│
├─ Activity: CommissionNode(site_config, system_id)
│ 1. Check current MAAS status.
│ 2. If New, accept machine.
│ 3. Set hostname + IPMI power parameters.
│ 4. If already Ready, skip. If already Commissioning, just wait.
│ 5. Otherwise, start commissioning (enable_ssh=1, skip_bmc_config=1).
│ Output: void (machine is commissioning or ready).
│ Idempotent: yes (status-aware).
│
├─ Activity: WaitForMaasStatus(site_config, system_id, "Ready", timeout)
│ Poll MAAS machine status with heartbeats.
│ Detect failure states (Failed, Broken, Error) → return error immediately.
│ Output: void (machine is Ready).
│ Uses Temporal activity heartbeat to stay alive during long waits.
│
├─ Activity: ConfigureStorage(site_config, system_id)
│ 1. Read block devices from MAAS.
│ 2. Detect BOSS boot disk by model/name/id_path pattern (boss|boot optimized|m.2).
│ 3. Set boot disk.
│ 4. Apply flat storage layout.
│ Output: boss_disk_id.
│ Fails if: no matching disk found (hardware mismatch, needs admin intervention).
│
├─ Activity: ApplyRoCEPhase2(site_config, hostname)
│ 1. Query maas_roce_assignments for this hostname + site.
│ 2. For each row: find interface, create /31 subnet if missing, apply STATIC link.
│ 3. This MAAS-side phase is only half of the prototype contract; the matching guest OS
│ RoCE routing bundle must also be present in cloud-init so node-local policy routing
│ matches the MAAS interface links.
│ 4. Skip if no rows (non-fatal).
│ Output: count of IPs applied.
│ Idempotent: yes (skips if already applied).
│
├─ Activity: EnsurePxeInterfaceAuto(site_config, system_id)
│ 1. Find node PXE interface by name.
│ 2. Ensure it has AUTO or DHCP subnet link.
│ 3. If not, unlink existing and link as AUTO.
│ Output: void.
│ Pre-deploy validation: fail if interface has no valid subnet association.
│
├─ Activity: CreateGPUaaSNodeAndRenderCloudInit(site_config, system_id, host, sku, gpus, region)
│ 1. Create node record in GPUasService DB (status: enrolling, onboarding_mode: maas).
│ 2. Generate enrollment token with MAAS-specific TTL (must exceed expected deploy + first boot time).
│ 3. Render first-boot payload with ordered layers:
│ a. Optional infra-owned site bootstrap bundle reference.
│ b. GPUasService agent bootstrap (curl command with enrollment token).
│ c. MAAS hardware sync credentials + timer.
│ d. Extra disk initialization script.
│ e. RoCE routing policy script + systemd unit.
│ f. Deploy user creation.
│ 4. Get MAAS machine token (consumer_key, token_key, token_secret) for hw-sync.
│ 5. Substitute all template variables into the composed first-boot payload.
│ Output: gpuaas_node_id, cloud_init_b64.
│ Note: the extra disk initialization script is destructive:
│ it partitions/formats all non-root disks and mounts them as /shareN.
│ This behavior must stay explicit in admin UX/runbooks and never be implicit.
│
├─ Activity: DeployViaMaas(site_config, system_id, cloud_init_b64, distro_series, enable_hw_sync)
│ 1. Call MAAS deploy with user_data=cloud_init_b64.
│ Output: void (machine is deploying).
│
├─ Activity: WaitForMaasStatus(site_config, system_id, "Deployed", timeout)
│ Same as above — poll with heartbeats.
│ Output: void (machine is Deployed).
│
├─ Activity: ClassifyDeployFailure(site_config, system_id, hostname)
│ 1. Inspect MAAS machine details + recent MAAS events.
│ 2. Detect datasource/cloud-init-like failures (`no datasource`, cloud-init final stage, etc.).
│ 3. Return failure_class = datasource_like | generic.
│
├─ Activity: RecoverForDatasourceRetry(site_config, system_id, hostname)
│ 1. Abort in-progress deploy/commission/test if needed.
│ 2. If the machine is Deployed or Failed deployment, release it back to Ready.
│ 3. Re-enter deploy flow once, bounded by site policy.
│ Output: void.
│
├─ Activity: EnsureHardwareSyncConfigured(site_config, system_id, hostname)
│ 1. Verify MAAS machine token exists for this system.
│ 2. If token is absent, retry until MAAS emits it.
│ 3. Resolve node SSH address using the site policy preferred interface, then fall back to
│ the first MAAS-reported deployed IP if needed.
│ 4. Push /etc/maas/maas-machine-creds.yml to the deployed host.
│ During initial onboarding, this is the one explicit SSH-based seed step before the
│ GPUasService node-agent is available. After agent enrollment, subsequent reseed/restart
│ operations should move through a typed `node.hw_sync_reseed` task instead of SSH.
│ 5. Install/restart maas-agent + hardware-sync timer.
│ 6. Enforce enable_hw_sync=true for the machine and hardware_sync_interval for the site.
│ Output: void.
│
├─ Activity: WaitForHardwareSyncHealthy(site_config, system_id, timeout)
│ Poll MAAS until:
│ - enable_hw_sync=true
│ - last_sync is populated
│ - next_sync is populated
│ - is_sync_healthy=true (or null on MAAS builds that omit it)
│ Output: void.
│
├─ Activity: WaitForAgentEnrollment(node_id, timeout)
│ Poll GPUasService node status until it transitions to `active`.
│ This happens when the agent calls /internal/v1/nodes/enroll on first boot.
│ Output: void (node is active and ready for allocations).
│
└─ Done: node is active in GPUasService and healthy for ongoing MAAS hardware sync.
4.4 Batch workflow¶
MaasNodeBatchOnboardWorkflow(input)
│
├─ For each node in input.nodes (with concurrency limit):
│ ├─ ChildWorkflow: MaasNodeOnboardWorkflow(per_node_input)
│ └─ Record result (success / failure + error)
│
└─ Return batch summary: { succeeded: N, failed: N, details: [...] }
Temporal's child workflow pattern provides:
- Configurable max parallel onboardings (e.g. 10 at a time).
- Independent retry per node — one failure doesn't block others.
- Parent workflow tracks aggregate status.
- Per-node logs/events analogous to the current onboard_parallel.sh row logs should be preserved in the read model so operators can inspect one failed node without losing batch context.
4.5 Persistent state per onboarding¶
Tracked in a durable control-plane read model and mirrored into Temporal workflow state/search attributes:
onboarding_id -- workflow ID (correlation key)
batch_id -- parent workflow ID (null for single)
site_id -- MAAS site
hostname -- target hostname
ipmi_ip -- IPMI BMC address
stage -- current stage name
status -- running, completed, failed, cancelled
maas_system_id -- populated after CreateOrFind (null before)
gpuaas_node_id -- populated after CreateGPUaaSNode (null before)
boss_disk_id -- populated after ConfigureStorage (null before)
cloud_init_rendered -- bool
deploy_ip_addresses -- populated after deploy
error -- last failure reason
retry_count -- per-stage retry tracking
started_at -- workflow start
updated_at -- last stage transition
Temporal search attributes should be optimized for discovery (site_id, hostname, status, current_stage, batch_id), while the SQL read model remains the source for admin APIs, audit joins, and long-lived history.
4.6 Failure handling and compensation¶
| Stage | Failure mode | Retry strategy | Compensation on cancel |
|---|---|---|---|
| LoadSiteConfig | Vault unreachable, site disabled | Retry 3x with backoff. If site disabled, fail immediately. | None |
| CreateOrFindInMaas | IPMI unreachable, MAAS API down, discovery timeout | Retry 3x (create is idempotent). Discovery timeout: fail, admin reviews IPMI/PXE. | None (machine may exist in MAAS — leave it, admin can clean up) |
| CommissionNode | MAAS API error | Retry 3x. | None (commissioning can be re-triggered) |
| WaitForReady | Timeout, enters Failed/Broken | If Failed: log MAAS error details, mark failed_commission. If timeout: mark failed_commission_timeout. |
None (machine stays in MAAS as-is) |
| ConfigureStorage | BOSS disk not found | Fail immediately — hardware mismatch, needs admin. No retry. | None |
| ApplyRoCEPhase2 | Interface not found, subnet create fails | Non-fatal by default. Log warning, continue. | None |
| EnsurePxeInterfaceAuto | Validation fails (no subnet link) | Fail if STRICT_PXE_PREFLIGHT. Retry 1x after re-linking. | None |
| CreateGPUaaSNodeAndRenderCloudInit | SKU not found, duplicate node | SKU invalid: fail immediately (input error). Duplicate: recover existing node_id. | Delete GPUasService node record if created. Consume enrollment token. |
| DeployViaMaas | MAAS API error, deploy rejected | Retry 2x. | Release machine back to Ready in MAAS. |
| WaitForDeployed | Timeout, enters Failed | Release back to Ready (compensation), then retry deploy. If the failure class is datasource/cloud-init-like, allow the bounded site-policy deploy retry path. | Release machine back to Ready in MAAS. |
| ClassifyDeployFailure | MAAS event query fails, failure class unknown | Best-effort. If classification fails, treat as generic deploy failure and stop automatic retry. | None |
| RecoverForDatasourceRetry | Abort/release recovery fails | Retry 1x. If recovery still fails, stop and require admin intervention. | None |
| EnsureHardwareSyncConfigured | Token missing, no deploy-reachable SSH IP, creds push fails, maas-agent install fails | Retry until timeout. If MAAS never emits machine token, fail failed_hw_sync_seed. |
Do not release from MAAS; leave node deployed for investigation. |
| WaitForHardwareSyncHealthy | enable_hw_sync=false, last_sync absent, unhealthy sync |
Retry with periodic re-seed of creds/timer. Fail failed_hw_sync_health on timeout. |
Do not mark node active. Alert admin. |
| WaitForAgentEnrollment | Token expired, agent can't reach API | Regenerate enrollment token, wait again. Max 2 attempts. | Node is deployed in MAAS but not active in GPUasService. Mark failed_enrollment. Do NOT release from MAAS (OS is running). |
Key compensation principle: pre-deploy failures are retryable in place (machine stays Ready). Deploy failures compensate by releasing back to Ready. Post-deploy failures (hardware sync seed/health, enrollment) leave MAAS alone and only affect GPUasService state until an admin resolves the node.
4.7 Failure taxonomy¶
Recovery decisions must not be driven by one generic failed state. The workflow/read model should classify failures so retry, resume, rerun, and manual intervention are deterministic.
| Failure class | Typical examples | Default handling |
|---|---|---|
input_config_error |
invalid SKU, malformed onboarding input, disabled site | fail immediately, no automatic retry |
site_capability_missing |
required site policy absent, required MAAS capability not available | fail immediately, operator fixes site capability |
upstream_transient |
MAAS API timeout, Vault/network transient | bounded automatic retry |
bmc_power_failure |
IPMI unreachable, power-cycle failure | bounded retry, then manual intervention |
pxe_discovery_failure |
discovery timeout, PXE never appears in MAAS | bounded retry, then manual intervention |
hardware_mismatch |
BOSS disk not found, expected NIC/interface absent | fail and block for manual intervention |
deploy_cloud_init_failure |
failed deployment, datasource/cloud-init class failure | classify and allow bounded recovery path |
post_deploy_connectivity_failure |
no deploy-reachable SSH IP, creds push fails | bounded retry, then manual intervention |
hardware_sync_failure |
token never emitted, sync unhealthy, timer/agent drift | bounded retry/reseed, then manual intervention |
agent_enrollment_failure |
enrollment timeout, API unreachable from node, token expired | bounded retry with token refresh, then manual intervention |
state_ambiguity |
conflicting discovery candidates, out-of-band MAAS change, late success after failure | stop automation and require reconcile/adopt decision |
These classes should be stored in the read model as first-class fields, not only inferred from free-form error text.
4.8 Workflow recovery state model¶
The onboarding workflow needs its own operational state in addition to coarse node state.
Suggested workflow/job states:
- pending
- running
- completed
- failed_retryable
- failed_manual_intervention
- cancelled
- compensating
- reconciled
The critical distinction is:
- failed_retryable: system believes a safe rerun/retry path exists
- failed_manual_intervention: automation must stop and surface the issue to an operator
nodes.status should stay coarse (enrolling, active, offline, quarantined, retired, removing). Detailed progress and failure semantics belong in the onboarding/decommission read models.
For inventory lifecycle transitions that temporarily use coarse states such as draining or removing, the workflow must still own resumability explicitly. A coarse node state is not enough by itself; the control plane must be able to:
- detect whether the owning lifecycle task still exists and is claimable
- requeue a stale dispatched lifecycle task
- recreate an expired or missing lifecycle task without duplicating a live one
- adopt an already advanced coarse state on rerun/resume instead of failing on a state conflict
4.9 Operator recovery semantics¶
Operator actions must be explicit and auditable. These are not interchangeable:
| Action | Meaning | Default use |
|---|---|---|
retry_stage |
rerun only the current failed stage | targeted recovery when the stage is known to be safe/idempotent |
resume |
continue from the last safe stage checkpoint | preferred after transient issues or after operator fixes an external prerequisite |
rerun |
re-enter the full onboarding workflow from the top, with status-aware stage adoption | default recovery for fresh-node onboarding |
restart_clean |
explicitly reset/compensate known progress, then start again | stronger recovery when prior state is suspect |
cancel |
stop the workflow and compensate where possible | operator abort |
mark_manual_intervention_required |
freeze automation until a human resolves the condition | used for hardware mismatch, ambiguity, repeated bounded failure |
adopt_observed_state |
accept externally advanced MAAS/node state and move workflow forward | used after out-of-band operator action or late success |
For fresh-node onboarding, rerun should be the primary operator recovery path. The workflow should be written so top-level re-entry converges on the desired state rather than duplicating work.
For node retirement and removal, resume must also cover lifecycle-task recovery:
- disable_node / drain resume should requeue or recreate node.drain when the node is already draining
- remove resume should requeue or recreate node.uninstall when the node is already removing
- a workflow must never treat draining or removing as a terminal conflict if the intended lifecycle action is already in progress
4.10 Safe re-entry rules¶
Every stage must define whether it can be re-entered safely and under what checks.
| Stage | Safe to rerun from top? | Requires state refresh first? | Requires cleanup first? | Notes |
|---|---|---|---|---|
LoadSiteConfig |
yes | no | no | pure read |
ResolvePowerCredentials |
yes | no | no | pure read |
CreateOrFindInMaas |
yes | yes | no | lookup-first, adopt existing machine if present |
CommissionNode |
yes | yes | no | status-aware; skip if already Ready |
WaitForReady |
yes | yes | no | must inspect current MAAS state before deciding timeout/failure outcome |
ConfigureStorage |
yes | yes | no | safe only while machine is in editable pre-deploy MAAS state |
ApplyRoCEPhase2 |
yes | yes | no | idempotent link application |
EnsurePxeInterfaceAuto |
yes | yes | maybe | only mutate if interface still editable in MAAS |
CreateGPUaaSNodeAndRenderCloudInit |
yes | yes | maybe | recover existing node row if already created |
DeployViaMaas |
not blindly | yes | sometimes | never issue deploy again without checking if machine already advanced/faulted |
WaitForDeployed |
yes | yes | maybe | may trigger classify/recover path instead of pure wait |
EnsureHardwareSyncConfigured |
yes | yes | no | safe to reseed repeatedly |
WaitForHardwareSyncHealthy |
yes | yes | no | wait/reseed loop |
WaitForAgentEnrollment |
yes | yes | maybe | may require fresh token or adoption of late success |
4.11 Manual intervention boundary¶
Some failures should intentionally stop automation and surface a blocked state instead of looping retries: - BOSS disk not found - conflicting discovery candidates - wrong PXE interface wiring that cannot be auto-corrected - machine deleted from MAAS mid-workflow - repeated datasource/cloud-init failure after bounded retry - repeated post-deploy SSH/hardware-sync failure after bounded retry - workflow belief conflicts with observed MAAS/node state
These should land in:
- workflow state: failed_manual_intervention
- node state: remain coarse (enrolling or quarantined, depending on context)
4.12 Rich diagnostics and audit requirements¶
The onboarding/decommission read model should capture more than a message string:
- failure class
- current stage
- last observed MAAS status
- last observed MAAS power state
- last observed MAAS IPs
- stage-specific details (system_id, interface, disk ids, token status, etc.)
- upstream diagnostic snippet/reference (bounded/sanitized)
- recommended next operator action
Every operator recovery action must audit: - actor identity and role - reason - prior workflow state/stage - requested recovery action - expected target state - correlation/workflow ids
4.13 In-flight reconciliation rules¶
The workflow must account for out-of-band and late-arriving changes while it is still running: - operator changes MAAS state directly - machine disappears or is re-created - MAAS reports a later state than expected - node-agent enrolls after the workflow already marked a failure
Design rules:
- refresh observed MAAS state before every destructive or irreversible step
- define adoption rules for late success (WaitForAgentEnrollment should be able to accept a late enrollment if the node is now healthy)
- define conflict rules for out-of-band changes (move to state_ambiguity / manual intervention when ownership is unclear)
- reconciliation can mark a workflow reconciled when operator/adoption logic successfully realigns observed state with desired state
4.14 Capability preflight and timeout policy¶
Before a workflow starts, run a cheap preflight to fail fast on missing prerequisites: - site is active - MAAS reachable - MAAS token valid - required site policy present - required site capability flags satisfied
Timeouts are product behavior and must be policy-driven/documented: - discovery timeout - commission timeout - deploy timeout - hardware sync seed timeout - hardware sync health timeout - agent enrollment timeout
Each timeout should map to:
- a failure class
- a workflow state (failed_retryable vs failed_manual_intervention)
- a recommended operator action
4.15 Operator APIs needed¶
The admin API surface should support the recovery model explicitly:
- GET /api/v1/admin/onboardings/{id}
- GET /api/v1/admin/onboardings?batch_id=...
- POST /api/v1/admin/onboardings/{id}/retry
- POST /api/v1/admin/onboardings/{id}/resume
- POST /api/v1/admin/onboardings/{id}/rerun
- POST /api/v1/admin/onboardings/{id}/restart-clean
- POST /api/v1/admin/onboardings/{id}/cancel
- POST /api/v1/admin/onboardings/{id}/adopt
- POST /api/v1/admin/onboardings/{id}/mark-manual-intervention
- POST /api/v1/admin/reconciliation/run
The contract can be narrowed later, but the design should assume these are distinct operations with different semantics.
4.16 Discovery ambiguity policy¶
The prototype scripts encode a practical but potentially dangerous machine-claiming strategy. The workflow must make this policy explicit.
Default behavior:
1. Match by hostname.
2. Match by IPMI power_address.
3. Match by PXE MAC if provided.
4. If the workflow created or triggered discovery and exactly one new MAAS machine appears during that run, claim it.
5. If multiple candidates remain, fail with an explicit ambiguity error.
6. Only if site policy auto_claim_single_new_machine=true, allow claiming the sole New machine when no stronger match exists.
This keeps v1 safe-by-default while still allowing the prototype’s convenience behavior in tightly controlled sites.
5. Decommission Modes¶
5.1 Overview¶
A node has two independent lifecycles: MAAS (bare metal state) and GPUasService (allocation/agent state). Decommission means different things depending on what's being cleaned up.
POST /api/v1/admin/nodes/{id}/decommission
{
"mode": "soft_reset | reimage | full_decommission",
"erase": "none | quick | secure", // reimage + full only
"storage_cleanup": "none | local | weka | all", // soft_reset + reimage
"reason": "tenant_transition | security_patch | hardware_removal | ..."
}
Response:
{
"decommission_id": "uuid", // Temporal workflow ID
"mode": "reimage",
"status": "started"
}
5.2 Mode 1: Soft Reset (allocation turnover)¶
Use case: Tenant's allocation released, node stays deployed, ready for next tenant immediately.
Workflow:
SoftResetWorkflow(node_id, storage_cleanup)
├─ Activity: DisableNode(node_id)
│ Mark node unavailable for new allocations.
│
├─ Activity: DrainNode(node_id)
│ Close active terminal sessions.
│ Revoke all allocation users (agent task: allocation.revoke_user).
│ Wait for task completion.
│
├─ Activity: CleanupStorage(node_id, storage_cleanup)
│ If "local": scrub /tmp, /var/tmp, clear local scratch.
│ If "weka": unmount Weka POSIX mounts, remove tenant Weka config.
│ If "all": both.
│ If "none": skip.
│
├─ Activity: ScrubGPU(node_id)
│ Reset GPU memory/VRAM state.
│ Vendor-specific: rocm-smi --resetgpu (AMD) or nvidia-smi --gpu-reset (NVIDIA).
│
├─ Activity: ValidateCleanNode(node_id)
│ Verify: no tenant users exist, no tenant processes running,
│ no tenant mounts, GPU memory clear.
│ If validation fails: quarantine node, alert admin.
│
├─ Activity: EnableNode(node_id)
│ Mark node available for allocations.
│
└─ Done: node is clean and available.
MAAS involvement: None. Node stays Deployed. Agent stays running.
What's missing today: GPU scrub task, Weka cleanup task, post-cleanup validation task.
5.3 Mode 2: Reimage (OS reset, same hardware)¶
Use case: OS drift, security patch cycle, major tenant transition requiring clean OS.
Workflow:
ReimageWorkflow(node_id, site_id, erase, storage_cleanup)
├─ Activity: DisableNode(node_id)
│ Mark node unavailable.
│
├─ Activity: DrainNode(node_id)
│ Close terminals, revoke users, wait.
│
├─ Activity: CleanupStorage(node_id, storage_cleanup)
│ Same as soft reset.
│
├─ Activity: LoadSiteConfig(site_id)
│ Read MAAS site config + Vault secrets.
│
├─ Activity: ReleaseMaasNode(site_config, system_id, erase)
│ Call MAAS release with erase mode (none/quick/secure).
│ If erase fails: fallback to no-erase release (configurable).
│ Wait for machine to reach Ready.
│
├─ Activity: ConfigureStorage(site_config, system_id)
│ Re-detect BOSS disk, set boot disk, flat layout.
│ (Hardware hasn't changed, but MAAS resets storage config on release.)
│
├─ Activity: ApplyRoCEPhase2(site_config, hostname)
│ Re-apply RoCE IPs (MAAS clears interface config on release).
│
├─ Activity: EnsurePxeInterfaceAuto(site_config, system_id)
│ Re-ensure PXE interface AUTO link.
│
├─ Activity: RenderCloudInit(site_config, system_id, node_id)
│ 1. Generate new enrollment token (existing cert may still work,
│ but fresh token ensures recovery if cert was lost during reimage).
│ 2. Render cloud-init with agent bootstrap.
│
├─ Activity: DeployViaMaas(site_config, system_id, cloud_init_b64, distro_series, enable_hw_sync)
│ 1. Deploy via MAAS.
│
├─ Activity: WaitForMaasStatus(site_config, system_id, "Deployed", timeout)
│ 1. Wait for Deployed.
│
├─ Activity: EnsureHardwareSyncConfigured(site_config, system_id, hostname)
│
├─ Activity: WaitForHardwareSyncHealthy(site_config, system_id, timeout)
│
├─ Activity: WaitForAgentEnrollment(node_id, timeout)
│ Agent re-enrolls on boot (or reuses existing cert if still valid).
│
├─ Activity: EnableNode(node_id)
│ Mark node available.
│
└─ Done: node has fresh OS, agent re-enrolled, ready for allocations.
GPUasService node ID stays the same. The physical machine and its control-plane identity are preserved across reimage. Reimage is an OS reset, not a node replacement.
5.4 Mode 3: Full Decommission (remove from fleet)¶
Use case: Hardware returned, failed beyond repair, moved to different cluster.
Workflow:
FullDecommissionWorkflow(node_id, site_id, erase)
├─ Activity: DisableNode(node_id)
│
├─ Activity: ForceReleaseAllocations(node_id)
│ Force-release any active allocation on this node.
│ Triggers billing closure.
│
├─ Activity: DrainNode(node_id)
│ Close terminals, revoke users.
│
├─ Activity: LoadSiteConfig(site_id)
│
├─ Activity: ReleaseMaasNode(site_config, system_id, erase)
│ Release with erase mode. If erase fails and site policy allows it,
│ abort/mark-fixed and retry with forced no-erase release. Wait for Ready.
│
├─ Activity: PowerOffMaasNode(site_config, system_id)
│ Power off via MAAS API.
│
├─ Activity: RetireGPUaaSNode(node_id)
│ Set node status → retired.
│
├─ Activity: RemoveGPUaaSNodeRecord(node_id)
│ Use the existing inventory lifecycle:
│ retired → removing → node.uninstall → delete on success
│ (or back to retired on uninstall failure).
│ Audit history remains in audit_logs + decommission tracking.
│
├─ Activity: CleanupSecrets(node_id)
│ Revoke/delete any node-specific Vault secrets if applicable.
│ Remove enrollment token from Redis if still present.
│
├─ Activity: RemoveMaasRecord(site_config, system_id) // optional
│ Delete machine from MAAS. Or leave for MAAS-side audit.
│ Configurable: default is leave.
│
└─ Done: node is powered off, removed from active GPUasService inventory, agent removed.
Re-onboarding a fully decommissioned machine: Create a new GPUasService node record. Reuse of the old node identity is not allowed after full remove.
5.5 Mode 4: Storage-Only Cleanup¶
Use case: Between tenants on shared storage (Weka), node and OS are fine.
This is a subset of soft reset — just the storage cleanup step. Can be triggered directly:
Runs CleanupStorage + ValidateCleanNode activities only.
5.6 Agent task catalog additions needed¶
| Task type | Purpose | Decommission modes |
|---|---|---|
node.gpu_scrub |
Reset GPU memory/VRAM, vendor-specific | soft_reset, reimage |
node.storage_cleanup |
Scrub local disks, unmount/cleanup Weka | soft_reset, reimage, storage-only |
node.validate_clean |
Verify no tenant residue (users, processes, mounts, GPU state) | soft_reset, reimage |
node.hw_sync_reseed |
Reinstall/restart MAAS hardware sync credentials + timer after deploy/reimage | onboarding, reimage |
Existing tasks already covering:
- allocation.revoke_user — user cleanup
- terminal.close — terminal session teardown
- node.drain — mark unschedulable
- node.uninstall — agent self-removal
5.7 Decommission monitoring¶
GET /api/v1/admin/decommissions/{id} -- status, stage, error
GET /api/v1/admin/decommissions?node_id=... -- history for a node
GET /api/v1/admin/decommissions?status=failed -- failed decommissions
POST /api/v1/admin/decommissions/{id}/retry -- retry from failed stage
POST /api/v1/admin/decommissions/{id}/cancel -- abort (best-effort)
Back these endpoints with a read model parallel to onboarding:
node_decommissions
decommission_id uuid PK -- external/job id, also Temporal workflow ID
node_id uuid NOT NULL
site_id uuid NULL
maas_system_id text NULL
mode text NOT NULL -- soft_reset | reimage | full_decommission | storage_cleanup
status text NOT NULL -- pending | running | completed | failed | cancelled
current_stage text NOT NULL
current_attempt int NOT NULL DEFAULT 0
requested_by_user_id uuid NULL
error_code text NULL
error_message text NULL
error_details jsonb NOT NULL DEFAULT '{}'::jsonb
workflow_id text NOT NULL
workflow_run_id text NULL
requested_at timestamptz NOT NULL DEFAULT now()
started_at timestamptz NULL
completed_at timestamptz NULL
updated_at timestamptz NOT NULL DEFAULT now()
Optional detailed history:
node_decommission_events
id uuid PK
decommission_id uuid FK → node_decommissions
stage text NOT NULL
attempt int NOT NULL
status text NOT NULL -- started | succeeded | failed | compensated | skipped
message text NULL
details jsonb NOT NULL DEFAULT '{}'::jsonb
occurred_at timestamptz NOT NULL DEFAULT now()
5.8 Decommission failure and recovery model¶
Decommission needs the same rigor as onboarding because it also depends on MAAS, the node-agent, and external node/network state.
| Stage | Failure mode | Retry strategy | Compensation / operator path |
|---|---|---|---|
DisableNode |
inventory update fails | retry 3x | stop before external mutation |
ForceReleaseAllocations |
allocation force-release fails | bounded retry | move to manual intervention if allocations remain attached |
DrainNode |
node-agent unreachable, drain task fails | bounded retry | mark manual intervention; operator may quarantine or continue with explicit override |
CleanupStorage / ScrubGPU / ValidateCleanNode |
host cleanup fails, validation residue remains | bounded retry | manual intervention or quarantine depending on mode |
ReleaseMaasNode |
MAAS release fails, erase fails, timeout to Ready |
bounded retry; fallback to no-erase if site policy allows | if still not Ready, stop and require operator reconcile |
PowerOffMaasNode |
MAAS power-off fails | bounded retry | manual intervention; do not pretend full decommission completed |
RetireGPUaaSNode |
inventory transition fails | retry 3x | stop before remove path advances |
RemoveGPUaaSNodeRecord |
node.uninstall fails, node remains retired |
use existing inventory retry path | explicit operator retry/resume supported; node must not be deleted prematurely |
CleanupSecrets |
Vault/Redis cleanup fails | bounded retry | decommission may complete with follow-up cleanup task recorded |
RemoveMaasRecord |
MAAS delete fails | best-effort by default | leave MAAS record and mark follow-up action if site policy says retain/delete mismatch |
Manual intervention should be the default outcome when decommission ownership becomes ambiguous, for example: - node-agent is unreachable but MAAS state is still mutable - MAAS release/power-off does not converge - uninstall fails after the node is already retired - operator changes MAAS state out of band during decommission
6. State Drift: MAAS ↔ GPUasService Reconciliation¶
6.1 Problem¶
MAAS and GPUasService are independent systems with independent state. Drift occurs when: - An admin acts directly in MAAS (UI/CLI) without going through GPUasService. - Hardware fails and MAAS detects it before GPUasService does. - Network issues cause the agent to lose contact. - A node is reimaged or released outside the workflow.
Without reconciliation, GPUasService may schedule allocations to nodes that no longer exist or are in the wrong state.
6.2 Drift scenarios¶
| Scenario | MAAS state | GPUasService state | Severity | Impact |
|---|---|---|---|---|
| Admin releases node in MAAS UI | Ready | active | CRITICAL | Agent is dead, allocations will fail |
| Admin redeploys from MAAS CLI | Deploying | active (with allocation) | CRITICAL | Tenant session destroyed |
| Node hardware failure | Failed testing | active | HIGH | Allocations to dead node |
| Node powered off externally | Off | active | HIGH | Agent stops, allocations fail |
| MAAS firmware update reboots node | Commissioning | active | MEDIUM | Temporary disruption |
| Node IP changes after redeploy | Deployed (new IP) | active (old IP) | MEDIUM | Agent still works (outbound), but SSH probe fails |
| GPUasService retires node, MAAS not told | Deployed | retired | LOW | Burning power, MAAS still manages |
| Machine deleted from MAAS | (absent) | active | CRITICAL | Node record is orphaned |
6.3 Detection: dual signal approach¶
Primary signal: agent heartbeat.
The agent already polls /internal/v1/nodes/{id}/tasks/wait. This is effectively a heartbeat. Track:
If now() - last_agent_contact_at > threshold (e.g. 5 minutes), mark node offline. This catches most drift scenarios without talking to MAAS.
The node.heartbeat_check task type already exists in the agent catalog. The control plane can queue periodic heartbeat tasks and track response time.
Secondary signal: periodic MAAS reconciler.
A Temporal cron workflow (or scheduled activity) runs every N minutes per site:
MaasSiteReconcilerWorkflow(site_id) -- cron: every 5 minutes
│
├─ Activity: LoadSiteConfig(site_id)
│
├─ Activity: FetchMaasMachines(site_config)
│ GET /api/2.0/machines/ — all machines for this site.
│ Returns: [{ system_id, hostname, status_name, power_state, ip_addresses }]
│
├─ Activity: FetchGPUaaSNodes(site_id)
│ Query nodes where site_id matches and status NOT IN (retired).
│ Returns: [{ node_id, hostname, maas_system_id, status, host }]
│
├─ Activity: Reconcile(maas_machines, gpuaas_nodes)
│ Apply reconciliation rules (see 6.4).
│ Output: list of drift actions taken + alerts raised.
│
└─ Done
6.4 Reconciliation rules¶
| MAAS state | GPUasService state | Auto-action | Alert |
|---|---|---|---|
| Deployed | active | OK — no action | — |
| Deployed | offline | Agent issue — no MAAS action | WARN: agent not polling |
| Ready / Released | active | Auto-quarantine GPUasService node | CRITICAL: node released outside workflow |
| Failed / Broken | active | Auto-quarantine GPUasService node | CRITICAL: hardware failure detected |
| Commissioning | active | Unexpected recommission | WARN: node being recommissioned |
| Deployed, IP changed | active, old IP | Auto-update node host field | INFO: IP change detected |
| (absent from MAAS) | active | Auto-quarantine GPUasService node | CRITICAL: machine deleted from MAAS |
| Deployed | retired | Issue MAAS release + power off | INFO: cleaning up retired node |
| Any | (no matching GPUasService node) | No action | DEBUG: unmanaged MAAS machine |
Auto-quarantine means:
1. Set GPUasService node status → quarantined.
2. Do NOT force-release active allocations immediately — alert admin first.
3. Admin reviews and decides: force-release + reimage, or investigate further.
6.5 Reconciliation state tracking¶
node_maas_state
node_id uuid FK → nodes
site_id uuid FK → maas_sites
maas_system_id text NOT NULL
last_maas_status text -- "Deployed", "Ready", "Failed", etc
last_maas_power_state text -- "on", "off", "unknown"
last_maas_ips text[]
last_reconciled_at timestamptz
drift_detected bool DEFAULT false
drift_details jsonb -- { "rule": "...", "maas_status": "...", "expected": "..." }
drift_resolved_at timestamptz -- null until admin resolves or auto-heals
UNIQUE (node_id)
node_maas_state is the authoritative runtime reconciliation record for the current MAAS binding of an active GPUasService node. node_onboardings and node_decommissions are workflow/history records and should remain authoritative for job history, not current runtime truth.
6.6 Reconciliation admin APIs¶
GET /api/v1/admin/reconciliation/status -- summary: nodes OK, drifted, unreconciled
GET /api/v1/admin/reconciliation/drift -- list all nodes with drift_detected=true
POST /api/v1/admin/reconciliation/drift/{node_id}/resolve -- admin acknowledges + takes action
POST /api/v1/admin/reconciliation/run -- trigger immediate reconciliation for a site
7. Node State Machine (Extended)¶
7.1 GPUasService node states¶
┌──────────────────────────────┐
│ │
┌─────────────┐ enroll OK ┌──▼──────┐ probe fail ┌────┴────┐
│ bootstrap ├──────────────►│ active ├───────────────►│ offline │
│ _issued │ │ │◄───────────────┤ │
└──────┬──────┘ └──┬───┬───┘ probe OK └────┬────┘
│ │ │ │
│ maas mode │ │ admin/auto │ admin
▼ │ ▼ │
┌─────────────┐ enroll OK │ ┌────────────┐ │
│ enrolling ├──────────────────┘ │quarantined │◄────────────┘
│ (maas) │ │ │ drift detected
└─────────────┘ └──────┬──────┘
│ admin
▼
┌─────────────┐
│ retired │
└──────┬──────┘
│ remove
▼
┌─────────────┐
│ removing │
└──────┬──────┘
│ uninstall success
▼
┌─────────────┐
│ deleted │
└─────────────┘
7.2 MAAS machine states (reference)¶
New → Commissioning → Ready → Allocated → Deploying → Deployed
▲ │
│ Release │
└────────────────────────────────┘
Side states: Failed, Broken (can appear after Commissioning or Deploying)
7.3 Combined state expectations for MAAS-onboarded nodes¶
| Lifecycle phase | GPUasService status | Expected MAAS status | Agent running? |
|---|---|---|---|
| Onboarding: pre-commission | enrolling | New / Commissioning | No |
| Onboarding: post-commission | enrolling | Ready | No |
| Onboarding: deploying | enrolling | Deploying | No |
| Onboarding: deployed, agent starting | enrolling | Deployed | Starting |
| Operational | active | Deployed | Yes |
| Agent down | offline | Deployed | No |
| Drifted (released outside workflow) | quarantined | Ready | No |
| Reimaging | enrolling (re-enter) | Ready / Deploying | No |
| Decommissioned (workflow in progress) | removing | Ready / Off / Deleted | No |
| Decommissioned (completed) | deleted from active inventory | Ready / Off / Deleted | No |
8. Data Flow: MAAS to GPUasService Field Mapping¶
8.0 Event integration¶
These workflows should emit typed domain events so downstream consumers do not need to poll operator APIs for lifecycle transitions.
Suggested events:
- node.onboarding.started
- node.onboarding.completed
- node.onboarding.failed
- node.onboarding.manual_intervention_required
- node.decommission.started
- node.decommission.completed
- node.decommission.failed
- node.decommission.manual_intervention_required
Exact event shapes and subjects should be specified in doc/api/asyncapi.draft.yaml before implementation.
8.1 Where GPUasService node fields come from¶
| GPUasService node field | Source | When populated |
|---|---|---|
| id | Generated by GPUasService | CreateGPUaaSNode activity |
| host | MAAS deployed IP (from ip_addresses) |
After deploy (updated by reconciler if IP changes) |
| hostname | Admin input (from CSV/API) | CreateGPUaaSNode activity |
| port | Default 22 | CreateGPUaaSNode activity |
| sku | Admin input | CreateGPUaaSNode activity |
| gpus_total | Admin input | CreateGPUaaSNode activity |
| region_code | Site's region_code (or admin override) | CreateGPUaaSNode activity |
| status | State machine | Transitions through workflow |
| ssh_username | "root" (default) | CreateGPUaaSNode activity |
| access_method | "node_agent" | CreateGPUaaSNode activity |
8.2 MAAS-specific data stored per node¶
| Field | Source | Purpose |
|---|---|---|
| maas_system_id | MAAS API (from CreateOrFind) | All subsequent MAAS API calls |
| site_id | Admin input | Links to maas_sites for config |
| ipmi_ip | Admin input | IPMI power management |
| pxe_mac | Observed/internal field | Discovery fallback only if later required |
Stored in node_maas_state or as additional columns on the nodes table (TBD — may prefer a separate table to keep the nodes table provider-agnostic). The runtime MAAS binding should be authoritative in node_maas_state; onboarding rows are historical workflow records.
9. Cloud-Init Template¶
The cloud-init user-data for MAAS-deployed nodes combines two ordered layers: - infra-owned site bootstrap bundle - GPUasService bootstrap bundle
Within those layers, the first-boot payload covers:
- deploy-user creation
- GPUasService node-agent bootstrap
- MAAS hardware-sync unit/timer/credentials wiring
- destructive /shareN non-root disk initialization
- guest OS RoCE routing service/script
- optional site-owned bootstrap logic delivered through the site bundle reference
Treat this as a versioned first-boot bundle composition, not an ad hoc inline script blob. The infra-owned site bootstrap bundle should remain separately versioned and testable from the GPUasService-owned bootstrap content so failures can be classified cleanly as: - site bootstrap failure - GPUasService bootstrap failure
Longer term, the site bundle should be managed through a control-plane reference and uploaded/versioned like other deployment artifacts rather than only by a local file path.
The template is rendered by the CreateGPUaaSNodeAndRenderCloudInit activity with substitutions for:
- Deploy user/pass
- MAAS machine token (consumer_key, token_key, token_secret)
- MAAS base URL
- MAAS system_id
- GPUasService API URL
- GPUasService enrollment token
- GPUasService agent package URL
- GPUasService CA bundle
- GPUasService task signing public key
The GPUasService bootstrap section uses the existing cloud-init rendering mode from the bootstrap-script endpoint (mode=cloud_init), injected after the infra-owned site bootstrap stage in the combined template.
Prototype intent carried forward from maas-automation-TG:
- deploy-user creation and password set
- destructive non-root disk initialization to /shareN
- MAAS hardware-sync credentials + unit/timer wiring
- RoCE routing systemd service/script in the guest OS
If any of these become optional later, that should be an explicit site policy decision, not silent template drift.
9.1 Hardware sync is a required invariant¶
For MAAS-managed nodes, hardware sync is not best-effort. A node is not considered fully onboarded until MAAS can continue to receive hardware sync from the deployed OS.
Required healthy state:
- enable_hw_sync=true
- MAAS machine token exists for the machine
- /etc/maas/maas-machine-creds.yml is present on the node
- maas-agent and the hardware-sync timer are installed/running
- last_sync is populated in MAAS
- next_sync is populated in MAAS
- is_sync_healthy=true (or null on MAAS builds that do not expose it)
What re-sync should do:
1. Re-assert site hardware-sync interval in MAAS.
2. Re-assert enable_hw_sync=true on the machine.
3. Fetch or re-fetch the per-machine MAAS token.
4. Rewrite /etc/maas/maas-machine-creds.yml.
5. Restart/enable maas-agent and the hardware-sync timer.
6. Wait for MAAS to observe a fresh last_sync/next_sync cycle.
This must run: - after initial deploy - after every reimage - whenever reconciliation detects sync drift on a deployed node
10. Future: LXC/LXD Path¶
LXC/LXD onboarding would follow the same patterns but with a different site entity and a much simpler workflow:
lxd_sites
id uuid
name text
region_code text
api_url text -- LXD REST API endpoint
client_cert_vault text -- Vault path for client cert
client_key_vault text -- Vault path for client key
trust_token_vault text -- Vault path for trust token
default_profile text -- LXD profile for GPU passthrough
status text
LXD onboarding workflow (seconds, not minutes):
LxdNodeOnboardWorkflow(input)
├─ Activity: LoadLxdSiteConfig(site_id)
├─ Activity: CreateLxdInstance(site_config, profile, gpu_config)
├─ Activity: WaitForRunning(instance_id) -- seconds
├─ Activity: InjectBootstrap(instance_id, cloud_init)
├─ Activity: WaitForAgentEnrollment(node_id) -- seconds
└─ Done
No IPMI, no PXE, no commissioning, no BOSS disk, no RoCE. The GPUasService node model stays the same — onboarding_mode: "lxd", agent enrolls identically.
LXD decommission is also simpler — stop instance, delete instance. No erase modes needed.
11. Future: Weka Storage Integration¶
Weka integration would add per-allocation storage provisioning:
Allocation-time mount:
- When an allocation is provisioned on a node, a Weka filesystem mount is set up for the tenant.
- Agent task: storage.weka_mount — installs Weka client, mounts tenant filesystem.
Deallocation-time cleanup:
- Agent task: storage.weka_unmount — unmounts, removes client config, scrubs local cache.
This integrates with the soft reset and reimage decommission modes via the storage_cleanup parameter.
12. Open Questions¶
-
Node table extension vs separate table: Should MAAS-specific fields (system_id, ipmi_ip, site_id) live on the nodes table directly, or in a separate
node_maas_statetable? Separate table keeps nodes provider-agnostic but adds joins. -
Enrollment token TTL validation for MAAS path:
enrollment_token_ttl_secondsalready exists in site policy. The remaining question is whether the configured default is sufficient for the slowest observed deploy path, or whether token refresh mid-workflow is still required. -
Reconciler frequency vs MAAS API load: Polling all machines every 5 minutes per site could be heavy for large sites (500+ nodes). Consider delta-based polling (MAAS events API) or MAAS webhooks (3.4+) where available.
-
Per-node power override lifecycle: If an override becomes stale or a BMC credential rotation finishes, should the reconciler warn on unused overrides, or should this remain manual cleanup?
-
Batch concurrency limits: How many nodes to commission/deploy in parallel per MAAS server? MAAS has internal concurrency limits (rack controller PXE capacity, image sync bandwidth). Need tunable per-site limits.
-
GPU scrub verification: How to verify GPU memory is actually clear? Vendor tools may not provide a reliable "clean" signal. Need to define acceptable verification criteria per GPU vendor.
-
Weka client lifecycle: Is the Weka client installed once at OS level (persistent across allocations) or per-allocation? Persistent is simpler but may leak state between tenants.
-
MAAS webhook vs polling: MAAS 3.4+ supports webhooks for machine state changes. If available, this is more reactive than polling. Worth supporting both modes (webhook-primary, polling-fallback) per site.