Skip to content

MAAS Bare Metal Lifecycle v1

1. Purpose

Define the end-to-end bare-metal node lifecycle managed through MAAS integration: automated onboarding, decommissioning modes, site configuration management, and state drift reconciliation between MAAS and GPUasService.

The dedicated state model for this lifecycle lives in: - MAAS_Node_State_Model_v1.md - MAAS_Recovery_Matrix_v1.md - Provisioning_BareMetal_MAAS_API_Boundary_v1.md - MAAS_Execution_Readiness_v1.md

This document covers: - MAAS site entity model and credential lifecycle (Vault-backed), - automated node onboarding via Temporal workflows (single + batch), - cloud-init bootstrap injection (agent enrollment without manual intervention), - decommission modes (soft reset, reimage, full decommission, storage-only cleanup), - MAAS ↔ GPUasService state drift detection and reconciliation, - failure handling, compensation, and recovery at each workflow stage, - future extensibility for LXC/LXD and Weka storage integration.

This document does not cover: - MAAS installation/bootstrap on a new server, - PostgreSQL tuning for MAAS itself, - external notification integrations such as Telegram/email monitoring.

Those are treated as MAAS site operations outside the GPUasService lifecycle contract. GPUasService assumes a working MAAS site already exists.

2. Context

2.1 Current state

  • GPUasService node onboarding supports two modes: manual and maas.
  • manual mode is fully implemented: admin creates node via API, gets a bootstrap curl command with a temp enrollment token, runs it on the node, agent enrolls, node goes active.
  • maas mode has a placeholder (maas.Client.PrepareNodeOnboarding) that calls an external HTTP endpoint — fire-and-forget with no state tracking.
  • A separate repository (maas-automation-TG) contains production-tested bash scripts (onboard_node.sh, deploy.sh, release_node.sh, etc.) that drive the full MAAS lifecycle via CLI. These scripts are the working model for the automation logic.
  • The bash scripts assume a happy path — no durable state, no retry across stages, no visibility into progress, no compensation on failure.

2.2 Target state

  • MAAS onboarding is a Temporal workflow with per-stage retry, compensation, and observability.
  • Admin triggers onboarding via API (single or batch), monitors progress, and can retry or cancel.
  • The GPUasService node agent bootstrap is injected into MAAS cloud-init at deploy time — no manual SSH or script delivery.
  • Site-level MAAS configuration is a first-class DB entity with Vault-backed secrets.
  • A reconciliation loop detects and resolves state drift between MAAS and GPUasService.

2.3 Prototype boundary from maas-automation-TG

The external scripting repo remains the reference for current operational behavior, but not every script belongs in GPUasService:

  • In scope for GPUasService lifecycle v1
  • machine discovery/claiming policy
  • commissioning, deploy, release/reimage flows
  • RoCE phase-2 MAAS configuration
  • cloud-init bundle composition
  • MAAS hardware sync seed/reseed/health
  • node cleanup/decommission semantics

  • Out of scope for GPUasService lifecycle v1

  • MAAS install/bootstrap on a new host
  • MAAS PostgreSQL tuning
  • external monitoring/notification apps (Telegram/email)
  • one-off MAAS operator utilities that can remain standalone CLI/SDK tools

2.4 Capability ownership boundary

GPUasService should own the lifecycle contract and required invariants for MAAS-managed nodes. It should not become a generic remote-script distribution system for MAAS.

GPUasService-managed - MAAS site records, secrets, and lifecycle policy - onboarding/reimage/decommission workflows - required runtime invariants for GPUasService-managed nodes: - expected boot image/profile - hardware sync policy - RoCE phase-2 enablement - discovery/claiming policy - bounded deploy retry policy - controlled cloud-init bundle/profile composition - capability probes and validation of required site prerequisites

Infra-managed initially - MAAS commissioning/testing script payloads - MAAS host install/bootstrap/tuning - optional MAAS operator tooling and monitoring integrations

Security boundary - GPUasService v1 must not expose an API for arbitrary MAAS script blob upload or arbitrary remote shell payload execution. - If a MAAS site depends on commissioning/testing scripts, GPUasService may record that dependency and validate presence/version, but the script content itself remains infra-owned until promoted into a controlled platform-managed artifact.

This preserves the stronger GPUasService security model built around typed workflows, audited admin actions, and controlled bootstrap content rather than ad hoc script execution.

3. MAAS Site Configuration

3.1 Site entity

Each MAAS region controller is represented as a site. A site is the unit of credential management, network topology, and workflow scoping.

For the first implementation slice, site-level policy is stored directly against the site so the control plane can stand up a usable MAAS admin surface quickly. The intended steady-state model is site + profile(s): - a site owns identity, connectivity, secret paths, and default environment wiring - a profile owns operational policy bundles applied to onboarding/reimage/decommission flows

That future split matters because one MAAS site may need multiple operational modes over time: - GPU default onboarding - stricter reimage/recovery behavior - alternate cloud-init bundles - future non-GPU host classes such as LXC/LXD-oriented profiles

Until profile CRUD lands, the current site-level policy row should be treated as the site's implicit default profile.

maas_sites
  id                    uuid        PK
  name                  text        UNIQUE, NOT NULL  -- "dc1-maas"
  region_code           text        NOT NULL          -- maps to GPUasService region
  api_base_url          text        NOT NULL          -- "http://10.176.46.1:5240/MAAS"
  api_token_vault_path            text        NOT NULL          -- "kv/maas-sites/{id}/api-token"
  default_power_creds_vault_path  text        NOT NULL          -- "kv/maas-sites/{id}/power/default"
  pxe_iface             text        NOT NULL          -- "ens19" (MAAS server side)
  pxe_vlan_vid          int         NOT NULL          -- 46
  node_pxe_iface        text        NOT NULL          -- "eno8303" (node side)
  distro_series         text        NOT NULL DEFAULT 'ubuntu/noble'
  architecture          text        NOT NULL DEFAULT 'amd64/generic'
  upstream_dns          text        DEFAULT '1.1.1.1 8.8.8.8'
  status                text        NOT NULL DEFAULT 'active'  -- active, disabled
  created_at            timestamptz NOT NULL DEFAULT now()
  updated_at            timestamptz NOT NULL DEFAULT now()

3.2 Secrets in Vault

Secrets are never stored in the database. The DB stores Vault KV v2 paths only.

kv/maas-sites/{site_id}/api-token
  → { "token": "<consumer_key>:<token_key>:<token_secret>" }

kv/maas-sites/{site_id}/power/default
  → { "user": "root", "pass": "..." }

The Vault client (packages/shared/vault) already supports ReadKVV2 and WriteKVV2. Workflow activities read from Vault at execution time — not at startup — so credential rotation is automatic without service restart.

3.2.1 Optional per-node power credential overrides

Site-wide defaults are the normal operating model. Some sites will still have exceptions: old BMC firmware, break-glass credentials for a subset of racks, or staged credential rotation. Model those as explicit overrides, not as ad hoc workflow inputs.

maas_power_credential_overrides
  id                  uuid        PK
  site_id             uuid        FK → maas_sites
  selector_type       text        NOT NULL   -- hostname | ipmi_ip | pxe_mac
  selector_value      text        NOT NULL
  vault_path          text        NOT NULL   -- "kv/maas-sites/{id}/power/overrides/{oid}"
  status              text        NOT NULL DEFAULT 'active'
  created_at          timestamptz NOT NULL DEFAULT now()
  updated_at          timestamptz NOT NULL DEFAULT now()

  UNIQUE (site_id, selector_type, selector_value)

Credential selection order: 1. Exact active override match by pxe_mac 2. Exact active override match by ipmi_ip 3. Exact active override match by hostname 4. Site default power credentials

This keeps the steady-state model simple while still supporting mixed fleets without baking credential logic into workflow inputs.

3.3 RoCE IP assignments

Replace the static roce_ips.csv file with a database table:

maas_roce_assignments
  id          uuid        PK
  site_id     uuid        FK → maas_sites
  hostname    text        NOT NULL
  interface   text        NOT NULL
  ipv4_cidr   text        NOT NULL
  created_at  timestamptz NOT NULL DEFAULT now()
  updated_at  timestamptz NOT NULL DEFAULT now()

  UNIQUE (site_id, hostname, interface)

Populated via admin API using typed JSON payloads. If operators maintain CSV files, conversion to JSON should happen in CLI/import tooling before calling the API. Queried by the RoCE phase-2 activity during onboarding.

3.4 Admin APIs for site management

POST   /api/v1/admin/maas-sites                        -- create site (non-secret fields)
GET    /api/v1/admin/maas-sites                        -- list sites
GET    /api/v1/admin/maas-sites/{id}                   -- get site details
PATCH  /api/v1/admin/maas-sites/{id}                   -- update site config
POST   /api/v1/admin/maas-sites/{id}/credentials       -- write secrets to Vault
POST   /api/v1/admin/maas-sites/{id}/probe             -- verify MAAS API reachable + token valid
DELETE /api/v1/admin/maas-sites/{id}                   -- disable (soft delete)

POST   /api/v1/admin/maas-sites/{id}/roce-assignments       -- bulk upsert (JSON array)
GET    /api/v1/admin/maas-sites/{id}/roce-assignments        -- list
DELETE /api/v1/admin/maas-sites/{id}/roce-assignments/{aid}  -- remove single

Validation on credential write: - Call MAAS GET /api/2.0/version/ with the provided token. - If MAAS is unreachable or token is invalid, reject with 422. - On success, write to Vault and store path in DB.

Seeding flow (first-time setup): 1. Admin calls POST /api/v1/admin/maas-sites with site config. 2. Admin calls POST /api/v1/admin/maas-sites/{id}/credentials with MAAS token + IPMI creds. 3. System validates MAAS connectivity, writes secrets to Vault, stores paths in DB. 4. Admin optionally uploads RoCE assignments via JSON API or a CLI/import helper that converts CSV into the same JSON shape. 5. Site is ready for onboarding workflows.

Example bulk RoCE assignment payload:

{
  "items": [
    { "hostname": "c07u31", "interface": "enp28s0np0", "ipv4_cidr": "172.30.9.61/31" },
    { "hostname": "c07u31", "interface": "enp29s0np0", "ipv4_cidr": "172.29.1.219/31" },
    { "hostname": "c07u31", "interface": "enp62s0np0", "ipv4_cidr": "172.30.9.63/31" }
  ]
}

Notes: - the resource is site-scoped and host-keyed by site_id + hostname - multiple rows per hostname are expected, one per interface assignment - operators may source this from CSV, but the API contract stays JSON-first

3.5 Credential lifecycle

Operation What happens
Initial seed Admin calls credentials endpoint, system validates + writes to Vault
Rotate MAAS token Admin calls credentials endpoint again — same Vault path, new value. Next activity execution picks it up. No restart.
Rotate default power creds Same — default site Vault path stays the same, value changes. In-flight workflows use whatever was read at activity start.
Rotate node-specific power creds Update the override Vault entry. Matching future workflows pick it up automatically.
Disable site PATCH /api/v1/admin/maas-sites/{id} { "status": "disabled" }. New onboardings rejected. In-flight workflows continue to completion.
Re-enable site PATCH /api/v1/admin/maas-sites/{id} { "status": "active" }.
Test connectivity POST /api/v1/admin/maas-sites/{id}/probe — hits MAAS API, verifies token, returns version + rack controller info.

3.6 Site policy knobs

The prototype scripts rely on site-specific behavior toggles. Capture them explicitly so workflow behavior is reproducible per site and not hidden in shell defaults.

maas_sites
  id                              uuid        PK
  name                            text        UNIQUE
  region_code                     text        NOT NULL
  api_base_url                    text        NOT NULL
  api_token_vault_path            text        NOT NULL
  default_power_creds_vault_path  text        NOT NULL
  pxe_iface                       text        NOT NULL
  pxe_vlan_vid                    int         NOT NULL
  node_pxe_iface                  text        NOT NULL
  distro_series                   text        NOT NULL
  architecture                    text        NOT NULL
  deploy_user                     text        NOT NULL DEFAULT 'hpcadmin'
  deploy_password_vault_path      text        NOT NULL
  deploy_ssh_iface                text        NOT NULL DEFAULT 'eno8303'
  upstream_dns_servers            text[]      NOT NULL
  status                          text        NOT NULL DEFAULT 'active'
  created_at                      timestamptz NOT NULL DEFAULT now()
  updated_at                      timestamptz NOT NULL DEFAULT now()
maas_site_policies
  site_id                         uuid        PK/FK → maas_sites
  strict_pxe_preflight           bool        NOT NULL DEFAULT true
  enable_phase2_roce             bool        NOT NULL DEFAULT true
  require_hw_sync                bool        NOT NULL DEFAULT true
  hardware_sync_interval         text        NOT NULL DEFAULT '15m'
  release_fallback_no_erase      bool        NOT NULL DEFAULT true
  enable_deploy_retry_on_datasource_failure  bool        NOT NULL DEFAULT true
  max_deploy_retry_attempts      int         NOT NULL DEFAULT 1
  auto_claim_single_new_machine  bool        NOT NULL DEFAULT false
  batch_max_parallel             int         NOT NULL DEFAULT 10
  site_bootstrap_bundle_ref      text        NULL
  enrollment_token_ttl_seconds   int         NOT NULL DEFAULT 7200
  created_at                     timestamptz NOT NULL DEFAULT now()
  updated_at                     timestamptz NOT NULL DEFAULT now()

Notes: - require_hw_sync should remain true for MAAS-managed nodes in v1. - hardware_sync_interval comes directly from prototype behavior and governs MAAS-side sync policy. - deploy_user and deploy_ssh_iface are site-scoped, not profile-scoped. One MAAS site should expose one standard deploy identity. - architecture and distro_series should be profile-scoped so one MAAS site can target more than one runtime shape. - GPUaaS custom uploaded host images may be stored in product profile intent without the MAAS namespace prefix, but deploy calls must send the MAAS custom namespace form (custom/<image-name>). The MAAS execution client normalizes gpuaas-* image names at the API boundary so profile authors do not accidentally trigger MAAS' Ubuntu-series lookup path. - pxe_iface, node_pxe_iface, and pxe_vlan_vid should behave as site defaults with profile-level overrides available when racks or node classes diverge. - enable_deploy_retry_on_datasource_failure and max_deploy_retry_attempts capture the current script behavior of allowing a bounded redeploy only for datasource/cloud-init class failures. - site_bootstrap_bundle_ref should replace the older extra_cloud_init_bundle_path concept. The real abstraction is an infra-owned, versioned site bootstrap bundle or script reference delivered at first boot before GPUaaS node bootstrap. - auto_claim_single_new_machine=false should be the default. Ambiguous discovery must fail closed unless the site explicitly opts in. - deploy password is still secret material and should live in Vault, not inline in the site row.

3.7 Future evolution: site profiles

The site policy model above is acceptable for the first MAAS implementation slice, but it should evolve into an explicit one-to-many profile model:

maas_site_profiles
  id                               uuid        PK
  site_id                          uuid        FK -> maas_sites
  name                             text        NOT NULL
  description                      text        NULL
  status                           text        NOT NULL DEFAULT 'active'
  strict_pxe_preflight             bool        NOT NULL DEFAULT true
  enable_phase2_roce               bool        NOT NULL DEFAULT true
  require_hw_sync                  bool        NOT NULL DEFAULT true
  hardware_sync_interval           text        NOT NULL DEFAULT '15m'
  release_fallback_no_erase        bool        NOT NULL DEFAULT true
  enable_deploy_retry_on_datasource_failure  bool NOT NULL DEFAULT true
  max_deploy_retry_attempts        int         NOT NULL DEFAULT 1
  auto_claim_single_new_machine    bool        NOT NULL DEFAULT false
  batch_max_parallel               int         NOT NULL DEFAULT 10
  site_bootstrap_bundle_ref        text        NULL
  enrollment_token_ttl_seconds     int         NOT NULL DEFAULT 7200
  created_at                       timestamptz NOT NULL DEFAULT now()
  updated_at                       timestamptz NOT NULL DEFAULT now()

  UNIQUE (site_id, name)

Expected transition: - maas_sites continues to hold site identity and connectivity - deploy_user, deploy_password_vault_path, and deploy_ssh_iface remain site-scoped - maas_site_policies is treated as the current implicit default profile - later, maas_site_profiles becomes the normal policy surface - onboarding/decommission requests carry both site_id and profile_id - each site may also declare a default_profile_id for simple operator flows

The profile split is a design direction, not a blocker for the current v1 bootstrap implementation.

4. Node Onboarding Workflow

4.1 API input

Assumptions for the first MAAS onboarding contract: - hostname is required and operator-supplied. - ipmi_ip is required and operator-supplied. - pxe_mac is not part of the normal operator-facing onboarding request. If retained in storage later, it is observed/internal data for reconciliation or fallback, not a required input. - site_id, profile_id, and sku_id are required request-level selectors. - discovery/adoption behavior stays policy-driven; it is not overridden per request. - RoCE phase-2 assignment is not inline onboarding input. It is a separately managed, pre-created assignment record keyed by site_id + hostname, and onboarding consumes it only when the selected profile enables phase-2 RoCE. - storage/runtime attach behavior such as Weka is out of scope for initial node onboarding. For now onboarding only covers host bring-up and attached-local storage preparation.

Single node:

POST /api/v1/admin/onboardings
{
  "site_id": "uuid",
  "profile_id": "uuid",
  "sku_id": "mi300x.192g.8gpu",
  "ipmi_ip": "10.176.16.128",
  "hostname": "c07u43"
}

Response:
{
  "onboarding_id": "uuid",              // Temporal workflow ID
  "status": "pending"
}

Batch:

POST /api/v1/admin/onboardings/batch
{
  "site_id": "uuid",
  "profile_id": "uuid",
  "sku_id": "mi300x.192g.8gpu",
  "nodes": [
    { "ipmi_ip": "10.176.16.128", "hostname": "c07u43" },
    { "ipmi_ip": "10.176.16.129", "hostname": "c07u44" },
    { "ipmi_ip": "10.176.16.130", "hostname": "c07u45" }
  ]
}

Response:
{
  "batch_id": "uuid",                   // parent workflow ID
  "onboardings": [
    { "hostname": "c07u43", "onboarding_id": "uuid", "node_id": "uuid" },
    { "hostname": "c07u44", "onboarding_id": "uuid", "node_id": "uuid" },
    { "hostname": "c07u45", "onboarding_id": "uuid", "node_id": "uuid" }
  ]
}

Batch onboarding uses one shared site_id, profile_id, and sku_id. Per-node rows carry only identity/discovery input. Batch spawns a parent Temporal workflow that fans out child workflows per node with configurable concurrency.

API contract rule: - The control-plane API is JSON-only for single and batch onboarding. - CSV, if supported at all, belongs in CLI/import tooling that validates required headers and converts rows into the JSON API shape before submission.

Prototype import intent carried forward from the current scripts: - Header-driven import should be the only supported CSV mode in tooling (hostname,ipmi_ip). - Legacy positional CSV parsing is a script convenience, not part of the GPUasService contract.

4.2 Monitoring APIs

These query a durable onboarding/decommission read model populated by workflow progress events. Temporal visibility remains useful for workflow debugging and search, but it should not be the only operator-facing source of history or status.

The full operator API surface, including recovery actions, is defined in section 4.15 Operator APIs needed. This section should be read as the read/query surface rather than the complete mutation contract.

4.2.1 Onboarding read model

node_onboardings
  onboarding_id          uuid        PK              -- external/job id, also Temporal workflow ID
  batch_id               uuid        NULL            -- parent batch workflow id when present
  node_id                uuid        NULL            -- GPUasService node id once created
  site_id                uuid        NOT NULL
  hostname               text        NOT NULL
  ipmi_ip                inet        NOT NULL
  pxe_mac                text        NULL
  maas_system_id         text        NULL
  requested_by_user_id   uuid        NULL
  current_stage          text        NOT NULL
  current_attempt        int         NOT NULL DEFAULT 0
  status                 text        NOT NULL        -- pending | running | completed | failed_retryable | failed_manual_intervention | cancelled | compensating | reconciled
  error_code             text        NULL            -- internal workflow/stage classification
  error_message          text        NULL
  error_details          jsonb       NOT NULL DEFAULT '{}'::jsonb
  workflow_id            text        NOT NULL
  workflow_run_id        text        NULL
  requested_at           timestamptz NOT NULL DEFAULT now()
  started_at             timestamptz NULL
  completed_at           timestamptz NULL
  updated_at             timestamptz NOT NULL DEFAULT now()

Optional stage-event detail for operator history:

node_onboarding_events
  id                     uuid        PK
  onboarding_id          uuid        FK → node_onboardings
  stage                  text        NOT NULL
  attempt                int         NOT NULL
  status                 text        NOT NULL        -- started | succeeded | failed | compensated | skipped
  message                text        NULL
  details                jsonb       NOT NULL DEFAULT '{}'::jsonb
  occurred_at            timestamptz NOT NULL DEFAULT now()

API behavior: - GET /api/v1/admin/onboardings/{id} reads node_onboardings + recent node_onboarding_events - GET /api/v1/admin/onboardings?batch_id=... groups by batch_id - filters should use the richer workflow status model, not a collapsed generic failed - recovery endpoints are defined in section 4.15 Operator APIs needed

4.3 Workflow stages

MaasNodeOnboardWorkflow(input)
  ├─ Activity: LoadSiteConfig(site_id)
  │    Read site record from DB + secrets from Vault.
  │    Validate MAAS reachable.
  │    Output: resolved SiteConfig (MAAS URL, token, default power creds, network config).
  ├─ Activity: ResolvePowerCredentials(site_config, hostname, ipmi_ip, pxe_mac)
  │    1. Evaluate active overrides in priority order.
  │    2. Fall back to site default power credentials.
  │    3. Return resolved BMC login for this workflow execution.
  │    Output: PowerCredentials(user, pass).
  ├─ Activity: CreateOrFindInMaas(site_config, power_credentials, ipmi_ip, hostname, pxe_mac)
  │    1. Look up by hostname, then by IPMI power_address, then by PXE MAC.
  │    2. If not found, create machine via MAAS API (hostname + IPMI power config).
  │    3. If create doesn't return usable machine, fall back to IPMI PXE boot cycle + discovery poll.
  │    4. During discovery polling:
  │       a. prefer a machine discovered during this workflow execution,
  │       b. only auto-claim a sole `New` machine if site policy explicitly enables it,
  │       c. otherwise fail closed on ambiguity.
  │    Output: maas_system_id.
  │    Idempotent: yes (lookup-first).
  ├─ Activity: CommissionNode(site_config, system_id)
  │    1. Check current MAAS status.
  │    2. If New, accept machine.
  │    3. Set hostname + IPMI power parameters.
  │    4. If already Ready, skip. If already Commissioning, just wait.
  │    5. Otherwise, start commissioning (enable_ssh=1, skip_bmc_config=1).
  │    Output: void (machine is commissioning or ready).
  │    Idempotent: yes (status-aware).
  ├─ Activity: WaitForMaasStatus(site_config, system_id, "Ready", timeout)
  │    Poll MAAS machine status with heartbeats.
  │    Detect failure states (Failed, Broken, Error) → return error immediately.
  │    Output: void (machine is Ready).
  │    Uses Temporal activity heartbeat to stay alive during long waits.
  ├─ Activity: ConfigureStorage(site_config, system_id)
  │    1. Read block devices from MAAS.
  │    2. Detect BOSS boot disk by model/name/id_path pattern (boss|boot optimized|m.2).
  │    3. Set boot disk.
  │    4. Apply flat storage layout.
  │    Output: boss_disk_id.
  │    Fails if: no matching disk found (hardware mismatch, needs admin intervention).
  ├─ Activity: ApplyRoCEPhase2(site_config, hostname)
  │    1. Query maas_roce_assignments for this hostname + site.
  │    2. For each row: find interface, create /31 subnet if missing, apply STATIC link.
  │    3. This MAAS-side phase is only half of the prototype contract; the matching guest OS
  │       RoCE routing bundle must also be present in cloud-init so node-local policy routing
  │       matches the MAAS interface links.
  │    4. Skip if no rows (non-fatal).
  │    Output: count of IPs applied.
  │    Idempotent: yes (skips if already applied).
  ├─ Activity: EnsurePxeInterfaceAuto(site_config, system_id)
  │    1. Find node PXE interface by name.
  │    2. Ensure it has AUTO or DHCP subnet link.
  │    3. If not, unlink existing and link as AUTO.
  │    Output: void.
  │    Pre-deploy validation: fail if interface has no valid subnet association.
  ├─ Activity: CreateGPUaaSNodeAndRenderCloudInit(site_config, system_id, host, sku, gpus, region)
  │    1. Create node record in GPUasService DB (status: enrolling, onboarding_mode: maas).
  │    2. Generate enrollment token with MAAS-specific TTL (must exceed expected deploy + first boot time).
  │    3. Render first-boot payload with ordered layers:
  │       a. Optional infra-owned site bootstrap bundle reference.
  │       b. GPUasService agent bootstrap (curl command with enrollment token).
  │       c. MAAS hardware sync credentials + timer.
  │       d. Extra disk initialization script.
  │       e. RoCE routing policy script + systemd unit.
  │       f. Deploy user creation.
  │    4. Get MAAS machine token (consumer_key, token_key, token_secret) for hw-sync.
  │    5. Substitute all template variables into the composed first-boot payload.
  │    Output: gpuaas_node_id, cloud_init_b64.
  │    Note: the extra disk initialization script is destructive:
  │          it partitions/formats all non-root disks and mounts them as /shareN.
  │          This behavior must stay explicit in admin UX/runbooks and never be implicit.
  ├─ Activity: DeployViaMaas(site_config, system_id, cloud_init_b64, distro_series, enable_hw_sync)
  │    1. Call MAAS deploy with user_data=cloud_init_b64.
  │    Output: void (machine is deploying).
  ├─ Activity: WaitForMaasStatus(site_config, system_id, "Deployed", timeout)
  │    Same as above — poll with heartbeats.
  │    Output: void (machine is Deployed).
  ├─ Activity: ClassifyDeployFailure(site_config, system_id, hostname)
  │    1. Inspect MAAS machine details + recent MAAS events.
  │    2. Detect datasource/cloud-init-like failures (`no datasource`, cloud-init final stage, etc.).
  │    3. Return failure_class = datasource_like | generic.
  ├─ Activity: RecoverForDatasourceRetry(site_config, system_id, hostname)
  │    1. Abort in-progress deploy/commission/test if needed.
  │    2. If the machine is Deployed or Failed deployment, release it back to Ready.
  │    3. Re-enter deploy flow once, bounded by site policy.
  │    Output: void.
  ├─ Activity: EnsureHardwareSyncConfigured(site_config, system_id, hostname)
  │    1. Verify MAAS machine token exists for this system.
  │    2. If token is absent, retry until MAAS emits it.
  │    3. Resolve node SSH address using the site policy preferred interface, then fall back to
  │       the first MAAS-reported deployed IP if needed.
  │    4. Push /etc/maas/maas-machine-creds.yml to the deployed host.
  │       During initial onboarding, this is the one explicit SSH-based seed step before the
  │       GPUasService node-agent is available. After agent enrollment, subsequent reseed/restart
  │       operations should move through a typed `node.hw_sync_reseed` task instead of SSH.
  │    5. Install/restart maas-agent + hardware-sync timer.
  │    6. Enforce enable_hw_sync=true for the machine and hardware_sync_interval for the site.
  │    Output: void.
  ├─ Activity: WaitForHardwareSyncHealthy(site_config, system_id, timeout)
  │    Poll MAAS until:
  │    - enable_hw_sync=true
  │    - last_sync is populated
  │    - next_sync is populated
  │    - is_sync_healthy=true (or null on MAAS builds that omit it)
  │    Output: void.
  ├─ Activity: WaitForAgentEnrollment(node_id, timeout)
  │    Poll GPUasService node status until it transitions to `active`.
  │    This happens when the agent calls /internal/v1/nodes/enroll on first boot.
  │    Output: void (node is active and ready for allocations).
  └─ Done: node is active in GPUasService and healthy for ongoing MAAS hardware sync.

4.4 Batch workflow

MaasNodeBatchOnboardWorkflow(input)
  ├─ For each node in input.nodes (with concurrency limit):
  │    ├─ ChildWorkflow: MaasNodeOnboardWorkflow(per_node_input)
  │    └─ Record result (success / failure + error)
  └─ Return batch summary: { succeeded: N, failed: N, details: [...] }

Temporal's child workflow pattern provides: - Configurable max parallel onboardings (e.g. 10 at a time). - Independent retry per node — one failure doesn't block others. - Parent workflow tracks aggregate status. - Per-node logs/events analogous to the current onboard_parallel.sh row logs should be preserved in the read model so operators can inspect one failed node without losing batch context.

4.5 Persistent state per onboarding

Tracked in a durable control-plane read model and mirrored into Temporal workflow state/search attributes:

onboarding_id              -- workflow ID (correlation key)
batch_id                   -- parent workflow ID (null for single)
site_id                    -- MAAS site
hostname                   -- target hostname
ipmi_ip                    -- IPMI BMC address
stage                      -- current stage name
status                     -- running, completed, failed, cancelled
maas_system_id             -- populated after CreateOrFind (null before)
gpuaas_node_id             -- populated after CreateGPUaaSNode (null before)
boss_disk_id               -- populated after ConfigureStorage (null before)
cloud_init_rendered        -- bool
deploy_ip_addresses        -- populated after deploy
error                      -- last failure reason
retry_count                -- per-stage retry tracking
started_at                 -- workflow start
  updated_at                 -- last stage transition

Temporal search attributes should be optimized for discovery (site_id, hostname, status, current_stage, batch_id), while the SQL read model remains the source for admin APIs, audit joins, and long-lived history.

4.6 Failure handling and compensation

Stage Failure mode Retry strategy Compensation on cancel
LoadSiteConfig Vault unreachable, site disabled Retry 3x with backoff. If site disabled, fail immediately. None
CreateOrFindInMaas IPMI unreachable, MAAS API down, discovery timeout Retry 3x (create is idempotent). Discovery timeout: fail, admin reviews IPMI/PXE. None (machine may exist in MAAS — leave it, admin can clean up)
CommissionNode MAAS API error Retry 3x. None (commissioning can be re-triggered)
WaitForReady Timeout, enters Failed/Broken If Failed: log MAAS error details, mark failed_commission. If timeout: mark failed_commission_timeout. None (machine stays in MAAS as-is)
ConfigureStorage BOSS disk not found Fail immediately — hardware mismatch, needs admin. No retry. None
ApplyRoCEPhase2 Interface not found, subnet create fails Non-fatal by default. Log warning, continue. None
EnsurePxeInterfaceAuto Validation fails (no subnet link) Fail if STRICT_PXE_PREFLIGHT. Retry 1x after re-linking. None
CreateGPUaaSNodeAndRenderCloudInit SKU not found, duplicate node SKU invalid: fail immediately (input error). Duplicate: recover existing node_id. Delete GPUasService node record if created. Consume enrollment token.
DeployViaMaas MAAS API error, deploy rejected Retry 2x. Release machine back to Ready in MAAS.
WaitForDeployed Timeout, enters Failed Release back to Ready (compensation), then retry deploy. If the failure class is datasource/cloud-init-like, allow the bounded site-policy deploy retry path. Release machine back to Ready in MAAS.
ClassifyDeployFailure MAAS event query fails, failure class unknown Best-effort. If classification fails, treat as generic deploy failure and stop automatic retry. None
RecoverForDatasourceRetry Abort/release recovery fails Retry 1x. If recovery still fails, stop and require admin intervention. None
EnsureHardwareSyncConfigured Token missing, no deploy-reachable SSH IP, creds push fails, maas-agent install fails Retry until timeout. If MAAS never emits machine token, fail failed_hw_sync_seed. Do not release from MAAS; leave node deployed for investigation.
WaitForHardwareSyncHealthy enable_hw_sync=false, last_sync absent, unhealthy sync Retry with periodic re-seed of creds/timer. Fail failed_hw_sync_health on timeout. Do not mark node active. Alert admin.
WaitForAgentEnrollment Token expired, agent can't reach API Regenerate enrollment token, wait again. Max 2 attempts. Node is deployed in MAAS but not active in GPUasService. Mark failed_enrollment. Do NOT release from MAAS (OS is running).

Key compensation principle: pre-deploy failures are retryable in place (machine stays Ready). Deploy failures compensate by releasing back to Ready. Post-deploy failures (hardware sync seed/health, enrollment) leave MAAS alone and only affect GPUasService state until an admin resolves the node.

4.7 Failure taxonomy

Recovery decisions must not be driven by one generic failed state. The workflow/read model should classify failures so retry, resume, rerun, and manual intervention are deterministic.

Failure class Typical examples Default handling
input_config_error invalid SKU, malformed onboarding input, disabled site fail immediately, no automatic retry
site_capability_missing required site policy absent, required MAAS capability not available fail immediately, operator fixes site capability
upstream_transient MAAS API timeout, Vault/network transient bounded automatic retry
bmc_power_failure IPMI unreachable, power-cycle failure bounded retry, then manual intervention
pxe_discovery_failure discovery timeout, PXE never appears in MAAS bounded retry, then manual intervention
hardware_mismatch BOSS disk not found, expected NIC/interface absent fail and block for manual intervention
deploy_cloud_init_failure failed deployment, datasource/cloud-init class failure classify and allow bounded recovery path
post_deploy_connectivity_failure no deploy-reachable SSH IP, creds push fails bounded retry, then manual intervention
hardware_sync_failure token never emitted, sync unhealthy, timer/agent drift bounded retry/reseed, then manual intervention
agent_enrollment_failure enrollment timeout, API unreachable from node, token expired bounded retry with token refresh, then manual intervention
state_ambiguity conflicting discovery candidates, out-of-band MAAS change, late success after failure stop automation and require reconcile/adopt decision

These classes should be stored in the read model as first-class fields, not only inferred from free-form error text.

4.8 Workflow recovery state model

The onboarding workflow needs its own operational state in addition to coarse node state.

Suggested workflow/job states: - pending - running - completed - failed_retryable - failed_manual_intervention - cancelled - compensating - reconciled

The critical distinction is: - failed_retryable: system believes a safe rerun/retry path exists - failed_manual_intervention: automation must stop and surface the issue to an operator

nodes.status should stay coarse (enrolling, active, offline, quarantined, retired, removing). Detailed progress and failure semantics belong in the onboarding/decommission read models.

For inventory lifecycle transitions that temporarily use coarse states such as draining or removing, the workflow must still own resumability explicitly. A coarse node state is not enough by itself; the control plane must be able to: - detect whether the owning lifecycle task still exists and is claimable - requeue a stale dispatched lifecycle task - recreate an expired or missing lifecycle task without duplicating a live one - adopt an already advanced coarse state on rerun/resume instead of failing on a state conflict

4.9 Operator recovery semantics

Operator actions must be explicit and auditable. These are not interchangeable:

Action Meaning Default use
retry_stage rerun only the current failed stage targeted recovery when the stage is known to be safe/idempotent
resume continue from the last safe stage checkpoint preferred after transient issues or after operator fixes an external prerequisite
rerun re-enter the full onboarding workflow from the top, with status-aware stage adoption default recovery for fresh-node onboarding
restart_clean explicitly reset/compensate known progress, then start again stronger recovery when prior state is suspect
cancel stop the workflow and compensate where possible operator abort
mark_manual_intervention_required freeze automation until a human resolves the condition used for hardware mismatch, ambiguity, repeated bounded failure
adopt_observed_state accept externally advanced MAAS/node state and move workflow forward used after out-of-band operator action or late success

For fresh-node onboarding, rerun should be the primary operator recovery path. The workflow should be written so top-level re-entry converges on the desired state rather than duplicating work.

For node retirement and removal, resume must also cover lifecycle-task recovery: - disable_node / drain resume should requeue or recreate node.drain when the node is already draining - remove resume should requeue or recreate node.uninstall when the node is already removing - a workflow must never treat draining or removing as a terminal conflict if the intended lifecycle action is already in progress

4.10 Safe re-entry rules

Every stage must define whether it can be re-entered safely and under what checks.

Stage Safe to rerun from top? Requires state refresh first? Requires cleanup first? Notes
LoadSiteConfig yes no no pure read
ResolvePowerCredentials yes no no pure read
CreateOrFindInMaas yes yes no lookup-first, adopt existing machine if present
CommissionNode yes yes no status-aware; skip if already Ready
WaitForReady yes yes no must inspect current MAAS state before deciding timeout/failure outcome
ConfigureStorage yes yes no safe only while machine is in editable pre-deploy MAAS state
ApplyRoCEPhase2 yes yes no idempotent link application
EnsurePxeInterfaceAuto yes yes maybe only mutate if interface still editable in MAAS
CreateGPUaaSNodeAndRenderCloudInit yes yes maybe recover existing node row if already created
DeployViaMaas not blindly yes sometimes never issue deploy again without checking if machine already advanced/faulted
WaitForDeployed yes yes maybe may trigger classify/recover path instead of pure wait
EnsureHardwareSyncConfigured yes yes no safe to reseed repeatedly
WaitForHardwareSyncHealthy yes yes no wait/reseed loop
WaitForAgentEnrollment yes yes maybe may require fresh token or adoption of late success

4.11 Manual intervention boundary

Some failures should intentionally stop automation and surface a blocked state instead of looping retries: - BOSS disk not found - conflicting discovery candidates - wrong PXE interface wiring that cannot be auto-corrected - machine deleted from MAAS mid-workflow - repeated datasource/cloud-init failure after bounded retry - repeated post-deploy SSH/hardware-sync failure after bounded retry - workflow belief conflicts with observed MAAS/node state

These should land in: - workflow state: failed_manual_intervention - node state: remain coarse (enrolling or quarantined, depending on context)

4.12 Rich diagnostics and audit requirements

The onboarding/decommission read model should capture more than a message string: - failure class - current stage - last observed MAAS status - last observed MAAS power state - last observed MAAS IPs - stage-specific details (system_id, interface, disk ids, token status, etc.) - upstream diagnostic snippet/reference (bounded/sanitized) - recommended next operator action

Every operator recovery action must audit: - actor identity and role - reason - prior workflow state/stage - requested recovery action - expected target state - correlation/workflow ids

4.13 In-flight reconciliation rules

The workflow must account for out-of-band and late-arriving changes while it is still running: - operator changes MAAS state directly - machine disappears or is re-created - MAAS reports a later state than expected - node-agent enrolls after the workflow already marked a failure

Design rules: - refresh observed MAAS state before every destructive or irreversible step - define adoption rules for late success (WaitForAgentEnrollment should be able to accept a late enrollment if the node is now healthy) - define conflict rules for out-of-band changes (move to state_ambiguity / manual intervention when ownership is unclear) - reconciliation can mark a workflow reconciled when operator/adoption logic successfully realigns observed state with desired state

4.14 Capability preflight and timeout policy

Before a workflow starts, run a cheap preflight to fail fast on missing prerequisites: - site is active - MAAS reachable - MAAS token valid - required site policy present - required site capability flags satisfied

Timeouts are product behavior and must be policy-driven/documented: - discovery timeout - commission timeout - deploy timeout - hardware sync seed timeout - hardware sync health timeout - agent enrollment timeout

Each timeout should map to: - a failure class - a workflow state (failed_retryable vs failed_manual_intervention) - a recommended operator action

4.15 Operator APIs needed

The admin API surface should support the recovery model explicitly: - GET /api/v1/admin/onboardings/{id} - GET /api/v1/admin/onboardings?batch_id=... - POST /api/v1/admin/onboardings/{id}/retry - POST /api/v1/admin/onboardings/{id}/resume - POST /api/v1/admin/onboardings/{id}/rerun - POST /api/v1/admin/onboardings/{id}/restart-clean - POST /api/v1/admin/onboardings/{id}/cancel - POST /api/v1/admin/onboardings/{id}/adopt - POST /api/v1/admin/onboardings/{id}/mark-manual-intervention - POST /api/v1/admin/reconciliation/run

The contract can be narrowed later, but the design should assume these are distinct operations with different semantics.

4.16 Discovery ambiguity policy

The prototype scripts encode a practical but potentially dangerous machine-claiming strategy. The workflow must make this policy explicit.

Default behavior: 1. Match by hostname. 2. Match by IPMI power_address. 3. Match by PXE MAC if provided. 4. If the workflow created or triggered discovery and exactly one new MAAS machine appears during that run, claim it. 5. If multiple candidates remain, fail with an explicit ambiguity error. 6. Only if site policy auto_claim_single_new_machine=true, allow claiming the sole New machine when no stronger match exists.

This keeps v1 safe-by-default while still allowing the prototype’s convenience behavior in tightly controlled sites.

5. Decommission Modes

5.1 Overview

A node has two independent lifecycles: MAAS (bare metal state) and GPUasService (allocation/agent state). Decommission means different things depending on what's being cleaned up.

POST /api/v1/admin/nodes/{id}/decommission
{
  "mode": "soft_reset | reimage | full_decommission",
  "erase": "none | quick | secure",              // reimage + full only
  "storage_cleanup": "none | local | weka | all", // soft_reset + reimage
  "reason": "tenant_transition | security_patch | hardware_removal | ..."
}

Response:
{
  "decommission_id": "uuid",    // Temporal workflow ID
  "mode": "reimage",
  "status": "started"
}

5.2 Mode 1: Soft Reset (allocation turnover)

Use case: Tenant's allocation released, node stays deployed, ready for next tenant immediately.

Workflow:

SoftResetWorkflow(node_id, storage_cleanup)
  ├─ Activity: DisableNode(node_id)
  │    Mark node unavailable for new allocations.
  ├─ Activity: DrainNode(node_id)
  │    Close active terminal sessions.
  │    Revoke all allocation users (agent task: allocation.revoke_user).
  │    Wait for task completion.
  ├─ Activity: CleanupStorage(node_id, storage_cleanup)
  │    If "local": scrub /tmp, /var/tmp, clear local scratch.
  │    If "weka": unmount Weka POSIX mounts, remove tenant Weka config.
  │    If "all": both.
  │    If "none": skip.
  ├─ Activity: ScrubGPU(node_id)
  │    Reset GPU memory/VRAM state.
  │    Vendor-specific: rocm-smi --resetgpu (AMD) or nvidia-smi --gpu-reset (NVIDIA).
  ├─ Activity: ValidateCleanNode(node_id)
  │    Verify: no tenant users exist, no tenant processes running,
  │    no tenant mounts, GPU memory clear.
  │    If validation fails: quarantine node, alert admin.
  ├─ Activity: EnableNode(node_id)
  │    Mark node available for allocations.
  └─ Done: node is clean and available.

MAAS involvement: None. Node stays Deployed. Agent stays running.

What's missing today: GPU scrub task, Weka cleanup task, post-cleanup validation task.

5.3 Mode 2: Reimage (OS reset, same hardware)

Use case: OS drift, security patch cycle, major tenant transition requiring clean OS.

Workflow:

ReimageWorkflow(node_id, site_id, erase, storage_cleanup)
  ├─ Activity: DisableNode(node_id)
  │    Mark node unavailable.
  ├─ Activity: DrainNode(node_id)
  │    Close terminals, revoke users, wait.
  ├─ Activity: CleanupStorage(node_id, storage_cleanup)
  │    Same as soft reset.
  ├─ Activity: LoadSiteConfig(site_id)
  │    Read MAAS site config + Vault secrets.
  ├─ Activity: ReleaseMaasNode(site_config, system_id, erase)
  │    Call MAAS release with erase mode (none/quick/secure).
  │    If erase fails: fallback to no-erase release (configurable).
  │    Wait for machine to reach Ready.
  ├─ Activity: ConfigureStorage(site_config, system_id)
  │    Re-detect BOSS disk, set boot disk, flat layout.
  │    (Hardware hasn't changed, but MAAS resets storage config on release.)
  ├─ Activity: ApplyRoCEPhase2(site_config, hostname)
  │    Re-apply RoCE IPs (MAAS clears interface config on release).
  ├─ Activity: EnsurePxeInterfaceAuto(site_config, system_id)
  │    Re-ensure PXE interface AUTO link.
  ├─ Activity: RenderCloudInit(site_config, system_id, node_id)
  │    1. Generate new enrollment token (existing cert may still work,
  │       but fresh token ensures recovery if cert was lost during reimage).
  │    2. Render cloud-init with agent bootstrap.
  ├─ Activity: DeployViaMaas(site_config, system_id, cloud_init_b64, distro_series, enable_hw_sync)
  │    1. Deploy via MAAS.
  ├─ Activity: WaitForMaasStatus(site_config, system_id, "Deployed", timeout)
  │    1. Wait for Deployed.
  ├─ Activity: EnsureHardwareSyncConfigured(site_config, system_id, hostname)
  ├─ Activity: WaitForHardwareSyncHealthy(site_config, system_id, timeout)
  ├─ Activity: WaitForAgentEnrollment(node_id, timeout)
  │    Agent re-enrolls on boot (or reuses existing cert if still valid).
  ├─ Activity: EnableNode(node_id)
  │    Mark node available.
  └─ Done: node has fresh OS, agent re-enrolled, ready for allocations.

GPUasService node ID stays the same. The physical machine and its control-plane identity are preserved across reimage. Reimage is an OS reset, not a node replacement.

5.4 Mode 3: Full Decommission (remove from fleet)

Use case: Hardware returned, failed beyond repair, moved to different cluster.

Workflow:

FullDecommissionWorkflow(node_id, site_id, erase)
  ├─ Activity: DisableNode(node_id)
  ├─ Activity: ForceReleaseAllocations(node_id)
  │    Force-release any active allocation on this node.
  │    Triggers billing closure.
  ├─ Activity: DrainNode(node_id)
  │    Close terminals, revoke users.
  ├─ Activity: LoadSiteConfig(site_id)
  ├─ Activity: ReleaseMaasNode(site_config, system_id, erase)
  │    Release with erase mode. If erase fails and site policy allows it,
  │    abort/mark-fixed and retry with forced no-erase release. Wait for Ready.
  ├─ Activity: PowerOffMaasNode(site_config, system_id)
  │    Power off via MAAS API.
  ├─ Activity: RetireGPUaaSNode(node_id)
  │    Set node status → retired.
  ├─ Activity: RemoveGPUaaSNodeRecord(node_id)
  │    Use the existing inventory lifecycle:
  │    retired → removing → node.uninstall → delete on success
  │    (or back to retired on uninstall failure).
  │    Audit history remains in audit_logs + decommission tracking.
  ├─ Activity: CleanupSecrets(node_id)
  │    Revoke/delete any node-specific Vault secrets if applicable.
  │    Remove enrollment token from Redis if still present.
  ├─ Activity: RemoveMaasRecord(site_config, system_id)  // optional
  │    Delete machine from MAAS. Or leave for MAAS-side audit.
  │    Configurable: default is leave.
  └─ Done: node is powered off, removed from active GPUasService inventory, agent removed.

Re-onboarding a fully decommissioned machine: Create a new GPUasService node record. Reuse of the old node identity is not allowed after full remove.

5.5 Mode 4: Storage-Only Cleanup

Use case: Between tenants on shared storage (Weka), node and OS are fine.

This is a subset of soft reset — just the storage cleanup step. Can be triggered directly:

POST /api/v1/admin/nodes/{id}/storage-cleanup
{
  "scope": "local | weka | all"
}

Runs CleanupStorage + ValidateCleanNode activities only.

5.6 Agent task catalog additions needed

Task type Purpose Decommission modes
node.gpu_scrub Reset GPU memory/VRAM, vendor-specific soft_reset, reimage
node.storage_cleanup Scrub local disks, unmount/cleanup Weka soft_reset, reimage, storage-only
node.validate_clean Verify no tenant residue (users, processes, mounts, GPU state) soft_reset, reimage
node.hw_sync_reseed Reinstall/restart MAAS hardware sync credentials + timer after deploy/reimage onboarding, reimage

Existing tasks already covering: - allocation.revoke_user — user cleanup - terminal.close — terminal session teardown - node.drain — mark unschedulable - node.uninstall — agent self-removal

5.7 Decommission monitoring

GET  /api/v1/admin/decommissions/{id}            -- status, stage, error
GET  /api/v1/admin/decommissions?node_id=...     -- history for a node
GET  /api/v1/admin/decommissions?status=failed    -- failed decommissions
POST /api/v1/admin/decommissions/{id}/retry       -- retry from failed stage
POST /api/v1/admin/decommissions/{id}/cancel      -- abort (best-effort)

Back these endpoints with a read model parallel to onboarding:

node_decommissions
  decommission_id        uuid        PK              -- external/job id, also Temporal workflow ID
  node_id                uuid        NOT NULL
  site_id                uuid        NULL
  maas_system_id         text        NULL
  mode                   text        NOT NULL        -- soft_reset | reimage | full_decommission | storage_cleanup
  status                 text        NOT NULL        -- pending | running | completed | failed | cancelled
  current_stage          text        NOT NULL
  current_attempt        int         NOT NULL DEFAULT 0
  requested_by_user_id   uuid        NULL
  error_code             text        NULL
  error_message          text        NULL
  error_details          jsonb       NOT NULL DEFAULT '{}'::jsonb
  workflow_id            text        NOT NULL
  workflow_run_id        text        NULL
  requested_at           timestamptz NOT NULL DEFAULT now()
  started_at             timestamptz NULL
  completed_at           timestamptz NULL
  updated_at             timestamptz NOT NULL DEFAULT now()

Optional detailed history:

node_decommission_events
  id                     uuid        PK
  decommission_id        uuid        FK → node_decommissions
  stage                  text        NOT NULL
  attempt                int         NOT NULL
  status                 text        NOT NULL        -- started | succeeded | failed | compensated | skipped
  message                text        NULL
  details                jsonb       NOT NULL DEFAULT '{}'::jsonb
  occurred_at            timestamptz NOT NULL DEFAULT now()

5.8 Decommission failure and recovery model

Decommission needs the same rigor as onboarding because it also depends on MAAS, the node-agent, and external node/network state.

Stage Failure mode Retry strategy Compensation / operator path
DisableNode inventory update fails retry 3x stop before external mutation
ForceReleaseAllocations allocation force-release fails bounded retry move to manual intervention if allocations remain attached
DrainNode node-agent unreachable, drain task fails bounded retry mark manual intervention; operator may quarantine or continue with explicit override
CleanupStorage / ScrubGPU / ValidateCleanNode host cleanup fails, validation residue remains bounded retry manual intervention or quarantine depending on mode
ReleaseMaasNode MAAS release fails, erase fails, timeout to Ready bounded retry; fallback to no-erase if site policy allows if still not Ready, stop and require operator reconcile
PowerOffMaasNode MAAS power-off fails bounded retry manual intervention; do not pretend full decommission completed
RetireGPUaaSNode inventory transition fails retry 3x stop before remove path advances
RemoveGPUaaSNodeRecord node.uninstall fails, node remains retired use existing inventory retry path explicit operator retry/resume supported; node must not be deleted prematurely
CleanupSecrets Vault/Redis cleanup fails bounded retry decommission may complete with follow-up cleanup task recorded
RemoveMaasRecord MAAS delete fails best-effort by default leave MAAS record and mark follow-up action if site policy says retain/delete mismatch

Manual intervention should be the default outcome when decommission ownership becomes ambiguous, for example: - node-agent is unreachable but MAAS state is still mutable - MAAS release/power-off does not converge - uninstall fails after the node is already retired - operator changes MAAS state out of band during decommission

6. State Drift: MAAS ↔ GPUasService Reconciliation

6.1 Problem

MAAS and GPUasService are independent systems with independent state. Drift occurs when: - An admin acts directly in MAAS (UI/CLI) without going through GPUasService. - Hardware fails and MAAS detects it before GPUasService does. - Network issues cause the agent to lose contact. - A node is reimaged or released outside the workflow.

Without reconciliation, GPUasService may schedule allocations to nodes that no longer exist or are in the wrong state.

6.2 Drift scenarios

Scenario MAAS state GPUasService state Severity Impact
Admin releases node in MAAS UI Ready active CRITICAL Agent is dead, allocations will fail
Admin redeploys from MAAS CLI Deploying active (with allocation) CRITICAL Tenant session destroyed
Node hardware failure Failed testing active HIGH Allocations to dead node
Node powered off externally Off active HIGH Agent stops, allocations fail
MAAS firmware update reboots node Commissioning active MEDIUM Temporary disruption
Node IP changes after redeploy Deployed (new IP) active (old IP) MEDIUM Agent still works (outbound), but SSH probe fails
GPUasService retires node, MAAS not told Deployed retired LOW Burning power, MAAS still manages
Machine deleted from MAAS (absent) active CRITICAL Node record is orphaned

6.3 Detection: dual signal approach

Primary signal: agent heartbeat.

The agent already polls /internal/v1/nodes/{id}/tasks/wait. This is effectively a heartbeat. Track:

nodes
  last_agent_contact_at   timestamptz   -- updated on every task poll

If now() - last_agent_contact_at > threshold (e.g. 5 minutes), mark node offline. This catches most drift scenarios without talking to MAAS.

The node.heartbeat_check task type already exists in the agent catalog. The control plane can queue periodic heartbeat tasks and track response time.

Secondary signal: periodic MAAS reconciler.

A Temporal cron workflow (or scheduled activity) runs every N minutes per site:

MaasSiteReconcilerWorkflow(site_id)  -- cron: every 5 minutes
  ├─ Activity: LoadSiteConfig(site_id)
  ├─ Activity: FetchMaasMachines(site_config)
  │    GET /api/2.0/machines/ — all machines for this site.
  │    Returns: [{ system_id, hostname, status_name, power_state, ip_addresses }]
  ├─ Activity: FetchGPUaaSNodes(site_id)
  │    Query nodes where site_id matches and status NOT IN (retired).
  │    Returns: [{ node_id, hostname, maas_system_id, status, host }]
  ├─ Activity: Reconcile(maas_machines, gpuaas_nodes)
  │    Apply reconciliation rules (see 6.4).
  │    Output: list of drift actions taken + alerts raised.
  └─ Done

6.4 Reconciliation rules

MAAS state GPUasService state Auto-action Alert
Deployed active OK — no action
Deployed offline Agent issue — no MAAS action WARN: agent not polling
Ready / Released active Auto-quarantine GPUasService node CRITICAL: node released outside workflow
Failed / Broken active Auto-quarantine GPUasService node CRITICAL: hardware failure detected
Commissioning active Unexpected recommission WARN: node being recommissioned
Deployed, IP changed active, old IP Auto-update node host field INFO: IP change detected
(absent from MAAS) active Auto-quarantine GPUasService node CRITICAL: machine deleted from MAAS
Deployed retired Issue MAAS release + power off INFO: cleaning up retired node
Any (no matching GPUasService node) No action DEBUG: unmanaged MAAS machine

Auto-quarantine means: 1. Set GPUasService node status → quarantined. 2. Do NOT force-release active allocations immediately — alert admin first. 3. Admin reviews and decides: force-release + reimage, or investigate further.

6.5 Reconciliation state tracking

node_maas_state
  node_id               uuid        FK → nodes
  site_id               uuid        FK → maas_sites
  maas_system_id        text        NOT NULL
  last_maas_status      text        -- "Deployed", "Ready", "Failed", etc
  last_maas_power_state text        -- "on", "off", "unknown"
  last_maas_ips         text[]
  last_reconciled_at    timestamptz
  drift_detected        bool        DEFAULT false
  drift_details         jsonb       -- { "rule": "...", "maas_status": "...", "expected": "..." }
  drift_resolved_at     timestamptz -- null until admin resolves or auto-heals

  UNIQUE (node_id)

node_maas_state is the authoritative runtime reconciliation record for the current MAAS binding of an active GPUasService node. node_onboardings and node_decommissions are workflow/history records and should remain authoritative for job history, not current runtime truth.

6.6 Reconciliation admin APIs

GET  /api/v1/admin/reconciliation/status                -- summary: nodes OK, drifted, unreconciled
GET  /api/v1/admin/reconciliation/drift                 -- list all nodes with drift_detected=true
POST /api/v1/admin/reconciliation/drift/{node_id}/resolve   -- admin acknowledges + takes action
POST /api/v1/admin/reconciliation/run                   -- trigger immediate reconciliation for a site

7. Node State Machine (Extended)

7.1 GPUasService node states

                                    ┌──────────────────────────────┐
                                    │                              │
  ┌─────────────┐   enroll OK   ┌──▼──────┐   probe fail    ┌────┴────┐
  │ bootstrap   ├──────────────►│  active  ├───────────────►│ offline  │
  │ _issued     │               │          │◄───────────────┤          │
  └──────┬──────┘               └──┬───┬───┘   probe OK     └────┬────┘
         │                         │   │                         │
         │ maas mode               │   │ admin/auto              │ admin
         ▼                         │   ▼                         │
  ┌─────────────┐   enroll OK      │ ┌────────────┐             │
  │  enrolling  ├──────────────────┘ │quarantined  │◄────────────┘
  │  (maas)     │                    │             │   drift detected
  └─────────────┘                    └──────┬──────┘
                                            │ admin
                                    ┌─────────────┐
                                    │   retired    │
                                    └──────┬──────┘
                                           │ remove
                                    ┌─────────────┐
                                    │  removing   │
                                    └──────┬──────┘
                                           │ uninstall success
                                    ┌─────────────┐
                                    │   deleted   │
                                    └─────────────┘

7.2 MAAS machine states (reference)

New → Commissioning → Ready → Allocated → Deploying → Deployed
                         ▲                                │
                         │         Release                │
                         └────────────────────────────────┘

Side states: Failed, Broken (can appear after Commissioning or Deploying)

7.3 Combined state expectations for MAAS-onboarded nodes

Lifecycle phase GPUasService status Expected MAAS status Agent running?
Onboarding: pre-commission enrolling New / Commissioning No
Onboarding: post-commission enrolling Ready No
Onboarding: deploying enrolling Deploying No
Onboarding: deployed, agent starting enrolling Deployed Starting
Operational active Deployed Yes
Agent down offline Deployed No
Drifted (released outside workflow) quarantined Ready No
Reimaging enrolling (re-enter) Ready / Deploying No
Decommissioned (workflow in progress) removing Ready / Off / Deleted No
Decommissioned (completed) deleted from active inventory Ready / Off / Deleted No

8. Data Flow: MAAS to GPUasService Field Mapping

8.0 Event integration

These workflows should emit typed domain events so downstream consumers do not need to poll operator APIs for lifecycle transitions.

Suggested events: - node.onboarding.started - node.onboarding.completed - node.onboarding.failed - node.onboarding.manual_intervention_required - node.decommission.started - node.decommission.completed - node.decommission.failed - node.decommission.manual_intervention_required

Exact event shapes and subjects should be specified in doc/api/asyncapi.draft.yaml before implementation.

8.1 Where GPUasService node fields come from

GPUasService node field Source When populated
id Generated by GPUasService CreateGPUaaSNode activity
host MAAS deployed IP (from ip_addresses) After deploy (updated by reconciler if IP changes)
hostname Admin input (from CSV/API) CreateGPUaaSNode activity
port Default 22 CreateGPUaaSNode activity
sku Admin input CreateGPUaaSNode activity
gpus_total Admin input CreateGPUaaSNode activity
region_code Site's region_code (or admin override) CreateGPUaaSNode activity
status State machine Transitions through workflow
ssh_username "root" (default) CreateGPUaaSNode activity
access_method "node_agent" CreateGPUaaSNode activity

8.2 MAAS-specific data stored per node

Field Source Purpose
maas_system_id MAAS API (from CreateOrFind) All subsequent MAAS API calls
site_id Admin input Links to maas_sites for config
ipmi_ip Admin input IPMI power management
pxe_mac Observed/internal field Discovery fallback only if later required

Stored in node_maas_state or as additional columns on the nodes table (TBD — may prefer a separate table to keep the nodes table provider-agnostic). The runtime MAAS binding should be authoritative in node_maas_state; onboarding rows are historical workflow records.

9. Cloud-Init Template

The cloud-init user-data for MAAS-deployed nodes combines two ordered layers: - infra-owned site bootstrap bundle - GPUasService bootstrap bundle

Within those layers, the first-boot payload covers: - deploy-user creation - GPUasService node-agent bootstrap - MAAS hardware-sync unit/timer/credentials wiring - destructive /shareN non-root disk initialization - guest OS RoCE routing service/script - optional site-owned bootstrap logic delivered through the site bundle reference

Treat this as a versioned first-boot bundle composition, not an ad hoc inline script blob. The infra-owned site bootstrap bundle should remain separately versioned and testable from the GPUasService-owned bootstrap content so failures can be classified cleanly as: - site bootstrap failure - GPUasService bootstrap failure

Longer term, the site bundle should be managed through a control-plane reference and uploaded/versioned like other deployment artifacts rather than only by a local file path.

The template is rendered by the CreateGPUaaSNodeAndRenderCloudInit activity with substitutions for: - Deploy user/pass - MAAS machine token (consumer_key, token_key, token_secret) - MAAS base URL - MAAS system_id - GPUasService API URL - GPUasService enrollment token - GPUasService agent package URL - GPUasService CA bundle - GPUasService task signing public key

The GPUasService bootstrap section uses the existing cloud-init rendering mode from the bootstrap-script endpoint (mode=cloud_init), injected after the infra-owned site bootstrap stage in the combined template.

Prototype intent carried forward from maas-automation-TG: - deploy-user creation and password set - destructive non-root disk initialization to /shareN - MAAS hardware-sync credentials + unit/timer wiring - RoCE routing systemd service/script in the guest OS

If any of these become optional later, that should be an explicit site policy decision, not silent template drift.

9.1 Hardware sync is a required invariant

For MAAS-managed nodes, hardware sync is not best-effort. A node is not considered fully onboarded until MAAS can continue to receive hardware sync from the deployed OS.

Required healthy state: - enable_hw_sync=true - MAAS machine token exists for the machine - /etc/maas/maas-machine-creds.yml is present on the node - maas-agent and the hardware-sync timer are installed/running - last_sync is populated in MAAS - next_sync is populated in MAAS - is_sync_healthy=true (or null on MAAS builds that do not expose it)

What re-sync should do: 1. Re-assert site hardware-sync interval in MAAS. 2. Re-assert enable_hw_sync=true on the machine. 3. Fetch or re-fetch the per-machine MAAS token. 4. Rewrite /etc/maas/maas-machine-creds.yml. 5. Restart/enable maas-agent and the hardware-sync timer. 6. Wait for MAAS to observe a fresh last_sync/next_sync cycle.

This must run: - after initial deploy - after every reimage - whenever reconciliation detects sync drift on a deployed node

10. Future: LXC/LXD Path

LXC/LXD onboarding would follow the same patterns but with a different site entity and a much simpler workflow:

lxd_sites
  id                    uuid
  name                  text
  region_code           text
  api_url               text          -- LXD REST API endpoint
  client_cert_vault     text          -- Vault path for client cert
  client_key_vault      text          -- Vault path for client key
  trust_token_vault     text          -- Vault path for trust token
  default_profile       text          -- LXD profile for GPU passthrough
  status                text

LXD onboarding workflow (seconds, not minutes):

LxdNodeOnboardWorkflow(input)
  ├─ Activity: LoadLxdSiteConfig(site_id)
  ├─ Activity: CreateLxdInstance(site_config, profile, gpu_config)
  ├─ Activity: WaitForRunning(instance_id)            -- seconds
  ├─ Activity: InjectBootstrap(instance_id, cloud_init)
  ├─ Activity: WaitForAgentEnrollment(node_id)        -- seconds
  └─ Done

No IPMI, no PXE, no commissioning, no BOSS disk, no RoCE. The GPUasService node model stays the same — onboarding_mode: "lxd", agent enrolls identically.

LXD decommission is also simpler — stop instance, delete instance. No erase modes needed.

11. Future: Weka Storage Integration

Weka integration would add per-allocation storage provisioning:

weka_sites
  id                    uuid
  name                  text
  region_code           text
  api_url               text
  auth_vault_path       text
  status                text

Allocation-time mount: - When an allocation is provisioned on a node, a Weka filesystem mount is set up for the tenant. - Agent task: storage.weka_mount — installs Weka client, mounts tenant filesystem.

Deallocation-time cleanup: - Agent task: storage.weka_unmount — unmounts, removes client config, scrubs local cache.

This integrates with the soft reset and reimage decommission modes via the storage_cleanup parameter.

12. Open Questions

  1. Node table extension vs separate table: Should MAAS-specific fields (system_id, ipmi_ip, site_id) live on the nodes table directly, or in a separate node_maas_state table? Separate table keeps nodes provider-agnostic but adds joins.

  2. Enrollment token TTL validation for MAAS path: enrollment_token_ttl_seconds already exists in site policy. The remaining question is whether the configured default is sufficient for the slowest observed deploy path, or whether token refresh mid-workflow is still required.

  3. Reconciler frequency vs MAAS API load: Polling all machines every 5 minutes per site could be heavy for large sites (500+ nodes). Consider delta-based polling (MAAS events API) or MAAS webhooks (3.4+) where available.

  4. Per-node power override lifecycle: If an override becomes stale or a BMC credential rotation finishes, should the reconciler warn on unused overrides, or should this remain manual cleanup?

  5. Batch concurrency limits: How many nodes to commission/deploy in parallel per MAAS server? MAAS has internal concurrency limits (rack controller PXE capacity, image sync bandwidth). Need tunable per-site limits.

  6. GPU scrub verification: How to verify GPU memory is actually clear? Vendor tools may not provide a reliable "clean" signal. Need to define acceptable verification criteria per GPU vendor.

  7. Weka client lifecycle: Is the Weka client installed once at OS level (persistent across allocations) or per-allocation? Persistent is simpler but may leak state between tenants.

  8. MAAS webhook vs polling: MAAS 3.4+ supports webhooks for machine state changes. If available, this is more reactive than polling. Worth supporting both modes (webhook-primary, polling-fallback) per site.