Skip to content

MAAS Node State Model v1

1. Purpose

Define the state model for MAAS-managed GPU nodes in a way that separates: - coarse GPUasService node lifecycle state, - detailed workflow/job state, - observed MAAS machine state, - operator actions and recovery transitions.

This document exists because the MAAS lifecycle is no longer simple enough to describe safely with prose alone.

Use this together with: - MAAS_Bare_Metal_Lifecycle_v1.md - State_Machines.md

2. Model Layers

There are three distinct state layers:

  1. nodes.status
  2. coarse lifecycle state of the GPUasService node record
  3. stable and operator-facing

  4. node_onboardings.status / node_decommissions.status

  5. detailed workflow/job execution state
  6. expresses retryability, manual intervention, compensation, reconciliation

  7. observed MAAS machine state

  8. current upstream machine status from MAAS
  9. not owned by GPUasService

Do not collapse these into one enum.

3. GPUasService Node Lifecycle

3.1 Canonical coarse states

State Meaning
bootstrap_issued manual bootstrap bundle/token issued; node not yet enrolled
enrolling MAAS/manual onboarding is in progress; node not ready for scheduling
active node is healthy and schedulable
offline node expected to exist but not currently polling/responding
quarantined node exists but is blocked from scheduling due to drift, cleanup failure, or investigation
draining retire/decommission drain is in progress; the node is unschedulable until the drain lifecycle finishes
retired node intentionally removed from scheduling but not yet being deleted
removing remove/uninstall workflow is in progress
deleted terminal state outside active inventory

3.2 Node state transitions

bootstrap_issued -> enrolling -> active
enrolling -> quarantined
active -> offline
offline -> active
active -> quarantined
offline -> quarantined
quarantined -> active
quarantined -> draining
active -> draining
offline -> draining
draining -> retired
draining -> offline          (drain failed / operator recovery chooses non-retired fallback)
retired -> active            (reactivate/reuse same identity)
retired -> removing
removing -> retired          (uninstall failure / operator rollback)
removing -> deleted

3.3 Transition ownership

Transition Owner
bootstrap_issued -> enrolling admin API / onboarding workflow start
enrolling -> active onboarding workflow after successful agent enrollment and health checks
enrolling -> quarantined onboarding workflow or operator action when onboarding reaches a blocked/unsafe state
active -> offline heartbeat/reconciliation logic
offline -> active heartbeat recovery / reconciliation
active/offline -> quarantined reconciler or cleanup validation
quarantined -> active explicit operator recovery or reconciliation after the node is proven healthy again
active/offline/quarantined -> draining admin lifecycle action that starts a resumable drain operation
draining -> retired drain lifecycle completes successfully
draining -> offline drain lifecycle fails or operator recovery elects to stop short of retirement
retired -> active explicit reactivate/reuse flow, only if the node was paused/retired and not fully removed
retired -> removing remove workflow
removing -> retired uninstall failure path
removing -> deleted uninstall success / final delete

Guardrails: - draining and removing are coarse lifecycle states, not self-sufficient execution state. - Entering either state must also create or resume an owning lifecycle operation/task. - A node must never be left in draining or removing solely because a short-lived task lease expired. - retired -> active is valid only for retained node identity reuse. - Once a full decommission reaches completed removal (removing -> deleted), the old node identity must not be reactivated. - Re-onboarding after full remove creates a new GPUasService node record.

4. Onboarding Workflow State

4.1 Workflow/job states

These belong in node_onboardings.status, not nodes.status.

State Meaning
pending accepted but not started
running workflow executing
completed workflow reached intended terminal success
failed_retryable failed, but safe recovery path exists
failed_manual_intervention blocked pending operator action
cancelled operator/system cancelled workflow
compensating rollback/cleanup is in progress
reconciled previously failed/ambiguous workflow has been realigned with observed state

4.2 Typical onboarding stages

Stage
load_site_config
resolve_power_credentials
create_or_find_in_maas
commission_node
wait_for_ready
configure_storage
apply_roce_phase2
ensure_pxe_interface_auto
render_cloud_init
deploy_via_maas
wait_for_deployed
classify_deploy_failure
recover_for_datasource_retry
ensure_hardware_sync_configured
wait_for_hardware_sync_healthy
wait_for_agent_enrollment

4.3 Mapping: node state vs onboarding state

Node state Onboarding workflow state Meaning
enrolling pending / running normal onboarding in progress
enrolling failed_retryable node still not ready; safe rerun/resume available
enrolling or quarantined failed_manual_intervention workflow blocked pending operator decision
active completed normal success
active reconciled workflow succeeded after adopt/reconcile action

Rule: - do not create micro-node states such as commissioning, deploying, waiting_for_hw_sync - those are workflow stages only

5. Decommission Workflow State

5.1 Workflow/job states

node_decommissions.status should use the same job-state model as onboarding: - pending - running - completed - failed_retryable - failed_manual_intervention - cancelled - compensating - reconciled

5.2 Typical decommission stages

This is the superset stage catalog across soft_reset, reimage, full_decommission, and storage_cleanup. Not every mode uses every stage.

Stage
disable_node
force_release_allocations
drain_node
cleanup_storage
scrub_gpu
validate_clean_node
load_site_config
release_maas_node
power_off_maas_node
retire_gpuaas_node
remove_gpuaas_node_record
cleanup_secrets
remove_maas_record

5.3 Mapping: node state vs decommission state

Node state Decommission workflow state Meaning
active / offline / quarantined / draining / retired running pre-remove cleanup or reimage in progress
removing running uninstall/remove path in progress
retired failed_retryable remove/uninstall failed but identity preserved
retired / quarantined failed_manual_intervention decommission blocked pending operator action
deleted completed full remove succeeded

6. MAAS State Mapping

6.1 Relevant MAAS states

MAAS state Interpretation
New discovered but not accepted/commissioned
Commissioning MAAS is commissioning the machine
Ready editable, not deployed
Allocated MAAS allocated but not yet fully deployed
Deploying OS deployment in progress
Deployed OS deployed
Failed / Broken / Failed deployment upstream failure state
absent machine missing/deleted in MAAS

6.2 Expected MAAS state by GPUasService lifecycle

GPUasService node state Expected MAAS state
bootstrap_issued none required
enrolling New, Commissioning, Ready, Allocated, Deploying, or Deployed depending on stage
active Deployed
offline usually Deployed
quarantined often Ready, Failed, Broken, Deployed, or absent
draining usually Deployed or Ready; decommission workflow owns the transition to retired
retired Deployed, Ready, or powered off depending on mode
removing Ready, powered off, or absent
deleted absent or retained in MAAS by policy

7. Manual Intervention Triggers

These should usually move workflow state to failed_manual_intervention: - conflicting discovery candidates - BOSS disk not found - PXE interface not fixable automatically - machine deleted from MAAS during workflow - repeated datasource/cloud-init failure after bounded retry - repeated hardware-sync or SSH seed failure - workflow state and observed MAAS/node state disagree in a way automation cannot safely adopt

8. Operator Actions

These actions operate on workflow/job state, not directly on coarse node state:

Action Effect
retry_stage rerun current failed stage
resume continue from last safe stage
rerun re-enter workflow from top with status-aware adoption
restart_clean explicitly compensate/reset then start again
cancel stop workflow and compensate where possible
adopt_observed_state accept externally advanced state and continue/finish
mark_manual_intervention_required freeze workflow until human resolution

For node inventory lifecycle, resume must be available whenever the node is in an in-progress coarse state whose owning task may have been lost or stalled: - draining -> resume/reissue/requeue node.drain - removing -> resume/reissue/requeue node.uninstall

These operator actions are idempotent: - if a fresh queued task already exists, the action returns success without duplication - if a live dispatched task still holds the lease, the action returns success without duplication - if the task expired or the dispatch lease is stale, the action requeues or recreates the lifecycle task

9.1 Inventory view

Show coarse node state only: - active - offline - quarantined - retired - removing

9.2 Lifecycle detail view

Show: - workflow state - current stage - attempt count - failure class - last observed MAAS state - recommended next action

This prevents inventory screens from becoming overloaded with workflow microstates.

10. Rules

  1. nodes.status must remain coarse and stable.
  2. Workflow stages and retryability belong in onboarding/decommission read models.
  3. MAAS state is observed upstream truth, not GPUasService-owned state.
  4. Reconciliation may change workflow/job state without directly inventing new node lifecycle states.
  5. Any new MAAS-specific lifecycle transition should be added here before implementation if it changes operator-facing behavior.