MAAS Node State Model v1¶

1. Purpose¶

Define the state model for MAAS-managed GPU nodes in a way that separates: - coarse GPUasService node lifecycle state, - detailed workflow/job state, - observed MAAS machine state, - operator actions and recovery transitions.

This document exists because the MAAS lifecycle is no longer simple enough to describe safely with prose alone.

Use this together with: - MAAS_Bare_Metal_Lifecycle_v1.md - State_Machines.md

2. Model Layers¶

There are three distinct state layers:

nodes.status
coarse lifecycle state of the GPUasService node record
stable and operator-facing
node_onboardings.status / node_decommissions.status
detailed workflow/job execution state
expresses retryability, manual intervention, compensation, reconciliation
observed MAAS machine state
current upstream machine status from MAAS
not owned by GPUasService

Do not collapse these into one enum.

3. GPUasService Node Lifecycle¶

3.1 Canonical coarse states¶

State	Meaning
`bootstrap_issued`	manual bootstrap bundle/token issued; node not yet enrolled
`enrolling`	MAAS/manual onboarding is in progress; node not ready for scheduling
`active`	node is healthy and schedulable
`offline`	node expected to exist but not currently polling/responding
`quarantined`	node exists but is blocked from scheduling due to drift, cleanup failure, or investigation
`draining`	retire/decommission drain is in progress; the node is unschedulable until the drain lifecycle finishes
`retired`	node intentionally removed from scheduling but not yet being deleted
`removing`	remove/uninstall workflow is in progress
`deleted`	terminal state outside active inventory

3.2 Node state transitions¶

bootstrap_issued -> enrolling -> active
enrolling -> quarantined
active -> offline
offline -> active
active -> quarantined
offline -> quarantined
quarantined -> active
quarantined -> draining
active -> draining
offline -> draining
draining -> retired
draining -> offline          (drain failed / operator recovery chooses non-retired fallback)
retired -> active            (reactivate/reuse same identity)
retired -> removing
removing -> retired          (uninstall failure / operator rollback)
removing -> deleted

3.3 Transition ownership¶

Transition	Owner
`bootstrap_issued -> enrolling`	admin API / onboarding workflow start
`enrolling -> active`	onboarding workflow after successful agent enrollment and health checks
`enrolling -> quarantined`	onboarding workflow or operator action when onboarding reaches a blocked/unsafe state
`active -> offline`	heartbeat/reconciliation logic
`offline -> active`	heartbeat recovery / reconciliation
`active/offline -> quarantined`	reconciler or cleanup validation
`quarantined -> active`	explicit operator recovery or reconciliation after the node is proven healthy again
`active/offline/quarantined -> draining`	admin lifecycle action that starts a resumable drain operation
`draining -> retired`	drain lifecycle completes successfully
`draining -> offline`	drain lifecycle fails or operator recovery elects to stop short of retirement
`retired -> active`	explicit reactivate/reuse flow, only if the node was paused/retired and not fully removed
`retired -> removing`	remove workflow
`removing -> retired`	uninstall failure path
`removing -> deleted`	uninstall success / final delete

Guardrails: - draining and removing are coarse lifecycle states, not self-sufficient execution state. - Entering either state must also create or resume an owning lifecycle operation/task. - A node must never be left in draining or removing solely because a short-lived task lease expired. - retired -> active is valid only for retained node identity reuse. - Once a full decommission reaches completed removal (removing -> deleted), the old node identity must not be reactivated. - Re-onboarding after full remove creates a new GPUasService node record.

4. Onboarding Workflow State¶

4.1 Workflow/job states¶

These belong in node_onboardings.status, not nodes.status.

State	Meaning
`pending`	accepted but not started
`running`	workflow executing
`completed`	workflow reached intended terminal success
`failed_retryable`	failed, but safe recovery path exists
`failed_manual_intervention`	blocked pending operator action
`cancelled`	operator/system cancelled workflow
`compensating`	rollback/cleanup is in progress
`reconciled`	previously failed/ambiguous workflow has been realigned with observed state

4.2 Typical onboarding stages¶

Stage
`load_site_config`
`resolve_power_credentials`
`create_or_find_in_maas`
`commission_node`
`wait_for_ready`
`configure_storage`
`apply_roce_phase2`
`ensure_pxe_interface_auto`
`render_cloud_init`
`deploy_via_maas`
`wait_for_deployed`
`classify_deploy_failure`
`recover_for_datasource_retry`
`ensure_hardware_sync_configured`
`wait_for_hardware_sync_healthy`
`wait_for_agent_enrollment`

4.3 Mapping: node state vs onboarding state¶

Node state	Onboarding workflow state	Meaning
`enrolling`	`pending` / `running`	normal onboarding in progress
`enrolling`	`failed_retryable`	node still not ready; safe rerun/resume available
`enrolling` or `quarantined`	`failed_manual_intervention`	workflow blocked pending operator decision
`active`	`completed`	normal success
`active`	`reconciled`	workflow succeeded after adopt/reconcile action

Rule: - do not create micro-node states such as commissioning, deploying, waiting_for_hw_sync - those are workflow stages only

5. Decommission Workflow State¶

5.1 Workflow/job states¶

node_decommissions.status should use the same job-state model as onboarding: - pending - running - completed - failed_retryable - failed_manual_intervention - cancelled - compensating - reconciled

5.2 Typical decommission stages¶

This is the superset stage catalog across soft_reset, reimage, full_decommission, and storage_cleanup. Not every mode uses every stage.

Stage
`disable_node`
`force_release_allocations`
`drain_node`
`cleanup_storage`
`scrub_gpu`
`validate_clean_node`
`load_site_config`
`release_maas_node`
`power_off_maas_node`
`retire_gpuaas_node`
`remove_gpuaas_node_record`
`cleanup_secrets`
`remove_maas_record`

5.3 Mapping: node state vs decommission state¶

Node state	Decommission workflow state	Meaning
`active` / `offline` / `quarantined` / `draining` / `retired`	`running`	pre-remove cleanup or reimage in progress
`removing`	`running`	uninstall/remove path in progress
`retired`	`failed_retryable`	remove/uninstall failed but identity preserved
`retired` / `quarantined`	`failed_manual_intervention`	decommission blocked pending operator action
`deleted`	`completed`	full remove succeeded

6. MAAS State Mapping¶

6.1 Relevant MAAS states¶

MAAS state	Interpretation
`New`	discovered but not accepted/commissioned
`Commissioning`	MAAS is commissioning the machine
`Ready`	editable, not deployed
`Allocated`	MAAS allocated but not yet fully deployed
`Deploying`	OS deployment in progress
`Deployed`	OS deployed
`Failed` / `Broken` / `Failed deployment`	upstream failure state
absent	machine missing/deleted in MAAS

6.2 Expected MAAS state by GPUasService lifecycle¶

GPUasService node state	Expected MAAS state
`bootstrap_issued`	none required
`enrolling`	`New`, `Commissioning`, `Ready`, `Allocated`, `Deploying`, or `Deployed` depending on stage
`active`	`Deployed`
`offline`	usually `Deployed`
`quarantined`	often `Ready`, `Failed`, `Broken`, `Deployed`, or absent
`draining`	usually `Deployed` or `Ready`; decommission workflow owns the transition to `retired`
`retired`	`Deployed`, `Ready`, or powered off depending on mode
`removing`	`Ready`, powered off, or absent
`deleted`	absent or retained in MAAS by policy

7. Manual Intervention Triggers¶

These should usually move workflow state to failed_manual_intervention: - conflicting discovery candidates - BOSS disk not found - PXE interface not fixable automatically - machine deleted from MAAS during workflow - repeated datasource/cloud-init failure after bounded retry - repeated hardware-sync or SSH seed failure - workflow state and observed MAAS/node state disagree in a way automation cannot safely adopt

8. Operator Actions¶

These actions operate on workflow/job state, not directly on coarse node state:

Action	Effect
`retry_stage`	rerun current failed stage
`resume`	continue from last safe stage
`rerun`	re-enter workflow from top with status-aware adoption
`restart_clean`	explicitly compensate/reset then start again
`cancel`	stop workflow and compensate where possible
`adopt_observed_state`	accept externally advanced state and continue/finish
`mark_manual_intervention_required`	freeze workflow until human resolution

For node inventory lifecycle, resume must be available whenever the node is in an in-progress coarse state whose owning task may have been lost or stalled: - draining -> resume/reissue/requeue node.drain - removing -> resume/reissue/requeue node.uninstall

These operator actions are idempotent: - if a fresh queued task already exists, the action returns success without duplication - if a live dispatched task still holds the lease, the action returns success without duplication - if the task expired or the dispatch lease is stale, the action requeues or recreates the lifecycle task

9. Recommended UI/Operator Presentation¶

9.1 Inventory view¶

Show coarse node state only: - active - offline - quarantined - retired - removing

9.2 Lifecycle detail view¶

Show: - workflow state - current stage - attempt count - failure class - last observed MAAS state - recommended next action

This prevents inventory screens from becoming overloaded with workflow microstates.

10. Rules¶

nodes.status must remain coarse and stable.
Workflow stages and retryability belong in onboarding/decommission read models.
MAAS state is observed upstream truth, not GPUasService-owned state.
Reconciliation may change workflow/job state without directly inventing new node lifecycle states.
Any new MAAS-specific lifecycle transition should be added here before implementation if it changes operator-facing behavior.