MAAS Node State Model v1¶
1. Purpose¶
Define the state model for MAAS-managed GPU nodes in a way that separates: - coarse GPUasService node lifecycle state, - detailed workflow/job state, - observed MAAS machine state, - operator actions and recovery transitions.
This document exists because the MAAS lifecycle is no longer simple enough to describe safely with prose alone.
Use this together with:
- MAAS_Bare_Metal_Lifecycle_v1.md
- State_Machines.md
2. Model Layers¶
There are three distinct state layers:
nodes.status- coarse lifecycle state of the GPUasService node record
-
stable and operator-facing
-
node_onboardings.status/node_decommissions.status - detailed workflow/job execution state
-
expresses retryability, manual intervention, compensation, reconciliation
-
observed MAAS machine state
- current upstream machine status from MAAS
- not owned by GPUasService
Do not collapse these into one enum.
3. GPUasService Node Lifecycle¶
3.1 Canonical coarse states¶
| State | Meaning |
|---|---|
bootstrap_issued |
manual bootstrap bundle/token issued; node not yet enrolled |
enrolling |
MAAS/manual onboarding is in progress; node not ready for scheduling |
active |
node is healthy and schedulable |
offline |
node expected to exist but not currently polling/responding |
quarantined |
node exists but is blocked from scheduling due to drift, cleanup failure, or investigation |
draining |
retire/decommission drain is in progress; the node is unschedulable until the drain lifecycle finishes |
retired |
node intentionally removed from scheduling but not yet being deleted |
removing |
remove/uninstall workflow is in progress |
deleted |
terminal state outside active inventory |
3.2 Node state transitions¶
bootstrap_issued -> enrolling -> active
enrolling -> quarantined
active -> offline
offline -> active
active -> quarantined
offline -> quarantined
quarantined -> active
quarantined -> draining
active -> draining
offline -> draining
draining -> retired
draining -> offline (drain failed / operator recovery chooses non-retired fallback)
retired -> active (reactivate/reuse same identity)
retired -> removing
removing -> retired (uninstall failure / operator rollback)
removing -> deleted
3.3 Transition ownership¶
| Transition | Owner |
|---|---|
bootstrap_issued -> enrolling |
admin API / onboarding workflow start |
enrolling -> active |
onboarding workflow after successful agent enrollment and health checks |
enrolling -> quarantined |
onboarding workflow or operator action when onboarding reaches a blocked/unsafe state |
active -> offline |
heartbeat/reconciliation logic |
offline -> active |
heartbeat recovery / reconciliation |
active/offline -> quarantined |
reconciler or cleanup validation |
quarantined -> active |
explicit operator recovery or reconciliation after the node is proven healthy again |
active/offline/quarantined -> draining |
admin lifecycle action that starts a resumable drain operation |
draining -> retired |
drain lifecycle completes successfully |
draining -> offline |
drain lifecycle fails or operator recovery elects to stop short of retirement |
retired -> active |
explicit reactivate/reuse flow, only if the node was paused/retired and not fully removed |
retired -> removing |
remove workflow |
removing -> retired |
uninstall failure path |
removing -> deleted |
uninstall success / final delete |
Guardrails:
- draining and removing are coarse lifecycle states, not self-sufficient execution state.
- Entering either state must also create or resume an owning lifecycle operation/task.
- A node must never be left in draining or removing solely because a short-lived task lease expired.
- retired -> active is valid only for retained node identity reuse.
- Once a full decommission reaches completed removal (removing -> deleted), the old node identity must not be reactivated.
- Re-onboarding after full remove creates a new GPUasService node record.
4. Onboarding Workflow State¶
4.1 Workflow/job states¶
These belong in node_onboardings.status, not nodes.status.
| State | Meaning |
|---|---|
pending |
accepted but not started |
running |
workflow executing |
completed |
workflow reached intended terminal success |
failed_retryable |
failed, but safe recovery path exists |
failed_manual_intervention |
blocked pending operator action |
cancelled |
operator/system cancelled workflow |
compensating |
rollback/cleanup is in progress |
reconciled |
previously failed/ambiguous workflow has been realigned with observed state |
4.2 Typical onboarding stages¶
| Stage |
|---|
load_site_config |
resolve_power_credentials |
create_or_find_in_maas |
commission_node |
wait_for_ready |
configure_storage |
apply_roce_phase2 |
ensure_pxe_interface_auto |
render_cloud_init |
deploy_via_maas |
wait_for_deployed |
classify_deploy_failure |
recover_for_datasource_retry |
ensure_hardware_sync_configured |
wait_for_hardware_sync_healthy |
wait_for_agent_enrollment |
4.3 Mapping: node state vs onboarding state¶
| Node state | Onboarding workflow state | Meaning |
|---|---|---|
enrolling |
pending / running |
normal onboarding in progress |
enrolling |
failed_retryable |
node still not ready; safe rerun/resume available |
enrolling or quarantined |
failed_manual_intervention |
workflow blocked pending operator decision |
active |
completed |
normal success |
active |
reconciled |
workflow succeeded after adopt/reconcile action |
Rule:
- do not create micro-node states such as commissioning, deploying, waiting_for_hw_sync
- those are workflow stages only
5. Decommission Workflow State¶
5.1 Workflow/job states¶
node_decommissions.status should use the same job-state model as onboarding:
- pending
- running
- completed
- failed_retryable
- failed_manual_intervention
- cancelled
- compensating
- reconciled
5.2 Typical decommission stages¶
This is the superset stage catalog across soft_reset, reimage, full_decommission, and storage_cleanup. Not every mode uses every stage.
| Stage |
|---|
disable_node |
force_release_allocations |
drain_node |
cleanup_storage |
scrub_gpu |
validate_clean_node |
load_site_config |
release_maas_node |
power_off_maas_node |
retire_gpuaas_node |
remove_gpuaas_node_record |
cleanup_secrets |
remove_maas_record |
5.3 Mapping: node state vs decommission state¶
| Node state | Decommission workflow state | Meaning |
|---|---|---|
active / offline / quarantined / draining / retired |
running |
pre-remove cleanup or reimage in progress |
removing |
running |
uninstall/remove path in progress |
retired |
failed_retryable |
remove/uninstall failed but identity preserved |
retired / quarantined |
failed_manual_intervention |
decommission blocked pending operator action |
deleted |
completed |
full remove succeeded |
6. MAAS State Mapping¶
6.1 Relevant MAAS states¶
| MAAS state | Interpretation |
|---|---|
New |
discovered but not accepted/commissioned |
Commissioning |
MAAS is commissioning the machine |
Ready |
editable, not deployed |
Allocated |
MAAS allocated but not yet fully deployed |
Deploying |
OS deployment in progress |
Deployed |
OS deployed |
Failed / Broken / Failed deployment |
upstream failure state |
| absent | machine missing/deleted in MAAS |
6.2 Expected MAAS state by GPUasService lifecycle¶
| GPUasService node state | Expected MAAS state |
|---|---|
bootstrap_issued |
none required |
enrolling |
New, Commissioning, Ready, Allocated, Deploying, or Deployed depending on stage |
active |
Deployed |
offline |
usually Deployed |
quarantined |
often Ready, Failed, Broken, Deployed, or absent |
draining |
usually Deployed or Ready; decommission workflow owns the transition to retired |
retired |
Deployed, Ready, or powered off depending on mode |
removing |
Ready, powered off, or absent |
deleted |
absent or retained in MAAS by policy |
7. Manual Intervention Triggers¶
These should usually move workflow state to failed_manual_intervention:
- conflicting discovery candidates
- BOSS disk not found
- PXE interface not fixable automatically
- machine deleted from MAAS during workflow
- repeated datasource/cloud-init failure after bounded retry
- repeated hardware-sync or SSH seed failure
- workflow state and observed MAAS/node state disagree in a way automation cannot safely adopt
8. Operator Actions¶
These actions operate on workflow/job state, not directly on coarse node state:
| Action | Effect |
|---|---|
retry_stage |
rerun current failed stage |
resume |
continue from last safe stage |
rerun |
re-enter workflow from top with status-aware adoption |
restart_clean |
explicitly compensate/reset then start again |
cancel |
stop workflow and compensate where possible |
adopt_observed_state |
accept externally advanced state and continue/finish |
mark_manual_intervention_required |
freeze workflow until human resolution |
For node inventory lifecycle, resume must be available whenever the node is in an in-progress coarse state whose owning task may have been lost or stalled:
- draining -> resume/reissue/requeue node.drain
- removing -> resume/reissue/requeue node.uninstall
These operator actions are idempotent: - if a fresh queued task already exists, the action returns success without duplication - if a live dispatched task still holds the lease, the action returns success without duplication - if the task expired or the dispatch lease is stale, the action requeues or recreates the lifecycle task
9. Recommended UI/Operator Presentation¶
9.1 Inventory view¶
Show coarse node state only:
- active
- offline
- quarantined
- retired
- removing
9.2 Lifecycle detail view¶
Show: - workflow state - current stage - attempt count - failure class - last observed MAAS state - recommended next action
This prevents inventory screens from becoming overloaded with workflow microstates.
10. Rules¶
nodes.statusmust remain coarse and stable.- Workflow stages and retryability belong in onboarding/decommission read models.
- MAAS state is observed upstream truth, not GPUasService-owned state.
- Reconciliation may change workflow/job state without directly inventing new node lifecycle states.
- Any new MAAS-specific lifecycle transition should be added here before implementation if it changes operator-facing behavior.