Skip to content

MAAS Recovery Matrix v1

1. Purpose

Define the stage-by-stage failure and recovery model for MAAS-managed node onboarding and decommission.

This document complements: - MAAS_Bare_Metal_Lifecycle_v1.md - MAAS_Node_State_Model_v1.md

Use it as the implementation-facing reference for: - retry behavior - safe rerun/resume rules - compensation - manual intervention boundaries - operator API semantics

2. Recovery Actions

Action Meaning
retry_stage rerun only the failed stage
resume continue from the last safe stage checkpoint
rerun re-enter workflow from the top with status-aware adoption
restart_clean explicitly compensate/reset progress, then start again
cancel stop workflow and compensate where possible
adopt_observed_state accept externally advanced MAAS/node state and continue
mark_manual_intervention_required freeze automation until operator resolves issue

3. Failure Classes

Class Meaning
input_config_error bad request, invalid SKU, disabled site, malformed input
site_capability_missing required site policy/capability absent
upstream_transient MAAS/Vault/network transient
bmc_power_failure IPMI/power-cycle reachability failure
pxe_discovery_failure discovery timeout or no deterministic MAAS match
hardware_mismatch expected boot disk/interface/hardware not present
deploy_cloud_init_failure MAAS deploy or cloud-init/datasource class failure
post_deploy_connectivity_failure SSH seed/connectivity failure after deploy
hardware_sync_failure machine token or sync health failure
agent_enrollment_failure node-agent fails to enroll/renew
state_ambiguity observed MAAS/node state conflicts with workflow expectation

4. Onboarding Recovery Matrix

Stage Typical failure classes Detect from Retryable? Safe recovery default Compensation Manual intervention boundary Resulting workflow state Resulting node state
load_site_config input_config_error, site_capability_missing, upstream_transient DB/Vault/API probe yes for transient only resume after transient, otherwise stop none disabled site, invalid token, missing required capability failed_retryable or failed_manual_intervention enrolling
resolve_power_credentials input_config_error override/default lookup usually no fix config then resume none no valid credential source remains failed_manual_intervention enrolling
create_or_find_in_maas upstream_transient, bmc_power_failure, pxe_discovery_failure, state_ambiguity MAAS/IPMI/discovery poll yes, bounded rerun from top is preferred none ambiguous candidates, repeated discovery failure failed_retryable or failed_manual_intervention enrolling
commission_node upstream_transient, bmc_power_failure, state_ambiguity MAAS machine status/events yes, bounded resume or rerun none machine deleted or unexpectedly advanced in MAAS failed_retryable or failed_manual_intervention enrolling
wait_for_ready upstream_transient, hardware_mismatch, state_ambiguity MAAS status/event poll bounded resume after transient; rerun if safe none Failed/Broken without safe path, machine absent failed_retryable or failed_manual_intervention enrolling
configure_storage hardware_mismatch, upstream_transient MAAS block-device inspection usually no fix hardware/site assumptions then rerun none BOSS disk not found failed_manual_intervention enrolling
apply_roce_phase2 upstream_transient, hardware_mismatch MAAS interface/subnet ops yes/non-fatal resume or skip if policy allows none required interface missing and policy says mandatory running or failed_manual_intervention enrolling
ensure_pxe_interface_auto hardware_mismatch, upstream_transient interface read/link validation bounded resume maybe unlink/relink interface cannot be made deploy-safe failed_manual_intervention enrolling
render_cloud_init input_config_error, site_capability_missing template/render/token generation yes if transient dependency resume delete partial node row if necessary required bootstrap/bundle input missing failed_manual_intervention enrolling
deploy_via_maas upstream_transient, deploy_cloud_init_failure, state_ambiguity MAAS deploy response bounded do not blindly retry; refresh state first release back to Ready if deploy attempt started machine already advanced, deploy rejected for non-transient reason failed_retryable or failed_manual_intervention enrolling
wait_for_deployed deploy_cloud_init_failure, upstream_transient, state_ambiguity MAAS status/events bounded classify failure, then resume/rerun release back to Ready repeated datasource/cloud-init class failure after bounded retry failed_retryable or failed_manual_intervention enrolling
classify_deploy_failure upstream_transient MAAS event query best-effort fall back to generic deploy failure none none by itself failed_retryable enrolling
recover_for_datasource_retry deploy_cloud_init_failure, state_ambiguity abort/release result bounded resume if recovery succeeds abort/release to clean deployable state abort/release cannot converge failed_manual_intervention enrolling
ensure_hardware_sync_configured post_deploy_connectivity_failure, hardware_sync_failure, upstream_transient token poll, SSH seed, service start bounded resume; later node.hw_sync_reseed after agent exists none during onboarding no reachable SSH IP, repeated seed failure failed_manual_intervention enrolling or quarantined
wait_for_hardware_sync_healthy hardware_sync_failure, upstream_transient MAAS sync fields bounded resume with reseed none repeated unhealthy sync after reseed failed_manual_intervention enrolling or quarantined
wait_for_agent_enrollment agent_enrollment_failure, upstream_transient, state_ambiguity API/node enrollment state bounded resume, possibly refresh token; adopt_observed_state on late success none repeated timeout or ambiguous late/out-of-band state failed_retryable or failed_manual_intervention enrolling or quarantined

5. Decommission Recovery Matrix

Stage Typical failure classes Detect from Retryable? Safe recovery default Compensation Manual intervention boundary Resulting workflow state Resulting node state
disable_node upstream_transient inventory transition yes resume none repeated DB/state transition failure failed_retryable active or quarantined
force_release_allocations upstream_transient, state_ambiguity allocation service state bounded resume none allocations remain attached after retries failed_manual_intervention quarantined
drain_node upstream_transient, agent_enrollment_failure node-task result, agent reachability bounded resume none node-agent unreachable and operator must choose override path failed_manual_intervention quarantined or retired
cleanup_storage / scrub_gpu / validate_clean_node upstream_transient, hardware_mismatch node-task results bounded resume quarantine on failed validation repeated residue/cleanup failure failed_manual_intervention quarantined
load_site_config input_config_error, site_capability_missing, upstream_transient DB/Vault/API probe yes for transient resume none required site capability missing failed_manual_intervention current node state unchanged
release_maas_node upstream_transient, deploy_cloud_init_failure, state_ambiguity MAAS release/status bounded resume; fallback no-erase if policy allows abort/mark-fixed + no-erase release cannot reach Ready / ownership unclear failed_manual_intervention quarantined or retired
power_off_maas_node upstream_transient, bmc_power_failure MAAS power state bounded resume none power-off never converges failed_manual_intervention retired
retire_gpuaas_node upstream_transient inventory transition yes resume none repeated transition failure failed_retryable quarantined
remove_gpuaas_node_record agent_enrollment_failure, state_ambiguity uninstall task result / inventory state bounded resume revert to retired on uninstall failure repeated uninstall failure or conflicting state failed_manual_intervention retired
cleanup_secrets upstream_transient Vault/Redis result bounded resume or record follow-up none rarely manual unless secrets remain sensitive failed_retryable retired or deleted
remove_maas_record upstream_transient, state_ambiguity MAAS delete result best-effort resume if policy requires delete none retain MAAS record by policy or if delete unsafe failed_retryable or completed with follow-up record deleted or retired

6. Timeouts

Timeouts are product behavior, not only implementation details.

Timeout Failure class Default workflow state Recommended default action
discovery timeout pxe_discovery_failure failed_retryable rerun
commission timeout upstream_transient or hardware_mismatch depending on MAAS status failed_retryable or failed_manual_intervention resume if transient; otherwise manual fix
deploy timeout deploy_cloud_init_failure failed_retryable or failed_manual_intervention classify then resume/rerun
hardware sync seed timeout hardware_sync_failure failed_manual_intervention operator fix then resume
hardware sync health timeout hardware_sync_failure failed_manual_intervention reseed or operator intervention
agent enrollment timeout agent_enrollment_failure failed_retryable then failed_manual_intervention after bounded attempts refresh token / resume / adopt_observed_state

7. Late Success and Out-of-Band Changes

These are first-class recovery conditions: - node-agent enrolls after workflow already marked failure - operator changes MAAS state directly while workflow is running - machine disappears and reappears with different upstream state

Rules: - refresh observed MAAS state before destructive or irreversible steps - allow adopt_observed_state when observed state now satisfies the intended checkpoint - move to state_ambiguity / failed_manual_intervention when ownership is unclear

8. Audit Requirements

Every operator recovery action must record: - actor id and role - workflow id and run id - prior workflow state and stage - requested action - reason - expected next state - correlation id

9. Notes

  1. Fresh-node onboarding should optimize for safe full rerun, not for operator micromanagement of every stage.
  2. The workflow should converge on desired state through status-aware adoption, not through blind replay of shell-era behavior.
  3. This matrix is the implementation baseline for MAAS workflow activities and recovery APIs.