MAAS Recovery Matrix v1¶
1. Purpose¶
Define the stage-by-stage failure and recovery model for MAAS-managed node onboarding and decommission.
This document complements:
- MAAS_Bare_Metal_Lifecycle_v1.md
- MAAS_Node_State_Model_v1.md
Use it as the implementation-facing reference for: - retry behavior - safe rerun/resume rules - compensation - manual intervention boundaries - operator API semantics
2. Recovery Actions¶
| Action | Meaning |
|---|---|
retry_stage |
rerun only the failed stage |
resume |
continue from the last safe stage checkpoint |
rerun |
re-enter workflow from the top with status-aware adoption |
restart_clean |
explicitly compensate/reset progress, then start again |
cancel |
stop workflow and compensate where possible |
adopt_observed_state |
accept externally advanced MAAS/node state and continue |
mark_manual_intervention_required |
freeze automation until operator resolves issue |
3. Failure Classes¶
| Class | Meaning |
|---|---|
input_config_error |
bad request, invalid SKU, disabled site, malformed input |
site_capability_missing |
required site policy/capability absent |
upstream_transient |
MAAS/Vault/network transient |
bmc_power_failure |
IPMI/power-cycle reachability failure |
pxe_discovery_failure |
discovery timeout or no deterministic MAAS match |
hardware_mismatch |
expected boot disk/interface/hardware not present |
deploy_cloud_init_failure |
MAAS deploy or cloud-init/datasource class failure |
post_deploy_connectivity_failure |
SSH seed/connectivity failure after deploy |
hardware_sync_failure |
machine token or sync health failure |
agent_enrollment_failure |
node-agent fails to enroll/renew |
state_ambiguity |
observed MAAS/node state conflicts with workflow expectation |
4. Onboarding Recovery Matrix¶
| Stage | Typical failure classes | Detect from | Retryable? | Safe recovery default | Compensation | Manual intervention boundary | Resulting workflow state | Resulting node state |
|---|---|---|---|---|---|---|---|---|
load_site_config |
input_config_error, site_capability_missing, upstream_transient |
DB/Vault/API probe | yes for transient only | resume after transient, otherwise stop |
none | disabled site, invalid token, missing required capability | failed_retryable or failed_manual_intervention |
enrolling |
resolve_power_credentials |
input_config_error |
override/default lookup | usually no | fix config then resume |
none | no valid credential source remains | failed_manual_intervention |
enrolling |
create_or_find_in_maas |
upstream_transient, bmc_power_failure, pxe_discovery_failure, state_ambiguity |
MAAS/IPMI/discovery poll | yes, bounded | rerun from top is preferred |
none | ambiguous candidates, repeated discovery failure | failed_retryable or failed_manual_intervention |
enrolling |
commission_node |
upstream_transient, bmc_power_failure, state_ambiguity |
MAAS machine status/events | yes, bounded | resume or rerun |
none | machine deleted or unexpectedly advanced in MAAS | failed_retryable or failed_manual_intervention |
enrolling |
wait_for_ready |
upstream_transient, hardware_mismatch, state_ambiguity |
MAAS status/event poll | bounded | resume after transient; rerun if safe |
none | Failed/Broken without safe path, machine absent |
failed_retryable or failed_manual_intervention |
enrolling |
configure_storage |
hardware_mismatch, upstream_transient |
MAAS block-device inspection | usually no | fix hardware/site assumptions then rerun |
none | BOSS disk not found | failed_manual_intervention |
enrolling |
apply_roce_phase2 |
upstream_transient, hardware_mismatch |
MAAS interface/subnet ops | yes/non-fatal | resume or skip if policy allows |
none | required interface missing and policy says mandatory | running or failed_manual_intervention |
enrolling |
ensure_pxe_interface_auto |
hardware_mismatch, upstream_transient |
interface read/link validation | bounded | resume |
maybe unlink/relink | interface cannot be made deploy-safe | failed_manual_intervention |
enrolling |
render_cloud_init |
input_config_error, site_capability_missing |
template/render/token generation | yes if transient dependency | resume |
delete partial node row if necessary | required bootstrap/bundle input missing | failed_manual_intervention |
enrolling |
deploy_via_maas |
upstream_transient, deploy_cloud_init_failure, state_ambiguity |
MAAS deploy response | bounded | do not blindly retry; refresh state first | release back to Ready if deploy attempt started |
machine already advanced, deploy rejected for non-transient reason | failed_retryable or failed_manual_intervention |
enrolling |
wait_for_deployed |
deploy_cloud_init_failure, upstream_transient, state_ambiguity |
MAAS status/events | bounded | classify failure, then resume/rerun |
release back to Ready |
repeated datasource/cloud-init class failure after bounded retry | failed_retryable or failed_manual_intervention |
enrolling |
classify_deploy_failure |
upstream_transient |
MAAS event query | best-effort | fall back to generic deploy failure | none | none by itself | failed_retryable |
enrolling |
recover_for_datasource_retry |
deploy_cloud_init_failure, state_ambiguity |
abort/release result | bounded | resume if recovery succeeds |
abort/release to clean deployable state | abort/release cannot converge | failed_manual_intervention |
enrolling |
ensure_hardware_sync_configured |
post_deploy_connectivity_failure, hardware_sync_failure, upstream_transient |
token poll, SSH seed, service start | bounded | resume; later node.hw_sync_reseed after agent exists |
none during onboarding | no reachable SSH IP, repeated seed failure | failed_manual_intervention |
enrolling or quarantined |
wait_for_hardware_sync_healthy |
hardware_sync_failure, upstream_transient |
MAAS sync fields | bounded | resume with reseed |
none | repeated unhealthy sync after reseed | failed_manual_intervention |
enrolling or quarantined |
wait_for_agent_enrollment |
agent_enrollment_failure, upstream_transient, state_ambiguity |
API/node enrollment state | bounded | resume, possibly refresh token; adopt_observed_state on late success |
none | repeated timeout or ambiguous late/out-of-band state | failed_retryable or failed_manual_intervention |
enrolling or quarantined |
5. Decommission Recovery Matrix¶
| Stage | Typical failure classes | Detect from | Retryable? | Safe recovery default | Compensation | Manual intervention boundary | Resulting workflow state | Resulting node state |
|---|---|---|---|---|---|---|---|---|
disable_node |
upstream_transient |
inventory transition | yes | resume |
none | repeated DB/state transition failure | failed_retryable |
active or quarantined |
force_release_allocations |
upstream_transient, state_ambiguity |
allocation service state | bounded | resume |
none | allocations remain attached after retries | failed_manual_intervention |
quarantined |
drain_node |
upstream_transient, agent_enrollment_failure |
node-task result, agent reachability | bounded | resume |
none | node-agent unreachable and operator must choose override path | failed_manual_intervention |
quarantined or retired |
cleanup_storage / scrub_gpu / validate_clean_node |
upstream_transient, hardware_mismatch |
node-task results | bounded | resume |
quarantine on failed validation | repeated residue/cleanup failure | failed_manual_intervention |
quarantined |
load_site_config |
input_config_error, site_capability_missing, upstream_transient |
DB/Vault/API probe | yes for transient | resume |
none | required site capability missing | failed_manual_intervention |
current node state unchanged |
release_maas_node |
upstream_transient, deploy_cloud_init_failure, state_ambiguity |
MAAS release/status | bounded | resume; fallback no-erase if policy allows |
abort/mark-fixed + no-erase release | cannot reach Ready / ownership unclear |
failed_manual_intervention |
quarantined or retired |
power_off_maas_node |
upstream_transient, bmc_power_failure |
MAAS power state | bounded | resume |
none | power-off never converges | failed_manual_intervention |
retired |
retire_gpuaas_node |
upstream_transient |
inventory transition | yes | resume |
none | repeated transition failure | failed_retryable |
quarantined |
remove_gpuaas_node_record |
agent_enrollment_failure, state_ambiguity |
uninstall task result / inventory state | bounded | resume |
revert to retired on uninstall failure |
repeated uninstall failure or conflicting state | failed_manual_intervention |
retired |
cleanup_secrets |
upstream_transient |
Vault/Redis result | bounded | resume or record follow-up |
none | rarely manual unless secrets remain sensitive | failed_retryable |
retired or deleted |
remove_maas_record |
upstream_transient, state_ambiguity |
MAAS delete result | best-effort | resume if policy requires delete |
none | retain MAAS record by policy or if delete unsafe | failed_retryable or completed with follow-up record |
deleted or retired |
6. Timeouts¶
Timeouts are product behavior, not only implementation details.
| Timeout | Failure class | Default workflow state | Recommended default action |
|---|---|---|---|
| discovery timeout | pxe_discovery_failure |
failed_retryable |
rerun |
| commission timeout | upstream_transient or hardware_mismatch depending on MAAS status |
failed_retryable or failed_manual_intervention |
resume if transient; otherwise manual fix |
| deploy timeout | deploy_cloud_init_failure |
failed_retryable or failed_manual_intervention |
classify then resume/rerun |
| hardware sync seed timeout | hardware_sync_failure |
failed_manual_intervention |
operator fix then resume |
| hardware sync health timeout | hardware_sync_failure |
failed_manual_intervention |
reseed or operator intervention |
| agent enrollment timeout | agent_enrollment_failure |
failed_retryable then failed_manual_intervention after bounded attempts |
refresh token / resume / adopt_observed_state |
7. Late Success and Out-of-Band Changes¶
These are first-class recovery conditions: - node-agent enrolls after workflow already marked failure - operator changes MAAS state directly while workflow is running - machine disappears and reappears with different upstream state
Rules:
- refresh observed MAAS state before destructive or irreversible steps
- allow adopt_observed_state when observed state now satisfies the intended checkpoint
- move to state_ambiguity / failed_manual_intervention when ownership is unclear
8. Audit Requirements¶
Every operator recovery action must record: - actor id and role - workflow id and run id - prior workflow state and stage - requested action - reason - expected next state - correlation id
9. Notes¶
- Fresh-node onboarding should optimize for safe full rerun, not for operator micromanagement of every stage.
- The workflow should converge on desired state through status-aware adoption, not through blind replay of shell-era behavior.
- This matrix is the implementation baseline for MAAS workflow activities and recovery APIs.