2026-03 Provisioning Workflow Recovery Gaps¶
Summary¶
Provisioning and release recovery took far too long because several owner layers were only partially implemented or only partially observable. The architecture direction was mostly correct, but the handoff and recovery surfaces were incomplete.
Impact¶
- incidents stretched from hours to days
- reimage/reinstall cycles destroyed useful host evidence
- operators had to prove the failing owner layer manually instead of isolating it quickly
Symptoms¶
release_failedallocations could leave nodes fenced withoccupancy=cleanup- cleanup and force-release paths were not consistently available through UI/API
- node-agent evidence disappeared after rebuild/reinstall
- wrapper errors hid whether failure was deploy, cert, ingress, API auth, or workflow state
Root Cause¶
The owner problem was not one bad service. It was a set of recovery and reconciliation
gaps across the provisioning workflow:
- deploy paths did not fully converge runtime prerequisites
- release recovery for release_failed was incomplete
- control-plane truth and UI/operator projections could drift
- observability did not preserve or correlate evidence across node, ingress, API, and workflow layers
Why Detection Was Weak¶
- no durable node log history after reimage
- no single operator view joining node, ingress, API, workflow, and DB evidence
- correlation identifiers were not consistently available across all layers
- some failures only surfaced as generic wrapper errors
Recovery¶
Recovery during this incident included: - manual stale allocation cleanup and audit/outbox repair - node re-enrollment and control-plane reactivation - multiple logging additions to expose owner-layer evidence - proving the real node-agent path with end-to-end task claim/result logs
Follow-ups¶
- ship durable node-agent logs to Loki, likely via a collector layer
- add shared identifiers across logs:
node_idnode_instance_idcorrelation_idallocation_idworkflow_id- add admin/operator recovery APIs for stuck
release_failedand similar lifecycle cleanup - update the intent/control/reconciliation model with the evidence from this incident
- revive local environment parity so transport and cert bugs can be reproduced without remote deploy loops