2026-03 Provisioning Workflow Recovery Gaps¶

Summary¶

Provisioning and release recovery took far too long because several owner layers were only partially implemented or only partially observable. The architecture direction was mostly correct, but the handoff and recovery surfaces were incomplete.

Impact¶

incidents stretched from hours to days
reimage/reinstall cycles destroyed useful host evidence
operators had to prove the failing owner layer manually instead of isolating it quickly

Symptoms¶

release_failed allocations could leave nodes fenced with occupancy=cleanup
cleanup and force-release paths were not consistently available through UI/API
node-agent evidence disappeared after rebuild/reinstall
wrapper errors hid whether failure was deploy, cert, ingress, API auth, or workflow state

Root Cause¶

The owner problem was not one bad service. It was a set of recovery and reconciliation gaps across the provisioning workflow: - deploy paths did not fully converge runtime prerequisites - release recovery for release_failed was incomplete - control-plane truth and UI/operator projections could drift - observability did not preserve or correlate evidence across node, ingress, API, and workflow layers

Why Detection Was Weak¶

no durable node log history after reimage
no single operator view joining node, ingress, API, workflow, and DB evidence
correlation identifiers were not consistently available across all layers
some failures only surfaced as generic wrapper errors

Recovery¶

Recovery during this incident included: - manual stale allocation cleanup and audit/outbox repair - node re-enrollment and control-plane reactivation - multiple logging additions to expose owner-layer evidence - proving the real node-agent path with end-to-end task claim/result logs

Follow-ups¶

ship durable node-agent logs to Loki, likely via a collector layer
add shared identifiers across logs:
node_id
node_instance_id
correlation_id
allocation_id
workflow_id
add admin/operator recovery APIs for stuck release_failed and similar lifecycle cleanup
update the intent/control/reconciliation model with the evidence from this incident
revive local environment parity so transport and cert bugs can be reproduced without remote deploy loops