Skip to content

2026-03 Provisioning Workflow Recovery Gaps

Summary

Provisioning and release recovery took far too long because several owner layers were only partially implemented or only partially observable. The architecture direction was mostly correct, but the handoff and recovery surfaces were incomplete.

Impact

  • incidents stretched from hours to days
  • reimage/reinstall cycles destroyed useful host evidence
  • operators had to prove the failing owner layer manually instead of isolating it quickly

Symptoms

  • release_failed allocations could leave nodes fenced with occupancy=cleanup
  • cleanup and force-release paths were not consistently available through UI/API
  • node-agent evidence disappeared after rebuild/reinstall
  • wrapper errors hid whether failure was deploy, cert, ingress, API auth, or workflow state

Root Cause

The owner problem was not one bad service. It was a set of recovery and reconciliation gaps across the provisioning workflow: - deploy paths did not fully converge runtime prerequisites - release recovery for release_failed was incomplete - control-plane truth and UI/operator projections could drift - observability did not preserve or correlate evidence across node, ingress, API, and workflow layers

Why Detection Was Weak

  • no durable node log history after reimage
  • no single operator view joining node, ingress, API, workflow, and DB evidence
  • correlation identifiers were not consistently available across all layers
  • some failures only surfaced as generic wrapper errors

Recovery

Recovery during this incident included: - manual stale allocation cleanup and audit/outbox repair - node re-enrollment and control-plane reactivation - multiple logging additions to expose owner-layer evidence - proving the real node-agent path with end-to-end task claim/result logs

Follow-ups

  • ship durable node-agent logs to Loki, likely via a collector layer
  • add shared identifiers across logs:
  • node_id
  • node_instance_id
  • correlation_id
  • allocation_id
  • workflow_id
  • add admin/operator recovery APIs for stuck release_failed and similar lifecycle cleanup
  • update the intent/control/reconciliation model with the evidence from this incident
  • revive local environment parity so transport and cert bugs can be reproduced without remote deploy loops