Skip to content

App Runtime Recovery Model v1

As of: April 10, 2026

Purpose

Define the generic recovery, intervention, and repair semantics for distributed workloads managed through the GPUaaS app-runtime model.

This document exists to stop each workload adapter from inventing its own partial retry and manual-recovery behavior.

It applies to: 1. Slurm reference and later HA variants, 2. self-managed RKE2, 3. future managed Kubernetes flows, 4. other multi-member runtimes that rely on controller and worker members.

It does not define workload-specific repair logic.

Scope

In scope: 1. generic instance and member recovery states, 2. ownership boundaries between platform runtime and workload adapter, 3. retry, reconcile, resume, repair, and intervention semantics, 4. operator-visible recovery actions and their meaning, 5. minimum audit and progress expectations.

Out of scope: 1. final Kubernetes implementation, 2. workload-specific health probes, 3. final UI polish, 4. replacement of Temporal or the current task transport.

Problem Statement

The platform already has evidence of recovery gaps: 1. partial onboarding states, 2. member operations accepted but not clearly advanced, 3. late dependency failures after mutation boundaries, 4. ambiguous operator choices between retry, rerun, and manual fixes.

Distributed workload lifecycle makes those gaps normal rather than exceptional.

Without a generic model: 1. each adapter will invent incompatible states, 2. workload UIs will expose inconsistent actions, 3. operators will not know whether a runtime is safe to resume, 4. Kubernetes work will grow on top of undefined repair semantics.

Decision Summary

The platform runtime must distinguish five different situations:

  1. Retryable execution failure The platform can safely retry the same step automatically.

  2. Reconcileable drift The desired shape is known, and the platform can run a bounded reconcile path without operator-authored input.

  3. Repairable runtime The runtime needs an explicit operator action, but the platform still owns the repair workflow once that action is requested.

  4. Manual intervention required The operator must supply missing information or complete an external fix before the platform can continue.

  5. Terminal failure The current workflow attempt is over, and a new lifecycle attempt or delete is required.

The platform must model these situations explicitly at both: 1. instance level, 2. member and operation level.

Object Model

The generic runtime objects remain: 1. app_instance 2. app_instance_member 3. app_instance_member_operation

The platform should continue to treat: 1. instance state as the user-facing workload contract, 2. member state as the distributed control-plane inventory, 3. operation state as the execution/audit record for member changes.

Generic Instance Recovery Semantics

Instance status

Existing instance status remains the high-level lifecycle surface: 1. requested 2. deploying 3. running 4. upgrading 5. rolling_back 6. failed 7. decommissioning 8. decommissioned 9. deleting 10. deleted

Runtime-state direction

runtime_state is the richer execution surface. It should carry generic fields that all distributed workloads can use.

Directionally:

{
  "phase": "controller_bootstrap",
  "progress": {
    "completed": 2,
    "total": 5,
    "unit": "members"
  },
  "health_status": "degraded",
  "health_detail": {
    "reason": "worker_join_timeout"
  },
  "recovery": {
    "mode": "manual_intervention",
    "reason_code": "allocation_missing",
    "safe_to_retry": false,
    "safe_to_resume": true,
    "operator_action_required": "bind_allocation"
  }
}

The recovery section is generic runtime metadata. It is not workload-specific health payload.

Recovery Modes

The runtime should normalize to one of these modes:

  1. none
  2. retrying
  3. reconciling
  4. repair_available
  5. manual_intervention
  6. terminal

Meaning

none

No active repair requirement.

retrying

The platform is already retrying a bounded failure automatically.

Use only when: 1. the step is idempotent or compensation-safe, 2. no new operator input is required, 3. retry count and last error are observable.

reconciling

The platform detected drift and is re-applying desired state.

Examples: 1. re-running member join after temporary controller restart, 2. re-issuing a runtime secret, 3. restoring missing delivered config.

repair_available

The operator can request a bounded platform-owned repair action.

Examples: 1. re-run member add, 2. reconcile runtime membership, 3. regenerate kubeconfig, 4. re-issue app-owned credential delivery.

manual_intervention

The platform cannot safely continue without an operator-completed fix or new operator-provided input.

Examples: 1. missing placement decision, 2. invalid external credential, 3. exhausted upstream provisioning timeout that changed real-world state, 4. app-specific bootstrap failed and needs external confirmation.

terminal

The current attempt is over. The runtime is failed in a way that should not silently auto-resume.

This still does not necessarily imply delete. It means the next action must be explicit.

Member Recovery Semantics

Members should expose: 1. status 2. runtime_state 3. health_status 4. health_detail 5. endpoint 6. last controlling operation

Member health meanings

The platform-level normalized member health states should be: 1. healthy 2. progressing 3. degraded 4. failed 5. unknown

The adapter may supply richer detail, but the top-level state should normalize to one of these values.

Member repair rule

A member may be unhealthy while the instance remains recoverable.

The platform must not immediately collapse all member failures into instance failed if: 1. the controller is still healthy enough to reconcile, 2. the operation is bounded and retryable, 3. the platform can express the degraded condition cleanly.

This is especially important for: 1. Kubernetes agent joins, 2. Slurm worker add/remove flows, 3. rolling upgrades.

Operation Recovery Semantics

Member operations should remain the source of truth for: 1. what change was requested, 2. whether it started, 3. whether it completed, 4. why it failed, 5. whether the next valid action is retry, resume, or abandon.

Directionally an operation should expose: 1. status 2. last_error_code 3. last_error_message 4. started_at 5. completed_at 6. runtime_state 7. optional recovery hint fields inside runtime_state

Operation statuses

Current statuses are acceptable as the execution skeleton: 1. accepted 2. in_progress 3. completed 4. failed 5. canceled

What was missing is not the status set. What was missing is the normalized recovery meaning around those statuses.

Ownership Boundary

Platform runtime owns

  1. workflow and task execution,
  2. retries and bounded reconcile behavior,
  3. member inventory and normalized health,
  4. generic repair actions,
  5. operator-visible recovery state,
  6. audit of privileged recovery actions.

Workload adapter owns

  1. adapter-specific phase names,
  2. adapter-specific health detail,
  3. mapping workload internals to generic runtime semantics,
  4. workload-specific safe-repair implementations.

Platform must not require adapters to invent

  1. their own manual-intervention lifecycle,
  2. their own repair action taxonomy,
  3. their own UI-level meaning for degraded versus failed.

Operator Actions

The runtime should reserve these generic operator actions:

  1. retry
  2. resume
  3. reconcile
  4. repair
  5. mark_intervention_complete
  6. decommission

Meanings

retry

Re-run the same failed step or operation when it is safe to repeat without new operator input.

resume

Continue a paused workflow after manual intervention has been completed.

This is not equivalent to retry. It assumes the workflow state is still meaningful and resumable.

reconcile

Ask the platform to compare desired and observed runtime state and fix drift without changing the intended topology.

repair

Run a bounded platform-owned repair procedure that may include one or more sub-steps and compensation logic.

mark_intervention_complete

Operator asserts that the external/manual fix has been completed and the platform may continue evaluation.

decommission

Abort the current runtime and move into a controlled teardown.

Safety Rules

Retry safety

Do not expose retry unless: 1. the underlying action is idempotent or compensated, 2. the previous attempt did not cross an unsafe mutation boundary without observable state, 3. the platform has enough state to avoid duplicate member creation or drift.

Resume safety

Do not expose resume unless: 1. the workflow state is persisted and still valid, 2. required manual intervention is marked complete, 3. the runtime has not drifted beyond the saved assumptions.

Reconcile safety

Do not expose reconcile if the desired-state model itself is missing or ambiguous.

Repair safety

Repairs must remain platform-owned and audited. They should not depend on undocumented shell commands.

Audit Requirements

Every recovery mutation must write an audit record including: 1. actor identity, 2. action, 3. target runtime or member, 4. result, 5. correlation ID, 6. previous recovery mode, 7. next recovery mode, 8. any reason code if applicable.

UX Direction

The workload detail shell should surface: 1. current phase, 2. progress summary, 3. health status, 4. current recovery mode, 5. next valid actions, 6. clear operator message when manual intervention is required.

The shell should never make the operator infer recovery state only from logs.

API Direction

This document does not require a brand-new endpoint family before implementation.

What it does require is that existing app-runtime read models directionally support: 1. top-level normalized recovery metadata on instances and members, 2. explicit repair and reconcile action endpoints when introduced, 3. durable operation records for every repair mutation.

Immediate Follow-On Work

  1. expose normalized recovery metadata in app-instance and member read models,
  2. define concrete repair endpoints for runtime/member operations,
  3. make workload detail surfaces show repair_available versus manual_intervention,
  4. keep Kubernetes and Slurm adapters inside this generic recovery contract.

Non-Goals

  1. final adapter-specific repair catalogs,
  2. final Kubernetes UI,
  3. replacing workload-specific logs with generic runtime state,
  4. automatic healing of every failure.
  1. doc/architecture/Kubernetes_Platform_Options_v1.md
  2. doc/product/Navigation_Redesign_App_Platform_v1.md
  3. doc/architecture/Slurm_App_Runtime_Adapter_v1.md
  4. doc/architecture/State_Machines.md