Skip to content

Kubernetes Runtime Reconcile and Repair v1

Purpose

Define safe lifecycle semantics for self-managed Kubernetes/RKE2 app runtimes so the product does not expose a misleading generic restart action.

This document applies to the current self-managed RKE2 validation product and to future Kubernetes-style runtimes that use the app-runtime worker contract.

Decision Summary

  1. Do not expose generic restart as the primary Kubernetes action.
  2. Use reconcile for non-destructive drift correction.
  3. Use repair for targeted service/runtime remediation that may restart or reinstall runtime services under a declared safe scope.
  4. Separate whole_runtime, control_plane, and member scopes.
  5. Every repair is a durable operation with step-level progress and report APIs.
  6. App-owned workers execute runtime-specific repair logic through the public app-runtime contract; platform core owns operation identity, audit, and authorization.

Why Not Restart

restart is ambiguous for Kubernetes:

  • single-node RKE2 server restart can be a local systemctl restart rke2-server,
  • multi-node control-plane restart needs ordering and quorum awareness,
  • agent restart may be safe while server restart is not,
  • whole-cluster restart can look successful even if the API never becomes usable,
  • restarting services does not fix many real drift cases such as bad kubeconfig delivery, stale member records, failed node join, certificate drift, or CNI readiness issues.

Therefore POST /app-instances/{id}/restart may be valid for simple runtimes but Kubernetes/RKE2 should reject it with a conflict and tell the UI to use repair or reconcile.

Action Semantics

Reconcile

reconcile means:

  • inspect current app instance, members, allocations, node task state, and adapter-observed runtime state,
  • correct safe control-plane drift,
  • re-run idempotent status discovery,
  • re-deliver kubeconfig or endpoint metadata when safe,
  • retry failed-but-idempotent join/report steps.

It should avoid destructive runtime service restarts unless the runtime contract declares that step safe for the selected scope.

Repair

repair means:

  • perform reconcile prechecks,
  • execute one or more runtime-specific remediation steps,
  • restart/reinstall runtime service units only when allowed by scope and current cluster state,
  • report step-level progress,
  • leave partially repaired state visible for later retry or manual intervention.

Repair is not a hidden decommission/recreate. If teardown/rebuild is required, the action should fail with a deterministic reason and direct the user/operator to replace or decommission the affected member.

Scope Model

Scope RKE2 mapping Allowed first-slice behavior
whole_runtime entire app instance Reconcile all members; repair only if single-server or adapter says quorum is safe.
control_plane server member Reconcile server state, kubeconfig, API health; repair server only when single-server or safe.
member one server or agent member Reconcile/repair the selected member. Agent repair is safer than server repair.

Rules:

  1. member scope requires target_member_id.
  2. control_plane may use component_key=server when the target member is not selected by the user.
  3. whole_runtime must not fan out destructive restarts across all members unless the adapter has proved safe ordering.

Operation Contract

New API surfaces:

  • POST /api/v1/projects/{project_id}/app-instances/{app_instance_id}/repair
  • GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/repair-operations
  • GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/repair-operations/{repair_operation_id}
  • POST /api/v1/projects/{project_id}/app-instances/{app_instance_id}/repair-operations/{repair_operation_id}/report

The operation is intentionally separate from app-instance member add/drain/remove operations. Add/drain/remove change topology. Repair/reconcile correct drift in the existing topology.

Step Progress

Repair operations carry ordered steps:

  1. precheck
  2. runtime_state_discovery
  3. member_health_check
  4. service_reconcile
  5. certificate_or_kubeconfig_reconcile
  6. endpoint_reconcile
  7. post_repair_health_check
  8. report_status

Adapters may add more specific step names, but these names should be preferred where they fit so V3 can present consistent progress.

Step statuses:

  • pending
  • running
  • succeeded
  • failed
  • skipped

The UI should show step-level progress before showing raw adapter logs.

Failure Behavior

Baseline failure rules:

  1. A failed repair operation does not automatically fail the app instance.
  2. If the runtime remains usable with reduced capacity, report instance health as degraded.
  3. If the API/control plane is unusable and no safe repair remains, report instance health as failed with repair_available=false in runtime state.
  4. Partial success is reported in operation steps; do not collapse it to a single generic error.
  5. Retrying the same operation with the same idempotency key returns the same operation. Retrying with a new key creates a new operation that can reuse the previous operation's observed state.

Single-Node RKE2

For the current first slice:

  • whole_runtime/reconcile: safe.
  • control_plane/reconcile: safe.
  • member/reconcile for server: safe.
  • control_plane/repair: may restart rke2-server after prechecks.
  • whole_runtime/repair: equivalent to control_plane/repair only when there is exactly one server and no HA control plane.
  • member/repair for agent: may restart rke2-agent and retry join.

Multi-Node RKE2

Until HA/quorum policy is defined:

  • agent reconcile/repair is allowed,
  • control-plane reconcile is allowed,
  • control-plane repair that restarts a server should be blocked unless the adapter reports safe quorum and ordering,
  • whole-runtime repair should be non-destructive by default.

Audit and Authorization

Repair creation is a privileged runtime mutation and must write audit:

Action Target
app_instance.repair.requested app_instance:{id}
app_instance.repair.reported app_instance_repair_operation:{id}

Human actors use normal project/app permissions. App-owned workers report only through scoped service-account or shared-runtime operator tokens.

UI Guidance

V3 should present:

  • Reconcile for safe drift correction,
  • Repair server or Repair agent for scoped remediation,
  • Repair runtime only when the adapter reports whole-runtime repair is safe,
  • no generic Restart for self-managed RKE2 unless the specific runtime reports deterministic restart support.

If an operation is running, the UI should show:

  • operation status,
  • current phase,
  • step list,
  • latest deterministic error code/message,
  • next safe action.
  • doc/architecture/Self_Managed_RKE2_First_Slice_v1.md
  • doc/architecture/App_Runtime_External_Worker_Contract_v1.md
  • doc/architecture/Clustered_App_Model_v1.md
  • doc/architecture/Allocation_Group_Model_v1.md