Kubernetes Runtime Reconcile and Repair v1¶
Purpose¶
Define safe lifecycle semantics for self-managed Kubernetes/RKE2 app runtimes so
the product does not expose a misleading generic restart action.
This document applies to the current self-managed RKE2 validation product and to future Kubernetes-style runtimes that use the app-runtime worker contract.
Decision Summary¶
- Do not expose generic restart as the primary Kubernetes action.
- Use
reconcilefor non-destructive drift correction. - Use
repairfor targeted service/runtime remediation that may restart or reinstall runtime services under a declared safe scope. - Separate
whole_runtime,control_plane, andmemberscopes. - Every repair is a durable operation with step-level progress and report APIs.
- App-owned workers execute runtime-specific repair logic through the public app-runtime contract; platform core owns operation identity, audit, and authorization.
Why Not Restart¶
restart is ambiguous for Kubernetes:
- single-node RKE2 server restart can be a local
systemctl restart rke2-server, - multi-node control-plane restart needs ordering and quorum awareness,
- agent restart may be safe while server restart is not,
- whole-cluster restart can look successful even if the API never becomes usable,
- restarting services does not fix many real drift cases such as bad kubeconfig delivery, stale member records, failed node join, certificate drift, or CNI readiness issues.
Therefore POST /app-instances/{id}/restart may be valid for simple runtimes but
Kubernetes/RKE2 should reject it with a conflict and tell the UI to use repair or
reconcile.
Action Semantics¶
Reconcile¶
reconcile means:
- inspect current app instance, members, allocations, node task state, and adapter-observed runtime state,
- correct safe control-plane drift,
- re-run idempotent status discovery,
- re-deliver kubeconfig or endpoint metadata when safe,
- retry failed-but-idempotent join/report steps.
It should avoid destructive runtime service restarts unless the runtime contract declares that step safe for the selected scope.
Repair¶
repair means:
- perform reconcile prechecks,
- execute one or more runtime-specific remediation steps,
- restart/reinstall runtime service units only when allowed by scope and current cluster state,
- report step-level progress,
- leave partially repaired state visible for later retry or manual intervention.
Repair is not a hidden decommission/recreate. If teardown/rebuild is required, the action should fail with a deterministic reason and direct the user/operator to replace or decommission the affected member.
Scope Model¶
| Scope | RKE2 mapping | Allowed first-slice behavior |
|---|---|---|
whole_runtime |
entire app instance | Reconcile all members; repair only if single-server or adapter says quorum is safe. |
control_plane |
server member |
Reconcile server state, kubeconfig, API health; repair server only when single-server or safe. |
member |
one server or agent member | Reconcile/repair the selected member. Agent repair is safer than server repair. |
Rules:
memberscope requirestarget_member_id.control_planemay usecomponent_key=serverwhen the target member is not selected by the user.whole_runtimemust not fan out destructive restarts across all members unless the adapter has proved safe ordering.
Operation Contract¶
New API surfaces:
POST /api/v1/projects/{project_id}/app-instances/{app_instance_id}/repairGET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/repair-operationsGET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/repair-operations/{repair_operation_id}POST /api/v1/projects/{project_id}/app-instances/{app_instance_id}/repair-operations/{repair_operation_id}/report
The operation is intentionally separate from app-instance member add/drain/remove operations. Add/drain/remove change topology. Repair/reconcile correct drift in the existing topology.
Step Progress¶
Repair operations carry ordered steps:
precheckruntime_state_discoverymember_health_checkservice_reconcilecertificate_or_kubeconfig_reconcileendpoint_reconcilepost_repair_health_checkreport_status
Adapters may add more specific step names, but these names should be preferred where they fit so V3 can present consistent progress.
Step statuses:
pendingrunningsucceededfailedskipped
The UI should show step-level progress before showing raw adapter logs.
Failure Behavior¶
Baseline failure rules:
- A failed repair operation does not automatically fail the app instance.
- If the runtime remains usable with reduced capacity, report instance health as
degraded. - If the API/control plane is unusable and no safe repair remains, report
instance health as
failedwithrepair_available=falsein runtime state. - Partial success is reported in operation steps; do not collapse it to a single generic error.
- Retrying the same operation with the same idempotency key returns the same operation. Retrying with a new key creates a new operation that can reuse the previous operation's observed state.
Single-Node RKE2¶
For the current first slice:
whole_runtime/reconcile: safe.control_plane/reconcile: safe.member/reconcilefor server: safe.control_plane/repair: may restartrke2-serverafter prechecks.whole_runtime/repair: equivalent tocontrol_plane/repaironly when there is exactly one server and no HA control plane.member/repairfor agent: may restartrke2-agentand retry join.
Multi-Node RKE2¶
Until HA/quorum policy is defined:
- agent reconcile/repair is allowed,
- control-plane reconcile is allowed,
- control-plane repair that restarts a server should be blocked unless the adapter reports safe quorum and ordering,
- whole-runtime repair should be non-destructive by default.
Audit and Authorization¶
Repair creation is a privileged runtime mutation and must write audit:
| Action | Target |
|---|---|
app_instance.repair.requested |
app_instance:{id} |
app_instance.repair.reported |
app_instance_repair_operation:{id} |
Human actors use normal project/app permissions. App-owned workers report only through scoped service-account or shared-runtime operator tokens.
UI Guidance¶
V3 should present:
Reconcilefor safe drift correction,Repair serverorRepair agentfor scoped remediation,Repair runtimeonly when the adapter reports whole-runtime repair is safe,- no generic
Restartfor self-managed RKE2 unless the specific runtime reports deterministic restart support.
If an operation is running, the UI should show:
- operation status,
- current phase,
- step list,
- latest deterministic error code/message,
- next safe action.
Related Documents¶
doc/architecture/Self_Managed_RKE2_First_Slice_v1.mddoc/architecture/App_Runtime_External_Worker_Contract_v1.mddoc/architecture/Clustered_App_Model_v1.mddoc/architecture/Allocation_Group_Model_v1.md