Kubernetes Runtime Reconcile and Repair v1¶

Purpose¶

Define safe lifecycle semantics for self-managed Kubernetes/RKE2 app runtimes so the product does not expose a misleading generic restart action.

This document applies to the current self-managed RKE2 validation product and to future Kubernetes-style runtimes that use the app-runtime worker contract.

Decision Summary¶

Do not expose generic restart as the primary Kubernetes action.
Use reconcile for non-destructive drift correction.
Use repair for targeted service/runtime remediation that may restart or reinstall runtime services under a declared safe scope.
Separate whole_runtime, control_plane, and member scopes.
Every repair is a durable operation with step-level progress and report APIs.
App-owned workers execute runtime-specific repair logic through the public app-runtime contract; platform core owns operation identity, audit, and authorization.

Why Not Restart¶

restart is ambiguous for Kubernetes:

single-node RKE2 server restart can be a local systemctl restart rke2-server,
multi-node control-plane restart needs ordering and quorum awareness,
agent restart may be safe while server restart is not,
whole-cluster restart can look successful even if the API never becomes usable,
restarting services does not fix many real drift cases such as bad kubeconfig delivery, stale member records, failed node join, certificate drift, or CNI readiness issues.

Therefore POST /app-instances/{id}/restart may be valid for simple runtimes but Kubernetes/RKE2 should reject it with a conflict and tell the UI to use repair or reconcile.

Action Semantics¶

Reconcile¶

reconcile means:

inspect current app instance, members, allocations, node task state, and adapter-observed runtime state,
correct safe control-plane drift,
re-run idempotent status discovery,
re-deliver kubeconfig or endpoint metadata when safe,
retry failed-but-idempotent join/report steps.

It should avoid destructive runtime service restarts unless the runtime contract declares that step safe for the selected scope.

Repair¶

repair means:

perform reconcile prechecks,
execute one or more runtime-specific remediation steps,
restart/reinstall runtime service units only when allowed by scope and current cluster state,
report step-level progress,
leave partially repaired state visible for later retry or manual intervention.

Repair is not a hidden decommission/recreate. If teardown/rebuild is required, the action should fail with a deterministic reason and direct the user/operator to replace or decommission the affected member.

Scope Model¶

Scope	RKE2 mapping	Allowed first-slice behavior
`whole_runtime`	entire app instance	Reconcile all members; repair only if single-server or adapter says quorum is safe.
`control_plane`	`server` member	Reconcile server state, kubeconfig, API health; repair server only when single-server or safe.
`member`	one server or agent member	Reconcile/repair the selected member. Agent repair is safer than server repair.

Rules:

member scope requires target_member_id.
control_plane may use component_key=server when the target member is not selected by the user.
whole_runtime must not fan out destructive restarts across all members unless the adapter has proved safe ordering.

Operation Contract¶

New API surfaces:

POST /api/v1/projects/{project_id}/app-instances/{app_instance_id}/repair
GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/repair-operations
GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/repair-operations/{repair_operation_id}
POST /api/v1/projects/{project_id}/app-instances/{app_instance_id}/repair-operations/{repair_operation_id}/report

The operation is intentionally separate from app-instance member add/drain/remove operations. Add/drain/remove change topology. Repair/reconcile correct drift in the existing topology.

Step Progress¶

Repair operations carry ordered steps:

precheck
runtime_state_discovery
member_health_check
service_reconcile
certificate_or_kubeconfig_reconcile
endpoint_reconcile
post_repair_health_check
report_status

Adapters may add more specific step names, but these names should be preferred where they fit so V3 can present consistent progress.

Step statuses:

pending
running
succeeded
failed
skipped

The UI should show step-level progress before showing raw adapter logs.

Failure Behavior¶

Baseline failure rules:

A failed repair operation does not automatically fail the app instance.
If the runtime remains usable with reduced capacity, report instance health as degraded.
If the API/control plane is unusable and no safe repair remains, report instance health as failed with repair_available=false in runtime state.
Partial success is reported in operation steps; do not collapse it to a single generic error.
Retrying the same operation with the same idempotency key returns the same operation. Retrying with a new key creates a new operation that can reuse the previous operation's observed state.

Single-Node RKE2¶

For the current first slice:

whole_runtime/reconcile: safe.
control_plane/reconcile: safe.
member/reconcile for server: safe.
control_plane/repair: may restart rke2-server after prechecks.
whole_runtime/repair: equivalent to control_plane/repair only when there is exactly one server and no HA control plane.
member/repair for agent: may restart rke2-agent and retry join.

Multi-Node RKE2¶

Until HA/quorum policy is defined:

agent reconcile/repair is allowed,
control-plane reconcile is allowed,
control-plane repair that restarts a server should be blocked unless the adapter reports safe quorum and ordering,
whole-runtime repair should be non-destructive by default.

Audit and Authorization¶

Repair creation is a privileged runtime mutation and must write audit:

Action	Target
`app_instance.repair.requested`	`app_instance:{id}`
`app_instance.repair.reported`	`app_instance_repair_operation:{id}`

Human actors use normal project/app permissions. App-owned workers report only through scoped service-account or shared-runtime operator tokens.

UI Guidance¶

V3 should present:

Reconcile for safe drift correction,
Repair server or Repair agent for scoped remediation,
Repair runtime only when the adapter reports whole-runtime repair is safe,
no generic Restart for self-managed RKE2 unless the specific runtime reports deterministic restart support.

If an operation is running, the UI should show:

operation status,
current phase,
step list,
latest deterministic error code/message,
next safe action.

doc/architecture/Self_Managed_RKE2_First_Slice_v1.md
doc/architecture/App_Runtime_External_Worker_Contract_v1.md
doc/architecture/Clustered_App_Model_v1.md
doc/architecture/Allocation_Group_Model_v1.md