Skip to content

Self-Managed RKE2 First Slice v1

As of: April 10, 2026

Purpose

Define the first implementation slice for Kubernetes on GPUaaS after the distributed-workload prerequisites are in place.

This is intentionally narrow. It is the validation slice for: 1. app-runtime member topology, 2. ordered bootstrap, 3. kubeconfig delivery, 4. workload-shell integration.

It is not the managed-Kubernetes product.

Current Conclusion

Self-managed RKE2 has reached a first usable validation state.

Working: 1. deploy a single-server RKE2 cluster into an existing allocation, 2. expose kubeconfig and API endpoint information through the workload shell, 3. add an agent allocation through member operations, 4. drain/remove an agent through member operations, 5. re-add an agent and verify both nodes with native kubectl get nodes, 6. stop/start the server service for the simple single-server lifecycle case, 7. platform-control validation has reached a stable running workload state for the first single-server path.

Important product caveat: 1. RKE2 is infrastructure lifecycle exposed through the app shell. 2. Treating it as an app is useful for validating the app-runtime framework, but it should not be marketed as managed Kubernetes. 3. The platform is currently hiding infra setup behind an app abstraction; that is acceptable for the validation slice, but the long-term product boundary needs to be explicit.

Product judgment: 1. RKE2 is usable for platform validation and early tenant-dedicated experiments. 2. RKE2 still needs substantially more product and infra design than Slurm before it can be called a mature Kubernetes offering. 3. The next work is not just UI polish; it includes endpoint exposure, storage/CSI, kubeconfig privilege, repair/reconcile semantics, and runtime cleanup.

Scope

In scope: 1. a self-managed RKE2 app offering, 2. one server member and zero or more agent members, 3. platform-owned bootstrap execution, 4. kubeconfig delivery through the workload shell, 5. workload detail/member visibility for cluster creation.

Out of scope: 1. Rancher integration, 2. managed upgrades, 3. cluster autoscaling, 4. embedded dashboard UI, 5. HA control-plane topologies beyond the first-slice contract, 6. persistent workload storage and CSI installation, 7. public/private endpoint exposure beyond the current kubeconfig path, 8. managed Kubernetes SRE responsibilities.

Product Statement

The first Kubernetes product should be labeled explicitly as:

Self-managed Kubernetes (RKE2)

GPUaaS will promise: 1. node placement, 2. ordered cluster bootstrap, 3. kubeconfig delivery, 4. basic workload/member status.

GPUaaS will not promise: 1. managed cluster operations, 2. SRE ownership of cluster health, 3. in-place version upgrades, 4. embedded cluster dashboard, 5. storage classes or persistent volumes until the storage model is defined, 6. a stable external endpoint model until infra confirms the exposure boundary.

Runtime Shape

The app instance remains the workload shell anchor.

Members for the first slice: 1. server 2. agent

The first server is the control-plane bootstrap anchor. Agent members join only after the server is usable.

Placement Model

Placement intent

The minimal placement schema should be:

{
  "server_allocation_id": "<uuid>",
  "agent_allocation_ids": ["<uuid>", "<uuid>"]
}

Rules

  1. server_allocation_id is required.
  2. agent_allocation_ids is optional.
  3. The server allocation must not also appear in agent_allocation_ids.
  4. All allocations must be:
  5. active,
  6. in the same project,
  7. reachable for platform bootstrap.

First-slice simplification

Do not add topology abstractions yet such as: 1. server pools, 2. mixed worker groups, 3. autoscaling groups, 4. multi-role taint templates.

Those can follow after the single-server path works.

Member Inventory Model

The first slice should use the generic member read model already introduced.

Directionally: 1. one member with component_key = server 2. zero or more members with component_key = agent

Each member should surface: 1. node binding, 2. allocation binding, 3. endpoint where relevant, 4. runtime state, 5. health.

Bootstrap Model

Ordering

  1. Validate placement and prerequisite runtime inputs.
  2. Create or reconcile the server member.
  3. Bootstrap RKE2 server on the selected server allocation.
  4. Wait for the server to become usable enough for agent joins.
  5. Create or reconcile agent members.
  6. Join each agent to the server.
  7. Deliver kubeconfig.
  8. Mark workload running when the cluster is usable.

Platform-owned execution

Bootstrap must stay inside platform-owned execution.

Allowed directions: 1. app-runtime worker dispatches node tasks, 2. platform-owned bootstrap artifacts delivered to nodes, 3. app adapter reports phase/progress through app-runtime APIs.

Not allowed as the default: 1. third-party manager SSHing into nodes directly, 2. handing raw node SSH credentials to external systems as the primary path.

Runtime Phases

Suggested first-slice phase vocabulary:

  1. placement_validated
  2. server_bootstrap_requested
  3. server_bootstrapping
  4. server_ready_for_agent_join
  5. agent_join_requested
  6. agent_joining
  7. kubeconfig_delivery_requested
  8. ready

These are adapter-specific phase labels carried through the generic runtime progress model.

Health Model

Instance health

Use generic top-level health states: 1. progressing 2. healthy 3. degraded 4. failed

Member health

Server and agent members should normalize to the same generic member-health set.

Examples: 1. server ready, agents joining -> instance progressing 2. server ready, one agent join failed but cluster usable -> instance degraded 3. server bootstrap failed terminally -> instance failed

Kubeconfig Delivery

The first slice needs a durable access story.

Required outcome

The workload shell must expose an Access surface where the user can retrieve: 1. kubeconfig, 2. API endpoint information, 3. any relevant connection notes.

Delivery direction

Prefer platform-generated or platform-normalized kubeconfig content rather than requiring the user to log into the server node and fetch files manually.

The exact delivery endpoint can be added later, but the first slice should be designed with a platform-owned kubeconfig delivery surface in mind.

Workload Shell Requirements

The first slice must integrate with the workload shell added earlier.

Minimum shell expectations: 1. workload appears under Needs attention while bootstrapping, 2. workload moves to Active when usable, 3. workload detail shows: 1. status, 2. phase, 3. progress, 4. members, 5. access/kubeconfig area.

No embedded cluster UI is required.

Failure and Recovery

The first slice should follow the generic app-runtime recovery model.

Expected behaviors: 1. server bootstrap timeout -> manual_intervention or repair_available, depending on what the platform can safely repeat 2. agent join failure after server ready -> instance may be degraded, not necessarily fully failed 3. kubeconfig delivery failure -> recoverable if the cluster itself is usable

Do not collapse all failures into full instance failure if the cluster remains repairable.

Open recovery work: 1. use repair/reconcile operations for server and agent drift as defined in doc/architecture/Kubernetes_Runtime_Reconcile_Repair_v1.md, 2. keep generic restart unsupported for RKE2 unless the adapter reports a deterministic safe restart capability for the selected scope, 3. make decommission tear down runtime services and cluster membership rather than relying on metadata cleanup alone, 4. surface bootstrap logs and recent controller events without requiring SSH or pod logs, 5. distinguish recoverable agent failures from full cluster failure in the UI.

Security and Trust

The first slice should respect existing platform trust boundaries: 1. node tasks remain platform-owned, 2. no query-string auth tokens, 3. kubeconfig delivery is authorization-checked at the platform boundary, 4. node mutation remains auditable.

Networking and Storage Open Items

The current implementation intentionally accepts the RKE2 default networking model: 1. CNI: rke2-canal 2. pod network: 10.42.0.0/16 3. overlay backend: VXLAN 4. network policy path: Calico through Canal

This is sufficient for the first bootstrap and member lifecycle validation slice. It is not the final storage data-path decision.

Current storage state: 1. GPUaaS does not install a Kubernetes StorageClass for the self-managed RKE2 app. 2. No CSI driver is configured by the RKE2 controller today. 3. Kubeconfig access is available, but persistent workload storage is not part of the first slice.

Open items before exposing Kubernetes storage UX: 1. confirm the first external storage backend with infra, 2. confirm whether Weka is the first backend and what multi-attach/RWX guarantees it provides, 3. define whether storage is exposed only inside the RKE2 cluster through PVCs or also through a direct external access path, 4. define which network carries high-throughput storage traffic so storage does not depend on pod VXLAN, 5. define the platform-owned contract for CSI driver installation, StorageClass creation, Secrets, PVC templates, and mount paths, 6. map GPUaaS storage attachments into Kubernetes PVCs without making users hand-author arbitrary low-level YAML for common cases.

The current design contract for these decisions is doc/architecture/RKE2_External_Storage_Model_v1.md.

Decision until these are answered: 1. keep RKE2 storage as a documented open item, 2. do not expose a storage attach or StorageClass picker for RKE2, 3. keep storage implementation work blocked behind the allocation storage model and infra storage/exposure decisions.

Remaining Work Before Productization

RKE2 should remain labeled as self-managed until these are addressed:

  1. Endpoint exposure model Define whether the API endpoint is private, public, VPN-only, tenant-network-only, or link-out through a platform gateway.

  2. Kubeconfig privilege model Replace or clearly label the current admin-style kubeconfig access with scoped credentials where possible. Add a copy action for kubeconfig in the access page so the current manual path is less error-prone while the privilege model is being hardened.

  3. Storage and CSI model Define the first StorageClass, CSI ownership, backend capabilities, and PVC generation model before adding storage UX. Keep storage/CSI as an explicit open item until infra confirms the backend storage product and ownership model.

  4. Runtime cleanup Decommission should stop/remove RKE2 services, clean agent membership, and report what host state remains.

  5. Repair/reconcile Use the app-runtime repair operation contract for safe drift-repair paths on server and agent members instead of treating every issue as manual intervention.

  6. Version and upgrade policy Keep version selection catalog-driven; do not expose arbitrary freeform version upgrade/rollback until supported.

  7. Support boundary Decide when this remains an app-runtime validation bundle versus when Kubernetes becomes a first-class infrastructure product surface.

  8. External access boundary Define whether access beyond the private/VPN path uses platform proxy, tenant networking, load-balancer integration, or another infra-owned route.

Validation Criteria

The validation slice is successful when all of these are true:

  1. a user can deploy a self-managed RKE2 workload from the app catalog,
  2. the workload creates one server and optional agent members with visible progress,
  3. nodes join in the intended order,
  4. the workload becomes running,
  5. kubeconfig is retrievable from the workload shell,
  6. at least one local-kind-plus-VM validation path proves the real bootstrap on VM-backed nodes.

First Implementation Tasks

  1. catalog entry and app artifact metadata for self-managed RKE2,
  2. placement intent contract and validation,
  3. app adapter bootstrap path for server then agents,
  4. kubeconfig delivery/read-model support,
  5. workload-shell access surface for kubeconfig, including a UI copy action for the reported kubeconfig.

Non-Goals

  1. making Kubernetes look managed before it is,
  2. shipping Rancher in the same slice,
  3. solving all day-2 lifecycle operations,
  4. cluster templates and version matrix depth beyond the first supported path.
  1. doc/architecture/Kubernetes_Platform_Options_v1.md
  2. doc/architecture/App_Runtime_Recovery_Model_v1.md
  3. doc/architecture/Embedded_UI_Gateway_Contract_v1.md
  4. doc/product/Navigation_Redesign_App_Platform_v1.md
  5. doc/architecture/RKE2_External_Storage_Model_v1.md