Skip to content

Kubernetes on GPUaaS: Platform Design v2

Purpose: - Reframe the Kubernetes design space around the platform's actual control boundaries. - Separate what is viable for an early product from what is attractive but premature. - Define the missing platform primitives that must exist before Kubernetes work starts.

Inputs: - doc/product/Navigation_Redesign_App_Platform_v1.md - doc/api/openapi.draft.yaml - doc/api/asyncapi.draft.yaml - doc/architecture/Node_Agent_Spec.md - packages/services/appruntime/

Related: - Slurm reference implementation is the current baseline for app-runtime lifecycle. - This document is intentionally opinionated; it is a design decision document, not a catalog of all possible Kubernetes architectures.


1. Executive Summary

The right path is:

  1. Add the missing app-runtime primitives that any distributed control-plane app needs.
  2. Validate those primitives with a small self-managed RKE2 offering.
  3. Offer managed Kubernetes with Rancher only if Rancher integration stays inside platform control boundaries.
  4. Defer Kamaji+CAPI until scale or control-plane economics justify it.

The most important correction from v1 is this:

  • Kubernetes integration must not introduce a side channel where a third-party system receives raw node SSH credentials and mutates nodes outside the platform's task, audit, and lifecycle model.

That means the product decision is not simply "Rancher vs Kamaji vs RKE2". The real decision is:

  • who owns cluster lifecycle,
  • who is allowed to touch nodes,
  • how cluster UIs authenticate,
  • and whether day-2 operations fit the existing platform model.

2. Product Modes

There are three materially different products hiding under the word "Kubernetes":

2.1 Self-Managed Kubernetes

The platform provides nodes and cluster bootstrap automation. The tenant operates the cluster after creation.

Examples: - single-server RKE2 - HA RKE2 with tenant-owned upgrades and troubleshooting

What the platform promises: - allocations - bootstrap - kubeconfig delivery - basic lifecycle status

What the platform does not promise: - managed upgrades - fleet visibility - monitoring/logging defaults - SRE ownership of downstream cluster health

2.2 Managed Kubernetes

The platform owns cluster lifecycle and provides a stable operating model.

Examples: - Rancher-managed downstream RKE2 clusters

What the platform promises: - cluster create/scale/upgrade flows - cluster health model - kubeconfig delivery - supported day-2 lifecycle

2.3 Shared-Control-Plane Kubernetes

The platform also optimizes control-plane density and operates a dedicated cluster-management substrate.

Examples: - Kamaji + CAPI

This is an efficiency and fleet-scale move, not an MVP move.


3. Hard Constraints From the Current Platform

Any Kubernetes design must respect these existing boundaries:

3.1 Node mutation should stay inside platform-controlled execution

Today the platform already has: - node-agent - typed tasks - app-runtime lifecycle - audit expectations - node identity and mTLS

Kubernetes must not bypass that by handing raw node SSH credentials to an external system as the primary control path.

Allowed: - platform-owned worker or controller orchestrates node changes - platform emits bootstrap artifacts or tasks - platform audits privileged actions

Not preferred: - Rancher or another manager receives raw node SSH access and mutates nodes directly with no platform-owned execution record

3.2 Workload status must remain platform-readable

The platform cannot reduce instance state to "requested" and "running". Distributed systems need: - component health - progress reporting - degraded states - manual intervention states

This is true for: - Slurm HA - Kubernetes control planes - Ray or Spark later

3.3 Embedded UI must follow a real auth model

Embedding Rancher, Grafana, or JupyterHub is not just an iframe problem. The platform must define: - login/session ownership - cookie behavior - WebSocket handling - CSRF model - logout/session expiry behavior - when to fall back to link-out

3.4 A usable cluster needs more than kubelet join

Kubernetes is not "done" when nodes join. The product also needs an opinion on: - API endpoint exposure - ingress/load-balancing - DNS - persistent storage classes - CNI and network policy - GPU operator lifecycle - observability defaults - upgrade and rollback semantics


4. Missing Platform Primitives

These should be treated as prerequisites for serious Kubernetes work.

4.1 App member inventory and networking

The runtime must be able to surface: - member list - component key - bound node ID - resolved node IPs or node-reachable endpoints - health - last operation

Example read model:

GET /api/v1/app-runtime/instances/{id}/members

Needed for: - RKE2 server/agent join - HA bootstrap ordering - instance detail UI - troubleshooting

4.2 Rich phase and progress reporting

Instance-level phase alone is not enough.

Needed:

POST /api/v1/app-runtime/instances/{id}/report-phase
  { "phase": "bootstrapping", "progress": {"completed": 2, "total": 5} }

POST /api/v1/app-runtime/instances/{id}/members/{member_id}/report-health
  { "healthy": true, "detail": {"role": "server"} }

This should be generic, not Kubernetes-specific.

4.3 Embedded UI gateway contract

Before embedding Rancher or any app UI, the platform needs a documented gateway contract: - reverse-proxy route shape - supported auth strategies - WebSocket upgrade behavior - session TTL model - CSP and frame policy requirements - audit and access checks

4.4 App-runtime recovery model

Kubernetes will amplify existing runtime gaps: - partial deploy - component drift - orphaned member operations - manual recovery

The app runtime needs explicit: - retry - reconcile - manual intervention - resume/repair semantics


5. Architecture Options

5.1 Option A: Self-Managed RKE2

What it is

The platform bootstraps RKE2 on allocated nodes. After bootstrap, the tenant operates the cluster.

Why it matters

This is the cheapest way to validate that the app-runtime model can support a second distributed system after Slurm.

Minimal placement shape

placement_schema:
  type: object
  required: [server_allocation_id]
  properties:
    server_allocation_id:
      type: string
      format: uuid
    agent_allocation_ids:
      type: array
      items:
        type: string
        format: uuid

Lifecycle

  1. User deploys a self-managed RKE2 app instance.
  2. Platform worker resolves placements and member networking.
  3. Platform-controlled execution bootstraps the first server.
  4. Platform-controlled execution joins agents.
  5. Platform delivers kubeconfig and marks the instance running.

What it validates

  • member networking read model
  • ordered member bootstrap
  • progress reporting
  • kubeconfig delivery
  • workload detail UX

What it does not solve

  • managed upgrades
  • cluster fleet management
  • tenant-friendly UI
  • SRE ownership

Recommendation

Do this first as a validation step, but position it clearly as self-managed.


5.2 Option B: Managed Kubernetes with Rancher

What it is

The platform operates Rancher centrally. Tenants request clusters through GPUaaS. Rancher manages downstream RKE2 clusters.

Why it is attractive

Rancher already provides: - cluster lifecycle - fleet management - RBAC - monitoring/logging ecosystem - kubeconfig issuance - a real UI

The key design correction

The v1 design assumed a simple GPUaaS node driver could: - create allocations, - wait for them to become active, - return node IPs and SSH credentials to Rancher.

That is too optimistic and is the wrong default control boundary.

The preferred model is:

  1. GPUaaS remains the system of record for allocations and node mutation.
  2. Rancher requests desired cluster actions.
  3. A platform-owned integration layer translates those actions into platform-controlled member operations or bootstrap artifacts.
  4. Rancher observes cluster readiness, but does not become the primary out-of-band node operator.

Acceptable integration shapes

Preferred: - Rancher integration service calls GPUaaS APIs - GPUaaS-owned worker performs node bootstrap via platform execution

Less preferred: - Rancher-native node driver with tightly scoped node credentials

If the Rancher-native path is used anyway, it must answer: - how credentials are scoped and rotated, - how actions are audited, - how node lifecycle stays reconcilable from GPUaaS, - how drift is repaired.

Managed Kubernetes product requirements

Rancher only becomes a real product fit if GPUaaS also specifies: - cluster endpoint exposure model - control-plane vs worker-node billing - default storage class - default ingress/DNS story - GPU operator install/upgrade owner - upgrade policy - support boundary between GPUaaS and tenant workloads

Recommendation

Rancher is still the best medium-term managed-Kubernetes direction, but only if it is integrated as a platform-managed control plane, not as an SSH side channel.


5.3 Option C: Kamaji + CAPI

What it is

The platform operates a management cluster and treats tenant clusters as declarative infrastructure.

Why it is not first

Kamaji + CAPI is attractive when: - cluster count is high, - control-plane cost matters materially, - and the platform is ready to own a full cluster-management substrate.

It does not validate the current app-platform model. It creates a new platform subsystem.

Recommendation

Do not start here. Revisit only after: - the managed Kubernetes product exists, - Rancher or another manager proves real demand, - and dedicated control-plane cost becomes painful.


6. Comparison

Dimension Self-Managed RKE2 Managed K8s via Rancher Kamaji + CAPI
Product promise bootstrap only managed cluster lifecycle managed cluster lifecycle + control-plane density
Fastest path yes no no
Validates app-runtime model yes partly yes mostly no
Requires embedded UI no yes optional
Requires day-2 ops model low high very high
Requires new platform substrate low medium high
Control-boundary risk low if platform-owned bootstrap medium if Rancher gets raw node control high implementation burden
Recommended timing first second later

7. Recommendation

7.1 Product sequence

  1. Phase 0: app-runtime hardening Build the generic primitives missing for distributed apps.

  2. Phase 1: self-managed RKE2 Ship a narrow, explicit validation offering.

  3. Phase 2: managed Kubernetes with Rancher Only after the control boundary and embedded-UI contracts are solid.

  4. Phase 3: evaluate Kamaji Only if scale and economics justify it.

7.2 Why this sequence

  • It matches current platform maturity.
  • It reuses the Slurm lessons instead of skipping them.
  • It avoids building a Kubernetes management plane before the platform can even model distributed-app health and embedded UIs correctly.

8. Implementation Phases

Phase 0: Platform prerequisites

Work Description
App member read model Members, roles, node binding, endpoint/network info, health
Rich phase API Instance progress and component health reporting
Embedded UI gateway contract Reverse proxy, auth modes, WS, session model
Runtime recovery model Retry, reconcile, intervention, repair flows
Workload UX support Running vs attention-needed workload visibility

Phase 1: Self-managed RKE2

Work Description
Catalog entry Explicitly labeled self-managed Kubernetes
Placement schema Server + agent placement intent
Worker lifecycle Ordered bootstrap and kubeconfig delivery
Validation test Deploy cluster, verify node join and kubeconfig usability

Phase 2: Managed Kubernetes with Rancher

Work Description
Rancher management plane HA Rancher operated by platform
Integration boundary Platform-owned execution path between Rancher intent and node mutation
Cluster request UX Name, version, topology, SKU/template selection
Instance detail UX Overview, Kubeconfig, Nodes, Health, Rancher UI
Day-2 model Scale, upgrade, repair, support boundary

Phase 3: Scale evaluation

Work Description
Fleet economics review Measure dedicated control-plane overhead
Kamaji feasibility Only if cluster count and cost justify it
Migration plan Keep tenant UX stable if backend changes

9. Open Questions

  1. Should managed Kubernetes control-plane nodes be tenant-billed, platform-billed, or template-dependent?
  2. What is the default cluster endpoint model: public, private, or tenant-network-only?
  3. What is the first supported storage class story for tenant clusters, and is the first backend Weka or another CSI-backed shared filesystem?
  4. Is embedded Rancher required for MVP, or is kubeconfig-first with optional link-out enough?
  5. If Rancher integration needs direct node credentials, is that acceptable at all under the current platform control model?
  6. Should storage be exposed only inside tenant RKE2 clusters as PVCs, or should GPUaaS also expose a direct external storage access path?
  7. What network path should storage traffic use, and what must the platform do to avoid depending on the default RKE2 VXLAN pod overlay for high-throughput storage?

10. Final Decision

The platform should not start by building "managed Kubernetes". It should start by building the missing distributed-app primitives and validating them with self-managed RKE2.

If that succeeds, Rancher is the right managed-Kubernetes direction, but only with a platform-owned execution boundary.

Kamaji remains a later optimization, not the first product.