Kubernetes on GPUaaS: Platform Design v2¶

Purpose: - Reframe the Kubernetes design space around the platform's actual control boundaries. - Separate what is viable for an early product from what is attractive but premature. - Define the missing platform primitives that must exist before Kubernetes work starts.

Inputs: - doc/product/Navigation_Redesign_App_Platform_v1.md - doc/api/openapi.draft.yaml - doc/api/asyncapi.draft.yaml - doc/architecture/Node_Agent_Spec.md - packages/services/appruntime/

Related: - Slurm reference implementation is the current baseline for app-runtime lifecycle. - This document is intentionally opinionated; it is a design decision document, not a catalog of all possible Kubernetes architectures.

1. Executive Summary¶

The right path is:

Add the missing app-runtime primitives that any distributed control-plane app needs.
Validate those primitives with a small self-managed RKE2 offering.
Offer managed Kubernetes with Rancher only if Rancher integration stays inside platform control boundaries.
Defer Kamaji+CAPI until scale or control-plane economics justify it.

The most important correction from v1 is this:

Kubernetes integration must not introduce a side channel where a third-party system receives raw node SSH credentials and mutates nodes outside the platform's task, audit, and lifecycle model.

That means the product decision is not simply "Rancher vs Kamaji vs RKE2". The real decision is:

who owns cluster lifecycle,
who is allowed to touch nodes,
how cluster UIs authenticate,
and whether day-2 operations fit the existing platform model.

2. Product Modes¶

There are three materially different products hiding under the word "Kubernetes":

2.1 Self-Managed Kubernetes¶

The platform provides nodes and cluster bootstrap automation. The tenant operates the cluster after creation.

Examples: - single-server RKE2 - HA RKE2 with tenant-owned upgrades and troubleshooting

What the platform promises: - allocations - bootstrap - kubeconfig delivery - basic lifecycle status

What the platform does not promise: - managed upgrades - fleet visibility - monitoring/logging defaults - SRE ownership of downstream cluster health

2.2 Managed Kubernetes¶

The platform owns cluster lifecycle and provides a stable operating model.

Examples: - Rancher-managed downstream RKE2 clusters

What the platform promises: - cluster create/scale/upgrade flows - cluster health model - kubeconfig delivery - supported day-2 lifecycle

2.3 Shared-Control-Plane Kubernetes¶

The platform also optimizes control-plane density and operates a dedicated cluster-management substrate.

Examples: - Kamaji + CAPI

This is an efficiency and fleet-scale move, not an MVP move.

3. Hard Constraints From the Current Platform¶

Any Kubernetes design must respect these existing boundaries:

3.1 Node mutation should stay inside platform-controlled execution¶

Today the platform already has: - node-agent - typed tasks - app-runtime lifecycle - audit expectations - node identity and mTLS

Kubernetes must not bypass that by handing raw node SSH credentials to an external system as the primary control path.

Allowed: - platform-owned worker or controller orchestrates node changes - platform emits bootstrap artifacts or tasks - platform audits privileged actions

Not preferred: - Rancher or another manager receives raw node SSH access and mutates nodes directly with no platform-owned execution record

3.2 Workload status must remain platform-readable¶

The platform cannot reduce instance state to "requested" and "running". Distributed systems need: - component health - progress reporting - degraded states - manual intervention states

This is true for: - Slurm HA - Kubernetes control planes - Ray or Spark later

3.3 Embedded UI must follow a real auth model¶

Embedding Rancher, Grafana, or JupyterHub is not just an iframe problem. The platform must define: - login/session ownership - cookie behavior - WebSocket handling - CSRF model - logout/session expiry behavior - when to fall back to link-out

3.4 A usable cluster needs more than kubelet join¶

Kubernetes is not "done" when nodes join. The product also needs an opinion on: - API endpoint exposure - ingress/load-balancing - DNS - persistent storage classes - CNI and network policy - GPU operator lifecycle - observability defaults - upgrade and rollback semantics

4. Missing Platform Primitives¶

These should be treated as prerequisites for serious Kubernetes work.

4.1 App member inventory and networking¶

The runtime must be able to surface: - member list - component key - bound node ID - resolved node IPs or node-reachable endpoints - health - last operation

Example read model:

GET /api/v1/app-runtime/instances/{id}/members

Needed for: - RKE2 server/agent join - HA bootstrap ordering - instance detail UI - troubleshooting

4.2 Rich phase and progress reporting¶

Instance-level phase alone is not enough.

Needed:

POST /api/v1/app-runtime/instances/{id}/report-phase
  { "phase": "bootstrapping", "progress": {"completed": 2, "total": 5} }

POST /api/v1/app-runtime/instances/{id}/members/{member_id}/report-health
  { "healthy": true, "detail": {"role": "server"} }

This should be generic, not Kubernetes-specific.

4.3 Embedded UI gateway contract¶

Before embedding Rancher or any app UI, the platform needs a documented gateway contract: - reverse-proxy route shape - supported auth strategies - WebSocket upgrade behavior - session TTL model - CSP and frame policy requirements - audit and access checks

4.4 App-runtime recovery model¶

Kubernetes will amplify existing runtime gaps: - partial deploy - component drift - orphaned member operations - manual recovery

The app runtime needs explicit: - retry - reconcile - manual intervention - resume/repair semantics

5. Architecture Options¶

5.1 Option A: Self-Managed RKE2¶

What it is¶

The platform bootstraps RKE2 on allocated nodes. After bootstrap, the tenant operates the cluster.

Why it matters¶

This is the cheapest way to validate that the app-runtime model can support a second distributed system after Slurm.

Minimal placement shape¶

placement_schema:
  type: object
  required: [server_allocation_id]
  properties:
    server_allocation_id:
      type: string
      format: uuid
    agent_allocation_ids:
      type: array
      items:
        type: string
        format: uuid

Lifecycle¶

User deploys a self-managed RKE2 app instance.
Platform worker resolves placements and member networking.
Platform-controlled execution bootstraps the first server.
Platform-controlled execution joins agents.
Platform delivers kubeconfig and marks the instance running.

What it validates¶

member networking read model
ordered member bootstrap
progress reporting
kubeconfig delivery
workload detail UX

What it does not solve¶

managed upgrades
cluster fleet management
tenant-friendly UI
SRE ownership

Recommendation¶

Do this first as a validation step, but position it clearly as self-managed.

5.2 Option B: Managed Kubernetes with Rancher¶

What it is¶

The platform operates Rancher centrally. Tenants request clusters through GPUaaS. Rancher manages downstream RKE2 clusters.

Why it is attractive¶

Rancher already provides: - cluster lifecycle - fleet management - RBAC - monitoring/logging ecosystem - kubeconfig issuance - a real UI

The key design correction¶

The v1 design assumed a simple GPUaaS node driver could: - create allocations, - wait for them to become active, - return node IPs and SSH credentials to Rancher.

That is too optimistic and is the wrong default control boundary.

The preferred model is:

GPUaaS remains the system of record for allocations and node mutation.
Rancher requests desired cluster actions.
A platform-owned integration layer translates those actions into platform-controlled member operations or bootstrap artifacts.
Rancher observes cluster readiness, but does not become the primary out-of-band node operator.

Acceptable integration shapes¶

Preferred: - Rancher integration service calls GPUaaS APIs - GPUaaS-owned worker performs node bootstrap via platform execution

Less preferred: - Rancher-native node driver with tightly scoped node credentials

If the Rancher-native path is used anyway, it must answer: - how credentials are scoped and rotated, - how actions are audited, - how node lifecycle stays reconcilable from GPUaaS, - how drift is repaired.

Managed Kubernetes product requirements¶

Rancher only becomes a real product fit if GPUaaS also specifies: - cluster endpoint exposure model - control-plane vs worker-node billing - default storage class - default ingress/DNS story - GPU operator install/upgrade owner - upgrade policy - support boundary between GPUaaS and tenant workloads

Recommendation¶

Rancher is still the best medium-term managed-Kubernetes direction, but only if it is integrated as a platform-managed control plane, not as an SSH side channel.

5.3 Option C: Kamaji + CAPI¶

What it is¶

The platform operates a management cluster and treats tenant clusters as declarative infrastructure.

Why it is not first¶

Kamaji + CAPI is attractive when: - cluster count is high, - control-plane cost matters materially, - and the platform is ready to own a full cluster-management substrate.

It does not validate the current app-platform model. It creates a new platform subsystem.

Recommendation¶

Do not start here. Revisit only after: - the managed Kubernetes product exists, - Rancher or another manager proves real demand, - and dedicated control-plane cost becomes painful.

6. Comparison¶

Dimension	Self-Managed RKE2	Managed K8s via Rancher	Kamaji + CAPI
Product promise	bootstrap only	managed cluster lifecycle	managed cluster lifecycle + control-plane density
Fastest path	yes	no	no
Validates app-runtime model	yes	partly yes	mostly no
Requires embedded UI	no	yes	optional
Requires day-2 ops model	low	high	very high
Requires new platform substrate	low	medium	high
Control-boundary risk	low if platform-owned bootstrap	medium if Rancher gets raw node control	high implementation burden
Recommended timing	first	second	later

7. Recommendation¶

7.1 Product sequence¶

Phase 0: app-runtime hardening Build the generic primitives missing for distributed apps.
Phase 1: self-managed RKE2 Ship a narrow, explicit validation offering.
Phase 2: managed Kubernetes with Rancher Only after the control boundary and embedded-UI contracts are solid.
Phase 3: evaluate Kamaji Only if scale and economics justify it.

7.2 Why this sequence¶

It matches current platform maturity.
It reuses the Slurm lessons instead of skipping them.
It avoids building a Kubernetes management plane before the platform can even model distributed-app health and embedded UIs correctly.

8. Implementation Phases¶

Phase 0: Platform prerequisites¶

Work	Description
App member read model	Members, roles, node binding, endpoint/network info, health
Rich phase API	Instance progress and component health reporting
Embedded UI gateway contract	Reverse proxy, auth modes, WS, session model
Runtime recovery model	Retry, reconcile, intervention, repair flows
Workload UX support	Running vs attention-needed workload visibility

Phase 1: Self-managed RKE2¶

Work	Description
Catalog entry	Explicitly labeled self-managed Kubernetes
Placement schema	Server + agent placement intent
Worker lifecycle	Ordered bootstrap and kubeconfig delivery
Validation test	Deploy cluster, verify node join and kubeconfig usability

Phase 2: Managed Kubernetes with Rancher¶

Work	Description
Rancher management plane	HA Rancher operated by platform
Integration boundary	Platform-owned execution path between Rancher intent and node mutation
Cluster request UX	Name, version, topology, SKU/template selection
Instance detail UX	Overview, Kubeconfig, Nodes, Health, Rancher UI
Day-2 model	Scale, upgrade, repair, support boundary

Phase 3: Scale evaluation¶

Work	Description
Fleet economics review	Measure dedicated control-plane overhead
Kamaji feasibility	Only if cluster count and cost justify it
Migration plan	Keep tenant UX stable if backend changes

9. Open Questions¶

Should managed Kubernetes control-plane nodes be tenant-billed, platform-billed, or template-dependent?
What is the default cluster endpoint model: public, private, or tenant-network-only?
What is the first supported storage class story for tenant clusters, and is the first backend Weka or another CSI-backed shared filesystem?
Is embedded Rancher required for MVP, or is kubeconfig-first with optional link-out enough?
If Rancher integration needs direct node credentials, is that acceptable at all under the current platform control model?
Should storage be exposed only inside tenant RKE2 clusters as PVCs, or should GPUaaS also expose a direct external storage access path?
What network path should storage traffic use, and what must the platform do to avoid depending on the default RKE2 VXLAN pod overlay for high-throughput storage?

10. Final Decision¶

The platform should not start by building "managed Kubernetes". It should start by building the missing distributed-app primitives and validating them with self-managed RKE2.

If that succeeds, Rancher is the right managed-Kubernetes direction, but only with a platform-owned execution boundary.

Kamaji remains a later optimization, not the first product.