Build-vs-buy & option evaluations¶
Decided Designed
For every meaningful build-or-buy choice the team has done an explicit option comparison and recorded what came out as requirements. This page consolidates them.
Overview¶
flowchart LR
classDef chosen fill:#d1e7dd,stroke:#0a3622
classDef deferred fill:#fff3cd,stroke:#332701
classDef rejected fill:#f8d7da,stroke:#42101e
K[Kubernetes platform options] --> K1[Self-managed RKE2 first]:::chosen
K --> K2[Managed via Rancher]:::deferred
K --> K3[Kamaji + CAPI]:::deferred
W[WEKA storage] --> W1[Dual-protocol POSIX+S3]:::chosen
W --> W2[S3-only path]:::deferred
SUI[Slurm UI options] --> SUI1[Native thin UI + extension panels]:::chosen
SUI --> SUI2[Open OnDemand iframed]:::deferred
SUI --> SUI3[Slurm-web iframed]:::deferred
EUI[Embedded UI gateway] --> EUI1[Platform-owned reverse proxy<br/>+ subdomain isolation]:::chosen
EUI --> EUI2[Pure iframe with shared cookies]:::rejected
EUI --> EUI3[Pure link-out]:::deferred
MAAS[H200 MAAS bundle fit] --> MAAS1[Reuse design; site-profile inputs]:::chosen
MAAS --> MAAS2[Adopt bundle scripts as core]:::rejected
OC[OpenClaw integration] --> OC1[As an app on the platform]:::chosen
OC --> OC2[As a platform feature]:::rejected
1. Kubernetes platform options¶
flowchart TB
classDef chosen fill:#d1e7dd,stroke:#0a3622
classDef defer fill:#fff3cd,stroke:#332701
Q{Kubernetes for GPUaaS<br/>which product mode?}
Q --> SM[Self-Managed<br/>nodes + bootstrap automation]:::chosen
Q --> MG[Managed via Rancher<br/>vendor cluster ops]:::defer
Q --> KM[Kamaji + CAPI<br/>hosted control planes]:::defer
SM --> SM1["Validated with cmd/rke2-self-managed-controller (1,334 lines)<br/>Stays inside platform task/audit model"]:::chosen
MG --> MG1["Adopt only if Rancher integration<br/>stays inside platform control boundaries<br/>(no direct node mutations)"]:::defer
KM --> KM1["Defer until scale or<br/>control-plane economics justify it"]:::defer
The non-negotiable rule¶
Kubernetes integration must not introduce a side channel where a third-party system receives raw node SSH credentials and mutates nodes outside the platform's task, audit, and lifecycle model.
That single rule is what makes the comparison real. It rules out anything that needs to ssh into nodes directly.
What came out as requirements¶
- Self-managed Kubernetes first — proves the app-runtime primitives any distributed control-plane app needs.
- Reference implementation in
cmd/rke2-self-managed-controller. - No raw-SSH-into-nodes side channels — all node mutations go through node-agent's typed-task contract.
- App-runtime primitives added first: stateful instance with multiple members, member operations, embedded UI gateway contract, app-runtime recovery, app-runtime metering.
- Multi-node Slurm/Kubernetes apps deferred until slice networking supports clusters; single-allocation apps allowed.
Source: Kubernetes_Platform_Options_v1.md, Self_Managed_RKE2_First_Slice_v1.md.
2. WEKA storage capability assessment¶
Dated 2026-04-27, based on WEKA documentation at that point. Validation against the actual deployment is still required.
flowchart LR
Q{WEKA fits GPUaaS<br/>storage + IAM model?}
Q --> POSIX[WEKAFS / POSIX<br/>primary]
Q --> S3[WEKA S3<br/>secondary]
POSIX --> P1[Training / notebooks /<br/>Kubernetes PVCs /<br/>apps needing fs semantics]
S3 --> S1[Bucket / object workflows /<br/>direct external clients /<br/>SDK access]
classDef chosen fill:#d1e7dd,stroke:#0a3622
class POSIX,S3 chosen
What came out as requirements¶
- Dual-protocol treatment: WEKAFS/POSIX as primary mount path, S3 as secondary for object workflows.
- POSIX path priority — WEKAFS is the high-performance training mount, not S3.
- S3 path for SDK / external clients — distinct from POSIX path; not a substitute.
- Validation required against actual deployment — assessment is documentation-based.
Source: Storage_WEKA_Capability_Assessment_v1.md. Related: Storage_Sharing_and_IAM_Model_v1.md, Storage_IAM_User_Flows_v1.md, Storage_Provider_Capability_Model_v1.md.
3. Slurm UI options¶
Slurm has no built-in web UI. The platform compared three approaches:
flowchart TB
classDef chosen fill:#d1e7dd,stroke:#0a3622
classDef defer fill:#fff3cd,stroke:#332701
Q{Slurm UI for the platform}
Q --> N["Native thin UI<br/>(detail page + extension panels)"]:::chosen
Q --> OOD["Open OnDemand iframed<br/>(community UI)"]:::defer
Q --> SW["Slurm-web iframed<br/>(SchedMD UI)"]:::defer
N --> N1["Slurm Runtime card<br/>Slurm Workers card<br/>Instance Members card<br/>Instance Operations card"]:::chosen
Why native first¶
- Management-focused (deploy, scale, credential rotate) is well-served by app-runtime panels.
- Visibility (queue, node utilization, GPU metrics) is the gap; deferred until embedded UI gateway is in place.
- Iframed third-party UIs require the embedded UI gateway contract first.
What came out as requirements¶
- Native thin UI shipped — extension panels in
packages/web/src/lib/apps/slurm-instance-panels.tsx. - Embedded UI gateway contract is a hard dependency for any third-party Slurm UI (Open OnDemand, Slurm-web).
- Slurm CLI proxying must route through node-agent task model, not direct SSH from the platform API.
- Visibility gap (queue, utilization) tracked separately under the app platform gap tracker.
Source: Slurm_UI_Options_v1.md.
4. Embedded UI gateway¶
Comparison of how to host third-party / workload UIs inside the platform shell:
flowchart TB
classDef chosen fill:#d1e7dd,stroke:#0a3622
classDef rejected fill:#f8d7da,stroke:#42101e
classDef defer fill:#fff3cd,stroke:#332701
Q{How to expose app UIs<br/>inside the workload shell?}
Q --> A[Platform-owned reverse proxy<br/>+ subdomain isolation]:::chosen
Q --> B[Pure iframe with shared cookies]:::rejected
Q --> C[Pure link-out]:::defer
A --> A1[Auth + session + cookie + WS +<br/>CSP all owned by platform]:::chosen
B --> B1[Rejected: cookie + origin model<br/>incompatible with multi-tenant<br/>security boundary]:::rejected
C --> C1[Fallback only when<br/>app cannot be safely embedded]:::defer
The rule the comparison codified¶
Embedding an app UI is not an iframe styling task. It is an auth, session, cookie, WebSocket, and support-boundary decision.
What came out as requirements¶
- Reverse-proxy route shape owned by platform.
- Auth and session ownership by platform, not by the embedded app.
- Cookie and origin rules that prevent the embedded app from punching out.
- WebSocket behavior standardized.
- CSP and frame policy expectations explicit.
- Explicit link-out fallback criteria — when an app cannot be safely embedded.
Source: Embedded_UI_Gateway_Contract_v1.md.
5. H200 MAAS bundle fit analysis¶
Question asked: how well does the current GPUaaS model fit the H200 MAAS automation bundle without changing core scope?
The bundle contained four layers:
- MAAS server bootstrap and tuning (
install_maas_3_7.sh,tune_maas.sh,postgres_tuning.sh) - Site/network/operator configuration (
site.env, inventory files,roce_ips.csv) - Node deployment scripts
- ROCE-specific networking setup
flowchart LR
classDef gpa fill:#e8f5e9,stroke:#2e7d32
classDef bun fill:#fff3e0,stroke:#e65100
GP[GPUaaS owns:<br/>MAAS-backed node lifecycle<br/>orchestration]:::gpa
GP -.does NOT own.-> B1[MAAS server bootstrap]:::bun
GP -.does NOT own.-> B2[Site/network config files]:::bun
GP -.does NOT own.-> B3[CSV-driven inventories]:::bun
GP --> R1[Data-model driven<br/>not file/CSV/script driven]:::gpa
GP --> R2[Site-specific personality<br/>via controlled site/profile inputs<br/>not copied into core logic]:::gpa
What came out as requirements¶
- Boundary kept: GPUaaS owns node lifecycle orchestration; doesn't own MAAS server bootstrap and tuning.
- Site profile pattern: hardware tags + firmware profile tags + per-site cloud-init helper. No CSV files in core logic.
- ROCE support added as a fabric option alongside InfiniBand. Slot metadata carries fabric type explicitly.
- No scope change triggered — bundle informed but did not reshape the model.
Source: H200_MAAS_Fit_Analysis_v1.md.
6. OpenClaw integration¶
Question: should OpenClaw (a private/self-hosted AI assistant) be a platform feature or an app on the platform?
flowchart LR
classDef chosen fill:#d1e7dd,stroke:#0a3622
classDef rejected fill:#f8d7da,stroke:#42101e
Q{OpenClaw integration mode}
Q --> A[As an app on the platform<br/>uses vLLM backend in an allocation]:::chosen
Q --> B[As a platform feature<br/>built into the control plane]:::rejected
A --> A1[Composes with vLLM running as<br/>its own app on a 1-GPU slice<br/>or CPU-only allocation]:::chosen
What came out as requirements¶
- OpenClaw as an app — not a platform feature. Reuses the app-runtime primitives.
- vLLM + OpenClaw composability — different allocations possible (vLLM on a 4-GPU slice, OpenClaw on a 1-GPU slice or CPU-only).
- Non-GPU workload support — OpenClaw can run on a CPU-only allocation if supported, without weakening GPU slice isolation and billing. Currently it would need a 1-GPU slice as workaround.
- CPU-only allocation track stays an open product question, with the platform model designed not to block it.
Source: OpenClaw_App_Integration_and_Platform_Composition_v1.md.
Cross-cutting takeaways¶
mindmap
root((Build-vs-buy themes))
Stay in control
No raw-SSH side channels
No third-party node mutations
Auth/session/cookies platform-owned
Reuse platform primitives
Apps run on allocations
Single typed-task contract
Single ledger
Single audit
Layer carefully
WEKAFS primary, S3 secondary
Native UI first, embedded later
Self-managed RKE2 first, managed later
Fit, don't redesign
MAAS bundle = site profile input
Don't copy CSV/script glue into core
Bundle informs, doesn't reshape