Build-vs-buy & option evaluations¶

Decided Designed

Source docs: Kubernetes options, WEKA storage, Slurm UI, Embedded UI gateway, H200 MAAS, OpenClaw — each evaluated competing approaches before committing.

For every meaningful build-or-buy choice the team has done an explicit option comparison and recorded what came out as requirements. This page consolidates them.

Overview¶

flowchart LR
    classDef chosen   fill:#d1e7dd,stroke:#0a3622
    classDef deferred fill:#fff3cd,stroke:#332701
    classDef rejected fill:#f8d7da,stroke:#42101e

    K[Kubernetes platform options] --> K1[Self-managed RKE2 first]:::chosen
    K --> K2[Managed via Rancher]:::deferred
    K --> K3[Kamaji + CAPI]:::deferred

    W[WEKA storage] --> W1[Dual-protocol POSIX+S3]:::chosen
    W --> W2[S3-only path]:::deferred

    SUI[Slurm UI options] --> SUI1[Native thin UI + extension panels]:::chosen
    SUI --> SUI2[Open OnDemand iframed]:::deferred
    SUI --> SUI3[Slurm-web iframed]:::deferred

    EUI[Embedded UI gateway] --> EUI1[Platform-owned reverse proxy<br/>+ subdomain isolation]:::chosen
    EUI --> EUI2[Pure iframe with shared cookies]:::rejected
    EUI --> EUI3[Pure link-out]:::deferred

    MAAS[H200 MAAS bundle fit] --> MAAS1[Reuse design; site-profile inputs]:::chosen
    MAAS --> MAAS2[Adopt bundle scripts as core]:::rejected

    OC[OpenClaw integration] --> OC1[As an app on the platform]:::chosen
    OC --> OC2[As a platform feature]:::rejected

1. Kubernetes platform options¶

flowchart TB
    classDef chosen fill:#d1e7dd,stroke:#0a3622
    classDef defer  fill:#fff3cd,stroke:#332701

    Q{Kubernetes for GPUaaS<br/>which product mode?}
    Q --> SM[Self-Managed<br/>nodes + bootstrap automation]:::chosen
    Q --> MG[Managed via Rancher<br/>vendor cluster ops]:::defer
    Q --> KM[Kamaji + CAPI<br/>hosted control planes]:::defer

    SM --> SM1["Validated with cmd/rke2-self-managed-controller (1,334 lines)<br/>Stays inside platform task/audit model"]:::chosen
    MG --> MG1["Adopt only if Rancher integration<br/>stays inside platform control boundaries<br/>(no direct node mutations)"]:::defer
    KM --> KM1["Defer until scale or<br/>control-plane economics justify it"]:::defer

The non-negotiable rule¶

Kubernetes integration must not introduce a side channel where a third-party system receives raw node SSH credentials and mutates nodes outside the platform's task, audit, and lifecycle model.

That single rule is what makes the comparison real. It rules out anything that needs to ssh into nodes directly.

What came out as requirements¶

Self-managed Kubernetes first — proves the app-runtime primitives any distributed control-plane app needs.
Reference implementation in cmd/rke2-self-managed-controller.
No raw-SSH-into-nodes side channels — all node mutations go through node-agent's typed-task contract.
App-runtime primitives added first: stateful instance with multiple members, member operations, embedded UI gateway contract, app-runtime recovery, app-runtime metering.
Multi-node Slurm/Kubernetes apps deferred until slice networking supports clusters; single-allocation apps allowed.

Source: Kubernetes_Platform_Options_v1.md, Self_Managed_RKE2_First_Slice_v1.md.

2. WEKA storage capability assessment¶

Dated 2026-04-27, based on WEKA documentation at that point. Validation against the actual deployment is still required.

flowchart LR
    Q{WEKA fits GPUaaS<br/>storage + IAM model?}
    Q --> POSIX[WEKAFS / POSIX<br/>primary]
    Q --> S3[WEKA S3<br/>secondary]
    POSIX --> P1[Training / notebooks /<br/>Kubernetes PVCs /<br/>apps needing fs semantics]
    S3 --> S1[Bucket / object workflows /<br/>direct external clients /<br/>SDK access]

    classDef chosen fill:#d1e7dd,stroke:#0a3622
    class POSIX,S3 chosen

What came out as requirements¶

Dual-protocol treatment: WEKAFS/POSIX as primary mount path, S3 as secondary for object workflows.
POSIX path priority — WEKAFS is the high-performance training mount, not S3.
S3 path for SDK / external clients — distinct from POSIX path; not a substitute.
Validation required against actual deployment — assessment is documentation-based.

Source: Storage_WEKA_Capability_Assessment_v1.md. Related: Storage_Sharing_and_IAM_Model_v1.md, Storage_IAM_User_Flows_v1.md, Storage_Provider_Capability_Model_v1.md.

3. Slurm UI options¶

Slurm has no built-in web UI. The platform compared three approaches:

flowchart TB
    classDef chosen fill:#d1e7dd,stroke:#0a3622
    classDef defer  fill:#fff3cd,stroke:#332701

    Q{Slurm UI for the platform}
    Q --> N["Native thin UI<br/>(detail page + extension panels)"]:::chosen
    Q --> OOD["Open OnDemand iframed<br/>(community UI)"]:::defer
    Q --> SW["Slurm-web iframed<br/>(SchedMD UI)"]:::defer

    N --> N1["Slurm Runtime card<br/>Slurm Workers card<br/>Instance Members card<br/>Instance Operations card"]:::chosen

Why native first¶

Management-focused (deploy, scale, credential rotate) is well-served by app-runtime panels.
Visibility (queue, node utilization, GPU metrics) is the gap; deferred until embedded UI gateway is in place.
Iframed third-party UIs require the embedded UI gateway contract first.

What came out as requirements¶

Native thin UI shipped — extension panels in packages/web/src/lib/apps/slurm-instance-panels.tsx.
Embedded UI gateway contract is a hard dependency for any third-party Slurm UI (Open OnDemand, Slurm-web).
Slurm CLI proxying must route through node-agent task model, not direct SSH from the platform API.
Visibility gap (queue, utilization) tracked separately under the app platform gap tracker.

Source: Slurm_UI_Options_v1.md.

4. Embedded UI gateway¶

Comparison of how to host third-party / workload UIs inside the platform shell:

flowchart TB
    classDef chosen   fill:#d1e7dd,stroke:#0a3622
    classDef rejected fill:#f8d7da,stroke:#42101e
    classDef defer    fill:#fff3cd,stroke:#332701

    Q{How to expose app UIs<br/>inside the workload shell?}
    Q --> A[Platform-owned reverse proxy<br/>+ subdomain isolation]:::chosen
    Q --> B[Pure iframe with shared cookies]:::rejected
    Q --> C[Pure link-out]:::defer

    A --> A1[Auth + session + cookie + WS +<br/>CSP all owned by platform]:::chosen
    B --> B1[Rejected: cookie + origin model<br/>incompatible with multi-tenant<br/>security boundary]:::rejected
    C --> C1[Fallback only when<br/>app cannot be safely embedded]:::defer

The rule the comparison codified¶

Embedding an app UI is not an iframe styling task. It is an auth, session, cookie, WebSocket, and support-boundary decision.

What came out as requirements¶

Reverse-proxy route shape owned by platform.
Auth and session ownership by platform, not by the embedded app.
Cookie and origin rules that prevent the embedded app from punching out.
WebSocket behavior standardized.
CSP and frame policy expectations explicit.
Explicit link-out fallback criteria — when an app cannot be safely embedded.

Source: Embedded_UI_Gateway_Contract_v1.md.

5. H200 MAAS bundle fit analysis¶

Question asked: how well does the current GPUaaS model fit the H200 MAAS automation bundle without changing core scope?

The bundle contained four layers:

MAAS server bootstrap and tuning (install_maas_3_7.sh, tune_maas.sh, postgres_tuning.sh)
Site/network/operator configuration (site.env, inventory files, roce_ips.csv)
Node deployment scripts
ROCE-specific networking setup

flowchart LR
    classDef gpa fill:#e8f5e9,stroke:#2e7d32
    classDef bun fill:#fff3e0,stroke:#e65100

    GP[GPUaaS owns:<br/>MAAS-backed node lifecycle<br/>orchestration]:::gpa
    GP -.does NOT own.-> B1[MAAS server bootstrap]:::bun
    GP -.does NOT own.-> B2[Site/network config files]:::bun
    GP -.does NOT own.-> B3[CSV-driven inventories]:::bun

    GP --> R1[Data-model driven<br/>not file/CSV/script driven]:::gpa
    GP --> R2[Site-specific personality<br/>via controlled site/profile inputs<br/>not copied into core logic]:::gpa

What came out as requirements¶

Boundary kept: GPUaaS owns node lifecycle orchestration; doesn't own MAAS server bootstrap and tuning.
Site profile pattern: hardware tags + firmware profile tags + per-site cloud-init helper. No CSV files in core logic.
ROCE support added as a fabric option alongside InfiniBand. Slot metadata carries fabric type explicitly.
No scope change triggered — bundle informed but did not reshape the model.

Source: H200_MAAS_Fit_Analysis_v1.md.

6. OpenClaw integration¶

Question: should OpenClaw (a private/self-hosted AI assistant) be a platform feature or an app on the platform?

flowchart LR
    classDef chosen   fill:#d1e7dd,stroke:#0a3622
    classDef rejected fill:#f8d7da,stroke:#42101e

    Q{OpenClaw integration mode}
    Q --> A[As an app on the platform<br/>uses vLLM backend in an allocation]:::chosen
    Q --> B[As a platform feature<br/>built into the control plane]:::rejected

    A --> A1[Composes with vLLM running as<br/>its own app on a 1-GPU slice<br/>or CPU-only allocation]:::chosen

What came out as requirements¶

OpenClaw as an app — not a platform feature. Reuses the app-runtime primitives.
vLLM + OpenClaw composability — different allocations possible (vLLM on a 4-GPU slice, OpenClaw on a 1-GPU slice or CPU-only).
Non-GPU workload support — OpenClaw can run on a CPU-only allocation if supported, without weakening GPU slice isolation and billing. Currently it would need a 1-GPU slice as workaround.
CPU-only allocation track stays an open product question, with the platform model designed not to block it.

Source: OpenClaw_App_Integration_and_Platform_Composition_v1.md.

Cross-cutting takeaways¶

mindmap
  root((Build-vs-buy themes))
    Stay in control
      No raw-SSH side channels
      No third-party node mutations
      Auth/session/cookies platform-owned
    Reuse platform primitives
      Apps run on allocations
      Single typed-task contract
      Single ledger
      Single audit
    Layer carefully
      WEKAFS primary, S3 secondary
      Native UI first, embedded later
      Self-managed RKE2 first, managed later
    Fit, don't redesign
      MAAS bundle = site profile input
      Don't copy CSV/script glue into core
      Bundle informs, doesn't reshape

Build-vs-buy & option evaluations¶

Overview¶

1. Kubernetes platform options¶

The non-negotiable rule¶

What came out as requirements¶

2. WEKA storage capability assessment¶

What came out as requirements¶

3. Slurm UI options¶

Why native first¶

What came out as requirements¶

4. Embedded UI gateway¶

The rule the comparison codified¶

What came out as requirements¶

5. H200 MAAS bundle fit analysis¶

What came out as requirements¶

6. OpenClaw integration¶

What came out as requirements¶

Cross-cutting takeaways¶

Where to look next¶