Slurm UI Options & Integration Spec v1¶

Purpose: - Document the available UI options for Slurm workloads on the platform. - Define how each option integrates with the platform shell, app extension registry, and embedded UI pattern. - Provide layout wireframes and implementation guidance for each approach.

Inputs: - Current Slurm instance detail page (packages/web/app/apps/instances/[instanceId]/page.tsx) - Current Slurm extension panels (packages/web/src/lib/apps/slurm-instance-panels.tsx) - Navigation redesign (doc/product/Navigation_Redesign_App_Platform_v1.md) - Kubernetes platform design (doc/architecture/Kubernetes_Platform_Options_v1.md) - App extension registry (packages/web/src/lib/apps/extensions.ts)

Hard dependencies (from doc/architecture/Kubernetes_Platform_Options_v1.md §3–4): - Embedded UI gateway contract (§4.3) must be defined before any embedded UI tab ships. - Node mutation must stay inside platform-controlled execution (§3.1) — Slurm CLI proxying via the native thin UI must route through the node-agent task model, not direct SSH from the platform API. - App-runtime recovery model (§4.4) should cover partial monitoring deploy and Grafana/OOD service failures.

1. Current State¶

Slurm has no built-in web UI. It is entirely CLI-driven (srun, sbatch, squeue, sinfo, sacct).

The platform currently provides a native Slurm instance detail page with: - Slurm Runtime card — controller status, cluster name, partition, slurmctld/slurmd state, bootstrap credential management - Slurm Workers card — add workers, drain/remove workers, worker history - Instance Members card — generic member list with status badges - Instance Operations card — member operation history

This is management-focused (deploy, scale, credential rotate). There is no visibility into what the Slurm cluster is actually doing — no job queue, no node utilization, no GPU metrics.

2. The Three UI Options¶

Option 1: Open OnDemand (Full HPC Portal)¶

Option 2: Grafana + Slurm Exporter (Monitoring Dashboard)¶

Option 3: Native Thin UI (Platform-Built Job View)¶

Each option serves a different user need and can be combined. They are not mutually exclusive.

3. Option 1: Open OnDemand¶

3.1 What It Is¶

Open OnDemand (OOD) is an open-source web portal for HPC clusters, developed by the Ohio Supercomputer Center. It is the most widely used HPC web interface (used by most US national labs and universities). License: MIT.

It provides: - Job submission and monitoring — submit batch jobs, view queue, cancel jobs - Interactive applications — launch Jupyter notebooks, VS Code, RStudio, desktop sessions directly on compute nodes - File browser — upload/download/edit files on the cluster filesystem - Shell access — browser-based terminal to the cluster - Cluster status — node utilization, partition overview

3.2 Architecture on the Platform¶

Open OnDemand runs as an Apache-based web application on a node that can reach the Slurm controller. It talks to Slurm via the standard CLI tools (squeue, sbatch, etc.) and optionally via the Slurm REST API (slurmrestd).

┌─────────────────────────────────────────────────────────────┐
│ Platform                                                     │
│                                                              │
│  App Instance: my-slurm (slug: slurm-reference)              │
│    ├── Controller allocation (slurmctld + slurmdbd)          │
│    ├── Worker allocations (slurmd, GPU nodes)                │
│    └── [OOD allocation or co-hosted on controller]           │
│                                                              │
│  Companion App Instance: my-slurm-ood (slug: open-ondemand) │
│    ├── Runs on controller allocation (or dedicated node)     │
│    ├── Apache + Passenger + OOD portal                       │
│    ├── Auth: OIDC via platform Keycloak                      │
│    └── Talks to slurmctld via Slurm CLI / slurmrestd         │
└─────────────────────────────────────────────────────────────┘

Deployment model: OOD can be deployed two ways:

A. Companion app instance — separate catalog entry (slug: open-ondemand) that references an existing Slurm instance. Deployed as a second app instance in the same project, configured to point at the Slurm controller.

B. Co-hosted on controller — OOD installed on the same allocation as slurmctld. Simpler, but mixes concerns. Suitable for single-node or dev clusters.

3.3 Platform Integration¶

App Manifest¶

slug: open-ondemand
display_name: "Open OnDemand"
runtime_backend: bare_metal
versions:
  - version: "3.1"
    placement_schema:
      type: object
      required: [host_allocation_id, slurm_instance_id]
      properties:
        host_allocation_id:
          type: string
          format: uuid
          description: "Allocation to run OOD on (can be Slurm controller)"
        slurm_instance_id:
          type: string
          format: uuid
          description: "Slurm app instance to connect to"
    config_schema:
      type: object
      properties:
        portal_title:
          type: string
          default: "GPU Cloud HPC Portal"
        interactive_apps:
          type: array
          items:
            type: string
            enum: [jupyter, vscode, rstudio, desktop]
          default: [jupyter, vscode]
    ui:
      endpoint:
        type: allocation_port
        component_key: portal
        port: 443
        path: "/"
        protocol: https
      auth:
        strategy: oidc_proxy
      embedding:
        allowed: true
        sandbox: "allow-same-origin allow-scripts allow-forms allow-popups"

Auth¶

Open OnDemand supports OIDC natively via mod_auth_openidc (Apache module). Configure it to use the platform's Keycloak realm:

OIDCProviderMetadataURL https://keycloak.example.com/realms/gpuaas/.well-known/openid-configuration
OIDCClientID            ood-portal
OIDCClientSecret        <from Keycloak>
OIDCRedirectURI         https://ood.example.com/oidc
OIDCCryptoPassphrase    <random>

When embedded in the platform via reverse proxy, the OIDC session flows through transparently — same Keycloak realm, SSO is automatic.

Prerequisite: The embedded UI gateway contract (see Kubernetes Platform Design v2 §4.3) must be defined before this embedding ships. That contract covers: reverse-proxy route shape, cookie behavior under proxying, WebSocket upgrade handling (OOD uses WS for its shell and interactive apps), CSRF model, session expiry/logout propagation, and when to fall back to link-out. OOD exercises all of these — it is not a simple read-only iframe.

Frontend Extension¶

const openOnDemandExtension: AppShellExtension = {
  slug: "open-ondemand",
  runtimeBackend: "bare_metal",
  deploy: {
    requiredInputs: {
      controllerAllocations: "single",
    },
    missingInputsMessage: "Open OnDemand requires a host allocation and a Slurm instance reference.",
    summaryMessage: "Deploy Open OnDemand portal connected to an existing Slurm cluster.",
    serviceAccountEmptyStateMessage: "No active service accounts exist in this project yet.",
    serviceAccountHelpText: "Optional machine identity for portal automation.",
    accessCredentialHelpText: "",
    buildPlacementIntent: ({ controllerAllocationIDs }) => ({
      host_allocation_id: controllerAllocationIDs[0] ?? "",
      // slurm_instance_id populated via additional UI field (instance picker)
    }),
    isPlacementComplete: ({ controllerAllocationIDs }) =>
      controllerAllocationIDs.length > 0,
  },
};

3.4 Layout: Instance Detail with Embedded OOD¶

When OOD is deployed as a companion app, the platform instance detail page gains an embedded UI tab:

┌── Top Bar ────────────────────────────────────────────────────────┐
│ [logo]  Tenant / Project ▾                    [$] [th] [n] [usr] │
├───────────────────────────────────────────────────────────────────┤
│ ┌─ Sidebar ──┐                                                   │
│ │ WORKLOADS  │  ┌── my-slurm-ood (Open OnDemand) ─────────────┐ │
│ │ ▸ my-slurm │  │                                              │ │
│ │ ▸ my-ood ← │  │  Overview │ Portal │ Config │ Logs           │ │
│ │            │  ├──────────────────────────────────────────────┤ │
│ │ INFRA      │  │                                              │ │
│ │ ...        │  │  ┌── Embedded Open OnDemand ──────────────┐  │ │
│ │            │  │  │                                        │  │ │
│ │            │  │  │  [Jobs]  [Files]  [Clusters]  [Apps]   │  │ │
│ │            │  │  │                                        │  │ │
│ │            │  │  │  Active Jobs              Submit Job    │  │ │
│ │            │  │  │  ┌─────────────────────┐  ┌─────────┐  │  │ │
│ │            │  │  │  │ job-001  running  4G │  │ Script  │  │  │ │
│ │            │  │  │  │ job-002  pending  2G │  │ [     ] │  │  │ │
│ │            │  │  │  │ job-003  completed 8 │  │ GPUs: 4 │  │  │ │
│ │            │  │  │  └─────────────────────┘  │ [Submit]│  │  │ │
│ │            │  │  │                           └─────────┘  │  │ │
│ │            │  │  │  Interactive Apps                       │  │ │
│ │            │  │  │  [Jupyter] [VS Code] [Desktop]         │  │ │
│ │            │  │  │                                        │  │ │
│ │            │  │  └────────────────────────────────────────┘  │ │
│ │            │  │                                              │ │
│ └────────────┘  └──────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────┘

Alternatively, if OOD is linked from the Slurm instance itself (not a separate workload entry), it appears as a tab on the Slurm instance detail:

┌── my-slurm (Slurm) ──────────────────────────────────────────────┐
│                                                                   │
│  Overview │ HPC Portal │ Workers │ Members │ Operations │ Cost    │
│                                                                   │
│  "HPC Portal" tab = embedded Open OnDemand                       │
│  (reverse-proxied, OIDC auth, same Keycloak realm)               │
└───────────────────────────────────────────────────────────────────┘

3.7 Filesystem and Storage Model¶

OOD is only compelling if users have a coherent filesystem story. The file browser, interactive apps, and job output all depend on where files live.

The question OOD forces: where is the user's home directory, where does job output go, and is any of it persistent across cluster rebuilds?

Storage options on the platform¶

Model	What	Persistent across rebuild?	OOD file browser?
Allocation-local	Files live on the controller allocation's local disk. `/home/{user}` is local.	No — lost when allocation is released or cluster is decommissioned.	Yes, but ephemeral.
Shared NFS	NFS server on controller, exported to workers. `/shared` mounted on all nodes.	No — NFS data is on the controller allocation.	Yes, shared across nodes.
Platform storage (S3)	Platform's existing storage service (`packages/services/storage/`). Mounted via s3fs or goofys.	Yes — survives cluster rebuild.	Possible but poor UX (S3 semantics).
External NFS/Lustre	Tenant-provided external storage, mounted at deploy time via `config_schema`.	Yes — tenant owns lifecycle.	Yes, if mounted correctly.

Recommendation for Phase 4 (OOD):

MVP: Allocation-local storage only. Home directories and job output live on the controller. OOD file browser works. Users accept that decommissioning the cluster loses local files. The platform warns about this in the decommission confirmation (decision-first UX principle).
Next: Add storage_mount to the Slurm placement intent, allowing tenants to mount their platform storage bucket at a known path (/workspace). OOD file browser sees it. Job output can be directed there. Survives cluster rebuild.
Later: External NFS/Lustre mount support via config_schema. For tenants with existing HPC storage infrastructure.

Without a storage answer, the following OOD features are misleading: - File browser shows ephemeral files that will vanish - Interactive Jupyter notebooks are not saved persistently - Job output is unrecoverable after cluster decommission

This is not a blocker for Phase 4, but the decommission UX must make the storage ephemerality visible. Users must understand what they will lose.

3.8 Pros / Cons¶

Pros	Cons
Full HPC user experience — job submission, interactive apps, file browser	Heavyweight — full Apache + Passenger stack
Battle-tested at scale (national labs, universities)	Configuration complexity — OOD cluster config, interactive app templates
OIDC native — clean SSO integration	Requires filesystem access to cluster shared storage
Active open-source community (MIT license)	OOD upgrades are a separate lifecycle from Slurm upgrades
Interactive Jupyter/VS Code sessions on GPU nodes	May overlap with platform's existing terminal feature

3.9 Best For¶

Research teams, ML engineers, and data scientists who want a familiar HPC portal experience. Users who need interactive GPU sessions (Jupyter on Slurm) rather than just batch job submission.

4. Option 2: Grafana + Slurm Exporter¶

4.1 What It Is¶

A lightweight monitoring stack that exports Slurm metrics to Prometheus and visualizes them in Grafana. No job submission — purely observability.

Components: - prometheus-slurm-exporter — Go binary that scrapes Slurm CLI output and exposes Prometheus metrics (node states, job counts, GPU utilization, queue wait times) - Prometheus — metrics storage and alerting - Grafana — dashboards and visualization

4.2 Architecture on the Platform¶

┌─────────────────────────────────────────────────────────────┐
│ Slurm App Instance                                           │
│                                                              │
│  Controller allocation                                       │
│    ├── slurmctld                                             │
│    ├── prometheus-slurm-exporter (:9341)                     │
│    ├── Prometheus (:9090)                                    │
│    └── Grafana (:3000)                                       │
│                                                              │
│  Worker allocations                                          │
│    ├── slurmd                                                │
│    └── node-exporter (:9100)  (optional per-node metrics)    │
└─────────────────────────────────────────────────────────────┘

Deployment model: The exporter, Prometheus, and Grafana are co-hosted on the controller allocation. They are deployed as part of the Slurm app lifecycle — the app worker installs them alongside slurmctld.

This is not a separate app instance — it is a built-in monitoring layer for Slurm.

4.3 Metrics Exposed¶

The prometheus-slurm-exporter provides:

Metric	Description
`slurm_nodes_alloc` / `slurm_nodes_idle` / `slurm_nodes_down`	Node state counts
`slurm_cpus_alloc` / `slurm_cpus_idle`	CPU allocation
`slurm_gpus_alloc` / `slurm_gpus_idle`	GPU allocation (with TRES plugin)
`slurm_queue_pending` / `slurm_queue_running`	Job queue depth
`slurm_scheduler_queue_size`	Scheduler backlog
`slurm_job_*`	Per-job metrics (optional, high cardinality)

Combined with node-exporter on workers:

Metric	Description
`nvidia_gpu_utilization`	Per-GPU utilization (via `dcgm-exporter` or `nvidia_gpu_exporter`)
`nvidia_gpu_memory_used`	GPU memory usage
`node_cpu_` / `node_memory_`	Standard host metrics

4.4 Grafana Dashboards¶

Pre-built dashboards shipped with the Slurm app:

Dashboard 1: Cluster Overview

┌── Slurm Cluster Overview ────────────────────────────────────┐
│                                                               │
│  ┌─ Nodes ────────┐  ┌─ GPUs ─────────┐  ┌─ Jobs ────────┐  │
│  │ 8 total        │  │ 32 total       │  │ 12 running    │  │
│  │ 6 allocated    │  │ 24 allocated   │  │  4 pending    │  │
│  │ 1 idle         │  │  8 idle        │  │  0 failed     │  │
│  │ 1 down         │  │                │  │               │  │
│  └────────────────┘  └────────────────┘  └───────────────┘  │
│                                                               │
│  ┌─ GPU Utilization (time series) ───────────────────────┐   │
│  │  ▁▂▃▅▇█████████████████████████▇▅▃▂▁▁▂▃▅▇████████   │   │
│  │  0%                                             100%  │   │
│  └───────────────────────────────────────────────────────┘   │
│                                                               │
│  ┌─ Queue Wait Time (histogram) ─────────────────────────┐   │
│  │  ▇                                                     │   │
│  │  █▅                                                    │   │
│  │  ██▃                                                   │   │
│  │  ███▂▁                                                 │   │
│  │  <1m  <5m  <15m  <1h  >1h                              │   │
│  └───────────────────────────────────────────────────────┘   │
└───────────────────────────────────────────────────────────────┘

Dashboard 2: Per-Node GPU Detail

┌── Node GPU Detail ───────────────────────────────────────────┐
│                                                               │
│  Node: worker-03  │  4× A100-80GB  │  Status: allocated      │
│                                                               │
│  GPU 0:  util 95%  mem 72/80 GB  temp 71°C  power 298W      │
│  GPU 1:  util 88%  mem 65/80 GB  temp 68°C  power 285W      │
│  GPU 2:  util 92%  mem 78/80 GB  temp 73°C  power 302W      │
│  GPU 3:  util 0%   mem  0/80 GB  temp 34°C  power  45W      │
│                                                               │
│  ┌─ GPU Utilization Over Time ───────────────────────────┐   │
│  │  GPU0 ████████████████████████████████████████████    │   │
│  │  GPU1 ██████████████████████████████████              │   │
│  │  GPU2 ████████████████████████████████████████████    │   │
│  │  GPU3 ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁    │   │
│  └───────────────────────────────────────────────────────┘   │
└───────────────────────────────────────────────────────────────┘

4.5 Platform Integration¶

Auth¶

Grafana supports OIDC natively (auth.generic_oauth configuration). Point at the platform's Keycloak realm. Users see dashboards scoped to their Slurm cluster.

[auth.generic_oauth]
enabled = true
name = GPUaaS
client_id = grafana-slurm
client_secret = <from Keycloak>
scopes = openid profile email
auth_url = https://keycloak.example.com/realms/gpuaas/protocol/openid-connect/auth
token_url = https://keycloak.example.com/realms/gpuaas/protocol/openid-connect/token
api_url = https://keycloak.example.com/realms/gpuaas/protocol/openid-connect/userinfo

Embedding¶

Prerequisite: The embedded UI gateway contract (Kubernetes Platform Design v2 §4.3) must be defined before this tab ships. Grafana is a simpler case than OOD (mostly read-only dashboards, no WebSocket for basic panels), making it a good first target to validate the gateway contract against. However, the contract must still specify: - reverse-proxy route resolution from instance placement - cookie domain and path scoping under proxy - CSP frame-ancestors coordination with the platform origin - session TTL alignment (Grafana session vs platform session) - logout propagation (platform logout should invalidate Grafana OIDC session)

Grafana supports iframe embedding natively (allow_embedding = true in config). The embedded dashboards render inside the platform shell:

┌── my-slurm (Slurm) ──────────────────────────────────────────┐
│                                                                │
│  Overview │ Monitoring │ Workers │ Members │ Operations         │
│                                                                │
│  ┌── Embedded Grafana ──────────────────────────────────────┐  │
│  │                                                          │  │
│  │  Cluster Overview  │  GPU Detail  │  Job Queue           │  │
│  │  (Grafana dashboard tabs)                                │  │
│  │                                                          │  │
│  │  [time range picker]  [auto-refresh: 30s]                │  │
│  │                                                          │  │
│  │  ... Grafana panels ...                                  │  │
│  │                                                          │  │
│  └──────────────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────────────┘

No Separate App Instance Needed¶

Because Grafana + Prometheus are part of the Slurm deployment itself (installed by the app worker alongside slurmctld), they appear as a tab on the Slurm instance detail page, not as a separate workload in the sidebar.

The Slurm extension registers the monitoring endpoint:

// Addition to slurmExtension in extensions.ts
const slurmExtension: AppShellExtension = {
  // ... existing fields ...
  ui: {
    tabs: [
      {
        key: "monitoring",
        label: "Monitoring",
        endpoint: {
          type: "allocation_port",
          component_key: "controller",
          port: 3000,          // Grafana
          path: "/d/slurm-overview",
          protocol: "https",
        },
        auth: { strategy: "oidc_proxy" },
        embedding: { allowed: true },
      },
    ],
  },
};

4.6 Prometheus Storage and Retention¶

Prometheus on the controller node is fine for dev, but production monitoring requires explicit retention, disk sizing, and failure behavior decisions.

Defaults for per-instance deployment:

Parameter	Value	Rationale
Retention period	`--storage.tsdb.retention.time=7d`	7 days covers most debugging windows without excessive disk
Retention size	`--storage.tsdb.retention.size=5GB`	Hard cap prevents disk pressure on controller
Scrape interval	`30s`	Balance between resolution and disk usage; GPU metrics change slowly
WAL compression	`--storage.tsdb.wal-compression`	Reduces WAL size ~50%

Failure behavior: - If Prometheus fills its allocated disk: it stops ingesting but Grafana stays up showing stale data. The app worker health check should detect Prometheus disk pressure and report a degraded state via the app-runtime status API. - If Prometheus crashes: Grafana shows "No data" panels. The app worker restarts Prometheus via node-agent task. Historical data is lost if WAL is corrupted. - If the controller allocation is released: all monitoring data is lost. This is acceptable for per-instance monitoring — the data's value is bounded by the cluster's lifetime.

Why per-instance first, shared later: - Per-instance is simpler — no cross-tenant isolation concerns, no shared Prometheus infrastructure to operate, no network topology requirements. - Shared monitoring (platform-managed Prometheus/Grafana) is more efficient but requires: tenant-scoped data isolation, platform-operated infrastructure, cross-allocation network routing for scraping. Defer to a later phase. - The shared monitoring pattern in §9 is the future direction, but Phase 3 ships per-instance monitoring only.

4.7 Pros / Cons¶

Pros	Cons
Lightweight — small resource footprint	No job submission (monitoring only)
GPU metrics (dcgm-exporter integration)	Users still need CLI for job management
Pre-built community dashboards	Grafana customization can be time-consuming
OIDC native, iframe-embeddable	Another service to maintain on the controller
Alerting built-in (Grafana alerts)	Prometheus storage on controller (disk pressure)
No separate app instance needed

4.8 Best For¶

Platform operators and users who need visibility into cluster utilization and GPU metrics but manage jobs via CLI (sbatch, squeue). Also valuable as the monitoring layer regardless of which job management UI is chosen.

5. Option 3: Native Thin UI (Platform-Built)¶

5.1 What It Is¶

Extend the existing Slurm instance detail page with a lightweight job queue view that calls Slurm CLI commands via the node-agent task API. No external software needed — everything is built into the platform UI.

5.1.1 Scope Boundary (MVP)¶

The native thin UI is also a product contract question: how much of Slurm's control surface does the platform own? Without a clear boundary, Phase 1 can sprawl into reimplementing half of scontrol.

Phase 1 MVP — strict scope:

In scope	Slurm command	Mutating?	Audited?
List job queue	`squeue --json`	No	No
List node status	`sinfo --json`	No	No
Submit batch job	`sbatch`	Yes	Yes — audit log row
Cancel job	`scancel {job_id}`	Yes	Yes — audit log row

Explicitly deferred (not Phase 1):

Deferred	Slurm command	Why deferred
Node drain/resume	`scontrol update NodeName=... State=drain\\|resume`	Admin action — needs RBAC boundary (who can drain?). Existing worker drain/remove in the Workers card covers the platform-level action.
Job output/logs	`sacct`, file reads from job output directory	Requires filesystem access model (see §3.7). Cannot show job stdout without knowing where output files live.
Job detail	`scontrol show job {id}`	Nice-to-have, not MVP. Queue view shows enough.
Partition management	`scontrol`	Admin-only, low frequency, CLI is fine.
`srun` (interactive)	`srun --pty`	Requires terminal relay — use existing platform terminal or OOD.

This boundary means the native thin UI is: read the queue, read the nodes, submit a batch job, cancel a job. Four operations. That is the contract.

5.2 Architecture¶

Platform UI  →  Platform API  →  Node Agent (on controller)  →  Slurm CLI
                                                                  │
    GET /api/v1/app-runtime/instances/{id}/slurm/queue           squeue
    GET /api/v1/app-runtime/instances/{id}/slurm/nodes           sinfo
    GET /api/v1/app-runtime/instances/{id}/slurm/jobs/{id}       sacct
    POST /api/v1/app-runtime/instances/{id}/slurm/jobs           sbatch

The platform API proxies structured Slurm queries to the node-agent running on the controller allocation. The node-agent executes the Slurm CLI commands and returns parsed JSON output. The platform UI renders the results natively.

Control boundary alignment: This is the only option that fully satisfies the platform's control boundary constraint (Kubernetes Platform Design v2 §3.1). All Slurm CLI execution flows through the node-agent typed-task model — the platform API never SSHes into the controller directly. The node-agent task for Slurm query/submit should be a new task type (e.g., app.slurm.query, app.slurm.submit) registered in the node-agent task catalog, with audit logging for mutations (job submission, node drain/resume).

This also means the platform API does not need the embedded UI gateway contract to ship the native thin UI — it uses existing platform auth, existing node-agent execution, and renders natively. No iframe, no reverse proxy, no cookie scoping. This is why it can ship before Options 1 and 2.

5.3 API Surface (MVP — four operations only)¶

# Addition to openapi.draft.yaml (app-runtime section)
# Scope: queue, nodes, submit, cancel. Nothing else in Phase 1.

/api/v1/app-runtime/instances/{instance_id}/slurm/queue:
  get:
    summary: List Slurm job queue
    description: Returns parsed output of squeue --json
    parameters:
      - name: state
        in: query
        schema: { type: string, enum: [pending, running, completed, failed] }
      - name: user
        in: query
        schema: { type: string }
    responses:
      200:
        content:
          application/json:
            schema:
              type: object
              properties:
                jobs:
                  type: array
                  items:
                    type: object
                    properties:
                      job_id: { type: integer }
                      name: { type: string }
                      user: { type: string }
                      state: { type: string }
                      partition: { type: string }
                      nodes: { type: string }
                      gpus: { type: integer }
                      time_elapsed: { type: string }
                      time_limit: { type: string }
                      submit_time: { type: string }

/api/v1/app-runtime/instances/{instance_id}/slurm/nodes:
  get:
    summary: List Slurm node status
    description: Returns parsed output of sinfo --json
    responses:
      200:
        content:
          application/json:
            schema:
              type: object
              properties:
                nodes:
                  type: array
                  items:
                    type: object
                    properties:
                      hostname: { type: string }
                      state: { type: string }
                      cpus_total: { type: integer }
                      cpus_alloc: { type: integer }
                      gpus_total: { type: integer }
                      gpus_alloc: { type: integer }
                      memory_total_mb: { type: integer }
                      memory_alloc_mb: { type: integer }
                      partitions: { type: array, items: { type: string } }

/api/v1/app-runtime/instances/{instance_id}/slurm/jobs:
  post:
    summary: Submit a Slurm batch job
    requestBody:
      content:
        application/json:
          schema:
            type: object
            required: [script]
            properties:
              script: { type: string, description: "Batch script content" }
              name: { type: string }
              partition: { type: string }
              gpus: { type: integer }
              nodes: { type: integer }
              time_limit: { type: string, description: "e.g. 01:00:00" }
    responses:
      201:
        content:
          application/json:
            schema:
              type: object
              properties:
                job_id: { type: integer }

/api/v1/app-runtime/instances/{instance_id}/slurm/jobs/{job_id}:
  delete:
    summary: Cancel a Slurm job
    description: Calls scancel {job_id}. Audited.
    responses:
      204:
        description: Job cancellation accepted
      404:
        description: Job not found

5.4 Layout: Enhanced Instance Detail¶

┌── my-slurm (Slurm) ──────────────────────────────────────────────┐
│                                                                    │
│  Overview │ Jobs │ Nodes │ Workers │ Members │ Operations          │
│                                                                    │
│  ═══ Jobs tab ═══                                                  │
│                                                                    │
│  ┌─ Submit Job ──────────────────────────────────────────────────┐ │
│  │ Name: [training-run-042       ]  Partition: [gpu    ▾]       │ │
│  │ GPUs: [4   ]  Time limit: [04:00:00]  Nodes: [1   ]         │ │
│  │                                                               │ │
│  │ Script:                                                       │ │
│  │ ┌──────────────────────────────────────────────────────────┐  │ │
│  │ │ #!/bin/bash                                              │  │ │
│  │ │ #SBATCH --job-name=training-run-042                      │  │ │
│  │ │ #SBATCH --gres=gpu:4                                     │  │ │
│  │ │ #SBATCH --time=04:00:00                                  │  │ │
│  │ │                                                          │  │ │
│  │ │ python train.py --model llama --epochs 10                │  │ │
│  │ └──────────────────────────────────────────────────────────┘  │ │
│  │                                                    [Submit]   │ │
│  └───────────────────────────────────────────────────────────────┘ │
│                                                                    │
│  ┌─ Job Queue ───────────────────────────────────────────────────┐ │
│  │ Filter: [All ▾]  [Refresh]                    Auto: 10s      │ │
│  │                                                               │ │
│  │  ID      Name              State    GPUs  Node       Elapsed     │ │
│  │  10042   training-run-041  RUNNING  4     worker-03  02:34  [x] │ │
│  │  10043   eval-checkpoint   RUNNING  2     worker-01  00:12  [x] │ │
│  │  10044   training-run-042  PENDING  4     --         --     [x] │ │
│  │  10040   preprocessing     COMPLTD  1     worker-02  00:45      │ │
│  │                                                               │ │
│  └───────────────────────────────────────────────────────────────┘ │
│                                                                    │
│  ═══ Nodes tab ═══                                                 │
│                                                                    │
│  ┌─ Cluster Nodes ───────────────────────────────────────────────┐ │
│  │                                                               │ │
│  │  Hostname    State      CPUs      GPUs      Memory    Part    │ │
│  │  worker-01   alloc      32/64     2/4       128/256   gpu     │ │
│  │  worker-02   idle       0/64      0/4       0/256     gpu     │ │
│  │  worker-03   alloc      64/64     4/4       245/256   gpu     │ │
│  │  worker-04   down*      --        --        --        gpu     │ │
│  │                                                               │ │
│  │  * worker-04: Node not responding (last seen 12m ago)         │ │
│  │    [Drain]  [Resume]                                          │ │
│  │                                                               │ │
│  └───────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────┘

5.5 Pros / Cons¶

Pros	Cons
Zero external dependencies — fully native	Limited functionality vs OOD (no interactive apps, no file browser)
Consistent UX — same design language as rest of platform	API surface to maintain (Slurm CLI parsing, versioning)
Fast to build for basic job view	No Jupyter/VS Code integration on compute nodes
No auth complexity — uses existing platform auth	Slurm CLI output format can change between versions
No iframe/embedding needed	Only useful for Slurm (not reusable for other HPC schedulers)
Decision-first UX — show GPU cost before job submission	Platform becomes coupled to Slurm internals

5.6 Best For¶

Teams that primarily use sbatch scripts and want quick visibility into queue status from the same UI where they manage their cluster. Good for smaller teams that don't need interactive notebook sessions on compute nodes.

6. Comparison Matrix¶

Dimension	Open OnDemand	Grafana + Exporter	Native Thin UI
Job submission	Yes (web form + script)	No	Yes (basic)
Job monitoring	Yes (queue, history)	No (metrics only)	Yes (queue view)
Interactive apps	Yes (Jupyter, VS Code, desktop)	No	No
File browser	Yes	No	No
GPU metrics	No (not built-in)	Yes (dcgm-exporter)	No
Cluster utilization	Basic (node overview)	Yes (detailed dashboards)	Basic (node table)
Alerting	No	Yes (Grafana alerts)	No
Auth	OIDC (mod_auth_openidc)	OIDC (Grafana config)	Platform auth (native)
Embedding	iframe (OIDC proxy)	iframe (allow_embedding)	Native (no iframe)
Deploy complexity	High (Apache + Passenger + OOD config)	Medium (3 binaries + config)	Low (API + UI code)
Resource footprint	Medium-high	Low-medium	None (runs on platform)
Separate app instance?	Yes (companion) or tab	No (co-hosted on controller)	No (native)
App platform validation	Yes — tests companion app + embedded UI pattern	Partially — tests embedded UI only	No — platform-internal feature
Build effort	Small (deploy + configure)	Small (deploy + dashboards)	Medium (API + UI + Slurm parsing)
Maintenance	OOD upgrades + Slurm version compat	Prometheus storage + dashboard upkeep	Slurm CLI parsing maintenance
Requires gateway contract?	Yes — hardest case (WS, file upload, complex cookies)	Yes — simpler case (mostly read-only)	No — native rendering, no proxy
Control boundary	OOD runs on allocation, platform proxies — acceptable	Grafana runs on allocation, platform proxies — acceptable	Node-agent task model — cleanest control path
Audit coverage	OOD actions not audited by platform (runs independently)	Read-only — no mutations to audit	Full audit — job submit/cancel via node-agent task log

7. Platform Primitive Dependencies¶

The Kubernetes Platform Design v2 (§3–4) identifies hard constraints and missing primitives that apply equally to Slurm UI integration. This section maps each option against those dependencies.

7.1 Gateway Contract (v2 §4.3)¶

Options 1 and 2 embed third-party UIs and therefore require the gateway contract. The contract must define:

Concern	Grafana (Option 2)	Open OnDemand (Option 1)
Reverse-proxy route resolution	Simple — Grafana on fixed port on controller	Complex — OOD on Apache, path rewriting
Cookie domain scoping	Grafana session cookie under proxy origin	OOD + Apache mod_auth_openidc cookies, potential domain conflict
WebSocket upgrade	Optional (Grafana live, can be disabled)	Required (shell access, interactive apps)
CSRF model	Grafana has built-in CSRF; proxy must pass tokens	OOD uses Rails CSRF; proxy must preserve headers
Session TTL alignment	Grafana OIDC session vs platform session	OOD OIDC session + Apache session vs platform session
Logout propagation	Platform logout → revoke Grafana OIDC session	Platform logout → revoke OOD OIDC + Apache sessions
Fallback to link-out	Acceptable for Grafana (low-stakes)	Less acceptable (daily-use tool, SSO breakage is disruptive)

Grafana is the simpler case and should validate the gateway contract first. OOD should only ship after Grafana proves the contract works.

7.2 Node Mutation Control (v2 §3.1)¶

Option	How it touches nodes	Alignment
Native Thin UI	Node-agent typed tasks (`app.slurm.query`, `app.slurm.submit`)	Full compliance — platform owns execution and audit
Grafana + Exporter	Installed by app worker during Slurm bootstrap; read-only after	Acceptable — no runtime node mutation from Grafana
Open OnDemand	OOD runs `sbatch`/`squeue` directly via Slurm CLI on the host	Partial — OOD bypasses node-agent for Slurm operations. Acceptable because OOD runs on the controller allocation (not arbitrary nodes) and Slurm itself is the authority. But OOD job submissions are not audited by the platform.

7.3 Recovery Model (v2 §4.4)¶

Each option introduces failure modes the app-runtime recovery model must handle:

Option	Failure modes	Recovery needed
Native Thin UI	Node-agent task timeout, Slurm CLI unavailable	Existing node-agent retry semantics; degrade gracefully in UI
Grafana + Exporter	Grafana process crash, Prometheus disk full, exporter crash	App worker health check on controller; restart via node-agent task; alert on monitoring-down (meta-monitoring)
Open OnDemand	Apache crash, OOD app error, OIDC session desync	Companion app instance status reporting; decommission/redeploy path; platform detects unhealthy OOD and surfaces in workloads sidebar

7.4 Dependency Summary¶

                          ┌─────────────────────────┐
                          │ Embedded UI Gateway      │
                          │ Contract (v2 §4.3)       │
                          └──────────┬──────────────┘
                                     │
                          ┌──────────┴──────────────┐
                          │                         │
                    ┌─────▼──────┐           ┌──────▼──────┐
                    │ Grafana    │           │ Open        │
                    │ (Phase 3)  │           │ OnDemand    │
                    │            │           │ (Phase 4)   │
                    └────────────┘           └─────────────┘

  ┌──────────────┐
  │ Native Thin  │   ← No gateway dependency
  │ UI (Phase 1) │   ← Uses node-agent tasks (existing)
  │              │   ← Uses platform auth (existing)
  └──────────────┘

8. Recommendation: Layer All Three¶

These options serve different needs and are complementary. The recommended approach layers them:

┌── my-slurm (Slurm) ──────────────────────────────────────────────┐
│                                                                    │
│  Overview │ Jobs │ Monitoring │ HPC Portal │ Workers │ Operations  │
│                                                                    │
│  Overview     = existing Slurm Runtime card + Workers card         │
│  Jobs         = Option 3 (native thin UI — job queue + submit)     │
│  Monitoring   = Option 2 (embedded Grafana — GPU metrics)          │
│  HPC Portal   = Option 1 (embedded OOD — full HPC experience)     │
│  Workers      = existing worker management                         │
│  Operations   = existing operation history                         │
└────────────────────────────────────────────────────────────────────┘

Implementation Order (revised to respect platform primitive dependencies)¶

The Kubernetes Platform Design v2 (§4) identifies four platform primitives that must exist before embedded UIs can ship. This changes the phase order from the naive "lowest effort first" to "dependency-correct order":

Phase	What	Prerequisite	Why this order
Phase 1	Native Thin UI (Jobs tab)	Node-agent task types for Slurm query/submit	No platform primitive dependencies — uses existing auth, existing node-agent, renders natively. Gives job queue visibility immediately. Builds the Slurm proxy API reusable by CLI and SDK.
Phase 2	Embedded UI gateway contract	Phase 1 validated; design work	Define reverse-proxy route shape, auth strategy contract, cookie/session model, WS upgrade, CSP, logout propagation. This is a platform primitive, not a Slurm feature.
Phase 3	Grafana + Slurm Exporter	Gateway contract (Phase 2)	First embedded UI. Grafana is the simplest case (mostly read-only, minimal WS, well-understood OIDC). Validates the gateway contract with low risk. GPU metrics are universally valuable.
Phase 4	Open OnDemand	Gateway contract proven (Phase 3)	Full HPC portal. Exercises the hardest gateway cases (WS for shell, interactive apps, file uploads, complex cookie state). By Phase 4 the gateway contract is validated.

Why This Order (corrected)¶

Native Jobs tab first because it has zero dependencies on new platform primitives. It uses the node-agent task model (already exists), platform auth (already exists), and renders natively (no iframe, no proxy). It ships the #1 user request (job queue visibility) while the gateway contract is being designed.
Gateway contract second because both Grafana and OOD depend on it. Designing the contract in parallel with Phase 1 is fine, but no embedded UI tab should ship before the contract is defined and reviewed. The contract covers: reverse-proxy route resolution, supported auth strategies, cookie domain scoping, WebSocket upgrade handling, CSP frame-ancestors, session TTL alignment, logout propagation, and fallback-to-link-out criteria.
Grafana third because it is the simplest embedded UI case — mostly read-only dashboards, minimal WebSocket usage, well-understood OIDC support. It validates the gateway contract with low blast radius. If the gateway contract has gaps, Grafana will expose them cheaply.
OOD fourth because it exercises every hard case in the gateway contract: WebSocket for shell and interactive apps, file upload multipart handling, complex cookie and session state, Apache mod_auth_openidc interaction with the platform's OIDC proxy. By Phase 4, those patterns are proven.

What Each Phase Adds to the Nav¶

After Phase 1 (native thin UI):

WORKLOADS
  ▸ my-slurm (Slurm)  →  Overview │ Jobs │ Nodes │ Workers │ Operations

After Phase 2 (gateway contract): No visible nav change. Platform primitive is internal.

After Phase 3 (Grafana):

WORKLOADS
  ▸ my-slurm (Slurm)  →  Overview │ Jobs │ Nodes │ Monitoring │ Workers │ Operations

After Phase 4 (Open OnDemand):

WORKLOADS
  ▸ my-slurm (Slurm)  →  Overview │ Jobs │ Nodes │ Monitoring │ HPC Portal │ Workers │ Operations

9. Known Tradeoff: OOD Audit Bypass¶

This is a conscious product and security decision, not an open question.

When users submit jobs via Open OnDemand, those submissions execute directly on the controller allocation via Slurm CLI. They bypass the platform's node-agent task model and audit trail entirely. The platform sees the Slurm cluster running but has no record of individual job submissions, cancellations, or interactive sessions initiated through OOD.

9.1 What the platform sees vs what OOD does¶

Action	Platform audit trail	OOD audit trail
Cluster deploy	Yes — app instance lifecycle	N/A
Worker add/drain/remove	Yes — member operations	N/A
Job submit via native UI (Phase 1)	Yes — node-agent task log	N/A
Job submit via OOD	No	OOD Apache access log (on the allocation)
Job cancel via OOD	No	OOD Apache access log
Interactive Jupyter session via OOD	No	OOD session log
File upload/download via OOD	No	OOD Apache access log

9.2 Why this is acceptable (for now)¶

OOD operates within Slurm's authorization model. OOD authenticates users via OIDC and maps them to OS users on the cluster. Slurm enforces per-user quotas, partition access, and job limits. OOD cannot do more than the user's Slurm permissions allow.
The platform audits the blast radius, not the workload. The platform's audit contract covers infrastructure mutations (allocations, node state, credentials). Individual workload actions (which training job ran, which notebook opened) are analogous to individual commands typed in an SSH session — the platform doesn't audit those either.
OOD logs exist, just not in the platform. Apache access logs and Slurm accounting (sacct) capture everything. They live on the allocation, not in the platform's audit_logs table. If compliance requires centralized audit, those logs can be forwarded to the platform's logging pipeline — but that is a separate integration, not a Phase 4 blocker.

9.3 When this becomes unacceptable¶

If the platform needs to enforce per-job cost attribution (billing per job, not per allocation). OOD-submitted jobs would not have platform-tracked cost.
If compliance requires that all user actions on platform-managed infrastructure are centrally audited. OOD's logs-on-allocation model does not satisfy this.
If the platform needs to enforce job-level policies (e.g., max GPU-hours per job, job submission rate limits). OOD bypasses the platform API entirely.

If any of these become requirements, the options are: 1. Proxy OOD through the platform API — heavy OOD customization, breaks the "deploy OOD as-is" model. 2. Forward OOD/Slurm logs to the platform — lighter, preserves OOD as-is, but audit is eventually-consistent (log shipping delay). 3. Accept the gap and document it — current recommendation.

9.4 Decision¶

For Phase 4: accept the audit gap and document it clearly in the product. The instance detail page should show a notice when the HPC Portal tab is active: "Actions taken in the HPC Portal are managed by Slurm, not tracked in platform audit logs." This is honest UX, consistent with the decision-first principle.

11. Shared Monitoring Pattern¶

The Grafana + Exporter model (Option 2) generalizes beyond Slurm. Every app type benefits from embedded monitoring:

App	Exporter	Dashboards
Slurm	prometheus-slurm-exporter + dcgm-exporter	Cluster overview, GPU detail, job queue depth
RKE2 / Kubernetes	kube-state-metrics + dcgm-exporter	Pod status, GPU scheduling, node health
Ray	ray-prometheus-exporter	Cluster resources, task queue, object store
MLflow	mlflow-prometheus-exporter (custom)	Experiment counts, model registry, artifact storage

This suggests the monitoring tab should be a platform-level primitive, not an app-specific feature. The app manifest declares its exporter endpoints, and the platform deploys Prometheus + Grafana as a shared monitoring sidecar:

# In app manifest
monitoring:
  exporters:
    - name: slurm
      port: 9341
      path: /metrics
    - name: dcgm
      port: 9400
      path: /metrics
  grafana_dashboards:
    - source: bundled          # shipped with the app
      path: /dashboards/slurm-overview.json
    - source: bundled
      path: /dashboards/gpu-detail.json

This keeps monitoring consistent across all apps and avoids each app developer reinventing the Prometheus + Grafana deployment.

Deployment model decision: Per-instance monitoring ships first (Phase 3). Shared/platform-managed monitoring is the future direction but requires tenant isolation, cross-allocation networking, and platform-operated Prometheus infrastructure. The shared pattern described above is the target architecture; Phase 3 is per-instance only. The manifest schema should be designed for the shared model from the start (so apps don't need to change when the platform migrates), but the runtime implementation in Phase 3 deploys Prometheus + Grafana on the controller allocation alongside the app.

12. Open Questions¶

OOD deployment model — companion app instance or co-hosted on controller? Companion is cleaner (separate lifecycle) but adds complexity (inter-instance references). Co-hosted is simpler but mixes concerns.
Slurm REST API (slurmrestd) vs CLI parsing — the native thin UI (Phase 1) can use either. slurmrestd provides structured JSON output but requires additional setup. CLI parsing (squeue --json) is simpler but format varies by Slurm version. Slurm 23.02+ has stable --json output.
Interactive sessions — OOD's killer feature is interactive Jupyter/VS Code on compute nodes. Should this be available without OOD, via the platform's existing terminal feature? The platform already has browser terminal support — extending it to launch Jupyter on a specific Slurm node could cover this use case.
Job cost attribution — the native thin UI could show estimated cost per job (GPU-hours × rate). This requires the platform to know the SKU rate and the job's GPU allocation. Worth building? It would be a differentiating feature vs. OOD.
Gateway contract ownership — the embedded UI gateway contract is a platform primitive, not a Slurm feature. Who owns the design? Should it be a separate architecture doc (doc/architecture/Embedded_UI_Gateway_Spec.md) or a section in the navigation redesign doc? It blocks Phases 3 and 4 of this spec and also blocks the Rancher embedded UI in the Kubernetes design.
Future tab concepts — the current tab model covers Jobs (native), Monitoring (embedded Grafana), and HPC Portal (embedded OOD). Two additional tab concepts may be worth exploring independently of OOD:
Access — SSH credentials, terminal launch, kubeconfig-style access tokens for the Slurm cluster. Currently scattered across the Overview and platform terminal. A dedicated tab could consolidate access methods.
Files — lightweight file browser for job output directories, without the full OOD stack. Could use the node-agent task model to list/read files on the controller. Simpler than OOD's file browser but covers the "where is my job output?" use case. These are not committed — they are options to evaluate if OOD is deferred or if users need file/access capabilities before Phase 4.