Skip to content

Slurm UI Options & Integration Spec v1

Purpose: - Document the available UI options for Slurm workloads on the platform. - Define how each option integrates with the platform shell, app extension registry, and embedded UI pattern. - Provide layout wireframes and implementation guidance for each approach.

Inputs: - Current Slurm instance detail page (packages/web/app/apps/instances/[instanceId]/page.tsx) - Current Slurm extension panels (packages/web/src/lib/apps/slurm-instance-panels.tsx) - Navigation redesign (doc/product/Navigation_Redesign_App_Platform_v1.md) - Kubernetes platform design (doc/architecture/Kubernetes_Platform_Options_v1.md) - App extension registry (packages/web/src/lib/apps/extensions.ts)

Hard dependencies (from doc/architecture/Kubernetes_Platform_Options_v1.md §3–4): - Embedded UI gateway contract (§4.3) must be defined before any embedded UI tab ships. - Node mutation must stay inside platform-controlled execution (§3.1) — Slurm CLI proxying via the native thin UI must route through the node-agent task model, not direct SSH from the platform API. - App-runtime recovery model (§4.4) should cover partial monitoring deploy and Grafana/OOD service failures.


1. Current State

Slurm has no built-in web UI. It is entirely CLI-driven (srun, sbatch, squeue, sinfo, sacct).

The platform currently provides a native Slurm instance detail page with: - Slurm Runtime card — controller status, cluster name, partition, slurmctld/slurmd state, bootstrap credential management - Slurm Workers card — add workers, drain/remove workers, worker history - Instance Members card — generic member list with status badges - Instance Operations card — member operation history

This is management-focused (deploy, scale, credential rotate). There is no visibility into what the Slurm cluster is actually doing — no job queue, no node utilization, no GPU metrics.


2. The Three UI Options

Option 1: Open OnDemand (Full HPC Portal)

Option 2: Grafana + Slurm Exporter (Monitoring Dashboard)

Option 3: Native Thin UI (Platform-Built Job View)

Each option serves a different user need and can be combined. They are not mutually exclusive.


3. Option 1: Open OnDemand

3.1 What It Is

Open OnDemand (OOD) is an open-source web portal for HPC clusters, developed by the Ohio Supercomputer Center. It is the most widely used HPC web interface (used by most US national labs and universities). License: MIT.

It provides: - Job submission and monitoring — submit batch jobs, view queue, cancel jobs - Interactive applications — launch Jupyter notebooks, VS Code, RStudio, desktop sessions directly on compute nodes - File browser — upload/download/edit files on the cluster filesystem - Shell access — browser-based terminal to the cluster - Cluster status — node utilization, partition overview

3.2 Architecture on the Platform

Open OnDemand runs as an Apache-based web application on a node that can reach the Slurm controller. It talks to Slurm via the standard CLI tools (squeue, sbatch, etc.) and optionally via the Slurm REST API (slurmrestd).

┌─────────────────────────────────────────────────────────────┐
│ Platform                                                     │
│                                                              │
│  App Instance: my-slurm (slug: slurm-reference)              │
│    ├── Controller allocation (slurmctld + slurmdbd)          │
│    ├── Worker allocations (slurmd, GPU nodes)                │
│    └── [OOD allocation or co-hosted on controller]           │
│                                                              │
│  Companion App Instance: my-slurm-ood (slug: open-ondemand) │
│    ├── Runs on controller allocation (or dedicated node)     │
│    ├── Apache + Passenger + OOD portal                       │
│    ├── Auth: OIDC via platform Keycloak                      │
│    └── Talks to slurmctld via Slurm CLI / slurmrestd         │
└─────────────────────────────────────────────────────────────┘

Deployment model: OOD can be deployed two ways:

A. Companion app instance — separate catalog entry (slug: open-ondemand) that references an existing Slurm instance. Deployed as a second app instance in the same project, configured to point at the Slurm controller.

B. Co-hosted on controller — OOD installed on the same allocation as slurmctld. Simpler, but mixes concerns. Suitable for single-node or dev clusters.

3.3 Platform Integration

App Manifest

slug: open-ondemand
display_name: "Open OnDemand"
runtime_backend: bare_metal
versions:
  - version: "3.1"
    placement_schema:
      type: object
      required: [host_allocation_id, slurm_instance_id]
      properties:
        host_allocation_id:
          type: string
          format: uuid
          description: "Allocation to run OOD on (can be Slurm controller)"
        slurm_instance_id:
          type: string
          format: uuid
          description: "Slurm app instance to connect to"
    config_schema:
      type: object
      properties:
        portal_title:
          type: string
          default: "GPU Cloud HPC Portal"
        interactive_apps:
          type: array
          items:
            type: string
            enum: [jupyter, vscode, rstudio, desktop]
          default: [jupyter, vscode]
    ui:
      endpoint:
        type: allocation_port
        component_key: portal
        port: 443
        path: "/"
        protocol: https
      auth:
        strategy: oidc_proxy
      embedding:
        allowed: true
        sandbox: "allow-same-origin allow-scripts allow-forms allow-popups"

Auth

Open OnDemand supports OIDC natively via mod_auth_openidc (Apache module). Configure it to use the platform's Keycloak realm:

OIDCProviderMetadataURL https://keycloak.example.com/realms/gpuaas/.well-known/openid-configuration
OIDCClientID            ood-portal
OIDCClientSecret        <from Keycloak>
OIDCRedirectURI         https://ood.example.com/oidc
OIDCCryptoPassphrase    <random>

When embedded in the platform via reverse proxy, the OIDC session flows through transparently — same Keycloak realm, SSO is automatic.

Prerequisite: The embedded UI gateway contract (see Kubernetes Platform Design v2 §4.3) must be defined before this embedding ships. That contract covers: reverse-proxy route shape, cookie behavior under proxying, WebSocket upgrade handling (OOD uses WS for its shell and interactive apps), CSRF model, session expiry/logout propagation, and when to fall back to link-out. OOD exercises all of these — it is not a simple read-only iframe.

Frontend Extension

const openOnDemandExtension: AppShellExtension = {
  slug: "open-ondemand",
  runtimeBackend: "bare_metal",
  deploy: {
    requiredInputs: {
      controllerAllocations: "single",
    },
    missingInputsMessage: "Open OnDemand requires a host allocation and a Slurm instance reference.",
    summaryMessage: "Deploy Open OnDemand portal connected to an existing Slurm cluster.",
    serviceAccountEmptyStateMessage: "No active service accounts exist in this project yet.",
    serviceAccountHelpText: "Optional machine identity for portal automation.",
    accessCredentialHelpText: "",
    buildPlacementIntent: ({ controllerAllocationIDs }) => ({
      host_allocation_id: controllerAllocationIDs[0] ?? "",
      // slurm_instance_id populated via additional UI field (instance picker)
    }),
    isPlacementComplete: ({ controllerAllocationIDs }) =>
      controllerAllocationIDs.length > 0,
  },
};

3.4 Layout: Instance Detail with Embedded OOD

When OOD is deployed as a companion app, the platform instance detail page gains an embedded UI tab:

┌── Top Bar ────────────────────────────────────────────────────────┐
│ [logo]  Tenant / Project ▾                    [$] [th] [n] [usr] │
├───────────────────────────────────────────────────────────────────┤
│ ┌─ Sidebar ──┐                                                   │
│ │ WORKLOADS  │  ┌── my-slurm-ood (Open OnDemand) ─────────────┐ │
│ │ ▸ my-slurm │  │                                              │ │
│ │ ▸ my-ood ← │  │  Overview │ Portal │ Config │ Logs           │ │
│ │            │  ├──────────────────────────────────────────────┤ │
│ │ INFRA      │  │                                              │ │
│ │ ...        │  │  ┌── Embedded Open OnDemand ──────────────┐  │ │
│ │            │  │  │                                        │  │ │
│ │            │  │  │  [Jobs]  [Files]  [Clusters]  [Apps]   │  │ │
│ │            │  │  │                                        │  │ │
│ │            │  │  │  Active Jobs              Submit Job    │  │ │
│ │            │  │  │  ┌─────────────────────┐  ┌─────────┐  │  │ │
│ │            │  │  │  │ job-001  running  4G │  │ Script  │  │  │ │
│ │            │  │  │  │ job-002  pending  2G │  │ [     ] │  │  │ │
│ │            │  │  │  │ job-003  completed 8 │  │ GPUs: 4 │  │  │ │
│ │            │  │  │  └─────────────────────┘  │ [Submit]│  │  │ │
│ │            │  │  │                           └─────────┘  │  │ │
│ │            │  │  │  Interactive Apps                       │  │ │
│ │            │  │  │  [Jupyter] [VS Code] [Desktop]         │  │ │
│ │            │  │  │                                        │  │ │
│ │            │  │  └────────────────────────────────────────┘  │ │
│ │            │  │                                              │ │
│ └────────────┘  └──────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────┘

Alternatively, if OOD is linked from the Slurm instance itself (not a separate workload entry), it appears as a tab on the Slurm instance detail:

┌── my-slurm (Slurm) ──────────────────────────────────────────────┐
│                                                                   │
│  Overview │ HPC Portal │ Workers │ Members │ Operations │ Cost    │
│                                                                   │
│  "HPC Portal" tab = embedded Open OnDemand                       │
│  (reverse-proxied, OIDC auth, same Keycloak realm)               │
└───────────────────────────────────────────────────────────────────┘

3.7 Filesystem and Storage Model

OOD is only compelling if users have a coherent filesystem story. The file browser, interactive apps, and job output all depend on where files live.

The question OOD forces: where is the user's home directory, where does job output go, and is any of it persistent across cluster rebuilds?

Storage options on the platform

Model What Persistent across rebuild? OOD file browser?
Allocation-local Files live on the controller allocation's local disk. /home/{user} is local. No — lost when allocation is released or cluster is decommissioned. Yes, but ephemeral.
Shared NFS NFS server on controller, exported to workers. /shared mounted on all nodes. No — NFS data is on the controller allocation. Yes, shared across nodes.
Platform storage (S3) Platform's existing storage service (packages/services/storage/). Mounted via s3fs or goofys. Yes — survives cluster rebuild. Possible but poor UX (S3 semantics).
External NFS/Lustre Tenant-provided external storage, mounted at deploy time via config_schema. Yes — tenant owns lifecycle. Yes, if mounted correctly.

Recommendation for Phase 4 (OOD):

  1. MVP: Allocation-local storage only. Home directories and job output live on the controller. OOD file browser works. Users accept that decommissioning the cluster loses local files. The platform warns about this in the decommission confirmation (decision-first UX principle).

  2. Next: Add storage_mount to the Slurm placement intent, allowing tenants to mount their platform storage bucket at a known path (/workspace). OOD file browser sees it. Job output can be directed there. Survives cluster rebuild.

  3. Later: External NFS/Lustre mount support via config_schema. For tenants with existing HPC storage infrastructure.

Without a storage answer, the following OOD features are misleading: - File browser shows ephemeral files that will vanish - Interactive Jupyter notebooks are not saved persistently - Job output is unrecoverable after cluster decommission

This is not a blocker for Phase 4, but the decommission UX must make the storage ephemerality visible. Users must understand what they will lose.

3.8 Pros / Cons

Pros Cons
Full HPC user experience — job submission, interactive apps, file browser Heavyweight — full Apache + Passenger stack
Battle-tested at scale (national labs, universities) Configuration complexity — OOD cluster config, interactive app templates
OIDC native — clean SSO integration Requires filesystem access to cluster shared storage
Active open-source community (MIT license) OOD upgrades are a separate lifecycle from Slurm upgrades
Interactive Jupyter/VS Code sessions on GPU nodes May overlap with platform's existing terminal feature

3.9 Best For

Research teams, ML engineers, and data scientists who want a familiar HPC portal experience. Users who need interactive GPU sessions (Jupyter on Slurm) rather than just batch job submission.


4. Option 2: Grafana + Slurm Exporter

4.1 What It Is

A lightweight monitoring stack that exports Slurm metrics to Prometheus and visualizes them in Grafana. No job submission — purely observability.

Components: - prometheus-slurm-exporter — Go binary that scrapes Slurm CLI output and exposes Prometheus metrics (node states, job counts, GPU utilization, queue wait times) - Prometheus — metrics storage and alerting - Grafana — dashboards and visualization

4.2 Architecture on the Platform

┌─────────────────────────────────────────────────────────────┐
│ Slurm App Instance                                           │
│                                                              │
│  Controller allocation                                       │
│    ├── slurmctld                                             │
│    ├── prometheus-slurm-exporter (:9341)                     │
│    ├── Prometheus (:9090)                                    │
│    └── Grafana (:3000)                                       │
│                                                              │
│  Worker allocations                                          │
│    ├── slurmd                                                │
│    └── node-exporter (:9100)  (optional per-node metrics)    │
└─────────────────────────────────────────────────────────────┘

Deployment model: The exporter, Prometheus, and Grafana are co-hosted on the controller allocation. They are deployed as part of the Slurm app lifecycle — the app worker installs them alongside slurmctld.

This is not a separate app instance — it is a built-in monitoring layer for Slurm.

4.3 Metrics Exposed

The prometheus-slurm-exporter provides:

Metric Description
slurm_nodes_alloc / slurm_nodes_idle / slurm_nodes_down Node state counts
slurm_cpus_alloc / slurm_cpus_idle CPU allocation
slurm_gpus_alloc / slurm_gpus_idle GPU allocation (with TRES plugin)
slurm_queue_pending / slurm_queue_running Job queue depth
slurm_scheduler_queue_size Scheduler backlog
slurm_job_* Per-job metrics (optional, high cardinality)

Combined with node-exporter on workers:

Metric Description
nvidia_gpu_utilization Per-GPU utilization (via dcgm-exporter or nvidia_gpu_exporter)
nvidia_gpu_memory_used GPU memory usage
node_cpu_* / node_memory_* Standard host metrics

4.4 Grafana Dashboards

Pre-built dashboards shipped with the Slurm app:

Dashboard 1: Cluster Overview

┌── Slurm Cluster Overview ────────────────────────────────────┐
│                                                               │
│  ┌─ Nodes ────────┐  ┌─ GPUs ─────────┐  ┌─ Jobs ────────┐  │
│  │ 8 total        │  │ 32 total       │  │ 12 running    │  │
│  │ 6 allocated    │  │ 24 allocated   │  │  4 pending    │  │
│  │ 1 idle         │  │  8 idle        │  │  0 failed     │  │
│  │ 1 down         │  │                │  │               │  │
│  └────────────────┘  └────────────────┘  └───────────────┘  │
│                                                               │
│  ┌─ GPU Utilization (time series) ───────────────────────┐   │
│  │  ▁▂▃▅▇█████████████████████████▇▅▃▂▁▁▂▃▅▇████████   │   │
│  │  0%                                             100%  │   │
│  └───────────────────────────────────────────────────────┘   │
│                                                               │
│  ┌─ Queue Wait Time (histogram) ─────────────────────────┐   │
│  │  ▇                                                     │   │
│  │  █▅                                                    │   │
│  │  ██▃                                                   │   │
│  │  ███▂▁                                                 │   │
│  │  <1m  <5m  <15m  <1h  >1h                              │   │
│  └───────────────────────────────────────────────────────┘   │
└───────────────────────────────────────────────────────────────┘

Dashboard 2: Per-Node GPU Detail

┌── Node GPU Detail ───────────────────────────────────────────┐
│                                                               │
│  Node: worker-03  │  4× A100-80GB  │  Status: allocated      │
│                                                               │
│  GPU 0:  util 95%  mem 72/80 GB  temp 71°C  power 298W      │
│  GPU 1:  util 88%  mem 65/80 GB  temp 68°C  power 285W      │
│  GPU 2:  util 92%  mem 78/80 GB  temp 73°C  power 302W      │
│  GPU 3:  util 0%   mem  0/80 GB  temp 34°C  power  45W      │
│                                                               │
│  ┌─ GPU Utilization Over Time ───────────────────────────┐   │
│  │  GPU0 ████████████████████████████████████████████    │   │
│  │  GPU1 ██████████████████████████████████              │   │
│  │  GPU2 ████████████████████████████████████████████    │   │
│  │  GPU3 ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁    │   │
│  └───────────────────────────────────────────────────────┘   │
└───────────────────────────────────────────────────────────────┘

4.5 Platform Integration

Auth

Grafana supports OIDC natively (auth.generic_oauth configuration). Point at the platform's Keycloak realm. Users see dashboards scoped to their Slurm cluster.

[auth.generic_oauth]
enabled = true
name = GPUaaS
client_id = grafana-slurm
client_secret = <from Keycloak>
scopes = openid profile email
auth_url = https://keycloak.example.com/realms/gpuaas/protocol/openid-connect/auth
token_url = https://keycloak.example.com/realms/gpuaas/protocol/openid-connect/token
api_url = https://keycloak.example.com/realms/gpuaas/protocol/openid-connect/userinfo

Embedding

Prerequisite: The embedded UI gateway contract (Kubernetes Platform Design v2 §4.3) must be defined before this tab ships. Grafana is a simpler case than OOD (mostly read-only dashboards, no WebSocket for basic panels), making it a good first target to validate the gateway contract against. However, the contract must still specify: - reverse-proxy route resolution from instance placement - cookie domain and path scoping under proxy - CSP frame-ancestors coordination with the platform origin - session TTL alignment (Grafana session vs platform session) - logout propagation (platform logout should invalidate Grafana OIDC session)

Grafana supports iframe embedding natively (allow_embedding = true in config). The embedded dashboards render inside the platform shell:

┌── my-slurm (Slurm) ──────────────────────────────────────────┐
│                                                                │
│  Overview │ Monitoring │ Workers │ Members │ Operations         │
│                                                                │
│  ┌── Embedded Grafana ──────────────────────────────────────┐  │
│  │                                                          │  │
│  │  Cluster Overview  │  GPU Detail  │  Job Queue           │  │
│  │  (Grafana dashboard tabs)                                │  │
│  │                                                          │  │
│  │  [time range picker]  [auto-refresh: 30s]                │  │
│  │                                                          │  │
│  │  ... Grafana panels ...                                  │  │
│  │                                                          │  │
│  └──────────────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────────────┘

No Separate App Instance Needed

Because Grafana + Prometheus are part of the Slurm deployment itself (installed by the app worker alongside slurmctld), they appear as a tab on the Slurm instance detail page, not as a separate workload in the sidebar.

The Slurm extension registers the monitoring endpoint:

// Addition to slurmExtension in extensions.ts
const slurmExtension: AppShellExtension = {
  // ... existing fields ...
  ui: {
    tabs: [
      {
        key: "monitoring",
        label: "Monitoring",
        endpoint: {
          type: "allocation_port",
          component_key: "controller",
          port: 3000,          // Grafana
          path: "/d/slurm-overview",
          protocol: "https",
        },
        auth: { strategy: "oidc_proxy" },
        embedding: { allowed: true },
      },
    ],
  },
};

4.6 Prometheus Storage and Retention

Prometheus on the controller node is fine for dev, but production monitoring requires explicit retention, disk sizing, and failure behavior decisions.

Defaults for per-instance deployment:

Parameter Value Rationale
Retention period --storage.tsdb.retention.time=7d 7 days covers most debugging windows without excessive disk
Retention size --storage.tsdb.retention.size=5GB Hard cap prevents disk pressure on controller
Scrape interval 30s Balance between resolution and disk usage; GPU metrics change slowly
WAL compression --storage.tsdb.wal-compression Reduces WAL size ~50%

Failure behavior: - If Prometheus fills its allocated disk: it stops ingesting but Grafana stays up showing stale data. The app worker health check should detect Prometheus disk pressure and report a degraded state via the app-runtime status API. - If Prometheus crashes: Grafana shows "No data" panels. The app worker restarts Prometheus via node-agent task. Historical data is lost if WAL is corrupted. - If the controller allocation is released: all monitoring data is lost. This is acceptable for per-instance monitoring — the data's value is bounded by the cluster's lifetime.

Why per-instance first, shared later: - Per-instance is simpler — no cross-tenant isolation concerns, no shared Prometheus infrastructure to operate, no network topology requirements. - Shared monitoring (platform-managed Prometheus/Grafana) is more efficient but requires: tenant-scoped data isolation, platform-operated infrastructure, cross-allocation network routing for scraping. Defer to a later phase. - The shared monitoring pattern in §9 is the future direction, but Phase 3 ships per-instance monitoring only.

4.7 Pros / Cons

Pros Cons
Lightweight — small resource footprint No job submission (monitoring only)
GPU metrics (dcgm-exporter integration) Users still need CLI for job management
Pre-built community dashboards Grafana customization can be time-consuming
OIDC native, iframe-embeddable Another service to maintain on the controller
Alerting built-in (Grafana alerts) Prometheus storage on controller (disk pressure)
No separate app instance needed

4.8 Best For

Platform operators and users who need visibility into cluster utilization and GPU metrics but manage jobs via CLI (sbatch, squeue). Also valuable as the monitoring layer regardless of which job management UI is chosen.


5. Option 3: Native Thin UI (Platform-Built)

5.1 What It Is

Extend the existing Slurm instance detail page with a lightweight job queue view that calls Slurm CLI commands via the node-agent task API. No external software needed — everything is built into the platform UI.

5.1.1 Scope Boundary (MVP)

The native thin UI is also a product contract question: how much of Slurm's control surface does the platform own? Without a clear boundary, Phase 1 can sprawl into reimplementing half of scontrol.

Phase 1 MVP — strict scope:

In scope Slurm command Mutating? Audited?
List job queue squeue --json No No
List node status sinfo --json No No
Submit batch job sbatch Yes Yes — audit log row
Cancel job scancel {job_id} Yes Yes — audit log row

Explicitly deferred (not Phase 1):

Deferred Slurm command Why deferred
Node drain/resume scontrol update NodeName=... State=drain\|resume Admin action — needs RBAC boundary (who can drain?). Existing worker drain/remove in the Workers card covers the platform-level action.
Job output/logs sacct, file reads from job output directory Requires filesystem access model (see §3.7). Cannot show job stdout without knowing where output files live.
Job detail scontrol show job {id} Nice-to-have, not MVP. Queue view shows enough.
Partition management scontrol Admin-only, low frequency, CLI is fine.
srun (interactive) srun --pty Requires terminal relay — use existing platform terminal or OOD.

This boundary means the native thin UI is: read the queue, read the nodes, submit a batch job, cancel a job. Four operations. That is the contract.

5.2 Architecture

Platform UI  →  Platform API  →  Node Agent (on controller)  →  Slurm CLI
    GET /api/v1/app-runtime/instances/{id}/slurm/queue           squeue
    GET /api/v1/app-runtime/instances/{id}/slurm/nodes           sinfo
    GET /api/v1/app-runtime/instances/{id}/slurm/jobs/{id}       sacct
    POST /api/v1/app-runtime/instances/{id}/slurm/jobs           sbatch

The platform API proxies structured Slurm queries to the node-agent running on the controller allocation. The node-agent executes the Slurm CLI commands and returns parsed JSON output. The platform UI renders the results natively.

Control boundary alignment: This is the only option that fully satisfies the platform's control boundary constraint (Kubernetes Platform Design v2 §3.1). All Slurm CLI execution flows through the node-agent typed-task model — the platform API never SSHes into the controller directly. The node-agent task for Slurm query/submit should be a new task type (e.g., app.slurm.query, app.slurm.submit) registered in the node-agent task catalog, with audit logging for mutations (job submission, node drain/resume).

This also means the platform API does not need the embedded UI gateway contract to ship the native thin UI — it uses existing platform auth, existing node-agent execution, and renders natively. No iframe, no reverse proxy, no cookie scoping. This is why it can ship before Options 1 and 2.

5.3 API Surface (MVP — four operations only)

# Addition to openapi.draft.yaml (app-runtime section)
# Scope: queue, nodes, submit, cancel. Nothing else in Phase 1.

/api/v1/app-runtime/instances/{instance_id}/slurm/queue:
  get:
    summary: List Slurm job queue
    description: Returns parsed output of squeue --json
    parameters:
      - name: state
        in: query
        schema: { type: string, enum: [pending, running, completed, failed] }
      - name: user
        in: query
        schema: { type: string }
    responses:
      200:
        content:
          application/json:
            schema:
              type: object
              properties:
                jobs:
                  type: array
                  items:
                    type: object
                    properties:
                      job_id: { type: integer }
                      name: { type: string }
                      user: { type: string }
                      state: { type: string }
                      partition: { type: string }
                      nodes: { type: string }
                      gpus: { type: integer }
                      time_elapsed: { type: string }
                      time_limit: { type: string }
                      submit_time: { type: string }

/api/v1/app-runtime/instances/{instance_id}/slurm/nodes:
  get:
    summary: List Slurm node status
    description: Returns parsed output of sinfo --json
    responses:
      200:
        content:
          application/json:
            schema:
              type: object
              properties:
                nodes:
                  type: array
                  items:
                    type: object
                    properties:
                      hostname: { type: string }
                      state: { type: string }
                      cpus_total: { type: integer }
                      cpus_alloc: { type: integer }
                      gpus_total: { type: integer }
                      gpus_alloc: { type: integer }
                      memory_total_mb: { type: integer }
                      memory_alloc_mb: { type: integer }
                      partitions: { type: array, items: { type: string } }

/api/v1/app-runtime/instances/{instance_id}/slurm/jobs:
  post:
    summary: Submit a Slurm batch job
    requestBody:
      content:
        application/json:
          schema:
            type: object
            required: [script]
            properties:
              script: { type: string, description: "Batch script content" }
              name: { type: string }
              partition: { type: string }
              gpus: { type: integer }
              nodes: { type: integer }
              time_limit: { type: string, description: "e.g. 01:00:00" }
    responses:
      201:
        content:
          application/json:
            schema:
              type: object
              properties:
                job_id: { type: integer }

/api/v1/app-runtime/instances/{instance_id}/slurm/jobs/{job_id}:
  delete:
    summary: Cancel a Slurm job
    description: Calls scancel {job_id}. Audited.
    responses:
      204:
        description: Job cancellation accepted
      404:
        description: Job not found

5.4 Layout: Enhanced Instance Detail

┌── my-slurm (Slurm) ──────────────────────────────────────────────┐
│                                                                    │
│  Overview │ Jobs │ Nodes │ Workers │ Members │ Operations          │
│                                                                    │
│  ═══ Jobs tab ═══                                                  │
│                                                                    │
│  ┌─ Submit Job ──────────────────────────────────────────────────┐ │
│  │ Name: [training-run-042       ]  Partition: [gpu    ▾]       │ │
│  │ GPUs: [4   ]  Time limit: [04:00:00]  Nodes: [1   ]         │ │
│  │                                                               │ │
│  │ Script:                                                       │ │
│  │ ┌──────────────────────────────────────────────────────────┐  │ │
│  │ │ #!/bin/bash                                              │  │ │
│  │ │ #SBATCH --job-name=training-run-042                      │  │ │
│  │ │ #SBATCH --gres=gpu:4                                     │  │ │
│  │ │ #SBATCH --time=04:00:00                                  │  │ │
│  │ │                                                          │  │ │
│  │ │ python train.py --model llama --epochs 10                │  │ │
│  │ └──────────────────────────────────────────────────────────┘  │ │
│  │                                                    [Submit]   │ │
│  └───────────────────────────────────────────────────────────────┘ │
│                                                                    │
│  ┌─ Job Queue ───────────────────────────────────────────────────┐ │
│  │ Filter: [All ▾]  [Refresh]                    Auto: 10s      │ │
│  │                                                               │ │
│  │  ID      Name              State    GPUs  Node       Elapsed     │ │
│  │  10042   training-run-041  RUNNING  4     worker-03  02:34  [x] │ │
│  │  10043   eval-checkpoint   RUNNING  2     worker-01  00:12  [x] │ │
│  │  10044   training-run-042  PENDING  4     --         --     [x] │ │
│  │  10040   preprocessing     COMPLTD  1     worker-02  00:45      │ │
│  │                                                               │ │
│  └───────────────────────────────────────────────────────────────┘ │
│                                                                    │
│  ═══ Nodes tab ═══                                                 │
│                                                                    │
│  ┌─ Cluster Nodes ───────────────────────────────────────────────┐ │
│  │                                                               │ │
│  │  Hostname    State      CPUs      GPUs      Memory    Part    │ │
│  │  worker-01   alloc      32/64     2/4       128/256   gpu     │ │
│  │  worker-02   idle       0/64      0/4       0/256     gpu     │ │
│  │  worker-03   alloc      64/64     4/4       245/256   gpu     │ │
│  │  worker-04   down*      --        --        --        gpu     │ │
│  │                                                               │ │
│  │  * worker-04: Node not responding (last seen 12m ago)         │ │
│  │    [Drain]  [Resume]                                          │ │
│  │                                                               │ │
│  └───────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────┘

5.5 Pros / Cons

Pros Cons
Zero external dependencies — fully native Limited functionality vs OOD (no interactive apps, no file browser)
Consistent UX — same design language as rest of platform API surface to maintain (Slurm CLI parsing, versioning)
Fast to build for basic job view No Jupyter/VS Code integration on compute nodes
No auth complexity — uses existing platform auth Slurm CLI output format can change between versions
No iframe/embedding needed Only useful for Slurm (not reusable for other HPC schedulers)
Decision-first UX — show GPU cost before job submission Platform becomes coupled to Slurm internals

5.6 Best For

Teams that primarily use sbatch scripts and want quick visibility into queue status from the same UI where they manage their cluster. Good for smaller teams that don't need interactive notebook sessions on compute nodes.


6. Comparison Matrix

Dimension Open OnDemand Grafana + Exporter Native Thin UI
Job submission Yes (web form + script) No Yes (basic)
Job monitoring Yes (queue, history) No (metrics only) Yes (queue view)
Interactive apps Yes (Jupyter, VS Code, desktop) No No
File browser Yes No No
GPU metrics No (not built-in) Yes (dcgm-exporter) No
Cluster utilization Basic (node overview) Yes (detailed dashboards) Basic (node table)
Alerting No Yes (Grafana alerts) No
Auth OIDC (mod_auth_openidc) OIDC (Grafana config) Platform auth (native)
Embedding iframe (OIDC proxy) iframe (allow_embedding) Native (no iframe)
Deploy complexity High (Apache + Passenger + OOD config) Medium (3 binaries + config) Low (API + UI code)
Resource footprint Medium-high Low-medium None (runs on platform)
Separate app instance? Yes (companion) or tab No (co-hosted on controller) No (native)
App platform validation Yes — tests companion app + embedded UI pattern Partially — tests embedded UI only No — platform-internal feature
Build effort Small (deploy + configure) Small (deploy + dashboards) Medium (API + UI + Slurm parsing)
Maintenance OOD upgrades + Slurm version compat Prometheus storage + dashboard upkeep Slurm CLI parsing maintenance
Requires gateway contract? Yes — hardest case (WS, file upload, complex cookies) Yes — simpler case (mostly read-only) No — native rendering, no proxy
Control boundary OOD runs on allocation, platform proxies — acceptable Grafana runs on allocation, platform proxies — acceptable Node-agent task model — cleanest control path
Audit coverage OOD actions not audited by platform (runs independently) Read-only — no mutations to audit Full audit — job submit/cancel via node-agent task log

7. Platform Primitive Dependencies

The Kubernetes Platform Design v2 (§3–4) identifies hard constraints and missing primitives that apply equally to Slurm UI integration. This section maps each option against those dependencies.

7.1 Gateway Contract (v2 §4.3)

Options 1 and 2 embed third-party UIs and therefore require the gateway contract. The contract must define:

Concern Grafana (Option 2) Open OnDemand (Option 1)
Reverse-proxy route resolution Simple — Grafana on fixed port on controller Complex — OOD on Apache, path rewriting
Cookie domain scoping Grafana session cookie under proxy origin OOD + Apache mod_auth_openidc cookies, potential domain conflict
WebSocket upgrade Optional (Grafana live, can be disabled) Required (shell access, interactive apps)
CSRF model Grafana has built-in CSRF; proxy must pass tokens OOD uses Rails CSRF; proxy must preserve headers
Session TTL alignment Grafana OIDC session vs platform session OOD OIDC session + Apache session vs platform session
Logout propagation Platform logout → revoke Grafana OIDC session Platform logout → revoke OOD OIDC + Apache sessions
Fallback to link-out Acceptable for Grafana (low-stakes) Less acceptable (daily-use tool, SSO breakage is disruptive)

Grafana is the simpler case and should validate the gateway contract first. OOD should only ship after Grafana proves the contract works.

7.2 Node Mutation Control (v2 §3.1)

Option How it touches nodes Alignment
Native Thin UI Node-agent typed tasks (app.slurm.query, app.slurm.submit) Full compliance — platform owns execution and audit
Grafana + Exporter Installed by app worker during Slurm bootstrap; read-only after Acceptable — no runtime node mutation from Grafana
Open OnDemand OOD runs sbatch/squeue directly via Slurm CLI on the host Partial — OOD bypasses node-agent for Slurm operations. Acceptable because OOD runs on the controller allocation (not arbitrary nodes) and Slurm itself is the authority. But OOD job submissions are not audited by the platform.

7.3 Recovery Model (v2 §4.4)

Each option introduces failure modes the app-runtime recovery model must handle:

Option Failure modes Recovery needed
Native Thin UI Node-agent task timeout, Slurm CLI unavailable Existing node-agent retry semantics; degrade gracefully in UI
Grafana + Exporter Grafana process crash, Prometheus disk full, exporter crash App worker health check on controller; restart via node-agent task; alert on monitoring-down (meta-monitoring)
Open OnDemand Apache crash, OOD app error, OIDC session desync Companion app instance status reporting; decommission/redeploy path; platform detects unhealthy OOD and surfaces in workloads sidebar

7.4 Dependency Summary

                          ┌─────────────────────────┐
                          │ Embedded UI Gateway      │
                          │ Contract (v2 §4.3)       │
                          └──────────┬──────────────┘
                          ┌──────────┴──────────────┐
                          │                         │
                    ┌─────▼──────┐           ┌──────▼──────┐
                    │ Grafana    │           │ Open        │
                    │ (Phase 3)  │           │ OnDemand    │
                    │            │           │ (Phase 4)   │
                    └────────────┘           └─────────────┘

  ┌──────────────┐
  │ Native Thin  │   ← No gateway dependency
  │ UI (Phase 1) │   ← Uses node-agent tasks (existing)
  │              │   ← Uses platform auth (existing)
  └──────────────┘

8. Recommendation: Layer All Three

These options serve different needs and are complementary. The recommended approach layers them:

┌── my-slurm (Slurm) ──────────────────────────────────────────────┐
│                                                                    │
│  Overview │ Jobs │ Monitoring │ HPC Portal │ Workers │ Operations  │
│                                                                    │
│  Overview     = existing Slurm Runtime card + Workers card         │
│  Jobs         = Option 3 (native thin UI — job queue + submit)     │
│  Monitoring   = Option 2 (embedded Grafana — GPU metrics)          │
│  HPC Portal   = Option 1 (embedded OOD — full HPC experience)     │
│  Workers      = existing worker management                         │
│  Operations   = existing operation history                         │
└────────────────────────────────────────────────────────────────────┘

Implementation Order (revised to respect platform primitive dependencies)

The Kubernetes Platform Design v2 (§4) identifies four platform primitives that must exist before embedded UIs can ship. This changes the phase order from the naive "lowest effort first" to "dependency-correct order":

Phase What Prerequisite Why this order
Phase 1 Native Thin UI (Jobs tab) Node-agent task types for Slurm query/submit No platform primitive dependencies — uses existing auth, existing node-agent, renders natively. Gives job queue visibility immediately. Builds the Slurm proxy API reusable by CLI and SDK.
Phase 2 Embedded UI gateway contract Phase 1 validated; design work Define reverse-proxy route shape, auth strategy contract, cookie/session model, WS upgrade, CSP, logout propagation. This is a platform primitive, not a Slurm feature.
Phase 3 Grafana + Slurm Exporter Gateway contract (Phase 2) First embedded UI. Grafana is the simplest case (mostly read-only, minimal WS, well-understood OIDC). Validates the gateway contract with low risk. GPU metrics are universally valuable.
Phase 4 Open OnDemand Gateway contract proven (Phase 3) Full HPC portal. Exercises the hardest gateway cases (WS for shell, interactive apps, file uploads, complex cookie state). By Phase 4 the gateway contract is validated.

Why This Order (corrected)

  1. Native Jobs tab first because it has zero dependencies on new platform primitives. It uses the node-agent task model (already exists), platform auth (already exists), and renders natively (no iframe, no proxy). It ships the #1 user request (job queue visibility) while the gateway contract is being designed.

  2. Gateway contract second because both Grafana and OOD depend on it. Designing the contract in parallel with Phase 1 is fine, but no embedded UI tab should ship before the contract is defined and reviewed. The contract covers: reverse-proxy route resolution, supported auth strategies, cookie domain scoping, WebSocket upgrade handling, CSP frame-ancestors, session TTL alignment, logout propagation, and fallback-to-link-out criteria.

  3. Grafana third because it is the simplest embedded UI case — mostly read-only dashboards, minimal WebSocket usage, well-understood OIDC support. It validates the gateway contract with low blast radius. If the gateway contract has gaps, Grafana will expose them cheaply.

  4. OOD fourth because it exercises every hard case in the gateway contract: WebSocket for shell and interactive apps, file upload multipart handling, complex cookie and session state, Apache mod_auth_openidc interaction with the platform's OIDC proxy. By Phase 4, those patterns are proven.

What Each Phase Adds to the Nav

After Phase 1 (native thin UI):

WORKLOADS
  ▸ my-slurm (Slurm)  →  Overview │ Jobs │ Nodes │ Workers │ Operations

After Phase 2 (gateway contract): No visible nav change. Platform primitive is internal.

After Phase 3 (Grafana):

WORKLOADS
  ▸ my-slurm (Slurm)  →  Overview │ Jobs │ Nodes │ Monitoring │ Workers │ Operations

After Phase 4 (Open OnDemand):

WORKLOADS
  ▸ my-slurm (Slurm)  →  Overview │ Jobs │ Nodes │ Monitoring │ HPC Portal │ Workers │ Operations


9. Known Tradeoff: OOD Audit Bypass

This is a conscious product and security decision, not an open question.

When users submit jobs via Open OnDemand, those submissions execute directly on the controller allocation via Slurm CLI. They bypass the platform's node-agent task model and audit trail entirely. The platform sees the Slurm cluster running but has no record of individual job submissions, cancellations, or interactive sessions initiated through OOD.

9.1 What the platform sees vs what OOD does

Action Platform audit trail OOD audit trail
Cluster deploy Yes — app instance lifecycle N/A
Worker add/drain/remove Yes — member operations N/A
Job submit via native UI (Phase 1) Yes — node-agent task log N/A
Job submit via OOD No OOD Apache access log (on the allocation)
Job cancel via OOD No OOD Apache access log
Interactive Jupyter session via OOD No OOD session log
File upload/download via OOD No OOD Apache access log

9.2 Why this is acceptable (for now)

  1. OOD operates within Slurm's authorization model. OOD authenticates users via OIDC and maps them to OS users on the cluster. Slurm enforces per-user quotas, partition access, and job limits. OOD cannot do more than the user's Slurm permissions allow.

  2. The platform audits the blast radius, not the workload. The platform's audit contract covers infrastructure mutations (allocations, node state, credentials). Individual workload actions (which training job ran, which notebook opened) are analogous to individual commands typed in an SSH session — the platform doesn't audit those either.

  3. OOD logs exist, just not in the platform. Apache access logs and Slurm accounting (sacct) capture everything. They live on the allocation, not in the platform's audit_logs table. If compliance requires centralized audit, those logs can be forwarded to the platform's logging pipeline — but that is a separate integration, not a Phase 4 blocker.

9.3 When this becomes unacceptable

  • If the platform needs to enforce per-job cost attribution (billing per job, not per allocation). OOD-submitted jobs would not have platform-tracked cost.
  • If compliance requires that all user actions on platform-managed infrastructure are centrally audited. OOD's logs-on-allocation model does not satisfy this.
  • If the platform needs to enforce job-level policies (e.g., max GPU-hours per job, job submission rate limits). OOD bypasses the platform API entirely.

If any of these become requirements, the options are: 1. Proxy OOD through the platform API — heavy OOD customization, breaks the "deploy OOD as-is" model. 2. Forward OOD/Slurm logs to the platform — lighter, preserves OOD as-is, but audit is eventually-consistent (log shipping delay). 3. Accept the gap and document it — current recommendation.

9.4 Decision

For Phase 4: accept the audit gap and document it clearly in the product. The instance detail page should show a notice when the HPC Portal tab is active: "Actions taken in the HPC Portal are managed by Slurm, not tracked in platform audit logs." This is honest UX, consistent with the decision-first principle.


11. Shared Monitoring Pattern

The Grafana + Exporter model (Option 2) generalizes beyond Slurm. Every app type benefits from embedded monitoring:

App Exporter Dashboards
Slurm prometheus-slurm-exporter + dcgm-exporter Cluster overview, GPU detail, job queue depth
RKE2 / Kubernetes kube-state-metrics + dcgm-exporter Pod status, GPU scheduling, node health
Ray ray-prometheus-exporter Cluster resources, task queue, object store
MLflow mlflow-prometheus-exporter (custom) Experiment counts, model registry, artifact storage

This suggests the monitoring tab should be a platform-level primitive, not an app-specific feature. The app manifest declares its exporter endpoints, and the platform deploys Prometheus + Grafana as a shared monitoring sidecar:

# In app manifest
monitoring:
  exporters:
    - name: slurm
      port: 9341
      path: /metrics
    - name: dcgm
      port: 9400
      path: /metrics
  grafana_dashboards:
    - source: bundled          # shipped with the app
      path: /dashboards/slurm-overview.json
    - source: bundled
      path: /dashboards/gpu-detail.json

This keeps monitoring consistent across all apps and avoids each app developer reinventing the Prometheus + Grafana deployment.

Deployment model decision: Per-instance monitoring ships first (Phase 3). Shared/platform-managed monitoring is the future direction but requires tenant isolation, cross-allocation networking, and platform-operated Prometheus infrastructure. The shared pattern described above is the target architecture; Phase 3 is per-instance only. The manifest schema should be designed for the shared model from the start (so apps don't need to change when the platform migrates), but the runtime implementation in Phase 3 deploys Prometheus + Grafana on the controller allocation alongside the app.


12. Open Questions

  1. OOD deployment model — companion app instance or co-hosted on controller? Companion is cleaner (separate lifecycle) but adds complexity (inter-instance references). Co-hosted is simpler but mixes concerns.

  2. Slurm REST API (slurmrestd) vs CLI parsing — the native thin UI (Phase 1) can use either. slurmrestd provides structured JSON output but requires additional setup. CLI parsing (squeue --json) is simpler but format varies by Slurm version. Slurm 23.02+ has stable --json output.

  3. Interactive sessions — OOD's killer feature is interactive Jupyter/VS Code on compute nodes. Should this be available without OOD, via the platform's existing terminal feature? The platform already has browser terminal support — extending it to launch Jupyter on a specific Slurm node could cover this use case.

  4. Job cost attribution — the native thin UI could show estimated cost per job (GPU-hours × rate). This requires the platform to know the SKU rate and the job's GPU allocation. Worth building? It would be a differentiating feature vs. OOD.

  5. Gateway contract ownership — the embedded UI gateway contract is a platform primitive, not a Slurm feature. Who owns the design? Should it be a separate architecture doc (doc/architecture/Embedded_UI_Gateway_Spec.md) or a section in the navigation redesign doc? It blocks Phases 3 and 4 of this spec and also blocks the Rancher embedded UI in the Kubernetes design.

  6. Future tab concepts — the current tab model covers Jobs (native), Monitoring (embedded Grafana), and HPC Portal (embedded OOD). Two additional tab concepts may be worth exploring independently of OOD:

  7. Access — SSH credentials, terminal launch, kubeconfig-style access tokens for the Slurm cluster. Currently scattered across the Overview and platform terminal. A dedicated tab could consolidate access methods.
  8. Files — lightweight file browser for job output directories, without the full OOD stack. Could use the node-agent task model to list/read files on the controller. Simpler than OOD's file browser but covers the "where is my job output?" use case. These are not committed — they are options to evaluate if OOD is deferred or if users need file/access capabilities before Phase 4.