Slurm UI Options & Integration Spec v1¶
Purpose: - Document the available UI options for Slurm workloads on the platform. - Define how each option integrates with the platform shell, app extension registry, and embedded UI pattern. - Provide layout wireframes and implementation guidance for each approach.
Inputs:
- Current Slurm instance detail page (packages/web/app/apps/instances/[instanceId]/page.tsx)
- Current Slurm extension panels (packages/web/src/lib/apps/slurm-instance-panels.tsx)
- Navigation redesign (doc/product/Navigation_Redesign_App_Platform_v1.md)
- Kubernetes platform design (doc/architecture/Kubernetes_Platform_Options_v1.md)
- App extension registry (packages/web/src/lib/apps/extensions.ts)
Hard dependencies (from doc/architecture/Kubernetes_Platform_Options_v1.md §3–4):
- Embedded UI gateway contract (§4.3) must be defined before any embedded UI tab ships.
- Node mutation must stay inside platform-controlled execution (§3.1) — Slurm CLI
proxying via the native thin UI must route through the node-agent task model, not
direct SSH from the platform API.
- App-runtime recovery model (§4.4) should cover partial monitoring deploy and
Grafana/OOD service failures.
1. Current State¶
Slurm has no built-in web UI. It is entirely CLI-driven (srun, sbatch, squeue,
sinfo, sacct).
The platform currently provides a native Slurm instance detail page with: - Slurm Runtime card — controller status, cluster name, partition, slurmctld/slurmd state, bootstrap credential management - Slurm Workers card — add workers, drain/remove workers, worker history - Instance Members card — generic member list with status badges - Instance Operations card — member operation history
This is management-focused (deploy, scale, credential rotate). There is no visibility into what the Slurm cluster is actually doing — no job queue, no node utilization, no GPU metrics.
2. The Three UI Options¶
Option 1: Open OnDemand (Full HPC Portal)¶
Option 2: Grafana + Slurm Exporter (Monitoring Dashboard)¶
Option 3: Native Thin UI (Platform-Built Job View)¶
Each option serves a different user need and can be combined. They are not mutually exclusive.
3. Option 1: Open OnDemand¶
3.1 What It Is¶
Open OnDemand (OOD) is an open-source web portal for HPC clusters, developed by the Ohio Supercomputer Center. It is the most widely used HPC web interface (used by most US national labs and universities). License: MIT.
It provides: - Job submission and monitoring — submit batch jobs, view queue, cancel jobs - Interactive applications — launch Jupyter notebooks, VS Code, RStudio, desktop sessions directly on compute nodes - File browser — upload/download/edit files on the cluster filesystem - Shell access — browser-based terminal to the cluster - Cluster status — node utilization, partition overview
3.2 Architecture on the Platform¶
Open OnDemand runs as an Apache-based web application on a node that can reach the
Slurm controller. It talks to Slurm via the standard CLI tools (squeue, sbatch, etc.)
and optionally via the Slurm REST API (slurmrestd).
┌─────────────────────────────────────────────────────────────┐
│ Platform │
│ │
│ App Instance: my-slurm (slug: slurm-reference) │
│ ├── Controller allocation (slurmctld + slurmdbd) │
│ ├── Worker allocations (slurmd, GPU nodes) │
│ └── [OOD allocation or co-hosted on controller] │
│ │
│ Companion App Instance: my-slurm-ood (slug: open-ondemand) │
│ ├── Runs on controller allocation (or dedicated node) │
│ ├── Apache + Passenger + OOD portal │
│ ├── Auth: OIDC via platform Keycloak │
│ └── Talks to slurmctld via Slurm CLI / slurmrestd │
└─────────────────────────────────────────────────────────────┘
Deployment model: OOD can be deployed two ways:
A. Companion app instance — separate catalog entry (slug: open-ondemand) that
references an existing Slurm instance. Deployed as a second app instance in the same
project, configured to point at the Slurm controller.
B. Co-hosted on controller — OOD installed on the same allocation as slurmctld. Simpler, but mixes concerns. Suitable for single-node or dev clusters.
3.3 Platform Integration¶
App Manifest¶
slug: open-ondemand
display_name: "Open OnDemand"
runtime_backend: bare_metal
versions:
- version: "3.1"
placement_schema:
type: object
required: [host_allocation_id, slurm_instance_id]
properties:
host_allocation_id:
type: string
format: uuid
description: "Allocation to run OOD on (can be Slurm controller)"
slurm_instance_id:
type: string
format: uuid
description: "Slurm app instance to connect to"
config_schema:
type: object
properties:
portal_title:
type: string
default: "GPU Cloud HPC Portal"
interactive_apps:
type: array
items:
type: string
enum: [jupyter, vscode, rstudio, desktop]
default: [jupyter, vscode]
ui:
endpoint:
type: allocation_port
component_key: portal
port: 443
path: "/"
protocol: https
auth:
strategy: oidc_proxy
embedding:
allowed: true
sandbox: "allow-same-origin allow-scripts allow-forms allow-popups"
Auth¶
Open OnDemand supports OIDC natively via mod_auth_openidc (Apache module). Configure
it to use the platform's Keycloak realm:
OIDCProviderMetadataURL https://keycloak.example.com/realms/gpuaas/.well-known/openid-configuration
OIDCClientID ood-portal
OIDCClientSecret <from Keycloak>
OIDCRedirectURI https://ood.example.com/oidc
OIDCCryptoPassphrase <random>
When embedded in the platform via reverse proxy, the OIDC session flows through transparently — same Keycloak realm, SSO is automatic.
Prerequisite: The embedded UI gateway contract (see Kubernetes Platform Design v2 §4.3) must be defined before this embedding ships. That contract covers: reverse-proxy route shape, cookie behavior under proxying, WebSocket upgrade handling (OOD uses WS for its shell and interactive apps), CSRF model, session expiry/logout propagation, and when to fall back to link-out. OOD exercises all of these — it is not a simple read-only iframe.
Frontend Extension¶
const openOnDemandExtension: AppShellExtension = {
slug: "open-ondemand",
runtimeBackend: "bare_metal",
deploy: {
requiredInputs: {
controllerAllocations: "single",
},
missingInputsMessage: "Open OnDemand requires a host allocation and a Slurm instance reference.",
summaryMessage: "Deploy Open OnDemand portal connected to an existing Slurm cluster.",
serviceAccountEmptyStateMessage: "No active service accounts exist in this project yet.",
serviceAccountHelpText: "Optional machine identity for portal automation.",
accessCredentialHelpText: "",
buildPlacementIntent: ({ controllerAllocationIDs }) => ({
host_allocation_id: controllerAllocationIDs[0] ?? "",
// slurm_instance_id populated via additional UI field (instance picker)
}),
isPlacementComplete: ({ controllerAllocationIDs }) =>
controllerAllocationIDs.length > 0,
},
};
3.4 Layout: Instance Detail with Embedded OOD¶
When OOD is deployed as a companion app, the platform instance detail page gains an embedded UI tab:
┌── Top Bar ────────────────────────────────────────────────────────┐
│ [logo] Tenant / Project ▾ [$] [th] [n] [usr] │
├───────────────────────────────────────────────────────────────────┤
│ ┌─ Sidebar ──┐ │
│ │ WORKLOADS │ ┌── my-slurm-ood (Open OnDemand) ─────────────┐ │
│ │ ▸ my-slurm │ │ │ │
│ │ ▸ my-ood ← │ │ Overview │ Portal │ Config │ Logs │ │
│ │ │ ├──────────────────────────────────────────────┤ │
│ │ INFRA │ │ │ │
│ │ ... │ │ ┌── Embedded Open OnDemand ──────────────┐ │ │
│ │ │ │ │ │ │ │
│ │ │ │ │ [Jobs] [Files] [Clusters] [Apps] │ │ │
│ │ │ │ │ │ │ │
│ │ │ │ │ Active Jobs Submit Job │ │ │
│ │ │ │ │ ┌─────────────────────┐ ┌─────────┐ │ │ │
│ │ │ │ │ │ job-001 running 4G │ │ Script │ │ │ │
│ │ │ │ │ │ job-002 pending 2G │ │ [ ] │ │ │ │
│ │ │ │ │ │ job-003 completed 8 │ │ GPUs: 4 │ │ │ │
│ │ │ │ │ └─────────────────────┘ │ [Submit]│ │ │ │
│ │ │ │ │ └─────────┘ │ │ │
│ │ │ │ │ Interactive Apps │ │ │
│ │ │ │ │ [Jupyter] [VS Code] [Desktop] │ │ │
│ │ │ │ │ │ │ │
│ │ │ │ └────────────────────────────────────────┘ │ │
│ │ │ │ │ │
│ └────────────┘ └──────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────┘
Alternatively, if OOD is linked from the Slurm instance itself (not a separate workload entry), it appears as a tab on the Slurm instance detail:
┌── my-slurm (Slurm) ──────────────────────────────────────────────┐
│ │
│ Overview │ HPC Portal │ Workers │ Members │ Operations │ Cost │
│ │
│ "HPC Portal" tab = embedded Open OnDemand │
│ (reverse-proxied, OIDC auth, same Keycloak realm) │
└───────────────────────────────────────────────────────────────────┘
3.7 Filesystem and Storage Model¶
OOD is only compelling if users have a coherent filesystem story. The file browser, interactive apps, and job output all depend on where files live.
The question OOD forces: where is the user's home directory, where does job output go, and is any of it persistent across cluster rebuilds?
Storage options on the platform¶
| Model | What | Persistent across rebuild? | OOD file browser? |
|---|---|---|---|
| Allocation-local | Files live on the controller allocation's local disk. /home/{user} is local. |
No — lost when allocation is released or cluster is decommissioned. | Yes, but ephemeral. |
| Shared NFS | NFS server on controller, exported to workers. /shared mounted on all nodes. |
No — NFS data is on the controller allocation. | Yes, shared across nodes. |
| Platform storage (S3) | Platform's existing storage service (packages/services/storage/). Mounted via s3fs or goofys. |
Yes — survives cluster rebuild. | Possible but poor UX (S3 semantics). |
| External NFS/Lustre | Tenant-provided external storage, mounted at deploy time via config_schema. |
Yes — tenant owns lifecycle. | Yes, if mounted correctly. |
Recommendation for Phase 4 (OOD):
-
MVP: Allocation-local storage only. Home directories and job output live on the controller. OOD file browser works. Users accept that decommissioning the cluster loses local files. The platform warns about this in the decommission confirmation (decision-first UX principle).
-
Next: Add
storage_mountto the Slurm placement intent, allowing tenants to mount their platform storage bucket at a known path (/workspace). OOD file browser sees it. Job output can be directed there. Survives cluster rebuild. -
Later: External NFS/Lustre mount support via
config_schema. For tenants with existing HPC storage infrastructure.
Without a storage answer, the following OOD features are misleading: - File browser shows ephemeral files that will vanish - Interactive Jupyter notebooks are not saved persistently - Job output is unrecoverable after cluster decommission
This is not a blocker for Phase 4, but the decommission UX must make the storage ephemerality visible. Users must understand what they will lose.
3.8 Pros / Cons¶
| Pros | Cons |
|---|---|
| Full HPC user experience — job submission, interactive apps, file browser | Heavyweight — full Apache + Passenger stack |
| Battle-tested at scale (national labs, universities) | Configuration complexity — OOD cluster config, interactive app templates |
| OIDC native — clean SSO integration | Requires filesystem access to cluster shared storage |
| Active open-source community (MIT license) | OOD upgrades are a separate lifecycle from Slurm upgrades |
| Interactive Jupyter/VS Code sessions on GPU nodes | May overlap with platform's existing terminal feature |
3.9 Best For¶
Research teams, ML engineers, and data scientists who want a familiar HPC portal experience. Users who need interactive GPU sessions (Jupyter on Slurm) rather than just batch job submission.
4. Option 2: Grafana + Slurm Exporter¶
4.1 What It Is¶
A lightweight monitoring stack that exports Slurm metrics to Prometheus and visualizes them in Grafana. No job submission — purely observability.
Components: - prometheus-slurm-exporter — Go binary that scrapes Slurm CLI output and exposes Prometheus metrics (node states, job counts, GPU utilization, queue wait times) - Prometheus — metrics storage and alerting - Grafana — dashboards and visualization
4.2 Architecture on the Platform¶
┌─────────────────────────────────────────────────────────────┐
│ Slurm App Instance │
│ │
│ Controller allocation │
│ ├── slurmctld │
│ ├── prometheus-slurm-exporter (:9341) │
│ ├── Prometheus (:9090) │
│ └── Grafana (:3000) │
│ │
│ Worker allocations │
│ ├── slurmd │
│ └── node-exporter (:9100) (optional per-node metrics) │
└─────────────────────────────────────────────────────────────┘
Deployment model: The exporter, Prometheus, and Grafana are co-hosted on the controller allocation. They are deployed as part of the Slurm app lifecycle — the app worker installs them alongside slurmctld.
This is not a separate app instance — it is a built-in monitoring layer for Slurm.
4.3 Metrics Exposed¶
The prometheus-slurm-exporter provides:
| Metric | Description |
|---|---|
slurm_nodes_alloc / slurm_nodes_idle / slurm_nodes_down |
Node state counts |
slurm_cpus_alloc / slurm_cpus_idle |
CPU allocation |
slurm_gpus_alloc / slurm_gpus_idle |
GPU allocation (with TRES plugin) |
slurm_queue_pending / slurm_queue_running |
Job queue depth |
slurm_scheduler_queue_size |
Scheduler backlog |
slurm_job_* |
Per-job metrics (optional, high cardinality) |
Combined with node-exporter on workers:
| Metric | Description |
|---|---|
nvidia_gpu_utilization |
Per-GPU utilization (via dcgm-exporter or nvidia_gpu_exporter) |
nvidia_gpu_memory_used |
GPU memory usage |
node_cpu_* / node_memory_* |
Standard host metrics |
4.4 Grafana Dashboards¶
Pre-built dashboards shipped with the Slurm app:
Dashboard 1: Cluster Overview
┌── Slurm Cluster Overview ────────────────────────────────────┐
│ │
│ ┌─ Nodes ────────┐ ┌─ GPUs ─────────┐ ┌─ Jobs ────────┐ │
│ │ 8 total │ │ 32 total │ │ 12 running │ │
│ │ 6 allocated │ │ 24 allocated │ │ 4 pending │ │
│ │ 1 idle │ │ 8 idle │ │ 0 failed │ │
│ │ 1 down │ │ │ │ │ │
│ └────────────────┘ └────────────────┘ └───────────────┘ │
│ │
│ ┌─ GPU Utilization (time series) ───────────────────────┐ │
│ │ ▁▂▃▅▇█████████████████████████▇▅▃▂▁▁▂▃▅▇████████ │ │
│ │ 0% 100% │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
│ ┌─ Queue Wait Time (histogram) ─────────────────────────┐ │
│ │ ▇ │ │
│ │ █▅ │ │
│ │ ██▃ │ │
│ │ ███▂▁ │ │
│ │ <1m <5m <15m <1h >1h │ │
│ └───────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────┘
Dashboard 2: Per-Node GPU Detail
┌── Node GPU Detail ───────────────────────────────────────────┐
│ │
│ Node: worker-03 │ 4× A100-80GB │ Status: allocated │
│ │
│ GPU 0: util 95% mem 72/80 GB temp 71°C power 298W │
│ GPU 1: util 88% mem 65/80 GB temp 68°C power 285W │
│ GPU 2: util 92% mem 78/80 GB temp 73°C power 302W │
│ GPU 3: util 0% mem 0/80 GB temp 34°C power 45W │
│ │
│ ┌─ GPU Utilization Over Time ───────────────────────────┐ │
│ │ GPU0 ████████████████████████████████████████████ │ │
│ │ GPU1 ██████████████████████████████████ │ │
│ │ GPU2 ████████████████████████████████████████████ │ │
│ │ GPU3 ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ │ │
│ └───────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────┘
4.5 Platform Integration¶
Auth¶
Grafana supports OIDC natively (auth.generic_oauth configuration). Point at the
platform's Keycloak realm. Users see dashboards scoped to their Slurm cluster.
[auth.generic_oauth]
enabled = true
name = GPUaaS
client_id = grafana-slurm
client_secret = <from Keycloak>
scopes = openid profile email
auth_url = https://keycloak.example.com/realms/gpuaas/protocol/openid-connect/auth
token_url = https://keycloak.example.com/realms/gpuaas/protocol/openid-connect/token
api_url = https://keycloak.example.com/realms/gpuaas/protocol/openid-connect/userinfo
Embedding¶
Prerequisite: The embedded UI gateway contract (Kubernetes Platform Design v2 §4.3)
must be defined before this tab ships. Grafana is a simpler case than OOD (mostly
read-only dashboards, no WebSocket for basic panels), making it a good first target
to validate the gateway contract against. However, the contract must still specify:
- reverse-proxy route resolution from instance placement
- cookie domain and path scoping under proxy
- CSP frame-ancestors coordination with the platform origin
- session TTL alignment (Grafana session vs platform session)
- logout propagation (platform logout should invalidate Grafana OIDC session)
Grafana supports iframe embedding natively (allow_embedding = true in config).
The embedded dashboards render inside the platform shell:
┌── my-slurm (Slurm) ──────────────────────────────────────────┐
│ │
│ Overview │ Monitoring │ Workers │ Members │ Operations │
│ │
│ ┌── Embedded Grafana ──────────────────────────────────────┐ │
│ │ │ │
│ │ Cluster Overview │ GPU Detail │ Job Queue │ │
│ │ (Grafana dashboard tabs) │ │
│ │ │ │
│ │ [time range picker] [auto-refresh: 30s] │ │
│ │ │ │
│ │ ... Grafana panels ... │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────┘
No Separate App Instance Needed¶
Because Grafana + Prometheus are part of the Slurm deployment itself (installed by the app worker alongside slurmctld), they appear as a tab on the Slurm instance detail page, not as a separate workload in the sidebar.
The Slurm extension registers the monitoring endpoint:
// Addition to slurmExtension in extensions.ts
const slurmExtension: AppShellExtension = {
// ... existing fields ...
ui: {
tabs: [
{
key: "monitoring",
label: "Monitoring",
endpoint: {
type: "allocation_port",
component_key: "controller",
port: 3000, // Grafana
path: "/d/slurm-overview",
protocol: "https",
},
auth: { strategy: "oidc_proxy" },
embedding: { allowed: true },
},
],
},
};
4.6 Prometheus Storage and Retention¶
Prometheus on the controller node is fine for dev, but production monitoring requires explicit retention, disk sizing, and failure behavior decisions.
Defaults for per-instance deployment:
| Parameter | Value | Rationale |
|---|---|---|
| Retention period | --storage.tsdb.retention.time=7d |
7 days covers most debugging windows without excessive disk |
| Retention size | --storage.tsdb.retention.size=5GB |
Hard cap prevents disk pressure on controller |
| Scrape interval | 30s |
Balance between resolution and disk usage; GPU metrics change slowly |
| WAL compression | --storage.tsdb.wal-compression |
Reduces WAL size ~50% |
Failure behavior: - If Prometheus fills its allocated disk: it stops ingesting but Grafana stays up showing stale data. The app worker health check should detect Prometheus disk pressure and report a degraded state via the app-runtime status API. - If Prometheus crashes: Grafana shows "No data" panels. The app worker restarts Prometheus via node-agent task. Historical data is lost if WAL is corrupted. - If the controller allocation is released: all monitoring data is lost. This is acceptable for per-instance monitoring — the data's value is bounded by the cluster's lifetime.
Why per-instance first, shared later: - Per-instance is simpler — no cross-tenant isolation concerns, no shared Prometheus infrastructure to operate, no network topology requirements. - Shared monitoring (platform-managed Prometheus/Grafana) is more efficient but requires: tenant-scoped data isolation, platform-operated infrastructure, cross-allocation network routing for scraping. Defer to a later phase. - The shared monitoring pattern in §9 is the future direction, but Phase 3 ships per-instance monitoring only.
4.7 Pros / Cons¶
| Pros | Cons |
|---|---|
| Lightweight — small resource footprint | No job submission (monitoring only) |
| GPU metrics (dcgm-exporter integration) | Users still need CLI for job management |
| Pre-built community dashboards | Grafana customization can be time-consuming |
| OIDC native, iframe-embeddable | Another service to maintain on the controller |
| Alerting built-in (Grafana alerts) | Prometheus storage on controller (disk pressure) |
| No separate app instance needed |
4.8 Best For¶
Platform operators and users who need visibility into cluster utilization and GPU
metrics but manage jobs via CLI (sbatch, squeue). Also valuable as the monitoring
layer regardless of which job management UI is chosen.
5. Option 3: Native Thin UI (Platform-Built)¶
5.1 What It Is¶
Extend the existing Slurm instance detail page with a lightweight job queue view that calls Slurm CLI commands via the node-agent task API. No external software needed — everything is built into the platform UI.
5.1.1 Scope Boundary (MVP)¶
The native thin UI is also a product contract question: how much of Slurm's control
surface does the platform own? Without a clear boundary, Phase 1 can sprawl into
reimplementing half of scontrol.
Phase 1 MVP — strict scope:
| In scope | Slurm command | Mutating? | Audited? |
|---|---|---|---|
| List job queue | squeue --json |
No | No |
| List node status | sinfo --json |
No | No |
| Submit batch job | sbatch |
Yes | Yes — audit log row |
| Cancel job | scancel {job_id} |
Yes | Yes — audit log row |
Explicitly deferred (not Phase 1):
| Deferred | Slurm command | Why deferred |
|---|---|---|
| Node drain/resume | scontrol update NodeName=... State=drain\|resume |
Admin action — needs RBAC boundary (who can drain?). Existing worker drain/remove in the Workers card covers the platform-level action. |
| Job output/logs | sacct, file reads from job output directory |
Requires filesystem access model (see §3.7). Cannot show job stdout without knowing where output files live. |
| Job detail | scontrol show job {id} |
Nice-to-have, not MVP. Queue view shows enough. |
| Partition management | scontrol |
Admin-only, low frequency, CLI is fine. |
srun (interactive) |
srun --pty |
Requires terminal relay — use existing platform terminal or OOD. |
This boundary means the native thin UI is: read the queue, read the nodes, submit a batch job, cancel a job. Four operations. That is the contract.
5.2 Architecture¶
Platform UI → Platform API → Node Agent (on controller) → Slurm CLI
│
GET /api/v1/app-runtime/instances/{id}/slurm/queue squeue
GET /api/v1/app-runtime/instances/{id}/slurm/nodes sinfo
GET /api/v1/app-runtime/instances/{id}/slurm/jobs/{id} sacct
POST /api/v1/app-runtime/instances/{id}/slurm/jobs sbatch
The platform API proxies structured Slurm queries to the node-agent running on the controller allocation. The node-agent executes the Slurm CLI commands and returns parsed JSON output. The platform UI renders the results natively.
Control boundary alignment: This is the only option that fully satisfies the
platform's control boundary constraint (Kubernetes Platform Design v2 §3.1). All
Slurm CLI execution flows through the node-agent typed-task model — the platform API
never SSHes into the controller directly. The node-agent task for Slurm query/submit
should be a new task type (e.g., app.slurm.query, app.slurm.submit) registered
in the node-agent task catalog, with audit logging for mutations (job submission,
node drain/resume).
This also means the platform API does not need the embedded UI gateway contract to ship the native thin UI — it uses existing platform auth, existing node-agent execution, and renders natively. No iframe, no reverse proxy, no cookie scoping. This is why it can ship before Options 1 and 2.
5.3 API Surface (MVP — four operations only)¶
# Addition to openapi.draft.yaml (app-runtime section)
# Scope: queue, nodes, submit, cancel. Nothing else in Phase 1.
/api/v1/app-runtime/instances/{instance_id}/slurm/queue:
get:
summary: List Slurm job queue
description: Returns parsed output of squeue --json
parameters:
- name: state
in: query
schema: { type: string, enum: [pending, running, completed, failed] }
- name: user
in: query
schema: { type: string }
responses:
200:
content:
application/json:
schema:
type: object
properties:
jobs:
type: array
items:
type: object
properties:
job_id: { type: integer }
name: { type: string }
user: { type: string }
state: { type: string }
partition: { type: string }
nodes: { type: string }
gpus: { type: integer }
time_elapsed: { type: string }
time_limit: { type: string }
submit_time: { type: string }
/api/v1/app-runtime/instances/{instance_id}/slurm/nodes:
get:
summary: List Slurm node status
description: Returns parsed output of sinfo --json
responses:
200:
content:
application/json:
schema:
type: object
properties:
nodes:
type: array
items:
type: object
properties:
hostname: { type: string }
state: { type: string }
cpus_total: { type: integer }
cpus_alloc: { type: integer }
gpus_total: { type: integer }
gpus_alloc: { type: integer }
memory_total_mb: { type: integer }
memory_alloc_mb: { type: integer }
partitions: { type: array, items: { type: string } }
/api/v1/app-runtime/instances/{instance_id}/slurm/jobs:
post:
summary: Submit a Slurm batch job
requestBody:
content:
application/json:
schema:
type: object
required: [script]
properties:
script: { type: string, description: "Batch script content" }
name: { type: string }
partition: { type: string }
gpus: { type: integer }
nodes: { type: integer }
time_limit: { type: string, description: "e.g. 01:00:00" }
responses:
201:
content:
application/json:
schema:
type: object
properties:
job_id: { type: integer }
/api/v1/app-runtime/instances/{instance_id}/slurm/jobs/{job_id}:
delete:
summary: Cancel a Slurm job
description: Calls scancel {job_id}. Audited.
responses:
204:
description: Job cancellation accepted
404:
description: Job not found
5.4 Layout: Enhanced Instance Detail¶
┌── my-slurm (Slurm) ──────────────────────────────────────────────┐
│ │
│ Overview │ Jobs │ Nodes │ Workers │ Members │ Operations │
│ │
│ ═══ Jobs tab ═══ │
│ │
│ ┌─ Submit Job ──────────────────────────────────────────────────┐ │
│ │ Name: [training-run-042 ] Partition: [gpu ▾] │ │
│ │ GPUs: [4 ] Time limit: [04:00:00] Nodes: [1 ] │ │
│ │ │ │
│ │ Script: │ │
│ │ ┌──────────────────────────────────────────────────────────┐ │ │
│ │ │ #!/bin/bash │ │ │
│ │ │ #SBATCH --job-name=training-run-042 │ │ │
│ │ │ #SBATCH --gres=gpu:4 │ │ │
│ │ │ #SBATCH --time=04:00:00 │ │ │
│ │ │ │ │ │
│ │ │ python train.py --model llama --epochs 10 │ │ │
│ │ └──────────────────────────────────────────────────────────┘ │ │
│ │ [Submit] │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─ Job Queue ───────────────────────────────────────────────────┐ │
│ │ Filter: [All ▾] [Refresh] Auto: 10s │ │
│ │ │ │
│ │ ID Name State GPUs Node Elapsed │ │
│ │ 10042 training-run-041 RUNNING 4 worker-03 02:34 [x] │ │
│ │ 10043 eval-checkpoint RUNNING 2 worker-01 00:12 [x] │ │
│ │ 10044 training-run-042 PENDING 4 -- -- [x] │ │
│ │ 10040 preprocessing COMPLTD 1 worker-02 00:45 │ │
│ │ │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │
│ ═══ Nodes tab ═══ │
│ │
│ ┌─ Cluster Nodes ───────────────────────────────────────────────┐ │
│ │ │ │
│ │ Hostname State CPUs GPUs Memory Part │ │
│ │ worker-01 alloc 32/64 2/4 128/256 gpu │ │
│ │ worker-02 idle 0/64 0/4 0/256 gpu │ │
│ │ worker-03 alloc 64/64 4/4 245/256 gpu │ │
│ │ worker-04 down* -- -- -- gpu │ │
│ │ │ │
│ │ * worker-04: Node not responding (last seen 12m ago) │ │
│ │ [Drain] [Resume] │ │
│ │ │ │
│ └───────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────┘
5.5 Pros / Cons¶
| Pros | Cons |
|---|---|
| Zero external dependencies — fully native | Limited functionality vs OOD (no interactive apps, no file browser) |
| Consistent UX — same design language as rest of platform | API surface to maintain (Slurm CLI parsing, versioning) |
| Fast to build for basic job view | No Jupyter/VS Code integration on compute nodes |
| No auth complexity — uses existing platform auth | Slurm CLI output format can change between versions |
| No iframe/embedding needed | Only useful for Slurm (not reusable for other HPC schedulers) |
| Decision-first UX — show GPU cost before job submission | Platform becomes coupled to Slurm internals |
5.6 Best For¶
Teams that primarily use sbatch scripts and want quick visibility into queue status
from the same UI where they manage their cluster. Good for smaller teams that don't
need interactive notebook sessions on compute nodes.
6. Comparison Matrix¶
| Dimension | Open OnDemand | Grafana + Exporter | Native Thin UI |
|---|---|---|---|
| Job submission | Yes (web form + script) | No | Yes (basic) |
| Job monitoring | Yes (queue, history) | No (metrics only) | Yes (queue view) |
| Interactive apps | Yes (Jupyter, VS Code, desktop) | No | No |
| File browser | Yes | No | No |
| GPU metrics | No (not built-in) | Yes (dcgm-exporter) | No |
| Cluster utilization | Basic (node overview) | Yes (detailed dashboards) | Basic (node table) |
| Alerting | No | Yes (Grafana alerts) | No |
| Auth | OIDC (mod_auth_openidc) | OIDC (Grafana config) | Platform auth (native) |
| Embedding | iframe (OIDC proxy) | iframe (allow_embedding) | Native (no iframe) |
| Deploy complexity | High (Apache + Passenger + OOD config) | Medium (3 binaries + config) | Low (API + UI code) |
| Resource footprint | Medium-high | Low-medium | None (runs on platform) |
| Separate app instance? | Yes (companion) or tab | No (co-hosted on controller) | No (native) |
| App platform validation | Yes — tests companion app + embedded UI pattern | Partially — tests embedded UI only | No — platform-internal feature |
| Build effort | Small (deploy + configure) | Small (deploy + dashboards) | Medium (API + UI + Slurm parsing) |
| Maintenance | OOD upgrades + Slurm version compat | Prometheus storage + dashboard upkeep | Slurm CLI parsing maintenance |
| Requires gateway contract? | Yes — hardest case (WS, file upload, complex cookies) | Yes — simpler case (mostly read-only) | No — native rendering, no proxy |
| Control boundary | OOD runs on allocation, platform proxies — acceptable | Grafana runs on allocation, platform proxies — acceptable | Node-agent task model — cleanest control path |
| Audit coverage | OOD actions not audited by platform (runs independently) | Read-only — no mutations to audit | Full audit — job submit/cancel via node-agent task log |
7. Platform Primitive Dependencies¶
The Kubernetes Platform Design v2 (§3–4) identifies hard constraints and missing primitives that apply equally to Slurm UI integration. This section maps each option against those dependencies.
7.1 Gateway Contract (v2 §4.3)¶
Options 1 and 2 embed third-party UIs and therefore require the gateway contract. The contract must define:
| Concern | Grafana (Option 2) | Open OnDemand (Option 1) |
|---|---|---|
| Reverse-proxy route resolution | Simple — Grafana on fixed port on controller | Complex — OOD on Apache, path rewriting |
| Cookie domain scoping | Grafana session cookie under proxy origin | OOD + Apache mod_auth_openidc cookies, potential domain conflict |
| WebSocket upgrade | Optional (Grafana live, can be disabled) | Required (shell access, interactive apps) |
| CSRF model | Grafana has built-in CSRF; proxy must pass tokens | OOD uses Rails CSRF; proxy must preserve headers |
| Session TTL alignment | Grafana OIDC session vs platform session | OOD OIDC session + Apache session vs platform session |
| Logout propagation | Platform logout → revoke Grafana OIDC session | Platform logout → revoke OOD OIDC + Apache sessions |
| Fallback to link-out | Acceptable for Grafana (low-stakes) | Less acceptable (daily-use tool, SSO breakage is disruptive) |
Grafana is the simpler case and should validate the gateway contract first. OOD should only ship after Grafana proves the contract works.
7.2 Node Mutation Control (v2 §3.1)¶
| Option | How it touches nodes | Alignment |
|---|---|---|
| Native Thin UI | Node-agent typed tasks (app.slurm.query, app.slurm.submit) |
Full compliance — platform owns execution and audit |
| Grafana + Exporter | Installed by app worker during Slurm bootstrap; read-only after | Acceptable — no runtime node mutation from Grafana |
| Open OnDemand | OOD runs sbatch/squeue directly via Slurm CLI on the host |
Partial — OOD bypasses node-agent for Slurm operations. Acceptable because OOD runs on the controller allocation (not arbitrary nodes) and Slurm itself is the authority. But OOD job submissions are not audited by the platform. |
7.3 Recovery Model (v2 §4.4)¶
Each option introduces failure modes the app-runtime recovery model must handle:
| Option | Failure modes | Recovery needed |
|---|---|---|
| Native Thin UI | Node-agent task timeout, Slurm CLI unavailable | Existing node-agent retry semantics; degrade gracefully in UI |
| Grafana + Exporter | Grafana process crash, Prometheus disk full, exporter crash | App worker health check on controller; restart via node-agent task; alert on monitoring-down (meta-monitoring) |
| Open OnDemand | Apache crash, OOD app error, OIDC session desync | Companion app instance status reporting; decommission/redeploy path; platform detects unhealthy OOD and surfaces in workloads sidebar |
7.4 Dependency Summary¶
┌─────────────────────────┐
│ Embedded UI Gateway │
│ Contract (v2 §4.3) │
└──────────┬──────────────┘
│
┌──────────┴──────────────┐
│ │
┌─────▼──────┐ ┌──────▼──────┐
│ Grafana │ │ Open │
│ (Phase 3) │ │ OnDemand │
│ │ │ (Phase 4) │
└────────────┘ └─────────────┘
┌──────────────┐
│ Native Thin │ ← No gateway dependency
│ UI (Phase 1) │ ← Uses node-agent tasks (existing)
│ │ ← Uses platform auth (existing)
└──────────────┘
8. Recommendation: Layer All Three¶
These options serve different needs and are complementary. The recommended approach layers them:
┌── my-slurm (Slurm) ──────────────────────────────────────────────┐
│ │
│ Overview │ Jobs │ Monitoring │ HPC Portal │ Workers │ Operations │
│ │
│ Overview = existing Slurm Runtime card + Workers card │
│ Jobs = Option 3 (native thin UI — job queue + submit) │
│ Monitoring = Option 2 (embedded Grafana — GPU metrics) │
│ HPC Portal = Option 1 (embedded OOD — full HPC experience) │
│ Workers = existing worker management │
│ Operations = existing operation history │
└────────────────────────────────────────────────────────────────────┘
Implementation Order (revised to respect platform primitive dependencies)¶
The Kubernetes Platform Design v2 (§4) identifies four platform primitives that must exist before embedded UIs can ship. This changes the phase order from the naive "lowest effort first" to "dependency-correct order":
| Phase | What | Prerequisite | Why this order |
|---|---|---|---|
| Phase 1 | Native Thin UI (Jobs tab) | Node-agent task types for Slurm query/submit | No platform primitive dependencies — uses existing auth, existing node-agent, renders natively. Gives job queue visibility immediately. Builds the Slurm proxy API reusable by CLI and SDK. |
| Phase 2 | Embedded UI gateway contract | Phase 1 validated; design work | Define reverse-proxy route shape, auth strategy contract, cookie/session model, WS upgrade, CSP, logout propagation. This is a platform primitive, not a Slurm feature. |
| Phase 3 | Grafana + Slurm Exporter | Gateway contract (Phase 2) | First embedded UI. Grafana is the simplest case (mostly read-only, minimal WS, well-understood OIDC). Validates the gateway contract with low risk. GPU metrics are universally valuable. |
| Phase 4 | Open OnDemand | Gateway contract proven (Phase 3) | Full HPC portal. Exercises the hardest gateway cases (WS for shell, interactive apps, file uploads, complex cookie state). By Phase 4 the gateway contract is validated. |
Why This Order (corrected)¶
-
Native Jobs tab first because it has zero dependencies on new platform primitives. It uses the node-agent task model (already exists), platform auth (already exists), and renders natively (no iframe, no proxy). It ships the #1 user request (job queue visibility) while the gateway contract is being designed.
-
Gateway contract second because both Grafana and OOD depend on it. Designing the contract in parallel with Phase 1 is fine, but no embedded UI tab should ship before the contract is defined and reviewed. The contract covers: reverse-proxy route resolution, supported auth strategies, cookie domain scoping, WebSocket upgrade handling, CSP frame-ancestors, session TTL alignment, logout propagation, and fallback-to-link-out criteria.
-
Grafana third because it is the simplest embedded UI case — mostly read-only dashboards, minimal WebSocket usage, well-understood OIDC support. It validates the gateway contract with low blast radius. If the gateway contract has gaps, Grafana will expose them cheaply.
-
OOD fourth because it exercises every hard case in the gateway contract: WebSocket for shell and interactive apps, file upload multipart handling, complex cookie and session state, Apache mod_auth_openidc interaction with the platform's OIDC proxy. By Phase 4, those patterns are proven.
What Each Phase Adds to the Nav¶
After Phase 1 (native thin UI):
After Phase 2 (gateway contract): No visible nav change. Platform primitive is internal.
After Phase 3 (Grafana):
After Phase 4 (Open OnDemand):
WORKLOADS
▸ my-slurm (Slurm) → Overview │ Jobs │ Nodes │ Monitoring │ HPC Portal │ Workers │ Operations
9. Known Tradeoff: OOD Audit Bypass¶
This is a conscious product and security decision, not an open question.
When users submit jobs via Open OnDemand, those submissions execute directly on the controller allocation via Slurm CLI. They bypass the platform's node-agent task model and audit trail entirely. The platform sees the Slurm cluster running but has no record of individual job submissions, cancellations, or interactive sessions initiated through OOD.
9.1 What the platform sees vs what OOD does¶
| Action | Platform audit trail | OOD audit trail |
|---|---|---|
| Cluster deploy | Yes — app instance lifecycle | N/A |
| Worker add/drain/remove | Yes — member operations | N/A |
| Job submit via native UI (Phase 1) | Yes — node-agent task log | N/A |
| Job submit via OOD | No | OOD Apache access log (on the allocation) |
| Job cancel via OOD | No | OOD Apache access log |
| Interactive Jupyter session via OOD | No | OOD session log |
| File upload/download via OOD | No | OOD Apache access log |
9.2 Why this is acceptable (for now)¶
-
OOD operates within Slurm's authorization model. OOD authenticates users via OIDC and maps them to OS users on the cluster. Slurm enforces per-user quotas, partition access, and job limits. OOD cannot do more than the user's Slurm permissions allow.
-
The platform audits the blast radius, not the workload. The platform's audit contract covers infrastructure mutations (allocations, node state, credentials). Individual workload actions (which training job ran, which notebook opened) are analogous to individual commands typed in an SSH session — the platform doesn't audit those either.
-
OOD logs exist, just not in the platform. Apache access logs and Slurm accounting (
sacct) capture everything. They live on the allocation, not in the platform's audit_logs table. If compliance requires centralized audit, those logs can be forwarded to the platform's logging pipeline — but that is a separate integration, not a Phase 4 blocker.
9.3 When this becomes unacceptable¶
- If the platform needs to enforce per-job cost attribution (billing per job, not per allocation). OOD-submitted jobs would not have platform-tracked cost.
- If compliance requires that all user actions on platform-managed infrastructure are centrally audited. OOD's logs-on-allocation model does not satisfy this.
- If the platform needs to enforce job-level policies (e.g., max GPU-hours per job, job submission rate limits). OOD bypasses the platform API entirely.
If any of these become requirements, the options are: 1. Proxy OOD through the platform API — heavy OOD customization, breaks the "deploy OOD as-is" model. 2. Forward OOD/Slurm logs to the platform — lighter, preserves OOD as-is, but audit is eventually-consistent (log shipping delay). 3. Accept the gap and document it — current recommendation.
9.4 Decision¶
For Phase 4: accept the audit gap and document it clearly in the product. The instance detail page should show a notice when the HPC Portal tab is active: "Actions taken in the HPC Portal are managed by Slurm, not tracked in platform audit logs." This is honest UX, consistent with the decision-first principle.
11. Shared Monitoring Pattern¶
The Grafana + Exporter model (Option 2) generalizes beyond Slurm. Every app type benefits from embedded monitoring:
| App | Exporter | Dashboards |
|---|---|---|
| Slurm | prometheus-slurm-exporter + dcgm-exporter | Cluster overview, GPU detail, job queue depth |
| RKE2 / Kubernetes | kube-state-metrics + dcgm-exporter | Pod status, GPU scheduling, node health |
| Ray | ray-prometheus-exporter | Cluster resources, task queue, object store |
| MLflow | mlflow-prometheus-exporter (custom) | Experiment counts, model registry, artifact storage |
This suggests the monitoring tab should be a platform-level primitive, not an app-specific feature. The app manifest declares its exporter endpoints, and the platform deploys Prometheus + Grafana as a shared monitoring sidecar:
# In app manifest
monitoring:
exporters:
- name: slurm
port: 9341
path: /metrics
- name: dcgm
port: 9400
path: /metrics
grafana_dashboards:
- source: bundled # shipped with the app
path: /dashboards/slurm-overview.json
- source: bundled
path: /dashboards/gpu-detail.json
This keeps monitoring consistent across all apps and avoids each app developer reinventing the Prometheus + Grafana deployment.
Deployment model decision: Per-instance monitoring ships first (Phase 3). Shared/platform-managed monitoring is the future direction but requires tenant isolation, cross-allocation networking, and platform-operated Prometheus infrastructure. The shared pattern described above is the target architecture; Phase 3 is per-instance only. The manifest schema should be designed for the shared model from the start (so apps don't need to change when the platform migrates), but the runtime implementation in Phase 3 deploys Prometheus + Grafana on the controller allocation alongside the app.
12. Open Questions¶
-
OOD deployment model — companion app instance or co-hosted on controller? Companion is cleaner (separate lifecycle) but adds complexity (inter-instance references). Co-hosted is simpler but mixes concerns.
-
Slurm REST API (slurmrestd) vs CLI parsing — the native thin UI (Phase 1) can use either.
slurmrestdprovides structured JSON output but requires additional setup. CLI parsing (squeue --json) is simpler but format varies by Slurm version. Slurm 23.02+ has stable--jsonoutput. -
Interactive sessions — OOD's killer feature is interactive Jupyter/VS Code on compute nodes. Should this be available without OOD, via the platform's existing terminal feature? The platform already has browser terminal support — extending it to launch Jupyter on a specific Slurm node could cover this use case.
-
Job cost attribution — the native thin UI could show estimated cost per job (GPU-hours × rate). This requires the platform to know the SKU rate and the job's GPU allocation. Worth building? It would be a differentiating feature vs. OOD.
-
Gateway contract ownership — the embedded UI gateway contract is a platform primitive, not a Slurm feature. Who owns the design? Should it be a separate architecture doc (
doc/architecture/Embedded_UI_Gateway_Spec.md) or a section in the navigation redesign doc? It blocks Phases 3 and 4 of this spec and also blocks the Rancher embedded UI in the Kubernetes design. -
Future tab concepts — the current tab model covers Jobs (native), Monitoring (embedded Grafana), and HPC Portal (embedded OOD). Two additional tab concepts may be worth exploring independently of OOD:
- Access — SSH credentials, terminal launch, kubeconfig-style access tokens for the Slurm cluster. Currently scattered across the Overview and platform terminal. A dedicated tab could consolidate access methods.
- Files — lightweight file browser for job output directories, without the full OOD stack. Could use the node-agent task model to list/read files on the controller. Simpler than OOD's file browser but covers the "where is my job output?" use case. These are not committed — they are options to evaluate if OOD is deferred or if users need file/access capabilities before Phase 4.