Skip to content

Slurm Product Workflow And Gap Assessment v1

Purpose

Record the actual product workflow that GPUaaS and the Slurm app must support, compare it to the current proof path, and identify the missing pieces that still block honest app-developer readiness.

This document exists because the current Slurm path proves feasibility, but still contains shortcut behavior. It should be used as the review gate before declaring the app platform ready for independent app teams.

Reading order: 1. External_App_Team_Integration_Guide_v1.md 2. Slurm_First_Slice_Platform_App_Split_v1.md 3. Slurm_First_Slice_Adapter_Contract_v1.md 4. Slurm_Product_Workflow_And_Gap_Assessment_v1.md 5. Slurm_Tenant_Scope_Semantics_v1.md

Current Conclusion

The current Slurm slice is: - in a first usable state for local and platform-control validation, - good enough as a proof of viability, - not yet ready as a fully hardened product workflow for external app developers.

Why: - the platform boundary is mostly proven, - the app/controller deployment model is proven, - single-node and two-node worker lifecycle now work through the app runtime APIs, - but final operator workflow, product cleanup, and broader multi-node semantics are still incomplete.

Current Usable State As Of April 2026

Slurm has reached the first usable validation slice.

Working: - deploy through the app catalog into an existing allocation, - bootstrap controller and worker on the same allocation, - add a worker on a second allocation through member operations, - remove and re-add a worker through member operations, - surface controller/worker members, events, runtime state, access notes, and lifecycle controls, - use Slurm natively after bootstrap with sinfo and srun, - recover bootstrap-failed worker records through supported remove/tombstone behavior. - platform-control validation now keeps the reference Slurm instance in stable running / healthy state after bootstrap completion; the controller no longer reports slurm_bootstrap_completed as progressing.

Still open: - make decommission perform honest runtime teardown, not just metadata cleanup, - harden batch-job/accounting behavior so sbatch does not show confusing transient InvalidAccount states, - clean up or document PMIx startup warnings from the distro Slurm package, - keep package/catalog version labeling honest because the GPUaaS app version may differ from the distro Slurm daemon version, - add automated platform-control smoke tests for deploy, add worker, remove worker, stop/start/restart, access, health classification, and decommission, - improve the Slurm access bundle with exact paths, common commands, and scheduler health examples, - define whether broader multi-node placement remains one-worker-at-a-time or grows a bulk worker placement flow.

Product judgment: - Slurm is a credible app-runtime reference implementation now. - The next work is hardening and operator UX, not proving the architecture from scratch.

Target Product Workflow

This is the workflow the product must support, not the shortcut path used during proof.

Tenant-admin / operator workflow

  1. Go to App Catalog.
  2. Choose an entitled app.
  3. Click Deploy.
  4. Provide the inputs that app actually requires.
  5. Submit the deployment.
  6. Watch clear lifecycle progress until the app reaches a stable state.
  7. Later perform app-specific lifecycle actions such as:
  8. add worker,
  9. drain worker,
  10. remove worker,
  11. decommission app instance.

App-controller workflow

  1. Run as an independently deployed app-owned controller.
  2. Authenticate with a project-scoped service account.
  3. Read app instances, members, and member operations through public APIs only.
  4. Acquire platform-supported machine access to the target allocation.
  5. Install or configure the runtime on the selected allocation host.
  6. Report progress, status, and failures back through public APIs.
  7. Continue reconciling until the desired app state is reached.

Platform workflow

The platform must provide: - app-instance lifecycle envelopes, - member and member-operation envelopes, - IAM, audit, and correlation, - explicit placement or allocation-intent surfaces, - app-compatible machine access primitives, - operator-readable status and failure evidence.

For placement, the preferred platform surface is the existing project-scoped allocation read model. The product should select from allocations already visible in scope rather than inventing a Slurm-only node-picker contract.

The platform must not provide: - Slurm-specific install logic, - Slurm-specific health logic, - Slurm-specific drain/remove logic, - Slurm-specific node role semantics.

Target Single-Node Slurm Workflow

For the first single-node Slurm product path:

  1. Tenant admin deploys Slurm Reference.
  2. Deploy UI collects the real required inputs for single-node Slurm.
  3. Platform creates the app instance and initial controller member.
  4. Slurm controller sees that intent through public APIs.
  5. Slurm controller acquires machine access through a supported app access path.
  6. Slurm controller bootstraps the same allocation host with:
  7. munge
  8. slurmctld
  9. slurmd
  10. App runtime seeds initial worker add operations from the selected worker_allocation_ids.
  11. Slurm controller reports:
  12. controller member ready
  13. app instance running
  14. runtime detail for operator visibility
  15. Operator sees the instance as stable in UI.
  16. Worker lifecycle actions use explicit allocation-based placement or explicit single-node semantics, not hidden inference.

Proof Path We Used

The current proof path demonstrated that the architecture is viable, but it is not the final intended workflow.

What was proven: - generic app-instance member and member-operation primitives can support an independently deployed app controller, - a project-scoped service account can drive the app controller, - the Slurm app can bootstrap a real node and report status through public APIs, - the app controller can run continuously in kind parity, - the UI can expose app status and app-specific worker actions.

Shortcut Behavior In The Current Proof

These are the places where the current implementation is still shortcut-oriented.

1. Multi-node product workflow is not complete

Current behavior: - deploy is explicit for the supported single-node reference path, - Add worker is explicit for one selected allocation at a time, - the controller discovers app instances by app slug in project scope.

Why this is not enough: - multi-node Slurm still needs clear operator semantics for selecting one or many worker allocations, - tenant-scoped or multi-project Slurm is still undefined product behavior.

2. Deploy flow is still single-node-first

Current behavior: - deploy asks for: - controller allocation - same-node vs separate initial worker placement - initial worker allocation list when separate worker placement is chosen - bootstrap SSH access credential - optional operator service account - the current UI now supports both: - same-node initial placement, and - explicit separate initial worker placement

Why this is not enough: - this is now correct for the supported project-scoped single-node or split-initial-placement reference path, - but broader multi-node lifecycle semantics and product language are still not finalized.

3. Operator workflow is still more explicit in docs than in UX

Current behavior: - the platform and app now support: - project-scoped access-credential custody and delivery - first-class deploy placement_intent - allocation-based worker placement - but the UI still needs more generalized reuse and polish around those supported primitives.

Why this matters: - external app developers need a clean reference path, not just a working Slurm-specific page.

4. Historical proof residue still influences the UX model

Current proof behavior: - the UI now separates current state from history better than before, - but some of the page structure was driven by cleaning up proof residue rather than a final operator workflow model.

Why this matters: - product UX should reflect intended day-1 and day-2 workflows, - not just what was needed to survive proof iterations.

What The Product Must Support Before App-Developer Readiness

A. Correct app machine-access workflow

The product needs a supported story for: - how an app controller gets machine access, - how credential custody is handled, - how access is granted and revoked, - how external app teams rely on it without ad hoc secret injection.

Minimum outcome: - an app developer can understand and implement machine access without hidden local-environment steps.

Current status for the project-scoped Slurm path: - project-scoped access-credential lifecycle under the public API, - Vault-backed secret write on create and rotate, - secure delivery back to the app controller, - no plaintext secret reveal in the normal control-plane response.

B. Explicit deploy-time targeting and placement

The product needs a clear answer for: - whether deploy chooses a target node or allocation, - whether the platform chooses it, - whether the app chooses it, - what fields are required in the contract.

Minimum outcome: - Deploy and Add worker do not depend on inferred host reuse as the primary model.

Preferred product shape: - Deploy and Add worker choose from existing allocations in scope, - the app carries explicit placement_intent or allocation_intent, - the controller reconciles against that explicit target.

Current status: - this now exists for the single-node reference path, initial worker seeding, and worker-add flow.

C. Explicit single-node vs multi-node semantics

The product needs to say: - what single-node Slurm means, - what actions are valid in single-node mode, - what changes in multi-node mode.

Minimum outcome: - the app UI and API contract do not pretend the same workflow covers all modes when it does not.

D. Operator-visible progress and escape paths

The product needs: - clear progress states, - clear error states, - retry/decommission/cleanup paths for non-terminal failures, - no dependence on manual database cleanup.

Minimum outcome: - an operator can recover through supported UI/API actions.

Gap Assessment

Ready enough today

  • generic member and member-operation platform primitives
  • app-owned controller deployment model
  • service-account identity for the app controller
  • project-scoped access-credential custody and delivery
  • first-class app-instance placement_intent
  • explicit allocation-based worker placement
  • public status reporting back into the platform
  • real-node bootstrap proof path
  • stable workload health classification after completed bootstrap

Not ready enough today

  • generalized multi-node placement UX and contract
  • tenant-scope or multi-project Slurm workflow
  • app-developer-facing machine access contract
  • polished operator workflow for all non-happy-paths
  • automated smoke coverage for platform-control Slurm day-2 operations

Readiness Judgment

If the question is: - “Can the current slice prove that the architecture works?”

The answer is: - yes

If the question is: - “Can we hand this to independent app developers as the clean supported product workflow?”

The answer is: - not yet, but it is much closer for the single-node project-scoped example-app path

Next Required Decisions

Before further productization, decide these explicitly:

  1. How does an app controller obtain machine access in the supported product model?
  2. What exact inputs must Deploy collect for single-node Slurm?
  3. What exact inputs must Add worker collect for multi-node Slurm?
  4. What parts of placement are platform-owned versus app-owned?
  5. What is the supported operator recovery path for stuck app and member operations?

Immediate Outcome

The next step should not be “add more incidental fixes.”

The next step should be: - use this document to define the correct product workflow, - close the machine-access and placement gaps, - then update UI, API, and app-controller behavior to match that intended workflow.