Skip to content

Slurm First Slice Local Validation

Purpose

Define the first real local validation path for the Slurm adapter boundary using an already allocated development node.

This document is intentionally practical.

It does not claim full Slurm productization. It exists to prove that: 1. the new GPUaaS core member envelopes are usable, 2. the Slurm app can run on top without hidden platform shortcuts, 3. a real node can be driven through the first adapter flow locally.

Current Local Assumption

Local proving target: - user: dev-admin - allocated node: compute-vm

This should be treated as the first real-node validation target for the Slurm app-side slice.

What This Validation Must Prove

Platform side

  1. app instance can be created through public API
  2. member records can be listed and inspected through public API
  3. member-operation requests can be created through public API
  4. service-account-compatible authz path works
  5. audit and correlation_id evidence exists for app/member mutations

Slurm app side

  1. adapter can interpret component_key
  2. adapter can execute controller or worker bring-up on a real node
  3. adapter can update member and operation state back into GPUaaS
  4. adapter can expose useful adapter_detail
  5. no hidden direct SQL or internal-only platform path is required

Initial Local Shape

For the first real-node proof, keep the runtime shape as small as possible.

Recommended first shape: 1. one app instance 2. one controller member or one worker member, depending on which path is easier to prove first 3. one real node target: compute-vm

Practical recommendation: - prove controller bootstrap first on the real node - then prove worker add/join using the same boundary model

Phase 1: contract sanity

  1. create app instance via public API
  2. confirm app-instance state is visible
  3. create or inspect member records via public API
  4. create one member operation via public API
  5. confirm operation status is visible and auditable

This proves the core envelope independently of Slurm runtime behavior.

Phase 2: adapter execution on compute-vm

  1. use platform-ready access path to compute-vm
  2. stage Slurm artifacts/config through the app-owned logic
  3. execute first real Slurm bring-up step on the node
  4. report status back through member and operation read models

This proves the app/adapter boundary on a real machine.

Phase 3: debugability check

For every failure, the operator should be able to answer: 1. which app instance is affected 2. which member is affected 3. which operation is in progress or failed 4. what correlation_id to use for log/trace lookup 5. what adapter_detail says about the runtime state

If this cannot be answered without DB inspection or hidden internals, the boundary is still incomplete.

What The Platform Should Provide During Local Validation

GPUaaS is responsible for: 1. the app-instance and member-operation public APIs 2. service-account and admin auth flows 3. node/bootstrap/access substrate 4. audit and correlation behavior 5. reusable node/allocation/artifact/secret primitives

GPUaaS is not responsible for: 1. Slurm-specific install steps 2. Slurm config rendering 3. Slurm runtime health rules 4. Slurm join/drain/remove semantics

What The Slurm App Must Provide During Local Validation

The Slurm app is responsible for: 1. component_key meaning 2. controller/worker runtime actions 3. runtime reconciliation 4. mapping runtime progress back into generic member/operation states 5. adapter_detail fields that make failures understandable

Minimal Success Criteria

The first local validation is successful only if all are true:

  1. compute-vm can be used without adding new Slurm-specific core platform APIs
  2. member operation requests are sufficient to drive the first adapter action
  3. the adapter can run its first real action on the node and report back through public read models
  4. failures are diagnosable through API state, audit, and correlation-first evidence
  5. the exercise reveals only reusable platform gaps, not accidental hidden coupling

What To Capture As Gap Evidence

If something fails or feels awkward, record it as one of:

  1. missing reusable platform primitive
  2. missing public read-model surface
  3. missing service-account/IAM capability
  4. Slurm-specific app implementation problem
  5. local environment/setup problem rather than a platform gap

Immediate Outcome

The current conclusion is: - compute-vm should be the first real-node proof target - the next valuable work is to run the Slurm adapter against that node through the new member envelope - any further platform growth should be driven by what this local validation actually reveals