Slurm First Slice Local Validation¶
Purpose¶
Define the first real local validation path for the Slurm adapter boundary using an already allocated development node.
This document is intentionally practical.
It does not claim full Slurm productization. It exists to prove that: 1. the new GPUaaS core member envelopes are usable, 2. the Slurm app can run on top without hidden platform shortcuts, 3. a real node can be driven through the first adapter flow locally.
Current Local Assumption¶
Local proving target:
- user: dev-admin
- allocated node: compute-vm
This should be treated as the first real-node validation target for the Slurm app-side slice.
What This Validation Must Prove¶
Platform side¶
- app instance can be created through public API
- member records can be listed and inspected through public API
- member-operation requests can be created through public API
- service-account-compatible authz path works
- audit and
correlation_idevidence exists for app/member mutations
Slurm app side¶
- adapter can interpret
component_key - adapter can execute controller or worker bring-up on a real node
- adapter can update member and operation state back into GPUaaS
- adapter can expose useful
adapter_detail - no hidden direct SQL or internal-only platform path is required
Initial Local Shape¶
For the first real-node proof, keep the runtime shape as small as possible.
Recommended first shape:
1. one app instance
2. one controller member or one worker member, depending on which path is easier to prove first
3. one real node target: compute-vm
Practical recommendation: - prove controller bootstrap first on the real node - then prove worker add/join using the same boundary model
Recommended Validation Sequence¶
Phase 1: contract sanity¶
- create app instance via public API
- confirm app-instance state is visible
- create or inspect member records via public API
- create one member operation via public API
- confirm operation status is visible and auditable
This proves the core envelope independently of Slurm runtime behavior.
Phase 2: adapter execution on compute-vm¶
- use platform-ready access path to
compute-vm - stage Slurm artifacts/config through the app-owned logic
- execute first real Slurm bring-up step on the node
- report status back through member and operation read models
This proves the app/adapter boundary on a real machine.
Phase 3: debugability check¶
For every failure, the operator should be able to answer:
1. which app instance is affected
2. which member is affected
3. which operation is in progress or failed
4. what correlation_id to use for log/trace lookup
5. what adapter_detail says about the runtime state
If this cannot be answered without DB inspection or hidden internals, the boundary is still incomplete.
What The Platform Should Provide During Local Validation¶
GPUaaS is responsible for: 1. the app-instance and member-operation public APIs 2. service-account and admin auth flows 3. node/bootstrap/access substrate 4. audit and correlation behavior 5. reusable node/allocation/artifact/secret primitives
GPUaaS is not responsible for: 1. Slurm-specific install steps 2. Slurm config rendering 3. Slurm runtime health rules 4. Slurm join/drain/remove semantics
What The Slurm App Must Provide During Local Validation¶
The Slurm app is responsible for:
1. component_key meaning
2. controller/worker runtime actions
3. runtime reconciliation
4. mapping runtime progress back into generic member/operation states
5. adapter_detail fields that make failures understandable
Minimal Success Criteria¶
The first local validation is successful only if all are true:
compute-vmcan be used without adding new Slurm-specific core platform APIs- member operation requests are sufficient to drive the first adapter action
- the adapter can run its first real action on the node and report back through public read models
- failures are diagnosable through API state, audit, and correlation-first evidence
- the exercise reveals only reusable platform gaps, not accidental hidden coupling
What To Capture As Gap Evidence¶
If something fails or feels awkward, record it as one of:
- missing reusable platform primitive
- missing public read-model surface
- missing service-account/IAM capability
- Slurm-specific app implementation problem
- local environment/setup problem rather than a platform gap
Immediate Outcome¶
The current conclusion is:
- compute-vm should be the first real-node proof target
- the next valuable work is to run the Slurm adapter against that node through the new member envelope
- any further platform growth should be driven by what this local validation actually reveals