Slurm First Slice Adapter Contract v1¶
Purpose¶
Define the app-owned contract for the first Slurm slice on top of the GPUaaS core member read and member-operation envelopes.
This document is intentionally adapter-specific.
It answers: 1. how Slurm uses the new generic platform surfaces, 2. what Slurm-specific meanings stay outside GPUaaS core, 3. what the first app-side implementation should support before broader Slurm work continues.
Reading order:
1. App_Platform_Core_For_First_Slurm_Slice_v1.md
2. Slurm_First_Slice_Platform_App_Split_v1.md
3. Slurm_First_Slice_Adapter_Contract_v1.md
Use this with:
- doc/api/openapi.draft.yaml
- doc/architecture/Slurm_App_Runtime_Adapter_v1.md
- doc/architecture/Node_Operations_and_Agent_Lifecycle_v1.md
Decision Summary¶
- The first Slurm slice uses the generic app-instance member APIs exactly as platform envelopes, not as runtime semantics.
- Slurm owns all meaning behind:
component_key- member status interpretation
- add/drain/remove/replace execution
- adapter detail payloads
- The first supported Slurm adapter shape is intentionally narrow:
- one controller
- zero or more workers
- no HA controller
- no shared multi-tenant Slurm service
- The first goal is to prove:
- controller bootstrap
- worker join
- worker drain/remove
- correlation-first status/debugging
- The first slice should not attempt a generalized scheduler framework.
Slurm Component Keys¶
For the first slice, the Slurm adapter uses two adapter-owned component_key values:
controllerworker
Important rule: - these are Slurm adapter values, not platform enums
GPUaaS core should treat them as opaque strings.
Supported Member Operations In Slice One¶
The generic platform operation envelope supports:
- add
- drain
- remove
- replace
The first Slurm slice should use them as follows.
add¶
Supported for:
- worker
Meaning in Slurm: - request capacity for one or more worker members - bootstrap those workers - render required Slurm config - join them to the controller - report success when the worker is registered and reaches the adapter’s ready condition
Not used in slice one for:
- controller
Reason: - first slice assumes one controller created by app-instance deploy path, not by later member scale-out
drain¶
Supported for:
- worker
Meaning in Slurm: - mark the worker as draining in Slurm - wait until the adapter’s safe-drain condition is met - update generic member status and adapter detail as the drain progresses
Not supported in slice one for:
- controller
Reason: - controller drain semantics are not part of the first slice
remove¶
Supported for:
- worker
Meaning in Slurm: - require the target worker to already be safely drained, or ensure the adapter drains it first according to its own policy - remove the worker from Slurm membership - continue any node lifecycle cleanup required by the platform composition rule
Not supported in slice one for:
- controller
replace¶
Supported for:
- worker
Meaning in Slurm: - equivalent to: - provision replacement worker - join replacement - drain/remove failed or retiring worker
Implementation note:
- the adapter may internally realize replace as a composed workflow rather than as one direct node action
App-Instance Deploy Path¶
The first Slurm app-instance deploy path should create:
- one
controllermember - zero or more
workermembers depending on adapter-owned initial config
Recommended first-step posture: - deploy controller first - reach controller healthy baseline - then reconcile any initial worker set
The platform app-instance status should remain generic:
- requested
- deploying
- running
- failed
Slurm-specific rollout detail belongs in member records and adapter_detail.
Generic Member Status Mapping¶
The platform member status values are generic. The Slurm adapter interprets them as follows:
requested¶
- member intent exists but no meaningful adapter execution has started yet
reconciling¶
- adapter is actively bootstrapping, joining, draining, or reconciling the member
ready¶
- member satisfies the Slurm adapter’s ready condition for its component key
draining¶
- adapter has placed the member into a drain path and is waiting for safe completion
deleting¶
- adapter is actively removing the member from runtime and infrastructure state
failed¶
- adapter reached a terminal failure for the current member lifecycle attempt
deleted¶
- member has been fully removed from the active runtime set
Important rule: - the human/runtime meaning behind readiness or drain safety remains Slurm-owned
adapter_detail Contract For Slice One¶
The first slice should expose enough adapter detail for operator debugging without leaking Slurm semantics into generic top-level fields.
Recommended initial fields:
Common fields¶
rolephaselast_adapter_errorlast_runtime_observed_at
Controller fields¶
controller_hostnameslurmctld_stateconfig_versioncluster_name
Worker fields¶
worker_hostnameslurmd_statedrain_reasonregistered_in_controllerconfig_version
Rule: - these are adapter-owned fields and may evolve with the Slurm adapter - they should not be promoted into the generic platform contract unless multiple runtimes prove the same need
Node/Allocation Composition¶
The Slurm adapter should use platform primitives in this order:
- request or reconcile required capacity through app-owned logic using platform allocation intent
- use platform node/bootstrap primitives to prepare the target host
- use platform artifact and secret delivery primitives
- execute Slurm-specific host configuration and service bring-up
- update member and operation state back through the platform surfaces
Boundary rule: - no hidden direct platform DB writes from the Slurm app - no bypass around platform audit/correlation model - no assumption that the platform understands Slurm semantics
Service Account Usage¶
The Slurm adapter should run under a project-scoped or tenant-scoped service account aligned with the app instance scope.
The adapter should use that identity for: - reading app instance and member state - creating or updating member operations if needed by its own workflows - consuming platform primitives - reporting lifecycle progress through public contracts
The adapter should not require: - platform superadmin privileges - direct human tokens - undocumented internal-only endpoints
First Slice Acceptance Criteria¶
The first Slurm adapter slice is acceptable only if all are true:
controllerandworkerremain adapter-ownedcomponent_keyvalues, not platform enums.- The Slurm app implements worker
add,drain,remove, andreplacewithout adding Slurm-specific fields to the generic platform status model. - Member and operation status are sufficient for an operator to diagnose progress with
correlation_idandadapter_detail. - The adapter uses platform node/allocation/artifact/secret primitives rather than hidden side channels.
- A future non-Slurm app could reuse the same platform member envelopes without inheriting Slurm behavior.
Explicit Non-Goals¶
Do not include in the first Slurm slice:
- multi-controller HA Slurm
- generic scheduler abstraction
- shared tenant-wide or platform-wide Slurm service
- controller drain/remove semantics
- Slurm-specific event taxonomy in core contracts
- Slurm-specific top-level API resources in GPUaaS core
Immediate Outcome¶
The current conclusion is: - the GPUaaS core now provides the minimal envelope - the Slurm app can define its first concrete meanings on top - the next app-side work is implementation of controller and worker reconciliation, not more core API design