Slurm App Runtime Adapter v1¶
Purpose¶
Define the first concrete runtime_backend=slurm path under the App Platform model so Slurm support lands as an adapter behind existing app-instance contracts rather than as a special-case control-plane feature.
Use this together with:
- Scheduler_as_Platform_App_v1.md
- App_Control_Plane_v1.md
- App_Runtime_Operating_Modes_v1.md
- Clustered_App_Model_v1.md
- App_Platform_Core_For_First_Slurm_Slice_v1.md
- Slurm_First_Slice_Platform_App_Split_v1.md
- Slurm_First_Slice_Adapter_Contract_v1.md
- Node_Operations_and_Agent_Lifecycle_v1.md
- ../operations/Slurm_First_Slice_Local_Validation.md
1. Decision Summary¶
- Slurm support is delivered as an App Platform runtime adapter, not as a new core allocation API.
- The control plane continues to own:
- app catalog
- app versions
- entitlements
- app instances
- audit
- events
- policy and service-account boundaries
- Slurm-specific behavior lives behind the adapter boundary:
- config rendering
- controller/worker bootstrap
- node join/drain/remove translation
- status mapping back into app-instance lifecycle
- implementation of generic app-instance member-operation envelopes
- The first production path is
tenant_dedicated. - The first execution target should stay narrow: one real reference adapter, not a generic scheduler framework.
2. What Remains Generic App Platform¶
The following stay runtime-neutral and must not grow Slurm-specific handler branches:
- app catalog and version publication
- project entitlement enable/disable
- app artifact registration and trust/promotion
- app instance CRUD and lifecycle states
- service account issuance and project scoping
- policy overlays for allowed regions/SKUs/artifact sources
- canonical error envelope and audit contract
apps.instance.*lifecycle events
Rule: - if Slurm requires changes in generic app-instance handlers beyond adapter selection and lifecycle orchestration, treat that as a platform design defect.
3. What the Slurm Adapter Owns¶
The Slurm adapter is responsible for translating one app instance into concrete runtime actions.
3.1 Control-plane responsibilities inside the adapter¶
- resolve the Slurm app version manifest
- resolve required controller/worker artifacts and templates
- mint task-bound secrets/credentials
- choose the target scope:
operating_mode=tenant_dedicatedcontrol_plane_scope=project|tenantruntime_backend=slurm- create adapter-owned runtime state records for controller and workers
- map adapter progress/failure into app-instance lifecycle
3.2 Host/data-plane responsibilities inside the adapter¶
Via typed node operations or typed node-agent tasks:
- install slurmd / slurmctld dependencies
- install/configure munge
- render slurm.conf and supporting files
- stage trusted artifacts from platform-controlled sources
- start/enable systemd services
- verify Slurm node/controller readiness
- translate node lifecycle events into Slurm drain/remove behavior
4. Initial Runtime Topology¶
The Slurm adapter is the first proof of the generic clustered-app model, not a one-off exception to it.
4.1 Initial supported mode¶
The first supported operating shape should be:
- operating_mode = tenant_dedicated
- runtime_backend = slurm
- control_plane_scope = project or tenant
This keeps isolation, rollback, and support boundaries manageable.
4.2 Initial topology assumption¶
For v1, assume: - one Slurm control-plane deployment per app instance scope - one or more worker nodes joining that Slurm control plane - no HA controller requirement yet - no shared platform-managed Slurm service yet
This is enough to prove: - adapter boundary correctness - node operation composition - lifecycle/event mapping - auditability and operator triage
5. Contract Shape for the First Slurm App Version¶
The app version manifest for Slurm should be explicit about the pieces the adapter needs.
Minimum conceptual inputs:
- controller artifact reference(s)
- worker artifact reference(s)
- config template set/version
- munge key delivery model
- required service units
- health checks
- expected ports
These belong in the app version manifest and trusted artifact model, not in ad hoc node shell inputs.
6. Runtime State Model¶
The app instance remains the user/operator-facing resource.
The Slurm adapter may keep adapter-specific runtime state such as: - controller hostname/resource - worker membership set - current rendered config version - last observed Slurm cluster health - last sync/drain/remove result
But this state must remain adapter-owned and surfaced through app-instance status/detail views rather than leaking directly into generic app contracts.
7. Lifecycle Mapping¶
7.1 App instance lifecycle¶
Generic lifecycle remains:
- requested
- deploying
- running
- failed
- decommissioning
- decommissioned
7.2 Slurm adapter interpretation¶
Suggested mapping:
- requested
- app instance accepted; adapter orchestration not started
- deploying
- controller/worker operations in progress
- running
- controller healthy and required worker baseline satisfied
- failed
- adapter hit unrecoverable error after bounded retries
- decommissioning
- Slurm runtime teardown in progress
- decommissioned
- controller and worker teardown complete
Do not introduce Slurm-specific generic app-instance statuses like controller_bootstrapping or draining_worker.
Those belong in adapter detail state, not in the canonical app lifecycle enum.
8. Node Lifecycle Integration¶
Slurm is where the earlier node runtime work becomes useful.
The adapter should build on: - trusted artifact delivery - runtime secret delivery - generic app-instance member-operation envelopes - typed host operation primitives - node drain/remove lifecycle
Expected interactions: - onboarding a Slurm worker node: - stage artifacts - render config - join/register worker - retiring/removing a Slurm worker node: - drain via Slurm-aware operation - wait for scheduler-safe state - then continue node retire/remove lifecycle
This means Slurm should consume the node lifecycle primitives, not redefine them.
9. Service Account and Trust Model¶
The Slurm adapter should run under a project-scoped or tenant-scoped service account owned through the existing App Platform model.
Requirements: - no static long-lived infrastructure credentials in app-instance config - no raw SSH orchestration from the adapter - all secrets delivered through the existing control-plane/Vault-backed mechanism - all artifact pulls and config rendering remain policy-governed
10. Observability and Events¶
The adapter must reuse the generic app-instance event surface:
- apps.instance.requested
- apps.instance.running
- apps.instance.failed
- apps.instance.deleting
- apps.instance.deleted
Additionally, the adapter should expose structured runtime details sufficient for: - controller health - worker membership and drain state - last adapter error - correlation-first triage
If the generic event surface proves insufficient, add new app-runtime events intentionally rather than emitting Slurm-only one-offs ad hoc.
11. First Implementation Slices¶
Slice 1: contract and manifest boundary¶
- define the Slurm adapter manifest shape and app-instance expectations
- no real host execution yet
Slice 2: controller bootstrap path¶
- prove one real Slurm control-plane deployment path through node operations
Slice 3: worker join/drain/remove path¶
- join workers
- map drain/remove to node lifecycle
Slice 4: status and teardown¶
- surface runtime state in app-instance detail
- implement decommission path
This staged path is preferable to a single broad “Slurm support” task because it keeps the adapter boundary honest and reviewable.
12. Non-Goals for v1¶
Do not include in the first Slurm adapter slice: - HA/multi-controller Slurm - shared platform-managed Slurm service - generalized scheduler abstraction beyond what current App Platform already models - Slurm-specific branches in core allocation handlers - direct CLI/SSH-driven host orchestration in product code
13. Review Standard¶
When implementing the adapter, evaluate every proposed change against this rule:
Does this belong in: - generic App Platform contract/lifecycle/policy, or - Slurm adapter-owned runtime execution/state?
If the answer is unclear, bias toward keeping generic contracts smaller and pushing Slurm-specific behavior behind the adapter.