Skip to content

Slurm App Runtime Adapter v1

Purpose

Define the first concrete runtime_backend=slurm path under the App Platform model so Slurm support lands as an adapter behind existing app-instance contracts rather than as a special-case control-plane feature.

Use this together with: - Scheduler_as_Platform_App_v1.md - App_Control_Plane_v1.md - App_Runtime_Operating_Modes_v1.md - Clustered_App_Model_v1.md - App_Platform_Core_For_First_Slurm_Slice_v1.md - Slurm_First_Slice_Platform_App_Split_v1.md - Slurm_First_Slice_Adapter_Contract_v1.md - Node_Operations_and_Agent_Lifecycle_v1.md - ../operations/Slurm_First_Slice_Local_Validation.md

1. Decision Summary

  1. Slurm support is delivered as an App Platform runtime adapter, not as a new core allocation API.
  2. The control plane continues to own:
  3. app catalog
  4. app versions
  5. entitlements
  6. app instances
  7. audit
  8. events
  9. policy and service-account boundaries
  10. Slurm-specific behavior lives behind the adapter boundary:
  11. config rendering
  12. controller/worker bootstrap
  13. node join/drain/remove translation
  14. status mapping back into app-instance lifecycle
  15. implementation of generic app-instance member-operation envelopes
  16. The first production path is tenant_dedicated.
  17. The first execution target should stay narrow: one real reference adapter, not a generic scheduler framework.

2. What Remains Generic App Platform

The following stay runtime-neutral and must not grow Slurm-specific handler branches:

  • app catalog and version publication
  • project entitlement enable/disable
  • app artifact registration and trust/promotion
  • app instance CRUD and lifecycle states
  • service account issuance and project scoping
  • policy overlays for allowed regions/SKUs/artifact sources
  • canonical error envelope and audit contract
  • apps.instance.* lifecycle events

Rule: - if Slurm requires changes in generic app-instance handlers beyond adapter selection and lifecycle orchestration, treat that as a platform design defect.

3. What the Slurm Adapter Owns

The Slurm adapter is responsible for translating one app instance into concrete runtime actions.

3.1 Control-plane responsibilities inside the adapter

  • resolve the Slurm app version manifest
  • resolve required controller/worker artifacts and templates
  • mint task-bound secrets/credentials
  • choose the target scope:
  • operating_mode=tenant_dedicated
  • control_plane_scope=project|tenant
  • runtime_backend=slurm
  • create adapter-owned runtime state records for controller and workers
  • map adapter progress/failure into app-instance lifecycle

3.2 Host/data-plane responsibilities inside the adapter

Via typed node operations or typed node-agent tasks: - install slurmd / slurmctld dependencies - install/configure munge - render slurm.conf and supporting files - stage trusted artifacts from platform-controlled sources - start/enable systemd services - verify Slurm node/controller readiness - translate node lifecycle events into Slurm drain/remove behavior

4. Initial Runtime Topology

The Slurm adapter is the first proof of the generic clustered-app model, not a one-off exception to it.

4.1 Initial supported mode

The first supported operating shape should be: - operating_mode = tenant_dedicated - runtime_backend = slurm - control_plane_scope = project or tenant

This keeps isolation, rollback, and support boundaries manageable.

4.2 Initial topology assumption

For v1, assume: - one Slurm control-plane deployment per app instance scope - one or more worker nodes joining that Slurm control plane - no HA controller requirement yet - no shared platform-managed Slurm service yet

This is enough to prove: - adapter boundary correctness - node operation composition - lifecycle/event mapping - auditability and operator triage

5. Contract Shape for the First Slurm App Version

The app version manifest for Slurm should be explicit about the pieces the adapter needs.

Minimum conceptual inputs: - controller artifact reference(s) - worker artifact reference(s) - config template set/version - munge key delivery model - required service units - health checks - expected ports

These belong in the app version manifest and trusted artifact model, not in ad hoc node shell inputs.

6. Runtime State Model

The app instance remains the user/operator-facing resource.

The Slurm adapter may keep adapter-specific runtime state such as: - controller hostname/resource - worker membership set - current rendered config version - last observed Slurm cluster health - last sync/drain/remove result

But this state must remain adapter-owned and surfaced through app-instance status/detail views rather than leaking directly into generic app contracts.

7. Lifecycle Mapping

7.1 App instance lifecycle

Generic lifecycle remains: - requested - deploying - running - failed - decommissioning - decommissioned

7.2 Slurm adapter interpretation

Suggested mapping: - requested - app instance accepted; adapter orchestration not started - deploying - controller/worker operations in progress - running - controller healthy and required worker baseline satisfied - failed - adapter hit unrecoverable error after bounded retries - decommissioning - Slurm runtime teardown in progress - decommissioned - controller and worker teardown complete

Do not introduce Slurm-specific generic app-instance statuses like controller_bootstrapping or draining_worker. Those belong in adapter detail state, not in the canonical app lifecycle enum.

8. Node Lifecycle Integration

Slurm is where the earlier node runtime work becomes useful.

The adapter should build on: - trusted artifact delivery - runtime secret delivery - generic app-instance member-operation envelopes - typed host operation primitives - node drain/remove lifecycle

Expected interactions: - onboarding a Slurm worker node: - stage artifacts - render config - join/register worker - retiring/removing a Slurm worker node: - drain via Slurm-aware operation - wait for scheduler-safe state - then continue node retire/remove lifecycle

This means Slurm should consume the node lifecycle primitives, not redefine them.

9. Service Account and Trust Model

The Slurm adapter should run under a project-scoped or tenant-scoped service account owned through the existing App Platform model.

Requirements: - no static long-lived infrastructure credentials in app-instance config - no raw SSH orchestration from the adapter - all secrets delivered through the existing control-plane/Vault-backed mechanism - all artifact pulls and config rendering remain policy-governed

10. Observability and Events

The adapter must reuse the generic app-instance event surface: - apps.instance.requested - apps.instance.running - apps.instance.failed - apps.instance.deleting - apps.instance.deleted

Additionally, the adapter should expose structured runtime details sufficient for: - controller health - worker membership and drain state - last adapter error - correlation-first triage

If the generic event surface proves insufficient, add new app-runtime events intentionally rather than emitting Slurm-only one-offs ad hoc.

11. First Implementation Slices

Slice 1: contract and manifest boundary

  • define the Slurm adapter manifest shape and app-instance expectations
  • no real host execution yet

Slice 2: controller bootstrap path

  • prove one real Slurm control-plane deployment path through node operations

Slice 3: worker join/drain/remove path

  • join workers
  • map drain/remove to node lifecycle

Slice 4: status and teardown

  • surface runtime state in app-instance detail
  • implement decommission path

This staged path is preferable to a single broad “Slurm support” task because it keeps the adapter boundary honest and reviewable.

12. Non-Goals for v1

Do not include in the first Slurm adapter slice: - HA/multi-controller Slurm - shared platform-managed Slurm service - generalized scheduler abstraction beyond what current App Platform already models - Slurm-specific branches in core allocation handlers - direct CLI/SSH-driven host orchestration in product code

13. Review Standard

When implementing the adapter, evaluate every proposed change against this rule:

Does this belong in: - generic App Platform contract/lifecycle/policy, or - Slurm adapter-owned runtime execution/state?

If the answer is unclear, bias toward keeping generic contracts smaller and pushing Slurm-specific behavior behind the adapter.