Skip to content

Slurm First Slice Adapter Contract v1

Purpose

Define the app-owned contract for the first Slurm slice on top of the GPUaaS core member read and member-operation envelopes.

This document is intentionally adapter-specific.

It answers: 1. how Slurm uses the new generic platform surfaces, 2. what Slurm-specific meanings stay outside GPUaaS core, 3. what the first app-side implementation should support before broader Slurm work continues.

Reading order: 1. App_Platform_Core_For_First_Slurm_Slice_v1.md 2. Slurm_First_Slice_Platform_App_Split_v1.md 3. Slurm_First_Slice_Adapter_Contract_v1.md

Use this with: - doc/api/openapi.draft.yaml - doc/architecture/Slurm_App_Runtime_Adapter_v1.md - doc/architecture/Node_Operations_and_Agent_Lifecycle_v1.md

Decision Summary

  1. The first Slurm slice uses the generic app-instance member APIs exactly as platform envelopes, not as runtime semantics.
  2. Slurm owns all meaning behind:
  3. component_key
  4. member status interpretation
  5. add/drain/remove/replace execution
  6. adapter detail payloads
  7. The first supported Slurm adapter shape is intentionally narrow:
  8. one controller
  9. zero or more workers
  10. no HA controller
  11. no shared multi-tenant Slurm service
  12. The first goal is to prove:
  13. controller bootstrap
  14. worker join
  15. worker drain/remove
  16. correlation-first status/debugging
  17. The first slice should not attempt a generalized scheduler framework.

Slurm Component Keys

For the first slice, the Slurm adapter uses two adapter-owned component_key values:

  1. controller
  2. worker

Important rule: - these are Slurm adapter values, not platform enums

GPUaaS core should treat them as opaque strings.

Supported Member Operations In Slice One

The generic platform operation envelope supports: - add - drain - remove - replace

The first Slurm slice should use them as follows.

add

Supported for: - worker

Meaning in Slurm: - request capacity for one or more worker members - bootstrap those workers - render required Slurm config - join them to the controller - report success when the worker is registered and reaches the adapter’s ready condition

Not used in slice one for: - controller

Reason: - first slice assumes one controller created by app-instance deploy path, not by later member scale-out

drain

Supported for: - worker

Meaning in Slurm: - mark the worker as draining in Slurm - wait until the adapter’s safe-drain condition is met - update generic member status and adapter detail as the drain progresses

Not supported in slice one for: - controller

Reason: - controller drain semantics are not part of the first slice

remove

Supported for: - worker

Meaning in Slurm: - require the target worker to already be safely drained, or ensure the adapter drains it first according to its own policy - remove the worker from Slurm membership - continue any node lifecycle cleanup required by the platform composition rule

Not supported in slice one for: - controller

replace

Supported for: - worker

Meaning in Slurm: - equivalent to: - provision replacement worker - join replacement - drain/remove failed or retiring worker

Implementation note: - the adapter may internally realize replace as a composed workflow rather than as one direct node action

App-Instance Deploy Path

The first Slurm app-instance deploy path should create:

  1. one controller member
  2. zero or more worker members depending on adapter-owned initial config

Recommended first-step posture: - deploy controller first - reach controller healthy baseline - then reconcile any initial worker set

The platform app-instance status should remain generic: - requested - deploying - running - failed

Slurm-specific rollout detail belongs in member records and adapter_detail.

Generic Member Status Mapping

The platform member status values are generic. The Slurm adapter interprets them as follows:

requested

  • member intent exists but no meaningful adapter execution has started yet

reconciling

  • adapter is actively bootstrapping, joining, draining, or reconciling the member

ready

  • member satisfies the Slurm adapter’s ready condition for its component key

draining

  • adapter has placed the member into a drain path and is waiting for safe completion

deleting

  • adapter is actively removing the member from runtime and infrastructure state

failed

  • adapter reached a terminal failure for the current member lifecycle attempt

deleted

  • member has been fully removed from the active runtime set

Important rule: - the human/runtime meaning behind readiness or drain safety remains Slurm-owned

adapter_detail Contract For Slice One

The first slice should expose enough adapter detail for operator debugging without leaking Slurm semantics into generic top-level fields.

Recommended initial fields:

Common fields

  • role
  • phase
  • last_adapter_error
  • last_runtime_observed_at

Controller fields

  • controller_hostname
  • slurmctld_state
  • config_version
  • cluster_name

Worker fields

  • worker_hostname
  • slurmd_state
  • drain_reason
  • registered_in_controller
  • config_version

Rule: - these are adapter-owned fields and may evolve with the Slurm adapter - they should not be promoted into the generic platform contract unless multiple runtimes prove the same need

Node/Allocation Composition

The Slurm adapter should use platform primitives in this order:

  1. request or reconcile required capacity through app-owned logic using platform allocation intent
  2. use platform node/bootstrap primitives to prepare the target host
  3. use platform artifact and secret delivery primitives
  4. execute Slurm-specific host configuration and service bring-up
  5. update member and operation state back through the platform surfaces

Boundary rule: - no hidden direct platform DB writes from the Slurm app - no bypass around platform audit/correlation model - no assumption that the platform understands Slurm semantics

Service Account Usage

The Slurm adapter should run under a project-scoped or tenant-scoped service account aligned with the app instance scope.

The adapter should use that identity for: - reading app instance and member state - creating or updating member operations if needed by its own workflows - consuming platform primitives - reporting lifecycle progress through public contracts

The adapter should not require: - platform superadmin privileges - direct human tokens - undocumented internal-only endpoints

First Slice Acceptance Criteria

The first Slurm adapter slice is acceptable only if all are true:

  1. controller and worker remain adapter-owned component_key values, not platform enums.
  2. The Slurm app implements worker add, drain, remove, and replace without adding Slurm-specific fields to the generic platform status model.
  3. Member and operation status are sufficient for an operator to diagnose progress with correlation_id and adapter_detail.
  4. The adapter uses platform node/allocation/artifact/secret primitives rather than hidden side channels.
  5. A future non-Slurm app could reuse the same platform member envelopes without inheriting Slurm behavior.

Explicit Non-Goals

Do not include in the first Slurm slice:

  1. multi-controller HA Slurm
  2. generic scheduler abstraction
  3. shared tenant-wide or platform-wide Slurm service
  4. controller drain/remove semantics
  5. Slurm-specific event taxonomy in core contracts
  6. Slurm-specific top-level API resources in GPUaaS core

Immediate Outcome

The current conclusion is: - the GPUaaS core now provides the minimal envelope - the Slurm app can define its first concrete meanings on top - the next app-side work is implementation of controller and worker reconciliation, not more core API design