Skip to content

Slurm First Slice Platform App Split v1

Purpose

Separate the first Slurm slice into: - platform-owned work, - app-owned work, - and the contract dependencies between them.

This document exists because the Slurm adapter is expected to be developed and deployed independently from the GPUaaS core platform. If ownership is not explicit, the first slice will accumulate hidden coupling and the platform will become harder to grow.

Reading order: 1. Clustered_App_Model_v1.md 2. App_Platform_Primitive_Boundary_v1.md 3. App_Platform_Core_For_First_Slurm_Slice_v1.md 4. Slurm_First_Slice_Platform_App_Split_v1.md

Use this with: - doc/api/openapi.draft.yaml - doc/architecture/App_Control_Plane_v1.md - doc/architecture/Slurm_App_Runtime_Adapter_v1.md - doc/architecture/Slurm_First_Slice_Adapter_Contract_v1.md - doc/architecture/Slurm_Product_Workflow_And_Gap_Assessment_v1.md - doc/architecture/CLI_Agent_Operable_Control_Plane_v2.md - doc/architecture/External_App_Team_Integration_Guide_v1.md

Decision Summary

  1. GPUaaS platform and the Slurm app must be able to evolve independently.
  2. The platform owns reusable control-plane contracts and execution primitives.
  3. The Slurm app owns runtime-specific implementation and reconciliation.
  4. The contract between them must be explicit enough that the Slurm app does not need hidden internal platform knowledge.
  5. Any requirement that only exists because of Slurm should stay outside the core platform unless another app class proves the same need.

Platform-Owned Work

The platform must provide these capabilities before the first Slurm slice can be built honestly.

Public contract surfaces

  • project-scoped app instance lifecycle APIs
  • project-scoped app artifact lifecycle APIs
  • project-scoped runtime secret issuance
  • generic app-instance member read APIs
  • generic app-instance member-operation APIs
  • canonical error envelope with correlation_id
  • idempotent mutation behavior for member-operation requests

Identity, policy, and audit

  • service-account-compatible auth for app automation
  • explicit authz for:
  • tenant_admin
  • project_admin
  • project-scoped service_account
  • audit rows for privileged app and member lifecycle mutations
  • project and tenant boundary enforcement

Execution primitives

  • allocation realization primitives
  • node lifecycle primitives the adapter can compose with
  • node-agent task execution substrate on enrolled nodes
  • artifact delivery primitives
  • runtime secret custody and delivery primitives

Read-model and async behavior

  • public member read surface with:
  • member_id
  • component_key
  • generic status
  • node binding
  • last operation or correlation references
  • public member-operation status surface
  • async operation and failure reporting suitable for humans, automation, and agents

Non-goals for the platform in the first slice

The platform should not own: - Slurm controller or worker semantics - Slurm config rendering - Slurm health rules - Slurm drain/remove safety logic - Slurm join/bootstrap implementation details

Slurm App-Owned Work

The Slurm app owns the runtime-specific implementation built on top of the platform contract.

Adapter/controller responsibilities

  • consume app-instance and member-operation intent
  • interpret component_key values for Slurm roles
  • implement add/drain/remove/replace semantics for Slurm members
  • decide when runtime state is healthy, degraded, drained, or failed
  • translate generic platform operations into Slurm-specific execution steps

Runtime-specific responsibilities

  • choose and validate Slurm manifest/config structure
  • render slurm.conf, munge, and related artifacts
  • define controller bootstrap steps
  • define worker join steps
  • define worker drain/remove steps
  • perform runtime-specific reconciliation after infrastructure actions complete
  • populate adapter_detail for operator debugging without leaking Slurm semantics into the generic platform model

Deployment and release responsibilities

  • package the Slurm adapter/controller
  • define its rollout/deploy process independently from GPUaaS core
  • keep runtime-specific change cadence separate from platform contract changes

Contract Dependencies

This is the minimum stable interface between GPUaaS platform and the Slurm app.

The Slurm app depends on the platform for

  • authenticated project-scoped API access
  • service-account-compatible lifecycle operations
  • artifact and secret access
  • app-instance read/write lifecycle
  • member read and member-operation envelopes
  • node/allocation execution primitives
  • correlation-first failure and audit evidence

The platform depends on the Slurm app for

  • correct runtime implementation of member operations
  • mapping generic operation status to runtime-specific progress
  • runtime-specific health and reconciliation behavior
  • adapter-specific detail surfaced for debugging

What must stay stable enough for independent development

  • endpoint shapes and auth rules
  • async operation envelope and statuses
  • generic member status/read shape
  • request/response idempotency expectations
  • audit and correlation behavior
  • adapter boundary around runtime-specific logic

Platform Deliverables Needed Before Slurm Implementation Starts

These are the platform deliverables that should be considered prerequisites for the first Slurm app implementation branch:

  1. additive OpenAPI support for:
  2. app-instance members
  3. app-instance member operations
  4. explicit RBAC/authz policy for member operations
  5. documented node/allocation composition rule for app adapters
  6. documented service-account usage model for app automation

Slurm Deliverables Built On Top Of That Contract

These are app-side deliverables that should stay outside the GPUaaS core platform:

  1. Slurm component-key contract
  2. controller bootstrap implementation
  3. worker join implementation
  4. worker drain/remove implementation
  5. Slurm-specific runtime detail mapping
  6. Slurm adapter deploy/runbook material

The first concrete app-side contract for those deliverables lives in: - doc/architecture/Slurm_First_Slice_Adapter_Contract_v1.md

Boundary Tests

Before the first slice is considered healthy, we should be able to answer yes to all of these:

  1. Can the Slurm app be developed without changing the GPUaaS core for every runtime-specific behavior change?
  2. Can a project-scoped service account drive the public contract without hidden internal access?
  3. Can operators debug member operations through public read models and correlations?
  4. Can the Slurm app change its runtime logic without forcing a redesign of the core platform API?
  5. Can a future non-Slurm app reuse the same platform primitives without inheriting Slurm semantics?

Immediate Outcome

The current conclusion is: - the first Slurm slice should be split into platform lane and app lane - only the contract boundary belongs in the shared core - runtime implementation belongs in the independently deployable Slurm app