Slurm First Slice Platform App Split v1¶
Purpose¶
Separate the first Slurm slice into: - platform-owned work, - app-owned work, - and the contract dependencies between them.
This document exists because the Slurm adapter is expected to be developed and deployed independently from the GPUaaS core platform. If ownership is not explicit, the first slice will accumulate hidden coupling and the platform will become harder to grow.
Reading order:
1. Clustered_App_Model_v1.md
2. App_Platform_Primitive_Boundary_v1.md
3. App_Platform_Core_For_First_Slurm_Slice_v1.md
4. Slurm_First_Slice_Platform_App_Split_v1.md
Use this with:
- doc/api/openapi.draft.yaml
- doc/architecture/App_Control_Plane_v1.md
- doc/architecture/Slurm_App_Runtime_Adapter_v1.md
- doc/architecture/Slurm_First_Slice_Adapter_Contract_v1.md
- doc/architecture/Slurm_Product_Workflow_And_Gap_Assessment_v1.md
- doc/architecture/CLI_Agent_Operable_Control_Plane_v2.md
- doc/architecture/External_App_Team_Integration_Guide_v1.md
Decision Summary¶
- GPUaaS platform and the Slurm app must be able to evolve independently.
- The platform owns reusable control-plane contracts and execution primitives.
- The Slurm app owns runtime-specific implementation and reconciliation.
- The contract between them must be explicit enough that the Slurm app does not need hidden internal platform knowledge.
- Any requirement that only exists because of Slurm should stay outside the core platform unless another app class proves the same need.
Platform-Owned Work¶
The platform must provide these capabilities before the first Slurm slice can be built honestly.
Public contract surfaces¶
- project-scoped app instance lifecycle APIs
- project-scoped app artifact lifecycle APIs
- project-scoped runtime secret issuance
- generic app-instance member read APIs
- generic app-instance member-operation APIs
- canonical error envelope with
correlation_id - idempotent mutation behavior for member-operation requests
Identity, policy, and audit¶
- service-account-compatible auth for app automation
- explicit authz for:
tenant_adminproject_admin- project-scoped
service_account - audit rows for privileged app and member lifecycle mutations
- project and tenant boundary enforcement
Execution primitives¶
- allocation realization primitives
- node lifecycle primitives the adapter can compose with
- node-agent task execution substrate on enrolled nodes
- artifact delivery primitives
- runtime secret custody and delivery primitives
Read-model and async behavior¶
- public member read surface with:
member_idcomponent_key- generic status
- node binding
- last operation or correlation references
- public member-operation status surface
- async operation and failure reporting suitable for humans, automation, and agents
Non-goals for the platform in the first slice¶
The platform should not own: - Slurm controller or worker semantics - Slurm config rendering - Slurm health rules - Slurm drain/remove safety logic - Slurm join/bootstrap implementation details
Slurm App-Owned Work¶
The Slurm app owns the runtime-specific implementation built on top of the platform contract.
Adapter/controller responsibilities¶
- consume app-instance and member-operation intent
- interpret
component_keyvalues for Slurm roles - implement add/drain/remove/replace semantics for Slurm members
- decide when runtime state is healthy, degraded, drained, or failed
- translate generic platform operations into Slurm-specific execution steps
Runtime-specific responsibilities¶
- choose and validate Slurm manifest/config structure
- render
slurm.conf,munge, and related artifacts - define controller bootstrap steps
- define worker join steps
- define worker drain/remove steps
- perform runtime-specific reconciliation after infrastructure actions complete
- populate
adapter_detailfor operator debugging without leaking Slurm semantics into the generic platform model
Deployment and release responsibilities¶
- package the Slurm adapter/controller
- define its rollout/deploy process independently from GPUaaS core
- keep runtime-specific change cadence separate from platform contract changes
Contract Dependencies¶
This is the minimum stable interface between GPUaaS platform and the Slurm app.
The Slurm app depends on the platform for¶
- authenticated project-scoped API access
- service-account-compatible lifecycle operations
- artifact and secret access
- app-instance read/write lifecycle
- member read and member-operation envelopes
- node/allocation execution primitives
- correlation-first failure and audit evidence
The platform depends on the Slurm app for¶
- correct runtime implementation of member operations
- mapping generic operation status to runtime-specific progress
- runtime-specific health and reconciliation behavior
- adapter-specific detail surfaced for debugging
What must stay stable enough for independent development¶
- endpoint shapes and auth rules
- async operation envelope and statuses
- generic member status/read shape
- request/response idempotency expectations
- audit and correlation behavior
- adapter boundary around runtime-specific logic
Platform Deliverables Needed Before Slurm Implementation Starts¶
These are the platform deliverables that should be considered prerequisites for the first Slurm app implementation branch:
- additive OpenAPI support for:
- app-instance members
- app-instance member operations
- explicit RBAC/authz policy for member operations
- documented node/allocation composition rule for app adapters
- documented service-account usage model for app automation
Slurm Deliverables Built On Top Of That Contract¶
These are app-side deliverables that should stay outside the GPUaaS core platform:
- Slurm component-key contract
- controller bootstrap implementation
- worker join implementation
- worker drain/remove implementation
- Slurm-specific runtime detail mapping
- Slurm adapter deploy/runbook material
The first concrete app-side contract for those deliverables lives in:
- doc/architecture/Slurm_First_Slice_Adapter_Contract_v1.md
Boundary Tests¶
Before the first slice is considered healthy, we should be able to answer yes to all of these:
- Can the Slurm app be developed without changing the GPUaaS core for every runtime-specific behavior change?
- Can a project-scoped service account drive the public contract without hidden internal access?
- Can operators debug member operations through public read models and correlations?
- Can the Slurm app change its runtime logic without forcing a redesign of the core platform API?
- Can a future non-Slurm app reuse the same platform primitives without inheriting Slurm semantics?
Immediate Outcome¶
The current conclusion is: - the first Slurm slice should be split into platform lane and app lane - only the contract boundary belongs in the shared core - runtime implementation belongs in the independently deployable Slurm app