Clustered App Model v1¶
Purpose¶
Define the generic app-platform model required for example clustered apps such as Slurm, Ray, future K8s-style control stacks, and other direct multi-node runtimes.
This document exists to keep two boundaries explicit before implementation starts: 1. platform capability vs example-app behavior 2. app-developer contract vs platform-owned infrastructure realization
Reading order:
1. Clustered_App_Model_v1.md - long-term model and boundary goals
2. App_Platform_Primitive_Boundary_v1.md - guardrail against premature platform expansion
3. App_Platform_Clustered_App_Gap_Table_v1.md - concrete classification checklist before implementation
4. App_Platform_Core_For_First_Slurm_Slice_v1.md - minimum core required before the first Slurm slice
This is the design baseline that should be reviewed before implementing the first Slurm adapter slice.
Use this together with:
- doc/architecture/App_Control_Plane_v1.md
- doc/architecture/Scheduler_as_Platform_App_v1.md
- doc/architecture/App_Runtime_Operating_Modes_v1.md
- doc/architecture/Slurm_App_Runtime_Adapter_v1.md
- doc/architecture/App_Platform_Primitive_Boundary_v1.md
- doc/architecture/App_Platform_Clustered_App_Gap_Table_v1.md
- doc/architecture/App_Platform_Core_For_First_Slurm_Slice_v1.md
- doc/architecture/Build_an_App_for_GPUaaS_v1.md
Decision Summary¶
- Example scheduler/control apps exist to validate the public App Platform API and SDK, not to justify special-case internal control paths.
- The core platform must support clustered apps as a generic concept, not only single-unit app instances.
- Tenant admins or project admins choose the desired logical app topology; the platform chooses the physical nodes and execution path.
- Project-scoped service accounts must be first-class callers for app lifecycle automation in the same model used by human operators.
- Core app contracts must stay language-neutral and usable by non-Go developers; SDKs are convenience layers over the public API, not the source of truth.
- The platform must resist promoting adapter-specific topology semantics into the generic API until a reusable primitive need is proven.
Why This Exists¶
The current app-platform baseline is strong on: - artifact lifecycle, - publish/promotion/trust, - runtime secret issuance, - service-account ownership, - app-instance lifecycle.
But the current contract is still too single-instance shaped for clustered example apps. Slurm, Ray, and future direct multi-node runtimes all require: - logical topology declaration, - mutable member lifecycle, - operator/service-account driven automation, - app-specific clustered behavior behind a generic platform boundary.
Without this document, the first example app risks becoming an internal demo that proves only special-case backend knowledge rather than proving the platform is usable by real app developers.
Platform vs App Boundary¶
Platform owns¶
- app catalog, versions, artifacts, trust, and promotion
- app-instance ownership, tenancy, and IAM evaluation
- service-account issuance and project-scoped automation identity
- runtime secret issuance and custody
- allocation realization and infrastructure orchestration
- lifecycle orchestration primitives
- audit, observability, and billing hooks
- safe add/drain/remove execution on infrastructure
Example app / adapter owns¶
- app-specific topology semantics
- app-specific config rendering
- role-specific join/leave/bootstrap logic
- app-specific health checks and readiness thresholds
- app-specific safety rules for scale, drain, remove, and upgrade ordering
- app-specific runtime metadata surfaced through generic platform read models
Rule: - if a new requirement is scheduler/database/runtime-specific, it belongs in the adapter unless multiple app classes need the same primitive. - if multiple app classes need it, it is a platform capability and should be modeled generically.
App-Developer vs Platform-Operator Boundary¶
The example app is meant to validate the experience for app developers using the public platform surface.
App developer should be able to do¶
- publish verified artifacts
- choose a supported logical topology
- create and manage app instances
- use a project-scoped service account for automation
- request scale/add/remove actions through public APIs
- observe state and failures through public read models and events
App developer should not need to do¶
- set internal host-role labels
- understand
/etc/hostsor bootstrap networking internals - choose exact nodes by control-plane implementation detail
- use hidden internal-only lifecycle endpoints
- read backend Go code to understand the runtime contract
Platform operator should do¶
- map logical topology to physical nodes
- enforce safe lifecycle sequencing
- choose the bootstrap and secret-delivery path
- preserve audit, policy, and supportability invariants
Actor Model¶
Human actors¶
tenant_adminproject_admin
Automation actor¶
- project-scoped
service_account
Baseline rules¶
tenant_adminmay manage app instances and app topology within the tenant boundary.project_adminmay manage project-scoped app instances and topology inside that project.service_accountmay perform app lifecycle actions only inside its own project and only through explicit allowlisted permissions.- Example apps must not rely on internal platform-admin-only paths for normal lifecycle automation.
Generic Clustered App Model¶
The platform should model a clustered app as one app instance with one or more logical component roles.
Generic concepts¶
topology_mode- examples for first adapters may include:
single_nodesplit
- this is currently an adapter-level concept, not a committed generic platform enum
-
future promotion into the core contract should happen only if multiple real runtimes prove the same need
-
component_role - an adapter-owned named logical role within the app topology
-
examples:
- Slurm:
controller,worker - Ray:
head,worker - K8s:
control_plane,worker - Kafka:
broker, optionallycontroller
- Slurm:
-
member - one realized runtime member for a component role
- bound to a physical node or runtime target by the platform
- a generic member status/read surface may become a platform primitive using a stable adapter-supplied identifier such as
component_key -
the human/runtime meaning of the component that a member belongs to remains adapter-owned unless proven reusable
-
desired_membership -
target count and coarse allocation intent for each role
-
runtime_state - adapter-owned health and status details per role/member
Key principle¶
The generic model should support multi-role distributed apps. It must not be limited to one controller plus many workers, even though that is a valid first adapter pattern.
Topology Ownership¶
Tenant/project admin chooses¶
topology_mode- intended logical roles and desired counts within the supported adapter contract
- allocation intent at the app level
Platform chooses¶
- exact physical nodes
- join/bootstrap execution path
- secret delivery path
- safe lifecycle sequencing
That means topology is app-level and user-visible, while physical node realization remains platform-owned.
Minimum Generic Lifecycle Operations¶
For clustered apps, the generic platform model should be able to express:
- create app instance
- read app instance and runtime status
- upgrade app instance
- rollback app instance
- decommission app instance
- add members to a role
- drain a member
- remove a member
- replace a failed member
The platform owns the operation lifecycle and audit trail; the adapter owns the app-specific safety semantics.
Failure and compensation expectation¶
Clustered member operations must not assume happy-path completion only.
Baseline rule: - the platform owns operation identity, audit, correlation, and durable status reporting - the adapter owns runtime-specific compensation and recovery semantics for partial failures
Examples:
- add member accepted but join fails
- drain member starts but the runtime never reaches a safe drained state
- remove member succeeds in infrastructure terms but leaves runtime reconciliation work incomplete
Runtime Access Model¶
Some example apps may need operator access after deployment for integration, repair, or ecosystem-specific workflows.
Current reusable primitive: - node/allocation SSH key synchronization already exists and can update access material after a runtime is active.
Design rule:
- if multiple app classes need post-bootstrap runtime access material, it should become a generic app-platform capability rather than being borrowed implicitly from allocation semantics.
- until that is proven, runtime access should be treated as prove with first example app, not as an automatic core primitive
This should stay separate from: - node bootstrap trust delivery - environment DNS/host preparation - raw host-specific shell assumptions
API Shape Review Against Current Baseline¶
What the current API already does well¶
Current app-platform API supports: - app catalog and version metadata - app artifacts and trust/promotion lifecycle - project entitlements - project-scoped app instances - project-scoped operator service account selection - app runtime secret issuance - async lifecycle transitions for deploy/upgrade/rollback/decommission
Current gaps in AppInstance¶
Current AppInstance contract in doc/api/openapi.draft.yaml is still effectively single-resource shaped:
- one instance
- one status
- no first-class topology
- no first-class component roles
- no first-class member set
This is sufficient for: - single runtime deployment - basic async lifecycle
It is not sufficient for: - controller/worker membership - scale-out and scale-in semantics - per-role runtime state - member-level drain/remove/replace
Current gaps in routes¶
Existing routes cover: - create/list/get app instance - upgrade/rollback/decommission - issue runtime secrets
Missing generic clustered-app surfaces: - role/member read model - role/member lifecycle actions - topology declaration beyond free-form config - operation-specific status surfaces for role/member changes
Current gaps in runtime secret contract¶
IssueAppInstanceRuntimeSecretRequest currently only supports:
- purpose = artifact_pull
That is too narrow if direct clustered example apps need other runtime/operator materials through the same platform path.
The platform should decide intentionally whether future purposes belong here, for example: - role join credentials - runtime operator access material - cluster bootstrap material
No expansion should be added until the boundary is explicit, but the current limitation is a known constraint.
Current gaps in IAM documentation¶
Existing app-platform docs correctly emphasize service accounts, but the permission baseline for clustered app operations is still underspecified for: - tenant admin scaling actions - project admin scaling actions - service-account-driven role/member lifecycle
That needs to be made explicit before implementation.
Recommended Contract Direction¶
The next contract iteration should keep the generic app-instance lifecycle while adding a generic clustered-app layer.
Generic additions the platform likely needs¶
- app-instance topology fields
- effective topology mode
-
adapter-owned role set summary
-
component-role read model
- role name
- desired count
- current count
-
health summary
-
member read model
- member id
- role name
- status
- bound node/resource
-
last operation and correlation id
-
role/member lifecycle actions
- add
- drain
- remove
- replace
What should remain adapter-specific¶
- Slurm config file details
- Ray bootstrap specifics
- K8s cluster join specifics
- Kafka or MongoDB quorum logic
- exact health probes and reconciliation behavior
How Slurm Should Use This¶
Slurm is the first proving adapter, not the end state.
Slurm should validate that the platform can support: - app-level topology - controller + mutable worker membership - service-account automation - operator-safe worker drain/remove - language-neutral lifecycle use through public APIs
If a capability is required by Slurm and is also plausibly needed by Ray, K8s, or another direct clustered app, that capability should be modeled generically in the platform.
How Future Direct Clustered Apps Fit¶
This model should be sufficient for direct app-platform runtimes such as: - Slurm - Ray - K8s - future direct Kafka or MongoDB-style adapters
It does not require the core platform to know application-specific distributed-system rules. Instead: - the platform models clustered topology and lifecycle primitives - the adapter enforces app-specific safety and runtime semantics
Immediate Follow-On Work¶
Before starting Slurm runtime implementation:
- review the public API against this clustered-app model
- define the first additive OpenAPI changes needed for topology and component/member visibility
- define the IAM matrix for tenant admin, project admin, and service account actions
- define whether runtime access material beyond
artifact_pullbelongs in the generic runtime secret model - update the Slurm adapter contract to consume the clustered-app model explicitly
Related Docs¶
doc/architecture/App_Control_Plane_v1.mddoc/architecture/Scheduler_as_Platform_App_v1.mddoc/architecture/App_Runtime_Operating_Modes_v1.mddoc/architecture/Slurm_App_Runtime_Adapter_v1.mddoc/architecture/App_Platform_Quickstart_v1.mddoc/architecture/Build_an_App_for_GPUaaS_v1.md