Clustered App Model v1¶

Purpose¶

Define the generic app-platform model required for example clustered apps such as Slurm, Ray, future K8s-style control stacks, and other direct multi-node runtimes.

This document exists to keep two boundaries explicit before implementation starts: 1. platform capability vs example-app behavior 2. app-developer contract vs platform-owned infrastructure realization

Reading order: 1. Clustered_App_Model_v1.md - long-term model and boundary goals 2. App_Platform_Primitive_Boundary_v1.md - guardrail against premature platform expansion 3. App_Platform_Clustered_App_Gap_Table_v1.md - concrete classification checklist before implementation 4. App_Platform_Core_For_First_Slurm_Slice_v1.md - minimum core required before the first Slurm slice

This is the design baseline that should be reviewed before implementing the first Slurm adapter slice.

Use this together with: - doc/architecture/App_Control_Plane_v1.md - doc/architecture/Scheduler_as_Platform_App_v1.md - doc/architecture/App_Runtime_Operating_Modes_v1.md - doc/architecture/Slurm_App_Runtime_Adapter_v1.md - doc/architecture/App_Platform_Primitive_Boundary_v1.md - doc/architecture/App_Platform_Clustered_App_Gap_Table_v1.md - doc/architecture/App_Platform_Core_For_First_Slurm_Slice_v1.md - doc/architecture/Build_an_App_for_GPUaaS_v1.md

Decision Summary¶

Example scheduler/control apps exist to validate the public App Platform API and SDK, not to justify special-case internal control paths.
The core platform must support clustered apps as a generic concept, not only single-unit app instances.
Tenant admins or project admins choose the desired logical app topology; the platform chooses the physical nodes and execution path.
Project-scoped service accounts must be first-class callers for app lifecycle automation in the same model used by human operators.
Core app contracts must stay language-neutral and usable by non-Go developers; SDKs are convenience layers over the public API, not the source of truth.
The platform must resist promoting adapter-specific topology semantics into the generic API until a reusable primitive need is proven.

Why This Exists¶

The current app-platform baseline is strong on: - artifact lifecycle, - publish/promotion/trust, - runtime secret issuance, - service-account ownership, - app-instance lifecycle.

But the current contract is still too single-instance shaped for clustered example apps. Slurm, Ray, and future direct multi-node runtimes all require: - logical topology declaration, - mutable member lifecycle, - operator/service-account driven automation, - app-specific clustered behavior behind a generic platform boundary.

Without this document, the first example app risks becoming an internal demo that proves only special-case backend knowledge rather than proving the platform is usable by real app developers.

Platform vs App Boundary¶

Platform owns¶

app catalog, versions, artifacts, trust, and promotion
app-instance ownership, tenancy, and IAM evaluation
service-account issuance and project-scoped automation identity
runtime secret issuance and custody
allocation realization and infrastructure orchestration
lifecycle orchestration primitives
audit, observability, and billing hooks
safe add/drain/remove execution on infrastructure

Example app / adapter owns¶

app-specific topology semantics
app-specific config rendering
role-specific join/leave/bootstrap logic
app-specific health checks and readiness thresholds
app-specific safety rules for scale, drain, remove, and upgrade ordering
app-specific runtime metadata surfaced through generic platform read models

Rule: - if a new requirement is scheduler/database/runtime-specific, it belongs in the adapter unless multiple app classes need the same primitive. - if multiple app classes need it, it is a platform capability and should be modeled generically.

App-Developer vs Platform-Operator Boundary¶

The example app is meant to validate the experience for app developers using the public platform surface.

App developer should be able to do¶

publish verified artifacts
choose a supported logical topology
create and manage app instances
use a project-scoped service account for automation
request scale/add/remove actions through public APIs
observe state and failures through public read models and events

App developer should not need to do¶

set internal host-role labels
understand /etc/hosts or bootstrap networking internals
choose exact nodes by control-plane implementation detail
use hidden internal-only lifecycle endpoints
read backend Go code to understand the runtime contract

Platform operator should do¶

map logical topology to physical nodes
enforce safe lifecycle sequencing
choose the bootstrap and secret-delivery path
preserve audit, policy, and supportability invariants

Actor Model¶

Human actors¶

tenant_admin
project_admin

Automation actor¶

project-scoped service_account

Baseline rules¶

tenant_admin may manage app instances and app topology within the tenant boundary.
project_admin may manage project-scoped app instances and topology inside that project.
service_account may perform app lifecycle actions only inside its own project and only through explicit allowlisted permissions.
Example apps must not rely on internal platform-admin-only paths for normal lifecycle automation.

Generic Clustered App Model¶

The platform should model a clustered app as one app instance with one or more logical component roles.

Generic concepts¶

topology_mode
examples for first adapters may include:
- single_node
- split
this is currently an adapter-level concept, not a committed generic platform enum
future promotion into the core contract should happen only if multiple real runtimes prove the same need
component_role
an adapter-owned named logical role within the app topology
examples:
- Slurm: controller, worker
- Ray: head, worker
- K8s: control_plane, worker
- Kafka: broker, optionally controller
member
one realized runtime member for a component role
bound to a physical node or runtime target by the platform
a generic member status/read surface may become a platform primitive using a stable adapter-supplied identifier such as component_key
the human/runtime meaning of the component that a member belongs to remains adapter-owned unless proven reusable
desired_membership
target count and coarse allocation intent for each role
runtime_state
adapter-owned health and status details per role/member

Key principle¶

The generic model should support multi-role distributed apps. It must not be limited to one controller plus many workers, even though that is a valid first adapter pattern.

Topology Ownership¶

Tenant/project admin chooses¶

topology_mode
intended logical roles and desired counts within the supported adapter contract
allocation intent at the app level

Platform chooses¶

exact physical nodes
join/bootstrap execution path
secret delivery path
safe lifecycle sequencing

That means topology is app-level and user-visible, while physical node realization remains platform-owned.

Minimum Generic Lifecycle Operations¶

For clustered apps, the generic platform model should be able to express:

create app instance
read app instance and runtime status
upgrade app instance
rollback app instance
decommission app instance
add members to a role
drain a member
remove a member
replace a failed member

The platform owns the operation lifecycle and audit trail; the adapter owns the app-specific safety semantics.

Failure and compensation expectation¶

Clustered member operations must not assume happy-path completion only.

Baseline rule: - the platform owns operation identity, audit, correlation, and durable status reporting - the adapter owns runtime-specific compensation and recovery semantics for partial failures

Examples: - add member accepted but join fails - drain member starts but the runtime never reaches a safe drained state - remove member succeeds in infrastructure terms but leaves runtime reconciliation work incomplete

Runtime Access Model¶

Some example apps may need operator access after deployment for integration, repair, or ecosystem-specific workflows.

Current reusable primitive: - node/allocation SSH key synchronization already exists and can update access material after a runtime is active.

Design rule: - if multiple app classes need post-bootstrap runtime access material, it should become a generic app-platform capability rather than being borrowed implicitly from allocation semantics. - until that is proven, runtime access should be treated as prove with first example app, not as an automatic core primitive

This should stay separate from: - node bootstrap trust delivery - environment DNS/host preparation - raw host-specific shell assumptions

API Shape Review Against Current Baseline¶

What the current API already does well¶

Current app-platform API supports: - app catalog and version metadata - app artifacts and trust/promotion lifecycle - project entitlements - project-scoped app instances - project-scoped operator service account selection - app runtime secret issuance - async lifecycle transitions for deploy/upgrade/rollback/decommission

Current gaps in `AppInstance`¶

Current AppInstance contract in doc/api/openapi.draft.yaml is still effectively single-resource shaped: - one instance - one status - no first-class topology - no first-class component roles - no first-class member set

This is sufficient for: - single runtime deployment - basic async lifecycle

It is not sufficient for: - controller/worker membership - scale-out and scale-in semantics - per-role runtime state - member-level drain/remove/replace

Current gaps in routes¶

Existing routes cover: - create/list/get app instance - upgrade/rollback/decommission - issue runtime secrets

Missing generic clustered-app surfaces: - role/member read model - role/member lifecycle actions - topology declaration beyond free-form config - operation-specific status surfaces for role/member changes

Current gaps in runtime secret contract¶

IssueAppInstanceRuntimeSecretRequest currently only supports: - purpose = artifact_pull

That is too narrow if direct clustered example apps need other runtime/operator materials through the same platform path.

The platform should decide intentionally whether future purposes belong here, for example: - role join credentials - runtime operator access material - cluster bootstrap material

No expansion should be added until the boundary is explicit, but the current limitation is a known constraint.

Current gaps in IAM documentation¶

Existing app-platform docs correctly emphasize service accounts, but the permission baseline for clustered app operations is still underspecified for: - tenant admin scaling actions - project admin scaling actions - service-account-driven role/member lifecycle

That needs to be made explicit before implementation.

Recommended Contract Direction¶

The next contract iteration should keep the generic app-instance lifecycle while adding a generic clustered-app layer.

Generic additions the platform likely needs¶

app-instance topology fields
effective topology mode
adapter-owned role set summary
component-role read model
role name
desired count
current count
health summary
member read model
member id
role name
status
bound node/resource
last operation and correlation id
role/member lifecycle actions
add
drain
remove
replace

What should remain adapter-specific¶

Slurm config file details
Ray bootstrap specifics
K8s cluster join specifics
Kafka or MongoDB quorum logic
exact health probes and reconciliation behavior

How Slurm Should Use This¶

Slurm is the first proving adapter, not the end state.

Slurm should validate that the platform can support: - app-level topology - controller + mutable worker membership - service-account automation - operator-safe worker drain/remove - language-neutral lifecycle use through public APIs

If a capability is required by Slurm and is also plausibly needed by Ray, K8s, or another direct clustered app, that capability should be modeled generically in the platform.

How Future Direct Clustered Apps Fit¶

This model should be sufficient for direct app-platform runtimes such as: - Slurm - Ray - K8s - future direct Kafka or MongoDB-style adapters

It does not require the core platform to know application-specific distributed-system rules. Instead: - the platform models clustered topology and lifecycle primitives - the adapter enforces app-specific safety and runtime semantics

Immediate Follow-On Work¶

Before starting Slurm runtime implementation:

review the public API against this clustered-app model
define the first additive OpenAPI changes needed for topology and component/member visibility
define the IAM matrix for tenant admin, project admin, and service account actions
define whether runtime access material beyond artifact_pull belongs in the generic runtime secret model
update the Slurm adapter contract to consume the clustered-app model explicitly

doc/architecture/App_Control_Plane_v1.md
doc/architecture/Scheduler_as_Platform_App_v1.md
doc/architecture/App_Runtime_Operating_Modes_v1.md
doc/architecture/Slurm_App_Runtime_Adapter_v1.md
doc/architecture/App_Platform_Quickstart_v1.md
doc/architecture/Build_an_App_for_GPUaaS_v1.md