Skip to content

Clustered App Model v1

Purpose

Define the generic app-platform model required for example clustered apps such as Slurm, Ray, future K8s-style control stacks, and other direct multi-node runtimes.

This document exists to keep two boundaries explicit before implementation starts: 1. platform capability vs example-app behavior 2. app-developer contract vs platform-owned infrastructure realization

Reading order: 1. Clustered_App_Model_v1.md - long-term model and boundary goals 2. App_Platform_Primitive_Boundary_v1.md - guardrail against premature platform expansion 3. App_Platform_Clustered_App_Gap_Table_v1.md - concrete classification checklist before implementation 4. App_Platform_Core_For_First_Slurm_Slice_v1.md - minimum core required before the first Slurm slice

This is the design baseline that should be reviewed before implementing the first Slurm adapter slice.

Use this together with: - doc/architecture/App_Control_Plane_v1.md - doc/architecture/Scheduler_as_Platform_App_v1.md - doc/architecture/App_Runtime_Operating_Modes_v1.md - doc/architecture/Slurm_App_Runtime_Adapter_v1.md - doc/architecture/App_Platform_Primitive_Boundary_v1.md - doc/architecture/App_Platform_Clustered_App_Gap_Table_v1.md - doc/architecture/App_Platform_Core_For_First_Slurm_Slice_v1.md - doc/architecture/Build_an_App_for_GPUaaS_v1.md

Decision Summary

  1. Example scheduler/control apps exist to validate the public App Platform API and SDK, not to justify special-case internal control paths.
  2. The core platform must support clustered apps as a generic concept, not only single-unit app instances.
  3. Tenant admins or project admins choose the desired logical app topology; the platform chooses the physical nodes and execution path.
  4. Project-scoped service accounts must be first-class callers for app lifecycle automation in the same model used by human operators.
  5. Core app contracts must stay language-neutral and usable by non-Go developers; SDKs are convenience layers over the public API, not the source of truth.
  6. The platform must resist promoting adapter-specific topology semantics into the generic API until a reusable primitive need is proven.

Why This Exists

The current app-platform baseline is strong on: - artifact lifecycle, - publish/promotion/trust, - runtime secret issuance, - service-account ownership, - app-instance lifecycle.

But the current contract is still too single-instance shaped for clustered example apps. Slurm, Ray, and future direct multi-node runtimes all require: - logical topology declaration, - mutable member lifecycle, - operator/service-account driven automation, - app-specific clustered behavior behind a generic platform boundary.

Without this document, the first example app risks becoming an internal demo that proves only special-case backend knowledge rather than proving the platform is usable by real app developers.

Platform vs App Boundary

Platform owns

  • app catalog, versions, artifacts, trust, and promotion
  • app-instance ownership, tenancy, and IAM evaluation
  • service-account issuance and project-scoped automation identity
  • runtime secret issuance and custody
  • allocation realization and infrastructure orchestration
  • lifecycle orchestration primitives
  • audit, observability, and billing hooks
  • safe add/drain/remove execution on infrastructure

Example app / adapter owns

  • app-specific topology semantics
  • app-specific config rendering
  • role-specific join/leave/bootstrap logic
  • app-specific health checks and readiness thresholds
  • app-specific safety rules for scale, drain, remove, and upgrade ordering
  • app-specific runtime metadata surfaced through generic platform read models

Rule: - if a new requirement is scheduler/database/runtime-specific, it belongs in the adapter unless multiple app classes need the same primitive. - if multiple app classes need it, it is a platform capability and should be modeled generically.

App-Developer vs Platform-Operator Boundary

The example app is meant to validate the experience for app developers using the public platform surface.

App developer should be able to do

  • publish verified artifacts
  • choose a supported logical topology
  • create and manage app instances
  • use a project-scoped service account for automation
  • request scale/add/remove actions through public APIs
  • observe state and failures through public read models and events

App developer should not need to do

  • set internal host-role labels
  • understand /etc/hosts or bootstrap networking internals
  • choose exact nodes by control-plane implementation detail
  • use hidden internal-only lifecycle endpoints
  • read backend Go code to understand the runtime contract

Platform operator should do

  • map logical topology to physical nodes
  • enforce safe lifecycle sequencing
  • choose the bootstrap and secret-delivery path
  • preserve audit, policy, and supportability invariants

Actor Model

Human actors

  • tenant_admin
  • project_admin

Automation actor

  • project-scoped service_account

Baseline rules

  1. tenant_admin may manage app instances and app topology within the tenant boundary.
  2. project_admin may manage project-scoped app instances and topology inside that project.
  3. service_account may perform app lifecycle actions only inside its own project and only through explicit allowlisted permissions.
  4. Example apps must not rely on internal platform-admin-only paths for normal lifecycle automation.

Generic Clustered App Model

The platform should model a clustered app as one app instance with one or more logical component roles.

Generic concepts

  1. topology_mode
  2. examples for first adapters may include:
    • single_node
    • split
  3. this is currently an adapter-level concept, not a committed generic platform enum
  4. future promotion into the core contract should happen only if multiple real runtimes prove the same need

  5. component_role

  6. an adapter-owned named logical role within the app topology
  7. examples:

    • Slurm: controller, worker
    • Ray: head, worker
    • K8s: control_plane, worker
    • Kafka: broker, optionally controller
  8. member

  9. one realized runtime member for a component role
  10. bound to a physical node or runtime target by the platform
  11. a generic member status/read surface may become a platform primitive using a stable adapter-supplied identifier such as component_key
  12. the human/runtime meaning of the component that a member belongs to remains adapter-owned unless proven reusable

  13. desired_membership

  14. target count and coarse allocation intent for each role

  15. runtime_state

  16. adapter-owned health and status details per role/member

Key principle

The generic model should support multi-role distributed apps. It must not be limited to one controller plus many workers, even though that is a valid first adapter pattern.

Topology Ownership

Tenant/project admin chooses

  • topology_mode
  • intended logical roles and desired counts within the supported adapter contract
  • allocation intent at the app level

Platform chooses

  • exact physical nodes
  • join/bootstrap execution path
  • secret delivery path
  • safe lifecycle sequencing

That means topology is app-level and user-visible, while physical node realization remains platform-owned.

Minimum Generic Lifecycle Operations

For clustered apps, the generic platform model should be able to express:

  1. create app instance
  2. read app instance and runtime status
  3. upgrade app instance
  4. rollback app instance
  5. decommission app instance
  6. add members to a role
  7. drain a member
  8. remove a member
  9. replace a failed member

The platform owns the operation lifecycle and audit trail; the adapter owns the app-specific safety semantics.

Failure and compensation expectation

Clustered member operations must not assume happy-path completion only.

Baseline rule: - the platform owns operation identity, audit, correlation, and durable status reporting - the adapter owns runtime-specific compensation and recovery semantics for partial failures

Examples: - add member accepted but join fails - drain member starts but the runtime never reaches a safe drained state - remove member succeeds in infrastructure terms but leaves runtime reconciliation work incomplete

Runtime Access Model

Some example apps may need operator access after deployment for integration, repair, or ecosystem-specific workflows.

Current reusable primitive: - node/allocation SSH key synchronization already exists and can update access material after a runtime is active.

Design rule: - if multiple app classes need post-bootstrap runtime access material, it should become a generic app-platform capability rather than being borrowed implicitly from allocation semantics. - until that is proven, runtime access should be treated as prove with first example app, not as an automatic core primitive

This should stay separate from: - node bootstrap trust delivery - environment DNS/host preparation - raw host-specific shell assumptions

API Shape Review Against Current Baseline

What the current API already does well

Current app-platform API supports: - app catalog and version metadata - app artifacts and trust/promotion lifecycle - project entitlements - project-scoped app instances - project-scoped operator service account selection - app runtime secret issuance - async lifecycle transitions for deploy/upgrade/rollback/decommission

Current gaps in AppInstance

Current AppInstance contract in doc/api/openapi.draft.yaml is still effectively single-resource shaped: - one instance - one status - no first-class topology - no first-class component roles - no first-class member set

This is sufficient for: - single runtime deployment - basic async lifecycle

It is not sufficient for: - controller/worker membership - scale-out and scale-in semantics - per-role runtime state - member-level drain/remove/replace

Current gaps in routes

Existing routes cover: - create/list/get app instance - upgrade/rollback/decommission - issue runtime secrets

Missing generic clustered-app surfaces: - role/member read model - role/member lifecycle actions - topology declaration beyond free-form config - operation-specific status surfaces for role/member changes

Current gaps in runtime secret contract

IssueAppInstanceRuntimeSecretRequest currently only supports: - purpose = artifact_pull

That is too narrow if direct clustered example apps need other runtime/operator materials through the same platform path.

The platform should decide intentionally whether future purposes belong here, for example: - role join credentials - runtime operator access material - cluster bootstrap material

No expansion should be added until the boundary is explicit, but the current limitation is a known constraint.

Current gaps in IAM documentation

Existing app-platform docs correctly emphasize service accounts, but the permission baseline for clustered app operations is still underspecified for: - tenant admin scaling actions - project admin scaling actions - service-account-driven role/member lifecycle

That needs to be made explicit before implementation.

The next contract iteration should keep the generic app-instance lifecycle while adding a generic clustered-app layer.

Generic additions the platform likely needs

  1. app-instance topology fields
  2. effective topology mode
  3. adapter-owned role set summary

  4. component-role read model

  5. role name
  6. desired count
  7. current count
  8. health summary

  9. member read model

  10. member id
  11. role name
  12. status
  13. bound node/resource
  14. last operation and correlation id

  15. role/member lifecycle actions

  16. add
  17. drain
  18. remove
  19. replace

What should remain adapter-specific

  • Slurm config file details
  • Ray bootstrap specifics
  • K8s cluster join specifics
  • Kafka or MongoDB quorum logic
  • exact health probes and reconciliation behavior

How Slurm Should Use This

Slurm is the first proving adapter, not the end state.

Slurm should validate that the platform can support: - app-level topology - controller + mutable worker membership - service-account automation - operator-safe worker drain/remove - language-neutral lifecycle use through public APIs

If a capability is required by Slurm and is also plausibly needed by Ray, K8s, or another direct clustered app, that capability should be modeled generically in the platform.

How Future Direct Clustered Apps Fit

This model should be sufficient for direct app-platform runtimes such as: - Slurm - Ray - K8s - future direct Kafka or MongoDB-style adapters

It does not require the core platform to know application-specific distributed-system rules. Instead: - the platform models clustered topology and lifecycle primitives - the adapter enforces app-specific safety and runtime semantics

Immediate Follow-On Work

Before starting Slurm runtime implementation:

  1. review the public API against this clustered-app model
  2. define the first additive OpenAPI changes needed for topology and component/member visibility
  3. define the IAM matrix for tenant admin, project admin, and service account actions
  4. define whether runtime access material beyond artifact_pull belongs in the generic runtime secret model
  5. update the Slurm adapter contract to consume the clustered-app model explicitly
  1. doc/architecture/App_Control_Plane_v1.md
  2. doc/architecture/Scheduler_as_Platform_App_v1.md
  3. doc/architecture/App_Runtime_Operating_Modes_v1.md
  4. doc/architecture/Slurm_App_Runtime_Adapter_v1.md
  5. doc/architecture/App_Platform_Quickstart_v1.md
  6. doc/architecture/Build_an_App_for_GPUaaS_v1.md