Skip to content

App Platform Clustered App Gap Table v1

Purpose

Turn the clustered-app and primitive-boundary discussions into a concrete review table before any Slurm implementation or additive OpenAPI change starts.

This table is intentionally not an implementation plan. It is a classification aid: - what the platform already has, - what primitive is missing, - what should stay adapter-owned, - what should be deferred until there is stronger evidence.

Reading order: 1. Clustered_App_Model_v1.md 2. App_Platform_Primitive_Boundary_v1.md 3. App_Platform_Clustered_App_Gap_Table_v1.md 4. App_Platform_Core_For_First_Slurm_Slice_v1.md

Use this with: - doc/architecture/Clustered_App_Model_v1.md - doc/architecture/App_Platform_Primitive_Boundary_v1.md - doc/architecture/App_Platform_Core_For_First_Slurm_Slice_v1.md - doc/architecture/App_Control_Plane_v1.md - doc/architecture/Slurm_App_Runtime_Adapter_v1.md

Review Rule

Every item below must be classified as one of: 1. core primitive needed 2. adapter-owned concern 3. prove with first example app 4. deferred until proven

If a proposed change cannot be justified as a reusable platform primitive, it should not enter the core API yet.

Gap Table

Area Current capability Gap / pressure Classification Notes
Identity User auth and project-scoped service-account auth exist Need explicit clustered-app authz rules for tenant admin, project admin, and service account actions core primitive needed This is platform IAM, not adapter logic
Audit App mutations already audit with correlation context Role/member lifecycle actions for clustered apps are not yet defined, so audit coverage for them is also undefined core primitive needed Required before any mutable clustered lifecycle exists
App instance lifecycle Create/read/upgrade/rollback/decommission exists Current lifecycle is single-instance shaped; no primitive for member-level operations core primitive needed Keep generic lifecycle small, but it must support additional operation types
App instance schema AppInstance exposes app-level metadata and status No first-class surface for clustered runtime detail beyond opaque config/runtime state core primitive needed Likely additive detail/read-model only, not a universal topology DSL
Topology semantics App create can carry request hints and config No core platform need yet to define universal controller/worker/head/broker semantics adapter-owned concern Slurm should define these in adapter manifest/config first
Topology modes Current docs discuss operating mode and control-plane scope single_node vs split is useful, but not yet proven as a core enum across multiple runtimes deferred until proven Keep adapter-owned initially
Role taxonomy Slurm/Ray/K8s can all name logical roles No evidence yet that role names should be standardized by the platform adapter-owned concern Avoid hardcoding role names in core OpenAPI
Desired membership Apps will need counts/intent for member sets Platform may need a generic way to receive mutation intent, but not necessarily a generic topology schema prove with first example app Review after Slurm contract slice
Member read model No public per-member app runtime model today Operators and app developers will need to inspect realized members, health, and last operation core primitive needed Keep it generic: member id, component_key, status, node binding, correlation
Member lifecycle actions No add/drain/remove/replace app-member operations today Clustered apps need mutable membership core primitive needed These should be generic operation envelopes owned by the platform interface, not scheduler-specific semantics
Node lifecycle integration Platform already has bootstrap/drain/remove/replace style node primitives Need a safe adapter-to-node-lifecycle composition model for clustered apps core primitive needed The platform should expose reusable hooks, not app-specific workflows
Allocation intent App layer is expected to work through allocations rather than direct node choice Need explicit boundary between app-level capacity intent and platform-level node realization; coarse hints like AZ may be added later only if proven reusable core primitive needed Keep app-facing semantics at allocation level, not raw node placement
Location hints Future runtimes may want coarse placement constraints such as AZ/region No evidence yet that location hints belong in the first clustered-app primitive set deferred until proven Treat as a separate concern from generic allocation intent
Runtime secrets App runtime secret issuance exists for artifact_pull Clustered apps may need additional runtime/operator materials prove with first example app Do not widen secret purposes until a real reusable need is proven
SSH/runtime access Allocation-side SSH key sync exists after runtime activation Clustered app direct-access semantics are not defined for app platform prove with first example app If Slurm and dstack both need it, promote later as a generic capability
Artifact model OCI and blob artifact lifecycles exist Good enough for first example app adapter-owned concern No immediate platform gap; adapter just consumes the existing artifact model
Platform support services Vault/registry primitives already exist conceptually as shared platform capabilities Need to preserve the boundary that Vault/registry are platform support services, not app-runtime topology components core primitive needed App developers consume support-service primitives, but should not manage support-service placement
Billing Usage-record primitives exist for app_runtime App-specific billing semantics remain runtime-owned adapter-owned concern Platform provides attribution hooks, not per-app business logic
Observability Correlation-first logging, traces, and app lifecycle evidence exist Need to prove whether clustered apps require per-member health/readiness and alerting surfaces beyond current app-instance evidence prove with first example app Add only the minimum reusable observability surface the first example app demonstrates
Events App lifecycle events exist Clustered apps may later require finer-grained events such as member added, member drained, member removed, or member failed prove with first example app Name the likely event classes, but do not add a broad taxonomy upfront
Upgrade / rollback semantics App instance upgrade and rollback flows exist Clustered upgrade ordering, quorum safety, and rolling behavior differ per runtime adapter-owned concern Keep runtime-specific upgrade semantics in the adapter unless reusable primitives emerge
Example-app contract Slurm adapter doc exists Need to prove first example app uses only public contracts and allowed primitives core primitive needed This is the acceptance test for the platform boundary itself

What Looks Reusable Now

These are the likely platform primitives worth tightening first:

  1. clustered-app IAM rules for tenant admin, project admin, and project service account actors
  2. generic app-operation lifecycle/read model for member-level operations
  3. generic member status/read surface
  4. adapter-to-node-lifecycle composition rules
  5. explicit boundary between app-level allocation intent and platform-level node realization

What Should Stay Adapter-Owned For Now

These should not become core API semantics in the first slice:

  1. Slurm topology schema
  2. Slurm role names
  3. Slurm controller/worker health rules
  4. Slurm join/drain/remove sequencing details
  5. runtime-native config files and bootstrap specifics

The same logic applies later to Ray, K8s, Kafka, MongoDB, or other direct clustered apps.

What Should Be Deferred

Do not commit to these until at least one example app proves the need is truly generic:

  1. universal topology DSL
  2. universal component-role taxonomy
  3. generic platform enums for clustered role names
  4. broadened runtime secret purpose model beyond what a real example app demonstrates
  5. large event taxonomy for member lifecycle beyond what operations actually need

Before changing doc/api/openapi.draft.yaml, the next review should answer only these questions:

  1. Which of the core primitive needed rows are required for the first example app slice?
  2. Can each proposed additive API shape stay generic and language-neutral?
  3. Is any proposed field actually adapter-owned and therefore better left in app manifest/config?
  4. Are we adding the minimum primitive, or accidentally inventing a universal clustered-app control plane?

Immediate Outcome

The current conclusion is: - do not start with a large generic topology API - first tighten the primitive layer - let Slurm prove which missing pieces are truly platform-owned - keep adapter semantics out of the core contract unless they are clearly reusable