App Platform Clustered App Gap Table v1¶
Purpose¶
Turn the clustered-app and primitive-boundary discussions into a concrete review table before any Slurm implementation or additive OpenAPI change starts.
This table is intentionally not an implementation plan. It is a classification aid: - what the platform already has, - what primitive is missing, - what should stay adapter-owned, - what should be deferred until there is stronger evidence.
Reading order:
1. Clustered_App_Model_v1.md
2. App_Platform_Primitive_Boundary_v1.md
3. App_Platform_Clustered_App_Gap_Table_v1.md
4. App_Platform_Core_For_First_Slurm_Slice_v1.md
Use this with:
- doc/architecture/Clustered_App_Model_v1.md
- doc/architecture/App_Platform_Primitive_Boundary_v1.md
- doc/architecture/App_Platform_Core_For_First_Slurm_Slice_v1.md
- doc/architecture/App_Control_Plane_v1.md
- doc/architecture/Slurm_App_Runtime_Adapter_v1.md
Review Rule¶
Every item below must be classified as one of:
1. core primitive needed
2. adapter-owned concern
3. prove with first example app
4. deferred until proven
If a proposed change cannot be justified as a reusable platform primitive, it should not enter the core API yet.
Gap Table¶
| Area | Current capability | Gap / pressure | Classification | Notes |
|---|---|---|---|---|
| Identity | User auth and project-scoped service-account auth exist | Need explicit clustered-app authz rules for tenant admin, project admin, and service account actions | core primitive needed |
This is platform IAM, not adapter logic |
| Audit | App mutations already audit with correlation context | Role/member lifecycle actions for clustered apps are not yet defined, so audit coverage for them is also undefined | core primitive needed |
Required before any mutable clustered lifecycle exists |
| App instance lifecycle | Create/read/upgrade/rollback/decommission exists | Current lifecycle is single-instance shaped; no primitive for member-level operations | core primitive needed |
Keep generic lifecycle small, but it must support additional operation types |
| App instance schema | AppInstance exposes app-level metadata and status |
No first-class surface for clustered runtime detail beyond opaque config/runtime state | core primitive needed |
Likely additive detail/read-model only, not a universal topology DSL |
| Topology semantics | App create can carry request hints and config | No core platform need yet to define universal controller/worker/head/broker semantics |
adapter-owned concern |
Slurm should define these in adapter manifest/config first |
| Topology modes | Current docs discuss operating mode and control-plane scope | single_node vs split is useful, but not yet proven as a core enum across multiple runtimes |
deferred until proven |
Keep adapter-owned initially |
| Role taxonomy | Slurm/Ray/K8s can all name logical roles | No evidence yet that role names should be standardized by the platform | adapter-owned concern |
Avoid hardcoding role names in core OpenAPI |
| Desired membership | Apps will need counts/intent for member sets | Platform may need a generic way to receive mutation intent, but not necessarily a generic topology schema | prove with first example app |
Review after Slurm contract slice |
| Member read model | No public per-member app runtime model today | Operators and app developers will need to inspect realized members, health, and last operation | core primitive needed |
Keep it generic: member id, component_key, status, node binding, correlation |
| Member lifecycle actions | No add/drain/remove/replace app-member operations today | Clustered apps need mutable membership | core primitive needed |
These should be generic operation envelopes owned by the platform interface, not scheduler-specific semantics |
| Node lifecycle integration | Platform already has bootstrap/drain/remove/replace style node primitives | Need a safe adapter-to-node-lifecycle composition model for clustered apps | core primitive needed |
The platform should expose reusable hooks, not app-specific workflows |
| Allocation intent | App layer is expected to work through allocations rather than direct node choice | Need explicit boundary between app-level capacity intent and platform-level node realization; coarse hints like AZ may be added later only if proven reusable | core primitive needed |
Keep app-facing semantics at allocation level, not raw node placement |
| Location hints | Future runtimes may want coarse placement constraints such as AZ/region | No evidence yet that location hints belong in the first clustered-app primitive set | deferred until proven |
Treat as a separate concern from generic allocation intent |
| Runtime secrets | App runtime secret issuance exists for artifact_pull |
Clustered apps may need additional runtime/operator materials | prove with first example app |
Do not widen secret purposes until a real reusable need is proven |
| SSH/runtime access | Allocation-side SSH key sync exists after runtime activation | Clustered app direct-access semantics are not defined for app platform | prove with first example app |
If Slurm and dstack both need it, promote later as a generic capability |
| Artifact model | OCI and blob artifact lifecycles exist | Good enough for first example app | adapter-owned concern |
No immediate platform gap; adapter just consumes the existing artifact model |
| Platform support services | Vault/registry primitives already exist conceptually as shared platform capabilities | Need to preserve the boundary that Vault/registry are platform support services, not app-runtime topology components | core primitive needed |
App developers consume support-service primitives, but should not manage support-service placement |
| Billing | Usage-record primitives exist for app_runtime |
App-specific billing semantics remain runtime-owned | adapter-owned concern |
Platform provides attribution hooks, not per-app business logic |
| Observability | Correlation-first logging, traces, and app lifecycle evidence exist | Need to prove whether clustered apps require per-member health/readiness and alerting surfaces beyond current app-instance evidence | prove with first example app |
Add only the minimum reusable observability surface the first example app demonstrates |
| Events | App lifecycle events exist | Clustered apps may later require finer-grained events such as member added, member drained, member removed, or member failed | prove with first example app |
Name the likely event classes, but do not add a broad taxonomy upfront |
| Upgrade / rollback semantics | App instance upgrade and rollback flows exist | Clustered upgrade ordering, quorum safety, and rolling behavior differ per runtime | adapter-owned concern |
Keep runtime-specific upgrade semantics in the adapter unless reusable primitives emerge |
| Example-app contract | Slurm adapter doc exists | Need to prove first example app uses only public contracts and allowed primitives | core primitive needed |
This is the acceptance test for the platform boundary itself |
What Looks Reusable Now¶
These are the likely platform primitives worth tightening first:
- clustered-app IAM rules for tenant admin, project admin, and project service account actors
- generic app-operation lifecycle/read model for member-level operations
- generic member status/read surface
- adapter-to-node-lifecycle composition rules
- explicit boundary between app-level allocation intent and platform-level node realization
What Should Stay Adapter-Owned For Now¶
These should not become core API semantics in the first slice:
- Slurm topology schema
- Slurm role names
- Slurm controller/worker health rules
- Slurm join/drain/remove sequencing details
- runtime-native config files and bootstrap specifics
The same logic applies later to Ray, K8s, Kafka, MongoDB, or other direct clustered apps.
What Should Be Deferred¶
Do not commit to these until at least one example app proves the need is truly generic:
- universal topology DSL
- universal component-role taxonomy
- generic platform enums for clustered role names
- broadened runtime secret purpose model beyond what a real example app demonstrates
- large event taxonomy for member lifecycle beyond what operations actually need
Recommended Next Review¶
Before changing doc/api/openapi.draft.yaml, the next review should answer only these questions:
- Which of the
core primitive neededrows are required for the first example app slice? - Can each proposed additive API shape stay generic and language-neutral?
- Is any proposed field actually adapter-owned and therefore better left in app manifest/config?
- Are we adding the minimum primitive, or accidentally inventing a universal clustered-app control plane?
Immediate Outcome¶
The current conclusion is: - do not start with a large generic topology API - first tighten the primitive layer - let Slurm prove which missing pieces are truly platform-owned - keep adapter semantics out of the core contract unless they are clearly reusable