Skip to content

App Platform Primitive Boundary v1

Purpose

Record the design boundary agreed during clustered-app planning so the platform does not grow a prematurely generic topology API that tries to model every distributed application directly.

This note exists because the first example app work, starting with Slurm, can easily push the platform toward an oversized API surface if the platform/app boundary is not kept explicit.

Reading order: 1. Clustered_App_Model_v1.md 2. App_Platform_Primitive_Boundary_v1.md 3. App_Platform_Clustered_App_Gap_Table_v1.md 4. App_Platform_Core_For_First_Slurm_Slice_v1.md

Use this with: - doc/architecture/Clustered_App_Model_v1.md - doc/architecture/App_Platform_Clustered_App_Gap_Table_v1.md - doc/architecture/App_Platform_Core_For_First_Slurm_Slice_v1.md - doc/architecture/App_Control_Plane_v1.md - doc/architecture/Scheduler_as_Platform_App_v1.md - doc/architecture/Slurm_App_Runtime_Adapter_v1.md

Decision Summary

  1. The platform should provide reusable primitives, not a fully generic distributed-systems topology language.
  2. Logical topology semantics belong primarily to the app developer or runtime adapter, not to the core platform.
  3. The platform should expose only the minimum generic lifecycle, identity, security, allocation-intent, and observability primitives needed for app developers to build on top.
  4. Slurm is a proving example app for the API and SDK, not a justification to hardcode scheduler-shaped concepts into the core platform.
  5. If a capability appears necessary for multiple app classes, it can be promoted into a generic platform primitive later. Until then, bias toward keeping it adapter-owned.

Why This Exists

The platform already needs to support: - artifacts and trust, - service accounts, - IAM and audit, - runtime secrets, - allocation and node lifecycle, - observability and billing hooks.

That does not automatically mean the core API should own: - controller/worker schemas, - role taxonomies for every distributed system, - universal scale/drain/remove semantics for every app, - or a generic topology DSL.

Without an explicit boundary, the first clustered example app can accidentally turn the platform into a large, hard-to-maintain control API that tries to satisfy every future runtime before the common primitives are proven.

Core Principle

The platform should be: - primitive-rich, - contract-first, - language-neutral, - secure by default.

It should not be: - a universal topology authoring system, - a scheduler-specific API, - a database-orchestration control plane, - or a place where adapter-specific semantics leak into generic contracts too early.

What The Platform Should Provide

These are the reusable building blocks the platform should own generically.

Identity and security

  • user and service-account authentication
  • project and tenant scoping
  • IAM enforcement
  • audit logging
  • canonical error envelopes

Artifact and secret primitives

  • app artifact registration and trust/promotion
  • registry and blob delivery primitives
  • runtime secret issuance
  • short-lived credential delivery

Platform support services

  • platform-owned registry
  • platform-owned Vault or equivalent secret-custody service
  • isolation of support-service capacity from core control-plane capacity where practical

These services are platform infrastructure, not app-runtime topology. App developers consume their primitives through the API; they should not need to care where those services run.

Compatibility rule: - future clustered-app primitives should be additive-first and must not force app developers to reason about support-service placement or support-service topology changes.

Infrastructure and lifecycle primitives

  • allocation intent primitives
  • project/tenant-scoped capacity requests
  • coarse location hints such as AZ or region if the platform later proves they are needed generically
  • node lifecycle primitives
  • bootstrap
  • drain
  • remove
  • replace
  • async operation model
  • operation/event correlation
  • runtime status and health surfaces

Platform-wide support primitives

  • usage attribution hooks
  • billing record primitives
  • observability requirements
  • runbook and incident correlation expectations

What Should Remain App/Adapter-Owned

These concerns should stay in the example app or adapter unless at least two real app classes prove they need a shared platform primitive.

Topology semantics

  • logical topology modes
  • role naming
  • role-specific desired counts
  • app-specific co-location rules
  • app-specific health and readiness semantics

Examples: - Slurm: controller, worker - Ray: head, worker - K8s: control_plane, worker - Kafka: broker, optional controller

These are valid app semantics, but they should not automatically become core platform enums.

App-specific lifecycle logic

  • how a role joins a cluster
  • how drain/remove is interpreted by the app
  • upgrade ordering rules
  • quorum or replication safety checks
  • rendered config semantics
  • role/member-specific health checks

What The Platform Should Not Do Yet

Do not add a large generic OpenAPI model for: - universal topology declaration - universal role taxonomy - universal component/member semantics for every distributed runtime

Why: - it is premature - it will be shaped too heavily by the first example app - it creates a larger long-term compatibility burden - it risks forcing future runtimes into the wrong abstraction

When a new requirement appears, classify it as one of:

  1. core primitive needed
  2. multiple app classes are likely to need it
  3. it belongs in the platform contract

  4. adapter-owned concern

  5. it is runtime-specific
  6. it should stay inside the app adapter or its manifest/config model

  7. prove with first example app

  8. not enough evidence yet
  9. keep it out of the core API until the example app demonstrates a real reusable need

What Slurm Should Prove

Slurm should be used to validate: - the platform has enough primitives for a real clustered app - the API is usable by app developers and automation - service accounts and IAM are sufficient - node lifecycle and secret/artifact primitives compose cleanly

Slurm should not be used to justify: - hardcoding scheduler semantics into the core platform - inventing a universal topology language before the primitive layer is proven

Practical Boundary For The Next Slice

Before implementation, the next review should produce three lists:

Core primitives needed now

Examples of likely candidates: - better async operation/read-model support - clearer runtime secret purposes - service-account authz for app automation - node lifecycle hooks that adapters can invoke safely - explicit boundary between app-level allocation intent and platform-level node realization

Adapter-owned concerns

Examples: - Slurm topology schema - Slurm role names - Slurm controller/worker health semantics - Slurm-specific join/drain behavior

Deferred until proven

Examples: - generic topology DSL - generic role taxonomy - generic multi-component app contract in core OpenAPI

Current working classification lives in: - doc/architecture/App_Platform_Clustered_App_Gap_Table_v1.md

Language Neutrality Requirement

Because app developers may use Go, Python, or other languages: - the public API must remain the source of truth - SDKs must be convenience layers - platform behavior must not depend on internal Go-only abstractions

This further argues against a premature, overly rich platform control surface. A smaller primitive layer is easier to explain, bind, and support across languages.

Relationship To Clustered App Model

Clustered_App_Model_v1.md defines the broader long-term model the platform must be able to support.

This document adds an implementation guardrail: - do not force that long-term model directly into the core API until the reusable primitive layer is proven - use the example app to validate which parts are truly generic

The two documents are complementary: - Clustered_App_Model_v1.md says what kinds of apps the platform should eventually support - App_Platform_Primitive_Boundary_v1.md says how cautious we should be about promoting app semantics into the platform contract

Next Step

The next design review should be a gap table, not an implementation sprint: - current capability - missing primitive - adapter-owned concern - deferred concern

That review should happen before any additive OpenAPI change for Slurm implementation.