Example App Developer Reference Workflow v1¶

Purpose¶

Define the reference workflow for an independently deployed app on GPUaaS.

This document exists to answer: 1. what the product workflow should look like for an operator, 2. what an app developer is expected to build, 3. how app UI integrates with platform UI, 4. how SDK/client usage should look, 5. whether the current platform is ready for external app developers.

This is not a Slurm-only document. Slurm is the first proving example app for this workflow.

Read This First¶

Use these as the companion references: 1. doc/api/openapi.draft.yaml 2. doc/architecture/App_Control_Plane_v1.md 3. doc/architecture/Build_an_App_for_GPUaaS_v1.md 4. doc/architecture/App_Platform_Quickstart_v1.md 5. doc/architecture/External_App_Team_Integration_Guide_v1.md 6. doc/architecture/Slurm_First_Slice_Platform_App_Split_v1.md 7. doc/architecture/Slurm_Product_Workflow_And_Gap_Assessment_v1.md 8. doc/architecture/Slurm_Tenant_Scope_Semantics_v1.md 9. doc/architecture/CLI_Agent_Operable_Control_Plane_v2.md 10. doc/architecture/CLI_PythonSDK_v1_Plan.md

Decision Summary¶

GPUaaS owns the platform shell, common contracts, IAM, audit, and reusable primitives.
App teams own runtime-specific controller logic, runtime-specific UI extensions, and any app-owned state they need.
The first-class integration model is still API-first.
SDKs and UI libraries are convenience layers over the public contract, not the source of truth.
Slurm should be treated as the first example app for this workflow, not a special-case internal path.

Reference Product Workflow¶

Operator workflow¶

The operator workflow should be:

Go to App Catalog.
Find an entitled app.
Click Deploy.
Fill the app-specific deploy form fields that are actually required.
Submit the deployment.
Observe lifecycle progress in the platform shell.
Use runtime-specific day-2 controls from the instance page.

Current platform-shell behavior: - the catalog UI shows only apps enabled for the active project, - entitlement management remains a separate tenant/project admin surface, - the instance page can host app-specific bootstrap credential controls where the app extension needs them.

App-controller workflow¶

The app-controller workflow should be:

Run independently from GPUaaS core.
Authenticate with a project-scoped service account.
Read app instances, members, and operations through public APIs only.
Acquire machine access through a supported platform path.
Install or configure runtime software on the selected nodes.
Report progress and status back through public APIs.
Continue reconciling until the desired runtime state is reached.

Slurm As The Reference Example¶

Slurm is the current example of how an app should consume GPUaaS building blocks without turning those blocks into Slurm-specific platform behavior.

The important split is:

Platform owns¶

app catalog and entitlement policy
project app instance record
project service-account identity
project-scoped access-credential custody and delivery
allocation and placement read models
generic member and member-operation envelopes
audit, correlation, and lifecycle read surfaces

Slurm app owns¶

controller reconcile logic
Slurm package install/configure logic
runtime-native slurmctld / slurmd health checks
worker topology decisions within the explicit placement chosen by the operator
any Slurm-specific runtime panels and day-2 controls

What this proves for other app teams¶

If another app needs: - a project-scoped machine identity, - one or more selected allocations, - a delivered bootstrap credential, - status reporting back into the app instance,

then it should be able to follow the same pattern without asking the platform to absorb runtime-specific SME logic.

Platform UI vs App UI¶

The product should expose two UI layers.

Platform UI owns¶

app catalog
app entitlement management
generic deploy entrypoint
app instance inventory
generic app lifecycle status
generic member and operation history
service account and access-credential selection if those are platform primitives
project allocation selection surfaces when placement is allocation-based

App UI owns¶

app-specific deploy fields
app-specific runtime summary panels
app-specific topology and node-selection fields
app-specific worker or role actions
app-specific debugging and day-2 operations

Integration model¶

The intended model is:

platform shell hosts the common navigation and instance pages,
app contributes structured UI extensions inside that shell,
app-specific UI is rendered in defined extension points rather than by replacing the full platform UI.

For example: - deploy flow starts in the platform catalog, - the app contributes deploy form extensions, - instance detail page shows generic cards plus app-specific panels.

Current shell extension points¶

The platform shell now has a first real extension seam in the web app:

catalog deploy modal asks the app extension for deploy-specific fields,
catalog deploy orchestration can reason about required platform-owned inputs from extension metadata,
catalog deploy can already render metadata-driven single-allocation and multi-allocation inputs for simpler apps without custom deploy components,
catalog deploy can also render metadata-driven service-account and access-credential inputs for simpler apps without custom deploy components,
instance detail page asks the app extension for runtime and action panels,
instance detail page can also host app-specific bootstrap credential lifecycle actions when the app uses platform access credentials,
generic lifecycle, members, and operation history stay platform-owned.
common platform-owned picker primitives can be reused inside app extensions for:
service-account selection,
access-credential selection,
single-allocation selection,
multi-allocation selection.

For the current Slurm example this means: - platform shell owns the page route, generic lifecycle actions, and generic history cards, - the Slurm extension provides the deploy fields and the Slurm runtime/worker panels, - the registry maps app identity (slug / runtime_backend) to those extension components.

This is still the first implementation cut, not the final long-term plugin system, but it is now a real structured boundary rather than ad hoc page-local branching.

Deploy Form Expectations¶

The deploy form should collect the real inputs the app needs.

For a clustered scheduler-style app, likely operator inputs are: - instance name - version - operating mode - controller allocation - worker allocation or worker allocation list - whether controller and worker can share a node - access credential selection - service account selection or creation

For the current Slurm reference path specifically: - the operator chooses one controller allocation, - the operator either reuses that same allocation for the initial worker, or - explicitly selects one or more separate worker allocations for the initial worker set.

If the operator must know or choose it, it should be explicit in the product workflow. If the app can safely default it, keep it as an app-side default.

Placement source of truth¶

The platform does not need a second app-specific node picker contract if the existing allocation read model already provides the needed placement data.

For the current model, app deploy and worker-add flows should treat GET /api/v1/projects/{project_id}/allocations and GET /api/v1/projects/{project_id}/allocations/{allocation_id} as the placement source of truth for the selected project.

That means the app workflow should select from allocated nodes already visible in scope, then carry explicit allocation_intent into app member operations.

Deploy-time placement should likewise be carried as first-class placement_intent on the app instance contract rather than hidden inside opaque app config.

The product should not rely on inferred host reuse once multiple candidate allocations exist.

Current example-app proof: - deploy uses first-class placement_intent - app runtime seeds initial worker add member operations from placement_intent.worker_allocation_ids - worker add uses explicit allocation_intent.allocation_id - the app controller resolves host and username details from the selected allocation through public APIs

App-Owned State¶

App developers may need their own state store, including a database.

This is allowed and expected for serious managed apps.

Platform remains source of truth for¶

app instances
members
member operations
IAM and audit
platform-owned lifecycle state

App may own state for¶

runtime reconciliation bookkeeping
runtime-native metadata
project-scope vs tenant-scope app mappings
app policy and runtime config mappings
scheduler/runtime object mappings
app-specific health and recovery state

Rule: - app-owned state must not replace platform-owned control-plane truth.

Service Account Model For Apps¶

Each independently deployed app controller should use a project-scoped service account.

The workflow should be explicit:

operator chooses or creates the service account for the app,
app controller uses that service account to mint short-lived tokens,
all app/platform interaction happens under that machine identity.

The product should not hide the existence of that service account if it is required for the app to run.

When a service account is required¶

Use a project-scoped service account when the app has a continuously running controller, reconciler, or operator process that must: - read app instances or shared runtime state, - read allocations selected by the operator, - retrieve delivered credentials, - write lifecycle/member/member-operation status back to the platform.

Do not require a service account only for the human operator to click Deploy.

Practical rule: - human user identity is for setup, approval, and day-2 operator actions, - service-account identity is for long-running app automation.

For Slurm today: - the operator selects the service account at deploy time, - the Slurm controller mints short-lived bearer tokens from that service account, - the controller uses those tokens for all subsequent public API calls.

Machine Access Model¶

Apps need a supported node access path for bootstrap and runtime management.

The intended product model is: - platform provides the secure machine access primitive, - app uses that primitive, - app installs and manages runtime software itself.

Current important direction: - do not expand node-agent into a universal app-runtime executor, - do provide correct credential custody, access delivery, and scoped retrieval.

Minimum core expectation¶

For app developers, the minimum supported platform path should be:

project-scoped access credential metadata and custody under the public API,
Vault-backed secret write on create and rotate,
scoped delivery back to the app controller without plaintext reveal in the normal API response,
app-managed bootstrap trust reconcile onto the selected allocation user when the app needs host bootstrap SSH,
audit of credential lifecycle and delivery actions,
service-account-compatible retrieval for the app controller.

This core slice now exists for project-scoped apps. Remaining work is mainly: - tenant-scoped or multi-project variants, - broader operator UX polish beyond the first bootstrap credential lifecycle controls, - polishing the reference app around that supported path.

Bootstrap SSH with Vault-backed custody¶

The intended flow is:

operator creates or selects a project-scoped SSH access credential,
the platform stores the private key material in Vault-backed custody,
the credential is bound to the app instance,
the app controller retrieves the credential through the public delivery path under service-account identity,
the app controller reconciles only its own bootstrap public key onto the selected allocation user,
the app controller uses the delivered private key to bootstrap the host.

Important ownership rule: - the app may manage only the bootstrap trust it owns, - it must not rewrite unrelated operator SSH keys on the allocation.

This is exactly the gap the Slurm reference flow exposed and then closed.

Allocation, Provisioning, And App Responsibility¶

App teams should distinguish three separate stages:

1. Node provisioning¶

This is platform-owned.

Examples: - MAAS reimage or deploy - node bootstrap script delivery - node-agent enrollment - allocation runtime user creation

An app should not need to provision the node OS itself.

2. Allocation selection or acquisition¶

This can vary by app: - some apps assume the operator already has active allocations and only select from them, - some apps may later orchestrate allocation acquisition through platform APIs, - but even then the app is requesting allocations, not provisioning nodes.

The clean boundary is: - platform provisions and owns node lifecycle, - app may select or request allocations as inputs to runtime placement.

3. App bootstrap after allocation is active¶

This is app-owned.

Examples: - install runtime packages - write runtime-native config - start runtime services - validate runtime health - add workers or other app-native topology members

For Slurm: - the allocation must already exist, - the Slurm controller then bootstraps the selected host after allocation is active, - worker contributions are app operations layered on top of explicit allocation placement.

Artifact And Registry Example¶

The Slurm reference flow currently proves the controller/bootstrap path more than the registry path, but app developers should still understand the intended artifact model.

The platform artifact flow is:

platform admin publishes a catalog version,
app team publishes immutable artifacts for that version,
the control plane records immutable artifact metadata,
runtime controllers consume version/artifact metadata through the public API.

For OCI-based apps: - use publish-intent APIs, - push directly to the platform-owned registry, - register the pushed digest with the control plane, - deploy by digest-backed metadata, not mutable tags.

For non-OCI apps: - use the non-OCI artifact direction when that contract is productized, - do not smuggle artifact sources through opaque runtime config.

Even if Slurm itself is not yet leaning on the registry path, app teams should design around immutable artifact metadata because that is the reusable platform contract.

Billing Model For App Teams¶

Billing is part of the app integration boundary, not a later add-on.

App teams need to decide: - what runtime signals are billable, - what resource or usage unit those signals map to, - how those signals preserve project and app-instance attribution.

The platform owns: - ledger behavior, - tenant/project attribution, - cost-control policy, - usage record normalization and charging.

The app owns: - deciding when the runtime is actually consuming billable capacity, - producing or reconciling those usage signals into the platform contract.

For scheduler-style apps such as Slurm, likely billable signals are tied more to: - active worker capacity, - reservation or contribution windows, - or runtime-owned capacity overlays, than to the controller process itself.

End-to-End Integration Checklist¶

For a real app team, the minimum end-to-end path should be:

publish and entitle the app,
create/select the operator service account,
create/select the bootstrap credential if bootstrap SSH is needed,
select existing allocations or request allocations through the supported platform path,
create the app instance with explicit placement intent,
let the app controller reconcile bootstrap trust and install runtime software,
report runtime/member/member-operation status back through public APIs,
preserve audit, correlation, and billing attribution throughout.

SDK And Client Model¶

The public API remains authoritative.

Current state¶

Today the example app is using the API directly.

That is acceptable and expected while the app workflow is being defined.

Intended SDK model¶

SDKs should provide: - typed clients generated from the public contract, - auth helpers, - polling helpers, - ergonomic wrappers around public operations, - no hidden private control paths.

For app developers: - API-first is the source of truth, - SDK is a convenience layer, - app-specific helper libraries may be added later if multiple apps prove the same need.

UI library direction¶

It is reasonable to provide shared app-platform UI helpers later, for example: - deploy form extension contracts, - instance detail panel contracts, - shared cards for generic member/operation state, - shared service-account and access-credential picker components.

But those should come after the workflow is stable.

The current codebase now has the beginning of this model: - an extension registry for app-shell matching, - a deploy-fields extension point, - declarative deploy metadata for required platform-owned inputs and placement intent, - deploy metadata that can express allocation cardinality such as single-allocation vs multi-allocation flows, - an instance-panels extension point, - reusable generic instance cards owned by the platform shell, - reusable picker primitives for platform-owned inputs inside app extensions.

Current Implementation Reality¶

Today: - the public API is ahead of the UI and SDK integration model, - Slurm proves the controller pattern with project-scope reconcile by app slug, - deploy and worker actions now use explicit placement primitives, - UI integration is real for the single-node Slurm path but not yet generalized, - SDK usage for apps is still effectively API-direct.

So the correct message to app developers today is: - build against the public API, - expect the app UI extension and SDK convenience model to improve, - do not assume the current Slurm proof path is already the final polished product workflow.

Readiness Test¶

We should consider the example-app workflow ready for external app developers only when:

deploy collects the real required app inputs,
app machine access is productized,
service-account usage is explicit and documented,
app UI integration points are documented,
app-owned state expectations are documented,
operator recovery paths are explicit,
direct API usage and SDK expectations are both clear.

Immediate Outcome¶

The platform should now be judged against this workflow:

if something is only working because of a proof shortcut, it is not yet app-developer-ready,
if the workflow is clear and implementable without hidden knowledge, it is ready to hand to app teams.