Skip to content

Example App Developer Reference Workflow v1

Purpose

Define the reference workflow for an independently deployed app on GPUaaS.

This document exists to answer: 1. what the product workflow should look like for an operator, 2. what an app developer is expected to build, 3. how app UI integrates with platform UI, 4. how SDK/client usage should look, 5. whether the current platform is ready for external app developers.

This is not a Slurm-only document. Slurm is the first proving example app for this workflow.

Read This First

Use these as the companion references: 1. doc/api/openapi.draft.yaml 2. doc/architecture/App_Control_Plane_v1.md 3. doc/architecture/Build_an_App_for_GPUaaS_v1.md 4. doc/architecture/App_Platform_Quickstart_v1.md 5. doc/architecture/External_App_Team_Integration_Guide_v1.md 6. doc/architecture/Slurm_First_Slice_Platform_App_Split_v1.md 7. doc/architecture/Slurm_Product_Workflow_And_Gap_Assessment_v1.md 8. doc/architecture/Slurm_Tenant_Scope_Semantics_v1.md 9. doc/architecture/CLI_Agent_Operable_Control_Plane_v2.md 10. doc/architecture/CLI_PythonSDK_v1_Plan.md

Decision Summary

  1. GPUaaS owns the platform shell, common contracts, IAM, audit, and reusable primitives.
  2. App teams own runtime-specific controller logic, runtime-specific UI extensions, and any app-owned state they need.
  3. The first-class integration model is still API-first.
  4. SDKs and UI libraries are convenience layers over the public contract, not the source of truth.
  5. Slurm should be treated as the first example app for this workflow, not a special-case internal path.

Reference Product Workflow

Operator workflow

The operator workflow should be:

  1. Go to App Catalog.
  2. Find an entitled app.
  3. Click Deploy.
  4. Fill the app-specific deploy form fields that are actually required.
  5. Submit the deployment.
  6. Observe lifecycle progress in the platform shell.
  7. Use runtime-specific day-2 controls from the instance page.

Current platform-shell behavior: - the catalog UI shows only apps enabled for the active project, - entitlement management remains a separate tenant/project admin surface, - the instance page can host app-specific bootstrap credential controls where the app extension needs them.

App-controller workflow

The app-controller workflow should be:

  1. Run independently from GPUaaS core.
  2. Authenticate with a project-scoped service account.
  3. Read app instances, members, and operations through public APIs only.
  4. Acquire machine access through a supported platform path.
  5. Install or configure runtime software on the selected nodes.
  6. Report progress and status back through public APIs.
  7. Continue reconciling until the desired runtime state is reached.

Slurm As The Reference Example

Slurm is the current example of how an app should consume GPUaaS building blocks without turning those blocks into Slurm-specific platform behavior.

The important split is:

Platform owns

  • app catalog and entitlement policy
  • project app instance record
  • project service-account identity
  • project-scoped access-credential custody and delivery
  • allocation and placement read models
  • generic member and member-operation envelopes
  • audit, correlation, and lifecycle read surfaces

Slurm app owns

  • controller reconcile logic
  • Slurm package install/configure logic
  • runtime-native slurmctld / slurmd health checks
  • worker topology decisions within the explicit placement chosen by the operator
  • any Slurm-specific runtime panels and day-2 controls

What this proves for other app teams

If another app needs: - a project-scoped machine identity, - one or more selected allocations, - a delivered bootstrap credential, - status reporting back into the app instance,

then it should be able to follow the same pattern without asking the platform to absorb runtime-specific SME logic.

Platform UI vs App UI

The product should expose two UI layers.

Platform UI owns

  • app catalog
  • app entitlement management
  • generic deploy entrypoint
  • app instance inventory
  • generic app lifecycle status
  • generic member and operation history
  • service account and access-credential selection if those are platform primitives
  • project allocation selection surfaces when placement is allocation-based

App UI owns

  • app-specific deploy fields
  • app-specific runtime summary panels
  • app-specific topology and node-selection fields
  • app-specific worker or role actions
  • app-specific debugging and day-2 operations

Integration model

The intended model is:

  1. platform shell hosts the common navigation and instance pages,
  2. app contributes structured UI extensions inside that shell,
  3. app-specific UI is rendered in defined extension points rather than by replacing the full platform UI.

For example: - deploy flow starts in the platform catalog, - the app contributes deploy form extensions, - instance detail page shows generic cards plus app-specific panels.

Current shell extension points

The platform shell now has a first real extension seam in the web app:

  • catalog deploy modal asks the app extension for deploy-specific fields,
  • catalog deploy orchestration can reason about required platform-owned inputs from extension metadata,
  • catalog deploy can already render metadata-driven single-allocation and multi-allocation inputs for simpler apps without custom deploy components,
  • catalog deploy can also render metadata-driven service-account and access-credential inputs for simpler apps without custom deploy components,
  • instance detail page asks the app extension for runtime and action panels,
  • instance detail page can also host app-specific bootstrap credential lifecycle actions when the app uses platform access credentials,
  • generic lifecycle, members, and operation history stay platform-owned.
  • common platform-owned picker primitives can be reused inside app extensions for:
  • service-account selection,
  • access-credential selection,
  • single-allocation selection,
  • multi-allocation selection.

For the current Slurm example this means: - platform shell owns the page route, generic lifecycle actions, and generic history cards, - the Slurm extension provides the deploy fields and the Slurm runtime/worker panels, - the registry maps app identity (slug / runtime_backend) to those extension components.

This is still the first implementation cut, not the final long-term plugin system, but it is now a real structured boundary rather than ad hoc page-local branching.

Deploy Form Expectations

The deploy form should collect the real inputs the app needs.

For a clustered scheduler-style app, likely operator inputs are: - instance name - version - operating mode - controller allocation - worker allocation or worker allocation list - whether controller and worker can share a node - access credential selection - service account selection or creation

For the current Slurm reference path specifically: - the operator chooses one controller allocation, - the operator either reuses that same allocation for the initial worker, or - explicitly selects one or more separate worker allocations for the initial worker set.

If the operator must know or choose it, it should be explicit in the product workflow. If the app can safely default it, keep it as an app-side default.

Placement source of truth

The platform does not need a second app-specific node picker contract if the existing allocation read model already provides the needed placement data.

For the current model, app deploy and worker-add flows should treat GET /api/v1/projects/{project_id}/allocations and GET /api/v1/projects/{project_id}/allocations/{allocation_id} as the placement source of truth for the selected project.

That means the app workflow should select from allocated nodes already visible in scope, then carry explicit allocation_intent into app member operations.

Deploy-time placement should likewise be carried as first-class placement_intent on the app instance contract rather than hidden inside opaque app config.

The product should not rely on inferred host reuse once multiple candidate allocations exist.

Current example-app proof: - deploy uses first-class placement_intent - app runtime seeds initial worker add member operations from placement_intent.worker_allocation_ids - worker add uses explicit allocation_intent.allocation_id - the app controller resolves host and username details from the selected allocation through public APIs

App-Owned State

App developers may need their own state store, including a database.

This is allowed and expected for serious managed apps.

Platform remains source of truth for

  • app instances
  • members
  • member operations
  • IAM and audit
  • platform-owned lifecycle state

App may own state for

  • runtime reconciliation bookkeeping
  • runtime-native metadata
  • project-scope vs tenant-scope app mappings
  • app policy and runtime config mappings
  • scheduler/runtime object mappings
  • app-specific health and recovery state

Rule: - app-owned state must not replace platform-owned control-plane truth.

Service Account Model For Apps

Each independently deployed app controller should use a project-scoped service account.

The workflow should be explicit:

  1. operator chooses or creates the service account for the app,
  2. app controller uses that service account to mint short-lived tokens,
  3. all app/platform interaction happens under that machine identity.

The product should not hide the existence of that service account if it is required for the app to run.

When a service account is required

Use a project-scoped service account when the app has a continuously running controller, reconciler, or operator process that must: - read app instances or shared runtime state, - read allocations selected by the operator, - retrieve delivered credentials, - write lifecycle/member/member-operation status back to the platform.

Do not require a service account only for the human operator to click Deploy.

Practical rule: - human user identity is for setup, approval, and day-2 operator actions, - service-account identity is for long-running app automation.

For Slurm today: - the operator selects the service account at deploy time, - the Slurm controller mints short-lived bearer tokens from that service account, - the controller uses those tokens for all subsequent public API calls.

Machine Access Model

Apps need a supported node access path for bootstrap and runtime management.

The intended product model is: - platform provides the secure machine access primitive, - app uses that primitive, - app installs and manages runtime software itself.

Current important direction: - do not expand node-agent into a universal app-runtime executor, - do provide correct credential custody, access delivery, and scoped retrieval.

Minimum core expectation

For app developers, the minimum supported platform path should be:

  1. project-scoped access credential metadata and custody under the public API,
  2. Vault-backed secret write on create and rotate,
  3. scoped delivery back to the app controller without plaintext reveal in the normal API response,
  4. app-managed bootstrap trust reconcile onto the selected allocation user when the app needs host bootstrap SSH,
  5. audit of credential lifecycle and delivery actions,
  6. service-account-compatible retrieval for the app controller.

This core slice now exists for project-scoped apps. Remaining work is mainly: - tenant-scoped or multi-project variants, - broader operator UX polish beyond the first bootstrap credential lifecycle controls, - polishing the reference app around that supported path.

Bootstrap SSH with Vault-backed custody

The intended flow is:

  1. operator creates or selects a project-scoped SSH access credential,
  2. the platform stores the private key material in Vault-backed custody,
  3. the credential is bound to the app instance,
  4. the app controller retrieves the credential through the public delivery path under service-account identity,
  5. the app controller reconciles only its own bootstrap public key onto the selected allocation user,
  6. the app controller uses the delivered private key to bootstrap the host.

Important ownership rule: - the app may manage only the bootstrap trust it owns, - it must not rewrite unrelated operator SSH keys on the allocation.

This is exactly the gap the Slurm reference flow exposed and then closed.

Allocation, Provisioning, And App Responsibility

App teams should distinguish three separate stages:

1. Node provisioning

This is platform-owned.

Examples: - MAAS reimage or deploy - node bootstrap script delivery - node-agent enrollment - allocation runtime user creation

An app should not need to provision the node OS itself.

2. Allocation selection or acquisition

This can vary by app: - some apps assume the operator already has active allocations and only select from them, - some apps may later orchestrate allocation acquisition through platform APIs, - but even then the app is requesting allocations, not provisioning nodes.

The clean boundary is: - platform provisions and owns node lifecycle, - app may select or request allocations as inputs to runtime placement.

3. App bootstrap after allocation is active

This is app-owned.

Examples: - install runtime packages - write runtime-native config - start runtime services - validate runtime health - add workers or other app-native topology members

For Slurm: - the allocation must already exist, - the Slurm controller then bootstraps the selected host after allocation is active, - worker contributions are app operations layered on top of explicit allocation placement.

Artifact And Registry Example

The Slurm reference flow currently proves the controller/bootstrap path more than the registry path, but app developers should still understand the intended artifact model.

The platform artifact flow is:

  1. platform admin publishes a catalog version,
  2. app team publishes immutable artifacts for that version,
  3. the control plane records immutable artifact metadata,
  4. runtime controllers consume version/artifact metadata through the public API.

For OCI-based apps: - use publish-intent APIs, - push directly to the platform-owned registry, - register the pushed digest with the control plane, - deploy by digest-backed metadata, not mutable tags.

For non-OCI apps: - use the non-OCI artifact direction when that contract is productized, - do not smuggle artifact sources through opaque runtime config.

Even if Slurm itself is not yet leaning on the registry path, app teams should design around immutable artifact metadata because that is the reusable platform contract.

Billing Model For App Teams

Billing is part of the app integration boundary, not a later add-on.

App teams need to decide: - what runtime signals are billable, - what resource or usage unit those signals map to, - how those signals preserve project and app-instance attribution.

The platform owns: - ledger behavior, - tenant/project attribution, - cost-control policy, - usage record normalization and charging.

The app owns: - deciding when the runtime is actually consuming billable capacity, - producing or reconciling those usage signals into the platform contract.

For scheduler-style apps such as Slurm, likely billable signals are tied more to: - active worker capacity, - reservation or contribution windows, - or runtime-owned capacity overlays, than to the controller process itself.

End-to-End Integration Checklist

For a real app team, the minimum end-to-end path should be:

  1. publish and entitle the app,
  2. create/select the operator service account,
  3. create/select the bootstrap credential if bootstrap SSH is needed,
  4. select existing allocations or request allocations through the supported platform path,
  5. create the app instance with explicit placement intent,
  6. let the app controller reconcile bootstrap trust and install runtime software,
  7. report runtime/member/member-operation status back through public APIs,
  8. preserve audit, correlation, and billing attribution throughout.

SDK And Client Model

The public API remains authoritative.

Current state

Today the example app is using the API directly.

That is acceptable and expected while the app workflow is being defined.

Intended SDK model

SDKs should provide: - typed clients generated from the public contract, - auth helpers, - polling helpers, - ergonomic wrappers around public operations, - no hidden private control paths.

For app developers: - API-first is the source of truth, - SDK is a convenience layer, - app-specific helper libraries may be added later if multiple apps prove the same need.

UI library direction

It is reasonable to provide shared app-platform UI helpers later, for example: - deploy form extension contracts, - instance detail panel contracts, - shared cards for generic member/operation state, - shared service-account and access-credential picker components.

But those should come after the workflow is stable.

The current codebase now has the beginning of this model: - an extension registry for app-shell matching, - a deploy-fields extension point, - declarative deploy metadata for required platform-owned inputs and placement intent, - deploy metadata that can express allocation cardinality such as single-allocation vs multi-allocation flows, - an instance-panels extension point, - reusable generic instance cards owned by the platform shell, - reusable picker primitives for platform-owned inputs inside app extensions.

Current Implementation Reality

Today: - the public API is ahead of the UI and SDK integration model, - Slurm proves the controller pattern with project-scope reconcile by app slug, - deploy and worker actions now use explicit placement primitives, - UI integration is real for the single-node Slurm path but not yet generalized, - SDK usage for apps is still effectively API-direct.

So the correct message to app developers today is: - build against the public API, - expect the app UI extension and SDK convenience model to improve, - do not assume the current Slurm proof path is already the final polished product workflow.

Readiness Test

We should consider the example-app workflow ready for external app developers only when:

  1. deploy collects the real required app inputs,
  2. app machine access is productized,
  3. service-account usage is explicit and documented,
  4. app UI integration points are documented,
  5. app-owned state expectations are documented,
  6. operator recovery paths are explicit,
  7. direct API usage and SDK expectations are both clear.

Immediate Outcome

The platform should now be judged against this workflow:

  • if something is only working because of a proof shortcut, it is not yet app-developer-ready,
  • if the workflow is clear and implementable without hidden knowledge, it is ready to hand to app teams.