Skip to content

External App Team Integration Guide v1

Purpose

Give an external app team a clear integration boundary for building on GPUaaS without depending on hidden platform internals.

This guide is written for teams that want to: - integrate their own control software with GPUaaS, - deploy and operate their own application logic independently, - understand what GPUaaS provides versus what their team must implement.

This is an integration guide, not a runtime-specific guide. It is intended to help teams such as BYOM, scheduler teams, or other platform-app teams understand the public contract and the ownership boundary before implementation starts.

Read This First

Canonical references: 1. REST contract: doc/api/openapi.draft.yaml 2. Event contract: doc/api/asyncapi.draft.yaml 3. Build path: doc/architecture/Build_an_App_for_GPUaaS_v1.md 4. App quickstart: doc/architecture/App_Platform_Quickstart_v1.md 5. App control plane model: doc/architecture/App_Control_Plane_v1.md 6. Service accounts: doc/architecture/Service_Account_Model.md 7. Role and policy model: doc/architecture/Role_and_Policy_Lifecycle_Model.md 8. Core/app split for first clustered app: doc/architecture/Slurm_First_Slice_Platform_App_Split_v1.md 9. Node operations baseline: doc/architecture/Node_Operations_and_Agent_Lifecycle_v1.md 10. Node bootstrap trust delivery: doc/architecture/Node_Bootstrap_Trust_Delivery_v1.md 11. Product workflow and readiness gap assessment: doc/architecture/Slurm_Product_Workflow_And_Gap_Assessment_v1.md 12. App UI extension contract: doc/architecture/App_UI_Extension_Model_v1.md 13. Tenant-scoped Slurm semantics: doc/architecture/Slurm_Tenant_Scope_Semantics_v1.md 14. Tenant-shared attachment model: doc/architecture/App_Tenant_Shared_Attachment_Model_v1.md 15. External worker contract direction: doc/architecture/App_Runtime_External_Worker_Contract_v1.md 16. Starter pack: doc/architecture/App_Developer_Starter_Pack_v1.md 17. Manifest registration guide: doc/architecture/App_Manifest_Registration_Guide_v1.md 18. Launchable OCI workload profile contract: doc/architecture/Launchable_OCI_Workload_Profile_Contract_v1.md 19. vLLM OpenAI Compose first slice: doc/product/VLLM_OpenAI_Compose_First_Slice_v1.md

The Boundary In One Page

GPUaaS provides the platform substrate. Your team provides the application logic.

GPUaaS owns

  • tenant, project, and service-account identity model
  • policy and IAM enforcement
  • app catalog and app-instance resource model
  • artifact and runtime secret primitives
  • allocation and node lifecycle primitives
  • node onboarding/bootstrap path
  • audit, correlation, and public read-model surfaces
  • optional secure post-bootstrap access mechanisms such as SSH-key-based access and Vault-backed secret delivery

Your app team owns

  • app-specific controller or operator logic
  • runtime-specific install/configure steps after platform onboarding
  • runtime-specific health, reconciliation, and recovery behavior
  • app-specific control channels after bootstrap
  • runtime-specific runbooks and SME logic

Key design rule

If a behavior requires runtime SME knowledge, it belongs to the app team unless multiple app classes prove the same reusable platform need.

What GPUaaS Exposes As A Stable Public Surface

1. API contract

GPUaaS is contract-first.

Use: - doc/api/openapi.draft.yaml - doc/api/asyncapi.draft.yaml

These are the source of truth for: - endpoints - request/response schemas - auth expectations - error envelopes - event payloads

Do not build against: - undocumented internal routes - direct database access - backend implementation details - direct Keycloak behavior unless explicitly documented as supported platform contract

2. Scope model

GPUaaS uses: - tenant (org) - project - user or service account

Ownership shape:

Tenant (org)
  -> Project
    -> User memberships
    -> Service accounts
    -> App instances

Practical rules: 1. app instances are project-owned 2. service accounts are project-scoped 3. cross-project and cross-tenant access is denied by default 4. project and tenant boundaries are enforced by the platform, not by app-team convention

Use these docs for details: - doc/architecture/Role_and_Policy_Lifecycle_Model.md - doc/architecture/Service_Account_Model.md - doc/architecture/ERD.md

3. Authentication and machine access

For backend-to-backend automation, use project-scoped service accounts.

Do not assume: - long-lived bearer tokens - direct user tokens for automation - direct IdP-specific integration as the primary contract

Current intended pattern: 1. human admin creates a project-scoped service account 2. service account mints short-lived access tokens 3. automation calls public GPUaaS APIs with that token

Use this model when your app has a controller or operator process that keeps running after the initial human deploy action. That controller should not keep reusing the human operator token.

What external teams should rely on: - service-account token flow - project scoping in the token and authorization path - canonical error responses

What external teams should not need to rely on: - raw internal Keycloak topology - undocumented JWT claim conventions beyond documented service-account semantics

JWT claim details can be useful as reference, but they should not be the primary integration contract. The primary contract is: - how to authenticate - which endpoints are allowlisted - which project scope is authorized

4. App control plane

Use the app control plane when your software is modeled as an app instance managed through GPUaaS lifecycle.

This includes: - app catalog - app entitlements - app instances - app artifacts - runtime secrets - generic member read and member-operation envelopes for clustered apps

Use: - doc/architecture/Build_an_App_for_GPUaaS_v1.md - doc/architecture/App_Platform_Quickstart_v1.md - doc/architecture/Slurm_First_Slice_Platform_App_Split_v1.md - doc/architecture/App_Runtime_External_Worker_Contract_v1.md

Important clarification: - the long-term contract for an external app team is a narrow runtime API plus scoped machine identity, - not "modify platform-core runtime switch logic". - transport choices such as NATS delivery or app-team-owned Temporal workflows are secondary to that runtime API boundary.

5. Node onboarding and lifecycle substrate

GPUaaS already has a platform-owned node onboarding/bootstrap and lifecycle direction.

This is the important boundary: - GPUaaS should get a node into a platform-ready state - your team should install and manage app-specific software after that

GPUaaS should not become the place where every external app team pushes arbitrary runtime logic into the node agent.

The preferred model is: 1. platform onboarding/bootstrap prepares the node 2. controlled post-bootstrap access is available when required 3. your software installs and manages its own runtime components

This keeps the node agent narrow and supportable.

6. Post-bootstrap access

Some app teams need host access after onboarding.

GPUaaS can support that through: - controlled SSH-key-based access - Vault-backed custody/delivery of private key material through project-scoped access-credential APIs

This is preferable to: - expanding the node agent into a universal remote execution surface for every app team

Boundary rule: - platform owns the secure access mechanism - app team owns what it chooses to install and manage through that access

Current proven example: - a project-scoped app controller can securely deliver a project SSH access credential through public APIs under service-account identity, - then reconcile its own bootstrap public key onto the selected allocation user, - then use the delivered private key to bootstrap the selected allocation host.

Important distinction: - the app may manage only its own bootstrap trust record, - it must not rewrite shared allocation SSH-key attachments that belong to operators or other workflows.

6a. Allocation versus provisioning boundary

External teams should not conflate allocation with provisioning.

GPUaaS owns: - node lifecycle, - OS deploy/reimage, - bootstrap script delivery, - node-agent enrollment, - allocation runtime user creation.

Your app may: - select from active allocations, - later request allocations through public control-plane APIs if the product exposes that flow for the app, - bootstrap runtime software only after allocation is active.

Your app should not: - own MAAS/node provisioning directly, - treat node provisioning as app-specific logic, - depend on direct SQL or direct MAAS API access as its integration path.

7. Lifecycle signals

App teams should expect lifecycle evidence through public surfaces such as: - app-instance status - member status for clustered app cases - canonical errors with correlation_id - audit records for privileged operations - event contracts where defined

If you need repeated direct database checks to understand the system, the platform is missing a read-model surface and that should be treated as a product gap.

8. Launchable OCI and curated Compose apps

The preferred first integration path for self-contained apps is now a launchable OCI workload profile.

This is the path used by the current JupyterLab and vLLM reference apps: 1. app team provides immutable OCI image variants, 2. platform stores digest-pinned app artifact records, 3. app version manifest declares parameters, artifacts, resources, endpoints, storage, and execution behavior, 4. UI renders a launch form from the manifest, 5. app-runtime renders a typed node-agent task, 6. node-agent starts either one container or a curated Docker Compose project on the selected allocation, 7. app-runtime reconciles lifecycle, endpoints, access, events, and decommission state back into the app-instance model.

For Compose-backed apps, the external contract is not raw Compose YAML. The manifest declares a platform-owned renderer and logical topology. The platform renders concrete Compose from approved artifacts, launch values, storage, and endpoint policy.

Use single-container OCI when: 1. one process and one endpoint are enough, 2. the app does not need sidecars, 3. simple launch/control/remove lifecycle maps cleanly to one container.

Use curated Compose when: 1. the app is single-node but needs multiple logical services, 2. sidecars are likely, 3. the UI should show service topology, 4. the platform can own a safe renderer for the app class.

Current constraints: 1. app teams should not depend on arbitrary user-authored Compose YAML, 2. all deployable images must resolve to immutable digests, 3. host paths, privileged containers, and registry credentials are not user inputs, 4. private endpoint access is supported through host-local binding and SSH tunnel instructions, 5. platform proxy exposure is available for endpoints that declare exposure_modes: ["platform_proxy"]; interactive apps that generate absolute paths should also declare endpoint proxy.base_url_mode: platform_path so the platform launches them with the reserved route as their external base URL, 6. persistent storage, private model credentials, and derived reusable package environments remain follow-on platform work.

What External Teams Most Commonly Need

When an app team starts integration, the usual asks are: 1. OpenAPI spec 2. tenant/project/user/service-account hierarchy 3. service-account auth flow 4. endpoint allowlist and project-scope rules 5. node onboarding/bootstrap model 6. node reclaim/offline lifecycle model 7. dev or stage environment details

Most of these are documentation and integration packaging problems, not a sign that the core platform API must grow immediately.

What External Teams Should Build Against

Build against these stable ideas: - project-scoped service account - public REST/event contracts - public app-instance and member read models - public async operation envelopes - platform-provided credential and allocation primitives - correlation-first debugging model

Do not build against: - hidden admin shortcuts - raw host-role assumptions - direct SQL - internal-only node-agent behavior - implementation details of Keycloak or backend Go packages

Questions To Ask Before Requesting A New Core Primitive

Before asking GPUaaS to add something to core, ask:

  1. Is this truly reusable for more than one app class?
  2. Is this missing from the platform contract, or just undocumented today?
  3. Is this really a platform primitive, or is it runtime-specific SME logic?
  4. Can this be solved by platform-ready node access plus app-owned runtime installation?
  5. Would adding this to core make the platform harder to maintain for unrelated app teams?
  1. Read the public API and app-platform docs first.
  2. Model your app against project-scoped service-account automation.
  3. Decide whether your software uses:
  4. app-instance lifecycle only,
  5. or app-instance plus clustered member envelopes.
  6. Use project-scoped allocation reads as the placement source of truth.
  7. Use platform onboarding/bootstrap and access primitives for nodes.
  8. Decide whether your app needs a bootstrap SSH credential and, if so, use the Vault-backed access-credential path plus app-managed bootstrap trust reconcile.
  9. Use immutable artifact metadata and the registry/publish-intent flow when your app runtime is artifact-backed.
  10. For self-contained apps, prefer a launchable OCI profile before requesting a bespoke runtime adapter.
  11. Use curated Compose only through a platform-owned renderer and manifest topology, not as arbitrary YAML pasted into app config.
  12. Keep your runtime/controller logic outside GPUaaS core.
  13. Escalate only reusable platform gaps, not app-specific behavior.

Current Most Likely Gap

The biggest risk for external teams today is not necessarily missing core API.

The bigger risk is that the current platform capabilities are not yet packaged clearly enough as an external developer story.

That means the most valuable near-term work is: - better developer/integration documentation - clearer examples for service-account-based automation - clearer node onboarding and post-bootstrap access guidance - clearer statement of what belongs to GPUaaS versus what belongs to the app team - clearer definition of the public app-worker runtime contract versus platform-internal implementation details - productized documentation for OCI image variants, profile manifests, curated Compose renderers, artifact promotion, and smoke-test expectations

Immediate Outcome

The current conclusion is: - GPUaaS should stay a substrate and contract provider - external app teams should implement their own runtime logic on top - the platform should grow only where reusable primitives are genuinely missing - better integration documentation is now part of the platform boundary itself - the next architectural step is to make the external worker contract explicit enough that a reference app can eventually move out of platform-core code paths