Build an App for GPUaaS v1¶

As of: April 14, 2026

Purpose¶

This is the developer and agent guide for building on the GPUaaS App Platform.

It is written for: 1. internal product teams, 2. external platform-app teams, 3. agents or automation acting on behalf of those teams.

The goal is to show the actual build path for an app team: 1. what GPUaaS owns, 2. what the app team owns, 3. how IAM, billing, and lifecycle fit together, 4. what is implemented now versus still missing.

Read This First¶

Use these as the canonical companion docs: 1. REST contract: doc/api/openapi.draft.yaml 2. Event contract: doc/api/asyncapi.draft.yaml 3. App model: doc/architecture/App_Control_Plane_v1.md 4. App lifecycle: doc/architecture/App_Runtime_Instance_Lifecycle_v1.md 5. Operating modes: doc/architecture/App_Runtime_Operating_Modes_v1.md 6. Billing baseline: doc/architecture/App_Runtime_Billing_Model_v1.md 7. Service accounts: doc/architecture/Service_Account_Model.md 8. IAM model: doc/architecture/Role_and_Policy_Lifecycle_Model.md 9. Scheduler reference: doc/architecture/Scheduler_as_Platform_App_v1.md 10. External integration boundary: doc/architecture/External_App_Team_Integration_Guide_v1.md 11. Example app workflow reference: doc/architecture/Example_App_Developer_Reference_Workflow_v1.md 12. App UI extension contract: doc/architecture/App_UI_Extension_Model_v1.md 13. External worker contract direction: doc/architecture/App_Runtime_External_Worker_Contract_v1.md 14. Starter pack: doc/architecture/App_Developer_Starter_Pack_v1.md 15. Manifest registration guide: doc/architecture/App_Manifest_Registration_Guide_v1.md 16. Launchable OCI workload profile contract: doc/architecture/Launchable_OCI_Workload_Profile_Contract_v1.md 17. vLLM OpenAI Compose first slice: doc/product/VLLM_OpenAI_Compose_First_Slice_v1.md 18. Jupyter package-install first slice: doc/product/Jupyter_Package_Install_v1.md

What GPUaaS Provides¶

GPUaaS is the control plane. It provides primitives, not app-specific business logic.

Platform-owned responsibilities: 1. identity and authentication, 2. tenant/project IAM and policy evaluation, 3. service-account lifecycle and short-lived token issuance, 4. app catalog and project entitlements, 5. app-instance lifecycle API and async events, 6. billing ownership and usage-record shape, 7. audit logs, correlation IDs, traces, and observability surfaces.

App-team-owned responsibilities: 1. app-specific runtime/operator logic, 2. runtime-native configuration and packaging, 3. mapping runtime lifecycle to apps.instance.*, 4. mapping billable runtime signals into the billing contract, 5. app-specific operational runbooks beyond platform invariants.

Platform Invariants¶

These are non-negotiable.

Policy/IAM is first-class.
Runtime contracts must stay runtime-neutral.
Lifecycle is event-driven and correlation-preserving.
Internal reference apps must use the same contracts as third-party teams.
App implementations must tolerate both:
tenant_dedicated
future platform_managed
Project ownership of an app instance does not imply project-scoped runtime control.

Current Platform State¶

This matters because app teams need to know what is real versus only reserved in the model.

Implemented baseline: 1. app catalog endpoints, 2. project app entitlements, 3. app instance create/read/upgrade/rollback/decommission endpoints, 4. operator service-account support for app-instance endpoints, 5. app lifecycle events, 6. app-runtime metering primitives in billing usage records, 7. operating_mode, control_plane_scope, runtime_backend, tenant_boundary_mode in the instance contract, 8. first-class placement_intent on app-instance create/read for deploy placement, 9. project-scoped access-credential custody and secure delivery, 10. project-scoped allocation read APIs for app controllers and app UI placement flows, 11. app-managed bootstrap SSH trust reconcile for app-instance-bound machine bootstrap flows, 12. platform catalog UI filtering by project entitlements for the active project context, 13. allocation-local launchable OCI workload profiles for curated apps such as JupyterLab, 14. digest-pinned OCI artifact selection from verified published or promoted app artifacts, 15. node-agent OCI container lifecycle for launch, status/control, and remove, 16. node-agent Docker Compose lifecycle for curated Compose-backed profiles, 17. private host-local endpoint access instructions for launchable OCI workloads, 18. configurable launch-time resource inputs for GPU count, CPU cores, memory GiB, and host port, 19. first-slice package install support for JupyterLab through validated pip package specifiers, 20. node-agent lifecycle self-update and bootstrap-owned Docker, Docker Compose, and NVIDIA Container Toolkit prerequisites.

Reserved but not fully productized: 1. generic third-party app manifest submission, review, and release workflow, 2. arbitrary user-supplied Docker Compose YAML, 3. multi-node or distributed Compose topology, 4. full scheduler/model-serving managed-service implementations, 5. non-OCI artifact runtime delivery for large model weights and private blobs, 6. final shared-service (platform_managed) rollout, 7. richer runtime-specific billing/rating logic, 8. proxy/public endpoint exposure beyond private host-local access, 9. persistent storage and derived reusable package environments, 10. full fleet lifecycle reconciliation for host prerequisites beyond the first self-update slice.

Recommended First Path: Launchable OCI Workload¶

For new self-contained apps, start with a launchable OCI workload profile unless the app clearly needs a long-running external controller.

This path is now the most concrete app-developer route because it has shipped through local kind and platform-control for JupyterLab, and through platform-control for the first vLLM OpenAI-compatible Compose app.

What the app team provides¶

one or more immutable OCI images,
a Dockerfile or upstream image reference with reproducible build inputs,
expected command, exposed service ports, and health/readiness behavior,
resource defaults for GPU count, CPU cores, and memory GiB,
launch parameters and validation rules,
access instructions for private mode and any future proxy mode,
smoke-test commands for the public endpoint,
release notes describing CPU, NVIDIA, or AMD image variants,
credential requirements such as private model tokens or package indexes.

What GPUaaS provides¶

app catalog and entitlement surfaces,
versioned workload profile manifest storage in app_versions.manifest,
verified published or promoted app artifact records,
digest-pinned artifact selection in the deploy form,
launch form rendering from profile schema and UI hints,
placement on an existing active allocation,
node-agent execution through typed task families,
lifecycle, events, access, config, and decommission surfaces,
audit, correlation, and project attribution.

What happens at deploy time¶

the user selects an entitled app version and target allocation,
the UI collects validated profile parameters,
the control plane resolves the selected digest-pinned artifact,
app-runtime renders a node-agent task from the manifest and launch values,
node-agent pulls the image with node-bound registry credentials,
node-agent starts either one container or a curated Compose project,
app-runtime reconciles task output into app-instance status and access data,
decommission stops and removes only the workload-owned container or Compose project.

The browser never sends raw Docker Compose YAML, image tags, registry credentials, arbitrary host paths, or runtime secrets.

Single Image Versus Docker Compose¶

Use a single OCI image when: 1. the app has one long-running service, 2. the runtime can be started with one container command, 3. endpoint and health reporting map to one process, 4. JupyterLab-style notebook or simple HTTP API behavior is enough.

Use the curated Docker Compose path when: 1. the app is still single-node but needs a topology abstraction, 2. sidecars or supporting services are likely, 3. the app benefits from explicit service/endpoint/mount topology in the UI, 4. the control plane can own rendering from a known safe renderer.

Important v1 constraint: - app developers do not submit arbitrary Compose YAML as the product contract. - the app manifest declares a platform-owned renderer such as vllm_openai and logical topology metadata. - app-runtime renders the concrete Compose file from approved artifacts, resource overrides, endpoint policy, and storage decisions.

This keeps the API stable while still proving multi-service and sidecar-ready runtime execution.

App Team Mental Model¶

The control-plane resource is the app instance.

The app instance is: 1. project-owned for IAM, audit, and billing attribution, 2. optionally backed by a runtime control plane scoped at: - project - tenant - platform

The two fields that matter most are: 1. operating_mode - tenant_dedicated - platform_managed 2. control_plane_scope - project - tenant - platform

Recommended starting assumption for new apps: 1. operating_mode = tenant_dedicated 2. control_plane_scope = project

That is the cleanest path for dev/test/stage/prod style project boundaries.

Product-facing shorthand: 1. project-scoped mode usually maps to tenant_dedicated + project 2. tenant-owned shared mode is the target term for tenant_dedicated + tenant 3. platform-managed shared mode maps to platform_managed + platform

Current limitation: - the app-instance control-plane contract is still project-owned - so any future tenant-owned shared mode needs an explicit attached-project model instead of being inferred from today’s project-owned instance shape - see: - doc/architecture/App_Tenant_Shared_Attachment_Model_v1.md

Step-by-Step Build Path¶

Step 1: Decide What Kind of App You Are Building¶

Choose: 1. runtime backend: - k8s - slurm - ray - bare_metal - launchable_oci - docker_compose 2. expected control-plane scope: - project - tenant 3. whether your first release is: - tenant-dedicated only, - or designed to evolve to platform-managed later.

Questions to answer up front: 1. does the app need dedicated control nodes, 2. does the app need worker/compute nodes, 3. does the app use OCI only, or other artifact sources, 4. what runtime-native lifecycle signals exist, 5. what usage signals are potentially billable.

For launchable OCI and Compose-backed apps, also answer: 1. which image variants are required (cpu, nvidia-h200, amd-rocm, or other SKU-specific variants), 2. whether a single container is enough or a curated Compose renderer is needed, 3. which endpoint ports are declared and whether private access is sufficient, 4. which launch parameters are user-facing versus platform-owned, 5. whether any credential should be modeled as a runtime secret instead of raw app config.

Step 2: Define the App Contract First¶

Before implementation, define: 1. catalog metadata, 2. version metadata, 3. instance config shape, 4. expected lifecycle states, 5. expected failure reasons, 6. required observability fields.

Do not start from runtime internals. Start from the control-plane contract.

Minimum API surfaces to understand: 1. GET /api/v1/apps/catalog 2. GET /api/v1/apps/catalog/{app_slug}/versions 3. GET /api/v1/projects/{project_id}/apps/entitlements 4. PUT /api/v1/projects/{project_id}/apps/entitlements/{app_slug} 5. GET /api/v1/projects/{project_id}/app-instances 6. POST /api/v1/projects/{project_id}/app-instances 7. GET /api/v1/projects/{project_id}/app-instances/{app_instance_id} 8. POST /api/v1/projects/{project_id}/app-instances/{app_instance_id}/upgrade 9. POST /api/v1/projects/{project_id}/app-instances/{app_instance_id}/rollback 10. POST /api/v1/projects/{project_id}/app-instances/{app_instance_id}/decommission

Step 3: Decide the IAM Model¶

Do not use user tokens for long-running automation.

For app operators: 1. create a project-scoped service account, 2. grant the minimum app-instance permissions, 3. mint short-lived service-account tokens, 4. run automation with that service account only.

Relevant endpoints: 1. POST /api/v1/projects/{project_id}/service-accounts 2. GET /api/v1/projects/{project_id}/service-accounts 3. POST /api/v1/projects/{project_id}/service-accounts/{service_account_id}/disable 4. POST /api/v1/projects/{project_id}/service-accounts/{service_account_id}/rotate-key 5. DELETE /api/v1/projects/{project_id}/service-accounts/{service_account_id} 6. POST /api/v1/auth/service-account/token

Required permission keys for the app operator path: 1. apps.instance.read 2. apps.instance.create 3. apps.instance.delete

In practice, if your operator also needs upgrade/rollback/decommission flows, make sure the assigned role covers those instance lifecycle actions too.

Step 4: Publish and Entitle the App¶

There are three actors:

platform admin
publishes or deprecates catalog versions
tenant/project admin
enables or narrows entitlements
app operator
creates and manages instances inside the allowed project scope

Admin catalog endpoints: 1. POST /api/v1/admin/apps/catalog/{app_slug}/versions/{version}/publish 2. POST /api/v1/admin/apps/catalog/{app_slug}/versions/{version}/deprecate

Entitlement endpoints: 1. GET /api/v1/projects/{project_id}/apps/entitlements 2. PUT /api/v1/projects/{project_id}/apps/entitlements/{app_slug}

Current platform UI behavior: 1. tenant/project admins manage entitlements from the entitlement surface, 2. the catalog UI shows only apps enabled for the active project, 3. the raw catalog API is still a platform catalog surface, so app teams should not assume GET /api/v1/apps/catalog is itself entitlement-filtered unless the contract changes.

Step 5: Create the App Instance¶

The app instance is the project-owned resource your operator works against.

At create time, the operator should provide only request hints, not assume final topology.

Important request/output concepts: 1. request hint: - operating_mode 2. effective output fields: - operating_mode - control_plane_scope - runtime_backend - tenant_boundary_mode

Rule: 1. the server decides the effective values from policy and backend rules, 2. the app team must read those effective values back and honor them.

If your app requires explicit deploy placement, carry it as first-class placement_intent on the create request rather than burying it in opaque runtime config.

If your app requires machine bootstrap access, use the project-scoped access-credential APIs. Do not rely on manually mounted secrets as the primary integration contract.

If your app needs bootstrap SSH trust on a selected allocation user, use an app-managed flow: 1. create or rotate the bootstrap credential through the access-credential APIs, 2. let the app controller reconcile its own bootstrap public key onto the target node user, 3. do not overwrite shared allocation/user SSH-key sets that belong to operators.

Step 6: Implement the Async Lifecycle Correctly¶

Treat lifecycle as async from the start.

Expected event types: 1. apps.instance.requested 2. apps.instance.deploying 3. apps.instance.running 4. apps.instance.upgrade_requested 5. apps.instance.upgraded 6. apps.instance.rollback_requested 7. apps.instance.rolled_back 8. apps.instance.decommission_requested 9. apps.instance.decommissioned 10. apps.instance.failed

Rules: 1. consume events idempotently, 2. correlate by correlation_id and event_id, 3. use deterministic failure reasons, 4. never couple to direct DB reads as the lifecycle source of truth.

Step 7: Implement Billing and Metering¶

Billing is not optional app-team cleanup. It is part of the app contract.

Every billable app-runtime signal must preserve: 1. org_id 2. project_id 3. app_instance_id 4. app_slug 5. operating_mode 6. control_plane_scope 7. runtime_backend 8. correlation_id

Current billing direction: 1. usage_source = app_runtime 2. usage_unit is runtime-specific 3. app-runtime usage must still reconcile through the platform usage-record and ledger model

The app team owns: 1. deciding which runtime-native signals are billable, 2. mapping those signals into the platform usage-record shape, 3. preserving project and app-instance attribution.

The platform owns: 1. immutable ledger behavior, 2. tenant/project ownership enforcement, 3. policy-driven cost controls, 4. usage-record to ledger integration.

Step 8: Preserve Audit and Observability¶

Every app path must preserve: 1. canonical error envelope with correlation_id, 2. trace continuity across API, workers, relays, and operators, 3. tenant/project/app-instance attribution.

Required logging fields: 1. correlation_id 2. trace_id 3. org_id 4. project_id 5. app_instance_id 6. operating_mode 7. control_plane_scope 8. runtime_backend

Do not leak: 1. bearer tokens, 2. private keys, 3. app secrets, 4. any PII blocked by platform sanitize rules.

Step 9: Support Humans and Agents the Same Way¶

Your app team may operate manually, through CI, or through another agent.

That agent must use: 1. the same public app-control endpoints, 2. the same service-account token path, 3. the same IAM and audit rules, 4. the same error envelopes and events.

Do not build a hidden “operator shortcut” path for internal automation. If the agent needs it, the platform contract should expose it properly.

Recommended Delivery Checklist¶

Definition of Ready¶

Before building the operator/runtime: 1. runtime backend chosen, 2. operating-mode assumptions documented, 3. IAM mapping documented, 4. service-account scope defined, 5. lifecycle events mapped, 6. billing signal candidates identified, 7. observability fields and triage path agreed, 8. OpenAPI/AsyncAPI deltas proposed if current contracts are insufficient.

Definition of Done¶

Before saying the app is integrated: 1. catalog/version path works, 2. entitlement path works, 3. service-account path works, 4. create -> running path works, 5. upgrade/rollback/decommission behaviors are defined and tested, 6. app-runtime usage records are attributable, 7. audit rows exist for privileged mutations, 8. runbook includes correlation-first incident triage, 9. no bypass of IAM, billing, or observability contracts exists.

Current Gaps App Teams Must Plan Around¶

These are the places where the platform model is ahead of full implementation.

Launchable OCI and curated Compose are proven first slices, but generic app-team manifest submission and review are not yet productized.
Non-OCI artifact runtime delivery for model weights, tarballs, and private blobs is not complete.
platform_managed is modeled but should not be assumed as a shipped default.
Runtime-specific billing/rating models are still baseline-only.
Scheduler/model-serving reference stacks prove the platform direction, not full managed-service maturity.
The long-term external worker contract is now documented, but reference apps still prove some of that boundary from inside the platform repo rather than as fully extracted app-team-owned workers.
Arbitrary user-authored Compose, privileged containers, host mounts, and mutable image tags remain intentionally unsupported.
Private endpoint access is supported; platform proxy exposure remains a follow-on.
Persistent project storage and reusable derived package environments remain follow-on work.
Node-agent lifecycle can update the agent binary and bootstrap prerequisites for current slices, but full fleet reconciliation and read-model telemetry hardening remain backlog.

If your app depends on one of these, treat it as an explicit dependency and track it as platform work, not app-team-local glue.

Artifact-specific foundations now exist in: 1. doc/architecture/App_Artifact_Trust_and_Promotion_v1.md 2. doc/architecture/App_Non_OCI_Artifact_Lifecycle_v1.md 3. doc/architecture/Launchable_OCI_Workload_Profile_Contract_v1.md

Anti-Patterns¶

Do not do these.

Do not call the database directly from an app operator.
Do not use long-lived user bearer tokens in automation.
Do not hardcode tenant/project bypasses for internal apps.
Do not invent a separate billing store.
Do not emit app-specific ad-hoc error envelopes.
Do not make runtime topology assumptions without reading effective instance metadata.
Do not couple to one environment shape such as local Docker only or one fixed cluster host.

Minimal Example Path¶

For a new scheduler or serving app, the practical first path is: 1. choose tenant_dedicated, 2. choose control_plane_scope = project, 3. create a project service account, 4. publish the app version, 5. enable project entitlement, 6. mint a service-account token, 7. create the app instance, 8. drive the runtime to running, 9. emit and consume apps.instance.*, 10. map runtime usage into usage_source = app_runtime.

That is the expected v1 path for both humans and agents.

For a launchable OCI app, the practical first path is: 1. build and publish immutable OCI image variants, 2. register digest-pinned artifacts on the app version, 3. define the workload profile manifest and parameter schema, 4. define single-container execution or a curated Compose renderer, 5. validate local kind with CPU-safe inputs where possible, 6. promote artifacts and profile metadata to platform-control, 7. deploy on an existing allocation, 8. validate the private endpoint through an SSH tunnel or node-local smoke, 9. decommission and verify workload-owned containers, Compose projects, scratch paths, and stale config are removed.

doc/architecture/App_Control_Plane_v1.md
doc/architecture/App_Runtime_Instance_Lifecycle_v1.md
doc/architecture/App_Runtime_Operating_Modes_v1.md
doc/architecture/App_Runtime_Billing_Model_v1.md
doc/architecture/App_Runtime_Metering_v1.md
doc/architecture/Service_Account_Model.md
doc/architecture/Scheduler_as_Platform_App_v1.md
doc/architecture/CLI_Agent_Operable_Control_Plane_v2.md
doc/architecture/App_Runtime_External_Worker_Contract_v1.md
doc/architecture/Launchable_OCI_Workload_Profile_Contract_v1.md
doc/product/VLLM_OpenAI_Compose_First_Slice_v1.md