App SDK — building apps on GPUaaS¶
Implemented Designed Contract
doc/architecture/Build_an_App_for_GPUaaS_v1.md · App_Developer_Starter_Pack_v1.md · App_Manifest_Registration_Guide_v1.md · Launchable_OCI_Workload_Profile_Contract_v1.md · App_Control_Plane_v1.md · App_Platform_Quickstart_v1.md · External_App_Team_Integration_Guide_v1.md
For developers building apps that run on GPUaaS. Distinct from the CLI/SDK (which are for clients calling the platform). An app team owns: the OCI artifact, the manifest declaring compatibility, optionally a custom UI extension, and the runtime behavior.
Where the boundary sits¶
flowchart LR
classDef p fill:#e3f2fd,stroke:#1565c0
classDef a fill:#fff3e0,stroke:#e65100
classDef shared fill:#e8f5e9,stroke:#2e7d32
subgraph PLAT[GPUaaS owns]
direction TB
ALLOC[Allocation lifecycle<br/>baremetal or gpu_slice]:::p
AR[app-runtime-worker]:::p
CAT[App catalog + promotion]:::p
REG[OCI registry baseline]:::p
IAM[Auth tenant project IAM]:::p
BILL[Allocation billing + ledger]:::p
AUD[Audit logs]:::p
end
subgraph APP[App team owns]
direction TB
MAN[App manifest<br/>compatibility, profile, runtime]:::a
OCI[OCI artifact<br/>image or controller]:::a
BS[Bootstrap script<br/>cloud-init or controller logic]:::a
UI[Optional UI extension]:::a
MET[App-runtime metering<br/>DESIGNED]:::a
end
subgraph SHARED[Joint contract]
direction TB
LP[Launchable OCI workload<br/>profile contract]:::shared
LC[Lifecycle hooks<br/>created -> running -> stopped]:::shared
BC[Billing alignment<br/>per-app metering]:::shared
end
What an app team does¶
flowchart TB
S1[1. Declare manifest<br/>requires_capacity_shape,<br/>min_gpu_count, runtime_profile] --> S2[2. Build + sign<br/>OCI artifact]
S2 --> S3[3. Register manifest<br/>POST /admin/apps/manifests]
S3 --> S4[4. Promote artifact<br/>draft → staged → canary → active]
S4 --> S5[5. Tenant launches app<br/>into their allocation]
S5 --> S6[6. App runtime executes<br/>via app-runtime-worker]
S6 --> S7[7. App emits lifecycle events<br/>+ optional metering]
S7 --> S8[8. Tenant stops or<br/>allocation released]
S8 --> S9[9. Honest decommission<br/>teardown not just metadata]
classDef done fill:#d1e7dd,stroke:#0a3622
class S1,S2,S3,S4,S5,S6 done
1. App manifest¶
The manifest is the gate between an app and the catalog. Sample:
{
"slug": "jupyter-cuda-dev",
"display_name": "Jupyter (CUDA dev)",
"version": "1.4.2",
"artifact_kind": "oci",
"artifact_ref": "registry.gpuaas.example.com/apps/jupyter-cuda-dev@sha256:abc...",
"compatibility": {
"requires_capacity_shape": ["baremetal", "gpu_slice"],
"min_gpu_count": 1,
"requires_exclusive_node": false,
"supported_accelerators": ["nvidia"]
},
"runtime_profile": {
"kind": "launchable_oci",
"image_ref": "registry.gpuaas.example.com/apps/jupyter-cuda-dev@sha256:abc...",
"command": ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888"],
"env": [{"name": "GRANT_SUDO", "value": "no"}],
"ports": [{"container": 8888, "expose": "embedded_ui_gateway"}],
"resources": {"min_memory_mib": 4096, "min_vcpu": 4}
},
"ui": {
"extension_kind": "embedded",
"path": "/lab"
},
"metering": {
"kind": "session_seconds"
}
}
Compatibility decides which allocations can run the app. The orchestrator filters incompatible targets out of the user's launch wizard automatically.
→ Source: App_Manifest_Registration_Guide_v1.md
2. OCI artifact: trust + promotion¶
stateDiagram-v2
[*] --> draft: developer publishes digest
draft --> staged: passes lint + smoke
staged --> canary: small % rollout enabled
canary --> active: confidence gained
canary --> staged: rollback
active --> deprecated: superseded version active
deprecated --> retired: grace period elapsed
retired --> [*]: no instance references it
note right of active
Tenants can launch this version.
Manifest pinned to a digest, not a tag.
end note
Promotion rules:
- Every state transition is audit-logged (actor, before, after, reason).
- An active manifest cannot reference a non-active artifact.
- A deprecated manifest cannot be assigned to a new instance — existing instances continue.
- Retiring is gated on "no active instances using this digest".
→ Source: App_Artifact_Trust_and_Promotion_v1.md, App_Platform_OCI_Registry_Baseline_v1.md
3. Launchable OCI workload profile (data model)¶
erDiagram
app_manifests ||--o{ app_instances : "spawns"
app_instances ||--o{ app_instance_members : "expands into"
app_instances ||--o{ app_instance_events : "produces timeline"
allocations ||--o{ app_instance_members : "bound to"
sku_catalog ||--o{ allocations : "billed against"
app_manifests {
text slug PK
text version
text artifact_kind "oci|launchable|controller"
text artifact_ref "digest@registry"
jsonb compatibility
jsonb runtime_profile
jsonb ui
jsonb metering
text status "draft|staged|canary|active|deprecated|retired"
}
app_instances {
uuid id PK
text app_slug FK
uuid tenant_id
uuid project_id
text status "created|starting|running|stopping|stopped|failed"
uuid target_allocation_id
timestamp created_at
timestamp activated_at
}
app_instance_members {
uuid id PK
uuid instance_id FK
uuid allocation_id FK
text role "controller|worker|head|node"
text status
}
app_instance_events {
uuid id PK
uuid instance_id FK
text event_type "start_completed|healthy|member_added|member_failed"
jsonb payload
timestamp occurred_at
}
→ Source: Launchable_OCI_Workload_Profile_Contract_v1.md. Schema JSON: doc/architecture/schemas/launchable_oci_workload_profile.v1.schema.json
4. Lifecycle contract (what the platform expects)¶
sequenceDiagram
autonumber
participant U as Tenant
participant API as cmd/api
participant ARW as app-runtime-worker
participant ART as app artifact / controller
participant ALLOC as tenant allocation
U->>API: POST /apps/instances {slug, target_allocation_id}
API->>ARW: enqueue create
ARW->>ART: pull OCI by digest + verify
ARW->>ALLOC: dispatch typed task to node-agent<br/>(via provisioning-worker path)
ALLOC->>ALLOC: bootstrap (cloud-init or controller)
ALLOC-->>ARW: readiness reported
ARW-->>API: status=running
API-->>U: WS push
loop while running
ART-->>ARW: app_instance_events (heartbeat, member changes)
ART-->>ARW: optional metering events
end
U->>API: POST /apps/instances/{id}/stop
API->>ARW: enqueue stop
ARW->>ART: graceful shutdown signal
ART->>ALLOC: tear down processes
ART-->>ARW: stopped
Note over ARW: honest teardown — not just metadata
ARW-->>API: status=stopped
Required hooks an app must implement:
| Hook | When | Required to emit |
|---|---|---|
start |
On instance creation | Reach running status; report readiness on a known port or via heartbeat |
health |
Continuously | Heartbeat + (optionally) member-add / member-remove / member-failed events |
stop |
On stop request OR allocation releasing | Real teardown of processes; clean exit; idempotent if called twice |
metering (optional) |
While running | Per-app usage events (session-seconds, tokens, requests, etc.) |
→ Sources: App_Runtime_Instance_Lifecycle_v1.md, App_Runtime_Operating_Modes_v1.md, App_Runtime_Recovery_Model_v1.md
5. App UI extension (optional)¶
If the app has a UI, embed it via the gateway pattern:
flowchart LR
U[Tenant browser] --> SHELL[GPUaaS web shell<br/>/workloads/instance/:id]
SHELL --> GW[Embedded UI gateway<br/>platform-owned reverse proxy]
GW --> APP[App UI<br/>running in allocation]
APP -.cookies / WS / CSP / origin.-> GW
Note[GW owns:<br/>auth + session +<br/>cookie + WS + CSP +<br/>frame policy]
GW -.delegates rendering only.-> APP
classDef plat fill:#e3f2fd,stroke:#1565c0
classDef app fill:#fff3e0,stroke:#e65100
class SHELL,GW plat
class APP app
The gateway pattern means the app never issues its own user session — the platform's session is forwarded with constrained scope. Apps that need their own UI declare ui.extension_kind = "embedded" and rely on the gateway.
→ Source: Embedded_UI_Gateway_Contract_v1.md, App_UI_Extension_Model_v1.md
6. App-runtime billing alignment¶
Designed — the metering producer side is the gap the external review flagged.
sequenceDiagram
autonumber
participant ALLOC as allocation (SKU)
participant APP as app instance
participant AM as app metering producer (DESIGNED)
participant BW as billing-worker
participant LED as ledger
ALLOC->>BW: GPU-hour usage_records
BW->>LED: ledger debit (allocation cost)
APP->>AM: app-specific metric<br/>(session_seconds | tokens | requests)
AM->>BW: app_usage_records<br/>(non-double-charge dimension)
BW->>LED: ledger debit (app-level)
Note over LED: tenant sees combined cost —<br/>operators see breakdown
The contract is defined; producer wiring is in progress. App teams should emit metering events even before the producer is fully wired — the events buffer cleanly.
→ Sources: App_Runtime_Billing_Model_v1.md, App_Runtime_Metering_v1.md, App_Runtime_External_Worker_Contract_v1.md
7. Tenant-scoped service accounts¶
Apps run with a tenant-scoped service account identity — not the user's identity. The platform issues a short-lived token (TTL auth.service_account_token_ttl_seconds, default 900) that the app uses to call back into the platform API.
sequenceDiagram
autonumber
participant ARW as app-runtime-worker
participant SA as tenant-scoped SA
participant APP as app inside allocation
participant API as cmd/api
ARW->>SA: provision SA for this instance (scope=project, role=app)
SA-->>ARW: {sa_id, signing_key} (one-time)
ARW->>APP: inject sa_id + signing_key + base_url via env
loop every <900s
APP->>API: POST /auth/sa/token (signed assertion)
API-->>APP: access_token, exp = now+900s
end
APP->>API: API calls with Bearer (e.g. storage, metering)
API->>API: verify SA + enforce scope
→ Source: Tenant_Scoped_App_Machine_Identity_v1.md, Service_Account_Model.md
8. Starter pack¶
The starter pack pulls everything an app team needs into one place:
mindmap
root((Starter pack))
Templates
OCI image Dockerfile template
App manifest template
cloud-init bootstrap template
Controller template - Go
Reference apps
Jupyter CUDA dev
vLLM OpenAI compose first slice
Slurm reference controller
RKE2 self-managed
Docs
Build an App for GPUaaS
External App Team Integration Guide
Example App Developer Reference Workflow
SDKs
Python SDK gpuaas-sdk
Go client snippets
CI
Manifest schema validator
Smoke test harness
OCI signing pipeline
→ Source: App_Developer_Starter_Pack_v1.md
9. End-to-end build path¶
flowchart TB
A[Clone starter pack] --> B[Adjust manifest template<br/>+ runtime_profile]
B --> C[Implement app behavior<br/>+ readiness signal]
C --> D[Build OCI image<br/>+ sign + push by digest]
D --> E[Smoke test locally<br/>via app_platform_quickstart]
E --> F[Submit manifest to platform admin<br/>POST /admin/apps/manifests]
F --> G[Platform promotes draft → staged → canary]
G --> H[Canary tenants launch the app]
H --> I[App active in production catalog]
I --> J[Version bump → repeat from B]
→ Source: App_Platform_Quickstart_v1.md, Example_App_Developer_Reference_Workflow_v1.md
What ships today vs what's still designed¶
| Capability | Status |
|---|---|
| Manifest registration + schema | Implemented |
| OCI registry baseline | Implemented |
| Artifact trust + promotion | Implemented |
| Launchable OCI workload profile contract | Implemented |
| App-runtime lifecycle | Implemented |
| Member operations (add/remove/recover) | Implemented (Slurm) |
| Embedded UI gateway contract | Designed |
| App-runtime metering producer | Designed |
| Tenant-shared runtime API direction | Designed |
| Multi-allocation cluster apps (cross-network) | Designed (blocked on slice networking) |
External team integration¶
Read this before kicking off an external-team build: External_App_Team_Integration_Guide_v1.md. Covers the joint operating model, what the platform owns vs the app team owns, security review process, and the readiness checklist.
Runbooks (apps side)¶
| Runbook | When |
|---|---|
| App Artifact Lifecycle Incident | OCI artifact promote / trust / digest issues |
| App Catalog Incident | Manifest registration / catalog page failures |
| App Platform Operator Incident | Platform-operator system app incidents |
| App Runtime Billing Incident | Per-app usage record alignment |
| App Runtime Lifecycle Incident | Instance stuck in create / start / stop / release |
Where to look next¶
- Apps trail — full app-platform path with diagrams per step
- CLI — operator surface
- Python SDK — programmatic surface
- Direct REST API — when the SDK / CLI doesn't expose what you need
- End-to-end quick start — Hello-World app from build to launch
- Source:
Build_an_App_for_GPUaaS_v1.md