Skip to content

App SDK — building apps on GPUaaS

Implemented Designed Contract

Source: doc/architecture/Build_an_App_for_GPUaaS_v1.md · App_Developer_Starter_Pack_v1.md · App_Manifest_Registration_Guide_v1.md · Launchable_OCI_Workload_Profile_Contract_v1.md · App_Control_Plane_v1.md · App_Platform_Quickstart_v1.md · External_App_Team_Integration_Guide_v1.md

For developers building apps that run on GPUaaS. Distinct from the CLI/SDK (which are for clients calling the platform). An app team owns: the OCI artifact, the manifest declaring compatibility, optionally a custom UI extension, and the runtime behavior.

Where the boundary sits

flowchart LR
    classDef p fill:#e3f2fd,stroke:#1565c0
    classDef a fill:#fff3e0,stroke:#e65100
    classDef shared fill:#e8f5e9,stroke:#2e7d32

    subgraph PLAT[GPUaaS owns]
        direction TB
        ALLOC[Allocation lifecycle<br/>baremetal or gpu_slice]:::p
        AR[app-runtime-worker]:::p
        CAT[App catalog + promotion]:::p
        REG[OCI registry baseline]:::p
        IAM[Auth tenant project IAM]:::p
        BILL[Allocation billing + ledger]:::p
        AUD[Audit logs]:::p
    end
    subgraph APP[App team owns]
        direction TB
        MAN[App manifest<br/>compatibility, profile, runtime]:::a
        OCI[OCI artifact<br/>image or controller]:::a
        BS[Bootstrap script<br/>cloud-init or controller logic]:::a
        UI[Optional UI extension]:::a
        MET[App-runtime metering<br/>DESIGNED]:::a
    end
    subgraph SHARED[Joint contract]
        direction TB
        LP[Launchable OCI workload<br/>profile contract]:::shared
        LC[Lifecycle hooks<br/>created -> running -> stopped]:::shared
        BC[Billing alignment<br/>per-app metering]:::shared
    end

What an app team does

flowchart TB
    S1[1. Declare manifest<br/>requires_capacity_shape,<br/>min_gpu_count, runtime_profile] --> S2[2. Build + sign<br/>OCI artifact]
    S2 --> S3[3. Register manifest<br/>POST /admin/apps/manifests]
    S3 --> S4[4. Promote artifact<br/>draft → staged → canary → active]
    S4 --> S5[5. Tenant launches app<br/>into their allocation]
    S5 --> S6[6. App runtime executes<br/>via app-runtime-worker]
    S6 --> S7[7. App emits lifecycle events<br/>+ optional metering]
    S7 --> S8[8. Tenant stops or<br/>allocation released]
    S8 --> S9[9. Honest decommission<br/>teardown not just metadata]

    classDef done fill:#d1e7dd,stroke:#0a3622
    class S1,S2,S3,S4,S5,S6 done

1. App manifest

The manifest is the gate between an app and the catalog. Sample:

{
  "slug": "jupyter-cuda-dev",
  "display_name": "Jupyter (CUDA dev)",
  "version": "1.4.2",
  "artifact_kind": "oci",
  "artifact_ref": "registry.gpuaas.example.com/apps/jupyter-cuda-dev@sha256:abc...",
  "compatibility": {
    "requires_capacity_shape": ["baremetal", "gpu_slice"],
    "min_gpu_count": 1,
    "requires_exclusive_node": false,
    "supported_accelerators": ["nvidia"]
  },
  "runtime_profile": {
    "kind": "launchable_oci",
    "image_ref": "registry.gpuaas.example.com/apps/jupyter-cuda-dev@sha256:abc...",
    "command": ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888"],
    "env": [{"name": "GRANT_SUDO", "value": "no"}],
    "ports": [{"container": 8888, "expose": "embedded_ui_gateway"}],
    "resources": {"min_memory_mib": 4096, "min_vcpu": 4}
  },
  "ui": {
    "extension_kind": "embedded",
    "path": "/lab"
  },
  "metering": {
    "kind": "session_seconds"
  }
}

Compatibility decides which allocations can run the app. The orchestrator filters incompatible targets out of the user's launch wizard automatically.

→ Source: App_Manifest_Registration_Guide_v1.md

2. OCI artifact: trust + promotion

stateDiagram-v2
    [*] --> draft: developer publishes digest
    draft --> staged: passes lint + smoke
    staged --> canary: small % rollout enabled
    canary --> active: confidence gained
    canary --> staged: rollback
    active --> deprecated: superseded version active
    deprecated --> retired: grace period elapsed
    retired --> [*]: no instance references it

    note right of active
      Tenants can launch this version.
      Manifest pinned to a digest, not a tag.
    end note

Promotion rules:

  • Every state transition is audit-logged (actor, before, after, reason).
  • An active manifest cannot reference a non-active artifact.
  • A deprecated manifest cannot be assigned to a new instance — existing instances continue.
  • Retiring is gated on "no active instances using this digest".

→ Source: App_Artifact_Trust_and_Promotion_v1.md, App_Platform_OCI_Registry_Baseline_v1.md

3. Launchable OCI workload profile (data model)

erDiagram
    app_manifests ||--o{ app_instances : "spawns"
    app_instances ||--o{ app_instance_members : "expands into"
    app_instances ||--o{ app_instance_events : "produces timeline"
    allocations ||--o{ app_instance_members : "bound to"
    sku_catalog ||--o{ allocations : "billed against"

    app_manifests {
        text slug PK
        text version
        text artifact_kind "oci|launchable|controller"
        text artifact_ref "digest@registry"
        jsonb compatibility
        jsonb runtime_profile
        jsonb ui
        jsonb metering
        text status "draft|staged|canary|active|deprecated|retired"
    }
    app_instances {
        uuid id PK
        text app_slug FK
        uuid tenant_id
        uuid project_id
        text status "created|starting|running|stopping|stopped|failed"
        uuid target_allocation_id
        timestamp created_at
        timestamp activated_at
    }
    app_instance_members {
        uuid id PK
        uuid instance_id FK
        uuid allocation_id FK
        text role "controller|worker|head|node"
        text status
    }
    app_instance_events {
        uuid id PK
        uuid instance_id FK
        text event_type "start_completed|healthy|member_added|member_failed"
        jsonb payload
        timestamp occurred_at
    }

→ Source: Launchable_OCI_Workload_Profile_Contract_v1.md. Schema JSON: doc/architecture/schemas/launchable_oci_workload_profile.v1.schema.json

4. Lifecycle contract (what the platform expects)

sequenceDiagram
    autonumber
    participant U as Tenant
    participant API as cmd/api
    participant ARW as app-runtime-worker
    participant ART as app artifact / controller
    participant ALLOC as tenant allocation

    U->>API: POST /apps/instances {slug, target_allocation_id}
    API->>ARW: enqueue create
    ARW->>ART: pull OCI by digest + verify
    ARW->>ALLOC: dispatch typed task to node-agent<br/>(via provisioning-worker path)
    ALLOC->>ALLOC: bootstrap (cloud-init or controller)
    ALLOC-->>ARW: readiness reported
    ARW-->>API: status=running
    API-->>U: WS push

    loop while running
        ART-->>ARW: app_instance_events (heartbeat, member changes)
        ART-->>ARW: optional metering events
    end

    U->>API: POST /apps/instances/{id}/stop
    API->>ARW: enqueue stop
    ARW->>ART: graceful shutdown signal
    ART->>ALLOC: tear down processes
    ART-->>ARW: stopped
    Note over ARW: honest teardown — not just metadata
    ARW-->>API: status=stopped

Required hooks an app must implement:

Hook When Required to emit
start On instance creation Reach running status; report readiness on a known port or via heartbeat
health Continuously Heartbeat + (optionally) member-add / member-remove / member-failed events
stop On stop request OR allocation releasing Real teardown of processes; clean exit; idempotent if called twice
metering (optional) While running Per-app usage events (session-seconds, tokens, requests, etc.)

→ Sources: App_Runtime_Instance_Lifecycle_v1.md, App_Runtime_Operating_Modes_v1.md, App_Runtime_Recovery_Model_v1.md

5. App UI extension (optional)

If the app has a UI, embed it via the gateway pattern:

flowchart LR
    U[Tenant browser] --> SHELL[GPUaaS web shell<br/>/workloads/instance/:id]
    SHELL --> GW[Embedded UI gateway<br/>platform-owned reverse proxy]
    GW --> APP[App UI<br/>running in allocation]

    APP -.cookies / WS / CSP / origin.-> GW
    Note[GW owns:<br/>auth + session +<br/>cookie + WS + CSP +<br/>frame policy]
    GW -.delegates rendering only.-> APP

    classDef plat fill:#e3f2fd,stroke:#1565c0
    classDef app fill:#fff3e0,stroke:#e65100
    class SHELL,GW plat
    class APP app

The gateway pattern means the app never issues its own user session — the platform's session is forwarded with constrained scope. Apps that need their own UI declare ui.extension_kind = "embedded" and rely on the gateway.

→ Source: Embedded_UI_Gateway_Contract_v1.md, App_UI_Extension_Model_v1.md

6. App-runtime billing alignment

Designed — the metering producer side is the gap the external review flagged.

sequenceDiagram
    autonumber
    participant ALLOC as allocation (SKU)
    participant APP as app instance
    participant AM as app metering producer (DESIGNED)
    participant BW as billing-worker
    participant LED as ledger

    ALLOC->>BW: GPU-hour usage_records
    BW->>LED: ledger debit (allocation cost)
    APP->>AM: app-specific metric<br/>(session_seconds | tokens | requests)
    AM->>BW: app_usage_records<br/>(non-double-charge dimension)
    BW->>LED: ledger debit (app-level)
    Note over LED: tenant sees combined cost —<br/>operators see breakdown

The contract is defined; producer wiring is in progress. App teams should emit metering events even before the producer is fully wired — the events buffer cleanly.

→ Sources: App_Runtime_Billing_Model_v1.md, App_Runtime_Metering_v1.md, App_Runtime_External_Worker_Contract_v1.md

7. Tenant-scoped service accounts

Apps run with a tenant-scoped service account identity — not the user's identity. The platform issues a short-lived token (TTL auth.service_account_token_ttl_seconds, default 900) that the app uses to call back into the platform API.

sequenceDiagram
    autonumber
    participant ARW as app-runtime-worker
    participant SA as tenant-scoped SA
    participant APP as app inside allocation
    participant API as cmd/api

    ARW->>SA: provision SA for this instance (scope=project, role=app)
    SA-->>ARW: {sa_id, signing_key} (one-time)
    ARW->>APP: inject sa_id + signing_key + base_url via env
    loop every <900s
        APP->>API: POST /auth/sa/token (signed assertion)
        API-->>APP: access_token, exp = now+900s
    end
    APP->>API: API calls with Bearer (e.g. storage, metering)
    API->>API: verify SA + enforce scope

→ Source: Tenant_Scoped_App_Machine_Identity_v1.md, Service_Account_Model.md

8. Starter pack

The starter pack pulls everything an app team needs into one place:

mindmap
  root((Starter pack))
    Templates
      OCI image Dockerfile template
      App manifest template
      cloud-init bootstrap template
      Controller template - Go
    Reference apps
      Jupyter CUDA dev
      vLLM OpenAI compose first slice
      Slurm reference controller
      RKE2 self-managed
    Docs
      Build an App for GPUaaS
      External App Team Integration Guide
      Example App Developer Reference Workflow
    SDKs
      Python SDK gpuaas-sdk
      Go client snippets
    CI
      Manifest schema validator
      Smoke test harness
      OCI signing pipeline

→ Source: App_Developer_Starter_Pack_v1.md

9. End-to-end build path

flowchart TB
    A[Clone starter pack] --> B[Adjust manifest template<br/>+ runtime_profile]
    B --> C[Implement app behavior<br/>+ readiness signal]
    C --> D[Build OCI image<br/>+ sign + push by digest]
    D --> E[Smoke test locally<br/>via app_platform_quickstart]
    E --> F[Submit manifest to platform admin<br/>POST /admin/apps/manifests]
    F --> G[Platform promotes draft → staged → canary]
    G --> H[Canary tenants launch the app]
    H --> I[App active in production catalog]
    I --> J[Version bump → repeat from B]

→ Source: App_Platform_Quickstart_v1.md, Example_App_Developer_Reference_Workflow_v1.md

What ships today vs what's still designed

Capability Status
Manifest registration + schema Implemented
OCI registry baseline Implemented
Artifact trust + promotion Implemented
Launchable OCI workload profile contract Implemented
App-runtime lifecycle Implemented
Member operations (add/remove/recover) Implemented (Slurm)
Embedded UI gateway contract Designed
App-runtime metering producer Designed
Tenant-shared runtime API direction Designed
Multi-allocation cluster apps (cross-network) Designed (blocked on slice networking)

External team integration

Read this before kicking off an external-team build: External_App_Team_Integration_Guide_v1.md. Covers the joint operating model, what the platform owns vs the app team owns, security review process, and the readiness checklist.

Runbooks (apps side)

Runbook When
App Artifact Lifecycle Incident OCI artifact promote / trust / digest issues
App Catalog Incident Manifest registration / catalog page failures
App Platform Operator Incident Platform-operator system app incidents
App Runtime Billing Incident Per-app usage record alignment
App Runtime Lifecycle Incident Instance stuck in create / start / stop / release

Where to look next