Launchable OCI Workload Profile Contract v1¶

As of: April 14, 2026

Purpose¶

Define the first implementable contract for launchable OCI workload images and composition profiles.

This document turns the product direction in doc/product/Launchable_OCI_Workload_Images_v1.md into a control-plane contract that can drive:

catalog display,
generated launch forms,
launch-on-existing-allocation validation,
execution-engine adapter rendering,
lifecycle/status reporting,
future JupyterLab, vLLM, and composed AI service stacks.

Decision Summary¶

Launchable OCI workloads should be modeled as a distinct app class:

raw allocation -> optional managed runtime bundle -> launchable workload profile

The first slice should:

use OCI images from the platform registry as the artifact source,
start with launch-on-existing-allocation,
store a typed workload profile manifest on the app version,
render UI fields from JSON Schema plus UI hints,
produce a typed launch values object,
hand the values object to an execution-engine adapter,
keep Docker Compose, Helm, Kubernetes, dstack, and node-agent execution behind adapter boundaries.

This is the app-model lane that follows Slurm and RKE2, but it is narrower than the full Helm/CUE/dstack meta-language problem. It defines the product-facing profile contract first.

Relationship To Existing App Model¶

The profile should reuse existing app control-plane objects where possible.

Current mapping:

app_catalog
app family and product card
app_versions.manifest
versioned workload profile manifest
app artifact records
OCI digest, promotion, trust, and provenance metadata
app instance
running workload instance and lifecycle record
app members or runtime state
adapter-reported placement, endpoints, health, and events

Do not create a separate launchable-workload ownership system for v1. The workload is an app instance with a stricter manifest shape and a different runtime adapter.

Profile Manifest Shape¶

The app version manifest should reserve this shape:

Canonical schema file:

doc/architecture/schemas/launchable_oci_workload_profile.v1.schema.json

profile:
  kind: gpuaas.launchable_oci_workload
  schema_version: v1
  slug: jupyterlab
  display_name: JupyterLab
  description: Browser notebook workspace on an existing allocation.
  support_level: platform_curated
  launch_mode: existing_allocation

artifacts:
  primary_image:
    source: platform_registry
    artifact_name: runtime-cpu
    digest_required: true
    media_type: application/vnd.oci.image.manifest.v1+json
  nvidia_h200_image:
    source: platform_registry
    artifact_name: runtime-nvidia-h200
    digest_required: true
    media_type: application/vnd.oci.image.manifest.v1+json
  amd_rocm_image:
    source: platform_registry
    artifact_name: runtime-amd-rocm
    digest_required: true
    media_type: application/vnd.oci.image.manifest.v1+json

parameters:
  schema:
    type: object
    required: [workspace_mount, exposure_mode]
    properties:
      workspace_mount:
        type: string
        enum: [scratch, project_storage]
        default: scratch
      exposure_mode:
        type: string
        enum: [private]
        default: private
      host_port:
        type: integer
        minimum: 1024
        maximum: 65535
        default: 8888
      gpu_count:
        type: integer
        minimum: 0
        default: 1
      cpu_cores:
        type: integer
        minimum: 1
        default: 2
      memory_gib:
        type: integer
        minimum: 1
        default: 4
      python_packages:
        type: string
        default: ""
        maxLength: 4096
  ui:
    order: [workspace_mount, exposure_mode, host_port, gpu_count, cpu_cores, memory_gib, python_packages]
    groups:
      - id: workspace
        title: Workspace
        fields: [workspace_mount]
      - id: access
        title: Access
        fields: [exposure_mode, host_port]
      - id: resources
        title: Resources
        fields: [gpu_count, cpu_cores, memory_gib]
      - id: packages
        title: Packages
        fields: [python_packages]
    launch_wizard:
      steps:
        - id: target
          title: Target
          description: Choose the active allocation that will host this workload.
          components:
            - kind: platform.target_allocation
              required: true
        - id: runtime
          title: Runtime
          description: Choose the digest-pinned OCI runtime image and runtime inputs.
          components:
            - kind: platform.artifact_selector
              required: true
            - kind: field
              name: python_packages
              required: false
        - id: resources
          title: Resources
          description: Size the workload for the selected allocation.
          components:
            - kind: field
              name: gpu_count
              required: true
            - kind: field
              name: cpu_cores
              required: true
            - kind: field
              name: memory_gib
              required: true
        - id: storage
          title: Storage
          components:
            - kind: field
              name: workspace_mount
              required: true
        - id: access
          title: Access
          components:
            - kind: field
              name: exposure_mode
              required: true
            - kind: field
              name: host_port
              required: true
        - id: review
          title: Review
          components:
            - kind: platform.review
              required: true

resources:
  gpu:
    min_count: 0
    default_count: 1
    placement: allocation_local
  cpu:
    min_cores: 2
  memory:
    min_gib: 4

storage:
  mounts:
    - name: workspace
      path: /workspace
      required: true
      modes: [scratch, project_storage]

network:
  endpoints:
    - name: web
      port: 8888
      protocol: http
      exposure_modes: [private, platform_proxy]

execution:
  default_engine: node_agent
  container_env:
    HOME: /workspace
    JUPYTER_RUNTIME_DIR: /workspace/.jupyter/runtime
    JUPYTER_CONFIG_DIR: /workspace/.jupyter/config
    JUPYTER_DATA_DIR: /workspace/.jupyter/data
    JUPYTER_PREFER_ENV_PATH: "0"
    JUPYTERLAB_SETTINGS_DIR: /workspace/.jupyter/lab/user-settings
    JUPYTERLAB_WORKSPACES_DIR: /workspace/.jupyter/lab/workspaces
  supported_engines:
    - engine: node_agent
      adapter: oci_container
      status: active
    - engine: docker_compose
      adapter: compose_v2
      status: future
    - engine: helm
      adapter: helm_values
      status: future

outputs:
  endpoints:
    - name: web
      source: network.endpoints.web
  commands:
    - name: inspect
      command: gpuaas apps instances describe

validation:
  readiness:
    - name: web_http_ready
      type: http
      endpoint: web
      path: /
  smoke:
    - name: container_running
      type: adapter_status
      status: running

This is the target shape, not a claim that the current API validates every field today.

For launchable profiles, parameters.ui.launch_wizard is the deploy UX source of truth. The platform owns the wizard shell, validation gates, and reserved platform components; the app manifest owns step names, descriptions, order, and which declared fields appear in each step.

Reserved platform wizard component kinds:

platform.target_allocation: single allocation picker for profile.launch_mode: existing_allocation; output is placement_intent.target_allocation_id.
platform.artifact_selector: verified published/promoted OCI artifact picker filtered by selected allocation and requested GPU count.
platform.access_credential: project SSH/bootstrap credential picker when a profile requires host bootstrap access.
platform.operator_service_account: project service-account picker when a profile requires app-owned automation identity.
platform.review: final read-only review of target, artifact digest, resource overrides, and config payload.
field: renders a field from parameters.schema.properties; name must reference a declared parameter.

For Docker Compose-backed profiles, the manifest may also carry a platform-owned compose renderer and logical service topology. This is metadata and adapter selection, not user-authored Compose YAML:

execution:
  default_engine: docker_compose
  compose:
    renderer: vllm_openai
    topology:
      description: Single-node OpenAI-compatible inference endpoint.
      services:
        - name: vllm
          role: openai_server
          endpoints: [openai]
          mounts: [workspace]

The control plane remains responsible for rendering the actual Compose file from the curated renderer, selected digest-pinned artifact, resource overrides, endpoint policy, and storage mounts.

Launch Request Values¶

The UI should not submit backend YAML.

It should submit:

{
  "app_slug": "jupyterlab",
  "app_version": "2026.04-preview",
  "display_name": "My JupyterLab",
  "placement_intent": {
    "target_allocation_id": "uuid"
  },
  "config": {
    "workspace_mount": "project_storage",
    "exposure_mode": "private",
    "host_port": 8888,
    "python_packages": "numpy==2.4.4 pandas>=2.2"
  },
  "resource_overrides": {
    "gpu_count": 0,
    "cpu_cores": 2,
    "memory_gib": 4
  }
}

The app instance placement key for this workload class is target_allocation_id. Do not overload Slurm controller_allocation_id or RKE2 server_allocation_id for allocation-local OCI workloads.

Server-side launch validation should check:

the allocation is active,
the allocation belongs to the project,
the app version is entitled and published,
the selected OCI artifact is promoted and not retired,
the digest is immutable and trusted according to current policy,
workspace/storage choices are allowed by the profile,
requested exposure mode is supported by the environment,
requested GPU/CPU/memory shape fits the target allocation,
the execution adapter is enabled for the environment.

Execution Engine Adapter Boundary¶

Every execution engine adapter receives the same high-level input:

profile manifest,
validated launch values,
project and app instance identity,
allocation placement,
resolved artifact digests and pull credentials,
resolved storage mount decisions,
resolved endpoint exposure decision.

Every adapter must return the same high-level output:

lifecycle state,
adapter-specific status summary,
endpoints,
logs or log references,
metrics references where available,
events and failure reasons,
cleanup/decommission result.

Adapter examples:

node_agent
launch one OCI container or local process on an allocation through typed node-agent tasks
docker_compose
render Compose YAML for allocation-local multi-container execution
kubernetes
render Kubernetes resources directly
helm
render curated Helm values and apply a chart bundle
dstack
render .dstack.yml or submit through dstack API

The adapter owns YAML/run configuration. The product contract owns values, constraints, entitlements, billing attribution, and lifecycle mapping.

First Slice Recommendation¶

The first implementable slice should be:

one curated jupyterlab profile and one single-node vllm-openai profile,
launch on an existing allocation,
scratch workspace first, project storage once allocation storage is ready,
private or platform-proxied HTTP endpoint depending on the endpoint work,
node-agent OCI container adapter first if container execution is available on the node,
otherwise Docker Compose adapter as the first allocation-local backend,
status/events surfaced through the existing app instance shell.

vllm should be the second curated profile after the profile/adapter contract is proven with JupyterLab because it introduces model path, GPU placement, tensor parallelism, and endpoint health concerns earlier.

Current Implementation Status¶

As of the April 14, 2026 local and platform-control implementation:

the profile schema exists at doc/architecture/schemas/launchable_oci_workload_profile.v1.schema.json,
jupyterlab is seeded as an active, entitled, launchable OCI catalog app,
the catalog UI can open the launch form only when a verified published or promoted matching OCI artifact is available,
the catalog UI can fetch app version manifests and preview the workload profile, execution engine summary, endpoints, resources, and raw manifest,
the general app-instance create path still rejects launch requests without the required target allocation, profile config, and digest-pinned artifact,
the node-agent task catalog recognizes workload.oci_runtime_status, workload.oci_launch, workload.oci_control, workload.oci_remove, workload.compose_launch, workload.compose_control, and workload.compose_remove,
workload.oci_runtime_status performs a non-mutating approved-runtime probe for docker, podman, and nerdctl,
node-agent launch/control/remove handlers are implemented as a bounded first adapter slice: digest-pinned images only, deterministic platform container names, workload-scoped host mounts under the allocation user's home, no raw pull credentials in task logs, approved local runtimes only, and bounded launch readiness based on container state/health,
the app-runtime worker defaults to fail-closed for gpuaas.launchable_oci_workload; when explicitly enabled with APP_RUNTIME_LAUNCHABLE_OCI_NODE_TASKS_ENABLED=true, it renders the selected artifact/profile/config into a queued workload.oci_launch node task and marks the app instance deploying,
while that flag is enabled, app-runtime also polls terminal workload.oci_* and workload.compose_* node task results and reconciles app/member state for launch, stop, start, restart, and decommission,
the app-instance create path reserves placement_intent.target_allocation_id for allocation-local launchable OCI placement and performs minimal manifest-aware value validation before accepting a launch request,
profiles with artifacts.primary_image.digest_required: true require a verified published or promoted app artifact to be selected before the instance is accepted,
the JupyterLab deploy form exposes GPU count, CPU cores, memory GiB, and a private host-local port as launch inputs; the node-agent adapter maps the host port to a 127.0.0.1:<host_port>:8888/tcp Docker publish binding and maps CPU and memory overrides to bounded Docker runtime limits,
the JupyterLab deploy form accepts an optional Python package list; the node-agent adapter validates package specifiers and runs python -m pip install --no-cache-dir ... inside the container before readiness completes. Pip version specifiers are allowed as single tokens, for example numpy==2.4.4, pandas>=2.2, or torch~=2.9.
curated first-slice JupyterLab runtime images now exist for:
runtime-cpu,
runtime-nvidia-h200,
runtime-amd-rocm,
platform-control validation has published and launched the runtime-nvidia-h200 image on an H200 allocation with gpu_count: 1; the resulting Docker device request is bounded to Count: 1, and nvidia-smi inside the container reports one visible NVIDIA H200,
node-agent bootstrap owns the host container-runtime prerequisite for this slice: fresh bootstrap/rebootstrap installs Docker when no approved runtime is present and auto-configures nvidia-container-toolkit when nvidia-smi is present,
the platform-control JupyterLab validation instance reached stable running state through the public app-instance API after the updated control plane and node-agent bootstrap were deployed,
vllm-openai is seeded as the first active Docker Compose-backed curated launchable profile; app-runtime selects the manifest-declared execution.compose.renderer, emits a manifest-owned topology block in the node task, and node-agent returns that topology in runtime output so the UI can show the service/endpoint/mount shape without exposing arbitrary Compose YAML,
platform-control validation launched the vllm-openai NVIDIA H200 profile with Mistral Small 3.2 on one H200 and validated both /v1/models and /v1/chat/completions through a private SSH tunnel,
node-agent lifecycle self-update now exists for agent binary delivery, and fresh bootstrap/rebootstrap owns Docker, Docker Compose, NVIDIA Container Toolkit, registry trust, and H200 site-bootstrap prerequisites for the current slices.

Keep APP_RUNTIME_LAUNCHABLE_OCI_NODE_TASKS_ENABLED enabled only in environments where node-agent has the OCI task handlers and target nodes have an approved OCI runtime. Local-kind and platform-control currently enable this path for the validation slice; production promotion still needs explicit runtime/preflight sign-off.

Adapter constraint observed and addressed during April 2026 validation:

some existing worker allocations did not expose docker, podman, or nerdctl in the default PATH,
Docker was not an active service on those nodes,
node-agent bootstrap now owns the first-slice container-runtime prerequisite: fresh bootstrap/rebootstrap checks for docker, podman, or nerdctl and installs the configured package (docker.io by default) when missing,
on NVIDIA nodes, node-agent bootstrap auto-detects nvidia-smi, installs nvidia-container-toolkit, runs nvidia-ctk runtime configure --runtime=docker, and restarts Docker,
node-agent bootstrap configures Docker registry host trust and registry login from a node-bound bootstrap credential endpoint when registry pull credentials are configured,
node_agent is the first adapter slice; deploy still depends on artifact verification and target-node runtime readiness.

This is intentionally a node bootstrap behavior, not an app deploy behavior. Existing nodes may still require rebootstrap or an infra-owned runtime install when host prerequisites drift outside the current lifecycle scope. The remaining platform gap is full fleet reconciliation and read-model telemetry, not the basic node-agent binary update path.

Storage Boundary¶

The profile may define storage requirements, but it must not invent its own storage product.

For v1:

scratch mount is allowed,
project persistent storage is a declared dependency on the allocation storage model,
arbitrary host path entry is not allowed in the user-facing launch form,
Kubernetes PVC generation is blocked until the RKE2 storage/CSI decision is made.

Endpoint Boundary¶

Launchable OCI workloads commonly need browser or HTTP access.

For v1:

every endpoint must be declared in the profile,
direct public exposure must be disabled until infra defines the exposure boundary,
platform-proxied exposure should integrate with the workload UI gateway model,
endpoint credentials or tokens must be generated by the platform or app adapter, never pasted into raw YAML.

Security Boundary¶

The first slice must preserve these rules:

deploy by digest, not mutable tag,
use platform registry artifacts unless explicitly approved otherwise,
use wrapped or short-lived pull credentials,
do not expose raw registry credentials to the browser,
do not allow arbitrary image names in the user-facing launcher,
do not let a profile request host mounts or privileged containers unless the profile is platform-curated and explicitly approved.

UI Implications¶

The launch UI should be generated from:

profile metadata,
JSON Schema parameter schema,
UI hints,
environment capability checks,
entitlement and placement checks.

The UI should show:

image/profile name and version,
support level,
target allocation,
workspace/storage mode,
endpoint exposure mode,
resource summary,
cost/resource warnings for optional add-ons,
generated endpoints after launch.

The UI should not show:

arbitrary Docker Compose YAML,
arbitrary Kubernetes YAML,
raw registry credentials,
raw Helm chart values beyond an advanced/debug view.

Open Questions¶

Should node-agent lifecycle upgrades automatically deliver host prerequisites such as Docker and NVIDIA Container Toolkit to existing nodes, or should this remain partially bootstrap/remediation-owned until infra packages own full fleet prerequisite reconciliation?
Should profile manifests live only inside app_versions.manifest, or also as OCI artifacts with annotations?
How should profile versions map to image versions when a profile uses multiple images?
What is the first approved endpoint exposure mode beyond private host-local access in platform-control?
Which storage mode is available before project persistent storage ships?
Should jupyterlab be allocation-scoped only, or also visible as a project workload under /workloads?
Should user package installs remain ephemeral per container launch, or should the next version add derived per-project images and persistent package caches?
What is the release process for publishing/promoting the curated CPU, NVIDIA/CUDA, and AMD/ROCm runtime image family?
How should package-install UX represent versions, constraints, failure reporting, and eventual environment reuse?

Next Implementation Steps¶

Completed local baseline:

Add a schema file for the workload profile manifest.
Add seed/catalog metadata for an internal jupyterlab launchable profile.
Expose a read-only manifest/profile preview through the existing app catalog version API and catalog UI.
Add the bounded node-agent OCI task handlers for runtime status, launch, control, and remove.
Add the feature-flagged app-runtime path that queues workload.oci_launch and reconciles terminal node task results back into app state.
Add app-runtime lifecycle mapping for workload.oci_control and workload.oci_remove, including stop/start/restart/decommission reconcile.

Remaining implementation:

Use workload.oci_runtime_status in a local smoke flow to decide whether the target node has an approved reachable runtime before enqueueing launch.
Expand the automated local-kind and platform-control smoke validation:
scripts/ops/app_runtime_first_slice_smoke.sh now provides the first observe/deploy/access/decommission harness,
CI wiring and stable environment-specific placement/artifact inputs remain to be added.
Package-install v1 design is captured in doc/product/Jupyter_Package_Install_v1.md; node-agent now reports a bounded pip install excerpt in task output and the workload Events tab surfaces it. Remaining follow-up is the derived image workflow design and release/channel policy for reusable project environments.
Endpoint-access guidance in the workload Access tab covers private SSH tunnel access now and states that platform-proxy routing is not wired yet.
Convert the Jupyter/vLLM scripts and manifests into a documented app-developer release flow covering image variants, schema validation, artifact promotion, smoke tests, and decommission cleanup.