Skip to content

App Platform Gap Tracker v1

As of: April 14, 2026

Purpose

Turn the app-platform builder and quickstart docs into explicit platform gaps that can be implemented in sequence.

This is the gap tracker for: 1. external app teams, 2. internal reference apps, 3. agents or automation building against the App Control Plane.

Current Baseline

What exists now: 1. app catalog and version metadata, 2. project entitlements, 3. project-scoped app instances, 4. operator service accounts, 5. app lifecycle events, 6. billing usage-record primitives for app_runtime, 7. operating-mode and control-plane-scope metadata in the instance contract, 8. live OCI registry on platform_control, 9. Vault-backed wrapped publish credentials for OCI publish intents, 10. project-storage-backed blob upload intents and registration verification, 11. launchable OCI workload profile schema and manifest-backed deploy forms, 12. JupyterLab launchable OCI profile with CPU, NVIDIA H200, and AMD ROCm image families, 13. configurable launch-time GPU count, CPU cores, memory GiB, host port, and first-slice Jupyter pip package installs, 14. node-agent OCI container launch/control/remove task family, 15. curated Docker Compose launch/control/remove task family, 16. vllm-openai single-node Compose profile with manifest-owned renderer and topology output, 17. node-agent bootstrap prerequisites for Docker, Docker Compose, and NVIDIA Container Toolkit, 18. node-agent lifecycle self-update for agent binary delivery.

IAM/operator baseline clarification

The v1 IAM baseline for app-platform operator use is now: 1. project-scoped service accounts can authenticate as first-class callers for app-instance create, read, upgrade, rollback, decommission, and runtime-secret issuance, 2. app-instance contracts, lifecycle events, and audit rows preserve whether the requester was a user or a service account, 3. project entitlement mutation remains a human tenant-admin/platform-admin workflow, 4. service-account administration remains a human project-admin workflow.

What remains intentionally out of scope for this baseline: 1. service-account-driven entitlement changes, 2. service-account self-management flows, 3. broad IAM information-architecture/UX completion beyond the backend contract and authz path.

Implementation status: 1. entitlement read/write endpoints reject service-account actors and stay human/admin-only, 2. app-instance create, read, lifecycle, member operation, and runtime-secret paths preserve user vs service-account requester identity, 3. app-artifact publish intent, registration, and promotion paths preserve service-account requester identity in audit rows, 4. app-artifact verify, revoke, deprecate, and retire remain human/admin control-plane actions.

What does not exist yet as a full platform capability: 1. tenant/project private artifact tenancy, 2. productized runtime operator stacks, 3. final managed-service operating mode rollout, 4. generalized secret delivery for app/operator runtime consumption, 5. generic app-team manifest submission, review, and release workflow, 6. arbitrary user-authored Compose YAML as a safe product surface, 7. multi-node/distributed Compose topology, 8. platform proxy exposure for launchable OCI endpoints, 9. persistent storage and derived reusable package environments, 10. full fleet lifecycle reconciliation and telemetry hardening for host prerequisites and agent read models.

Gap List

GAP-001: Platform OCI Registry Baseline

Problem: App teams need one platform-owned place to publish and pull runtime artifacts. The control plane and artifact model now exist, and platform_control now runs a first-class dev OCI registry endpoint. The remaining gap is turning that into a credentialed, policy-enforced platform capability rather than just a reachable service.

Needed: 1. platform registry endpoint and ownership model, 2. digest-pinned publish/pull contract, 3. project or tenant extension model for private repos/namespaces, 4. credential lifecycle for publish and pull, 5. lifecycle for publish, promote, deprecate, retire.

Why this matters: Without this, app teams will build hidden artifact flows outside the platform contract.

Current state after this implementation step: 1. ownership and lifecycle contract is defined, 2. public API and artifact persistence are implemented, 3. platform_control now exposes a live OCI registry endpoint, 4. OCI publish intents now return Vault-wrapped publisher credentials, 5. trust-state, promotion enforcement, and signature/provenance verification are implemented in the control plane, 6. generalized pull credential delivery remains follow-on work, 7. the control plane now exposes a platform release catalog and download API for developer-facing release artifacts, 8. the developer downloads UI still needs to consume that API instead of placeholder install snippets.

This gap should be closed with a layered model rather than a single endpoint shape.

Foundational primitive: 1. generalized pull-intent issuance for registry-backed pulls, 2. Vault-wrapped short-lived pull credentials, 3. canonical repository/ref/digest metadata in the response.

Why: 1. app artifacts can be large and should keep registry-native distribution, 2. CI and automation should be able to use ORAS or equivalent clients directly, 3. the control plane should own policy, audit, and credential delivery, not raw blob transport for all artifact classes.

Convenience surface: 1. a control-plane download proxy is still appropriate for small platform release artifacts such as: - CLI - Go SDK - Python SDK 2. this should be a convenience layer on top of the same release metadata/pull ownership model, not a separate artifact source of truth.

Current backend state: 1. platform release catalog/discovery endpoints are implemented, 2. platform release download endpoints for developer-facing small artifacts are implemented, 3. generalized pull-intent endpoints remain follow-on for: - platform release artifacts, - project app artifacts.

GAP-002: Vault/KMS-Backed Secrets and Key Custody

Problem: App teams and platform operators need one controlled place for: 1. registry robot credentials, 2. signing keys, 3. future app/operator secrets, 4. service-account private material hardening, 5. node task-signing private keys, 6. bootstrap CA trust distribution material.

Needed: 1. Vault-first secret custody baseline, 2. clear boundary between CI secrets, workload runtime secrets, and operator credentials, 3. secret injection model that does not depend on ad hoc env files, 4. automated signer key lifecycle (generate, store, rotate, verifier distribution), 5. automated bootstrap trust delivery so node agents do not rely on manual CA copy, 6. path to KMS-backed evolution later if required.

Why this matters: Registry, signing, and app runtime secrets should not become host-local sprawl.

Current state after this baseline task: 1. Vault-first custody model is now defined, 2. platform_control now runs a live Vault baseline in gpuaas-infra, 3. OCI publish intents use Vault-wrapped credential delivery, 4. CI, runtime, operator, and node bootstrap secret classes are separated, 5. broader workload secret delivery beyond artifact pull, signer custody, and rotation remain follow-on work.

GAP-003: DNS, TLS, and Control-Plane Endpoint Model

Problem: Environment creation must not depend on hardcoded node IPs or one-off hostnames.

Needed: 1. stable environment DNS model, 2. cert-manager or equivalent TLS automation for k8s-first control-plane surfaces, 3. stable public control-plane endpoint, 4. ingress/load-balancer model that survives adding platform_control-2 later.

Why this matters: Environment automation becomes fragile if every new environment requires hostname and cert rewrites.

Current state after this baseline task: 1. stable endpoint and DNS/TLS model is now defined, 2. cert-manager is the preferred k8s-native direction, 3. concrete provider/LB implementation remains follow-on work.

GAP-004: Artifact Trust and Promotion

Problem: Even with a registry, app artifacts need trust and promotion rules.

Needed: 1. signing model, 2. digest-only deployment rules, 3. promotion flow between environments, 4. verification policy in runtime/operator paths.

Why this matters: Multiple app teams publishing artifacts without a trust model creates an unreviewable supply chain.

Current state after this baseline task: 1. trust state and promotion-channel model is now defined, 2. digest-only deployment and explicit promotion are enforced by the control plane, 3. registry manifest existence is checked against the live registry before verification, 4. signature and provenance are verified against live registry manifests and persisted on the artifact record, 5. generalized pull/runtime credential delivery remains follow-on work.

GAP-005: Non-OCI Artifact Lifecycle

Problem: The node/operations model already reserves artifact.pull_blob, but app-platform teams do not yet have an end-to-end supported path for model weights, tarballs, or large private blobs.

Needed: 1. canonical blob-source contract, 2. resumable transfer semantics, 3. digest verification rules, 4. allowlist and policy enforcement, 5. credential delivery model for those sources.

Why this matters: OCI is not enough for many ML/runtime use cases.

Current state after this baseline task: 1. non-OCI artifacts now share the same app-artifact control-plane model, 2. project-storage-backed blob upload intents are implemented, 3. registration and verification check that the referenced project artifact object exists, 4. signed delivery/runtime transfer implementation remains follow-on work.

GAP-006: Productized Runtime Adapters

Problem: The control-plane contract exists, and the first launchable OCI and curated Compose slices now work end to end. The remaining gap is converting those reference slices into a repeatable platformization framework for more app teams.

Needed: 1. product-grade adapter for at least one reference class, 2. upgrade/rollback/decommission behavior proven, 3. app-team operator guidance tied to a real runtime backend, 4. observability and billing evidence attached to that adapter.

Why this matters: The platform is only proven when at least one real app class uses it end to end.

Current state after this implementation step: 1. JupyterLab proves the single-container launchable OCI adapter through local kind and platform-control. 2. vLLM OpenAI proves the first single-node Docker Compose adapter on platform-control with an H200-backed Mistral Small smoke. 3. The app-runtime and node-agent boundary uses typed task families rather than browser-submitted YAML. 4. The UI can show manifest-derived resources, endpoints, access instructions, events, and Compose topology. 5. The next gap is packaging this as a documented app-developer release flow: image variants, manifest schema, artifact promotion, smoke tests, and decommission verification.

GAP-007: Launchable OCI Developer Release Flow

Problem: The platform can launch curated OCI and Compose-backed apps, but new app teams still need a productized handoff for contributing image variants, manifests, artifacts, promotion evidence, and smoke tests.

Needed: 1. app developer checklist for image variants and supported SKUs, 2. manifest review and schema validation workflow, 3. artifact publish/promote scripts that are environment-parameterized, 4. required smoke-test contract for local kind and platform-control, 5. release notes and rollback/decommission verification expectations, 6. clear ownership for private model/package credentials.

Why this matters: Without this, every new app repeats the Jupyter/vLLM integration work manually.

GAP-008: Compose Topology Generalization

Problem: The first Compose path proves a single vllm service through a curated renderer. More complex apps need the same platform shape without allowing arbitrary YAML or unsafe runtime knobs.

Needed: 1. explicit manifest fields for service dependencies and health checks, 2. service-level endpoint and access output, 3. sidecar classification and allowed capability model, 4. lifecycle semantics for partial service failure, 5. decommission guarantees for networks, volumes, scratch paths, and stale Compose projects, 6. UI topology rendering that stays manifest-derived.

Why this matters: This is the bridge from single-service inference to sidecar-rich app stacks.

GAP-009: Fleet Lifecycle and Read-Model Hardening

Problem: Node-agent lifecycle self-update and bootstrap prerequisite installation now exist, but the platform still needs full fleet reconciliation so rebootstrap is not the normal path for existing nodes and admin read models report the active agent/runtime state accurately.

Needed: 1. fleet-wide desired agent version and host prerequisite policy, 2. automatic drift detection for agent binary, Docker, Compose, NVIDIA runtime, and registry trust, 3. safe staged rollout and rollback, 4. read-model updates for agent_version, agent_connected_at, runtime capability, and lifecycle status, 5. stale node task/config cleanup checks after upgrade and decommission.

Why this matters: App deploy quality depends on node runtime readiness, and operators should not have to infer readiness from SSH-only checks.

  1. platform OCI registry baseline and implementation
  2. Vault-backed platform secrets baseline and implementation
  3. DNS/TLS/control-plane endpoint automation
  4. artifact trust and promotion enforcement
  5. non-OCI blob backend implementation
  6. productized runtime adapter
  7. launchable OCI developer release flow
  8. Compose topology generalization
  9. fleet lifecycle and read-model hardening

Immediate Decision

Initial direction should be: 1. platform-owned OCI registry first, 2. later tenant/project private registry extension, 3. Vault-backed credentials, 4. cert-manager and stable DNS/load-balancer model for k8s control-plane environments.

  1. doc/architecture/Build_an_App_for_GPUaaS_v1.md
  2. doc/architecture/App_Platform_Quickstart_v1.md
  3. doc/architecture/App_Control_Plane_v1.md
  4. doc/architecture/Node_Operations_and_Agent_Lifecycle_v1.md
  5. doc/operations/Control_Plane_K8s_Migration_v1.md
  6. doc/architecture/Launchable_OCI_Workload_Profile_Contract_v1.md
  7. doc/product/VLLM_OpenAI_Compose_First_Slice_v1.md
  8. doc/product/Jupyter_Package_Install_v1.md