Skip to content

App Control Plane v1 (Extensibility Baseline)

Goal

Enable product teams to add platform applications (for example model serving, inference, schedulers) without changing core allocation APIs per app.

Core principle: - Core platform exposes primitives. - App teams integrate through a consistent app control plane contract.

Companion baseline for scheduler apps: - doc/architecture/Scheduler_as_Platform_App_v1.md - doc/architecture/App_Runtime_Operating_Modes_v1.md - doc/architecture/Clustered_App_Model_v1.md - doc/architecture/App_Platform_Primitive_Boundary_v1.md - doc/architecture/Slurm_First_Slice_Platform_App_Split_v1.md - doc/architecture/App_Platform_OCI_Registry_Baseline_v1.md

Platform Invariants (Non-Negotiable)

These invariants must hold for every app-platform feature, including internal reference apps.

  1. Policy/IAM is first-class, not optional.
  2. Every app action is tenant/project scoped and evaluated through the same role/policy path as core resources.
  3. No internal-app bypass paths are allowed for authz.
  4. Privileged app mutations must produce audit logs and canonical error envelopes with correlation_id.

  5. Artifact and runtime neutrality is mandatory.

  6. The control plane contract stays runtime-agnostic (k8s|slurm|ray|bare_metal adapters behind one API model).
  7. Artifact sources are policy-governed allowlists; no hardcoded single-vendor/source coupling in API semantics.
  8. OCI registry integration and non-OCI artifact source workflows are foundation requirements and are not yet fully implemented in runtime.

  9. Lifecycle is event-driven and loosely coupled.

  10. App instance lifecycle transitions must emit typed domain events (apps.instance.*).
  11. Integrations consume contracts/events, not database internals.
  12. Async state changes are observable by correlation and trace context across services.

  13. Internal reference apps must prove platform generality.

  14. Any first-party app added to validate the platform must use the same public contracts and operator model as third-party teams.
  15. If a feature only works for an internal app via special-case code, it is considered a platform defect.

Scope (v1)

  1. App catalog (platform-owned metadata).
  2. Project app entitlements (tenant/project scoping + policy overlays).
  3. App instances (request/deploy/run/fail/delete lifecycle).
  4. Async lifecycle events for operators and observability.

Out of scope for v1: - App runtime implementation details (k8s/slurm/ray operator internals). - UI workflows beyond existing shell. - Final pricing engine and runtime-specific metering implementation.

Ownership Model

  • App instance ownership: project.
  • Attribution: requested_by_user_id.
  • Tenant boundary: inferred from project -> org.
  • Service accounts remain project-scoped and are used by app operators.

Important distinction: 1. app instance ownership remains project-scoped 2. runtime control plane may be project, tenant, or platform scoped depending on operating mode

This is required so the platform can start with tenant-dedicated app backends and later introduce platform-managed services without changing ownership semantics.

Current limitation: - a tenant-scoped runtime control plane is not yet the same thing as a tenant-owned shared runtime contract, - tenant-owned shared mode needs an explicit attached-project model instead of overloading one project-owned app instance to mean tenant ownership, - see: - doc/architecture/App_Tenant_Shared_Attachment_Model_v1.md - doc/architecture/App_Tenant_Shared_Runtime_API_Direction_v1.md

Operating Mode Baseline

See doc/architecture/App_Runtime_Operating_Modes_v1.md.

Initial direction: 1. production default is tenant_dedicated 2. future shared offerings use platform_managed 3. both modes use the same app catalog, entitlement, app instance, IAM, audit, and observability paths

Billing Attribution Baseline

See doc/architecture/App_Runtime_Billing_Model_v1.md.

Baseline rules: 1. project remains the primary app-runtime billing anchor, even when the effective control plane is tenant or platform scoped 2. app-runtime billing must preserve: - org_id - project_id - app_instance_id - app_slug - operating_mode - control_plane_scope - runtime_backend - correlation_id 3. tenant_dedicated + project is the clean default for dev/test/stage/prod style environment boundaries 4. tenant-scoped shared control planes are supported, but any cross-project cost distribution must be explicit and policy-driven 5. platform-managed shared-service costs must still reconcile back to tenant and project usage records

API Surface (contract-first)

See doc/api/openapi.draft.yaml.

Added surfaces: 1. GET /api/v1/apps/catalog 2. GET /api/v1/apps/catalog/{app_slug}/versions 3. POST /api/v1/admin/apps/catalog/{app_slug}/versions/{version}/publish 4. POST /api/v1/admin/apps/catalog/{app_slug}/versions/{version}/deprecate 5. GET /api/v1/projects/{project_id}/apps/entitlements 6. PUT /api/v1/projects/{project_id}/apps/entitlements/{app_slug} 7. GET /api/v1/projects/{project_id}/app-instances 8. POST /api/v1/projects/{project_id}/app-instances 9. GET /api/v1/projects/{project_id}/app-instances/{app_instance_id} 10. DELETE /api/v1/projects/{project_id}/app-instances/{app_instance_id} 11. GET /api/v1/apps/registry 12. GET /api/v1/projects/{project_id}/app-artifacts 13. POST /api/v1/projects/{project_id}/app-artifacts/publish-intents 14. POST /api/v1/projects/{project_id}/app-artifacts 15. POST /api/v1/projects/{project_id}/app-artifacts/{artifact_id}/promote 16. POST /api/v1/projects/{project_id}/app-artifacts/{artifact_id}/deprecate 17. POST /api/v1/projects/{project_id}/app-artifacts/{artifact_id}/retire 18. GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/members 19. GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/members/{member_id} 20. POST /api/v1/projects/{project_id}/app-instances/{app_instance_id}/member-operations 21. GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/member-operations/{operation_id}

Behavioral intent: - Catalog is read-only for tenant users. - Entitlements are project-scoped controls (enable/disable + policy overrides). - Instance create/delete are async (202 Accepted), state transitions tracked via events. - Member lifecycle requests are async (202 Accepted) and remain generic platform operation envelopes; runtime-specific implementation stays in the adapter. - Artifact upload bytes flow directly to the registry; the API owns publish intent, digest registration, lifecycle, and audit.

Event Surface

See doc/api/asyncapi.draft.yaml.

Added lifecycle events: 1. apps.entitlement.updated 2. apps.instance.requested 3. apps.instance.running 4. apps.instance.failed 5. apps.instance.deleting 6. apps.instance.deleted 7. apps.artifact.registered 8. apps.artifact.promoted 9. apps.artifact.deprecated 10. apps.artifact.retired

Envelope remains standard: - event_id - event_type - occurred_at - version - correlation_id - payload

Security and Isolation

  1. Project context (X-Project-ID or project path) is authoritative for all project-owned app operations.
  2. Cross-project/cross-tenant operations are denied by default.
  3. App operators authenticate via project-scoped service accounts only.
  4. No raw command execution surface is exposed via app APIs.

RBAC Action Matrix (v1 Baseline)

Project-scoped app actions

Action platform_superadmin platform_ops tenant_owner tenant_admin project_owner project_admin project_member project_viewer service_account
apps.catalog.read allow allow allow allow allow allow allow allow deny
apps.versions.read allow allow allow allow allow allow allow allow deny
apps.entitlement.read allow allow allow allow allow allow deny deny deny
apps.entitlement.write allow allow allow allow allow allow deny deny deny
apps.instance.read allow allow allow allow allow allow allow allow allow (same project only)
apps.instance.create allow allow allow allow allow allow allow deny allow (same project only)
apps.instance.delete allow allow allow allow allow allow deny deny deny
apps.instance.member.read allow allow allow allow allow allow allow allow allow (same project only)
apps.instance.member.operate allow allow allow allow allow allow deny deny allow (same project only, explicit allowlist only)

Rules: 1. platform_superadmin and platform_ops are break-glass/platform operations and bypass tenant-level membership checks for explicit admin endpoints only. 2. service_account permissions are constrained to same-project resources and explicit allowlisted endpoint set. 3. tenant_owner and tenant_admin can manage project entitlements and app instances inside their tenant. 4. Project-scoped role evaluation follows project -> tenant -> platform decision chain from role lifecycle baseline.

Policy Overlay Direction

Overlay resolution order (future implementation): 1. global defaults 2. tenant overrides 3. project app entitlement overrides

Most-specific scope wins. Global hard-deny remains non-overridable.

Overlay schema direction (v1)

project_app_entitlements.policy_overrides supports: 1. allowed_regions: array of region codes. 2. allowed_skus: array of catalog sku codes. 3. max_instances_per_project: integer. 4. max_gpus_per_instance: integer. 5. artifact_source_allowlist: array of host patterns. 6. allowed_operating_modes: array of tenant_dedicated | platform_managed. 7. allowed_control_plane_scopes: array of project | tenant | platform.

Restrictions: 1. Project override can only narrow scope versus tenant/global policy. 2. Project override cannot enable an app disabled by tenant/global hard-deny. 3. Conflicts resolve by most-specific then most-restrictive.

Observability Requirements

Every app-instance mutation and event should carry: - correlation_id - org_id - project_id - app_slug - app_instance_id

Target triage path: - API error envelope -> correlation id - Loki lookup by correlation id - Tempo trace by trace_id - App lifecycle event timeline from async stream

Follow-ups (next iterations)

  1. Runtime-specific metering implementation and usage-record to ledger pipeline.
  2. Operator onboarding guide (reference implementation for one app, e.g. model serving).
  3. Admin catalog version disable endpoint and retirement workflow.
  4. Registry credential delivery through Vault-backed publish/pull secret paths.
  5. Generic clustered-app topology and component-role contract for multi-member example apps.
  6. Tenant-shared runtime ownership and attachment contract for apps that support tenant-owned shared mode.

DB Schema Proposal (v1 Draft)

SQL companion: - doc/architecture/db_schema_app_control_plane_phase1_draft.sql

Tables

  1. app_catalog
  2. id uuid pk
  3. slug text unique not null
  4. display_name text not null
  5. category text not null
  6. publisher text not null
  7. status text not null check (status in ('active','deprecated','disabled'))
  8. created_at timestamptz not null default now()
  9. updated_at timestamptz not null default now()

  10. app_versions

  11. id uuid pk
  12. app_id uuid not null references app_catalog(id) on delete cascade
  13. version text not null
  14. runtime_backend text not null check (runtime_backend in ('k8s','rke2','slurm','ray','bare_metal'))
  15. manifest jsonb not null
  16. status text not null check (status in ('active','deprecated','disabled'))
  17. created_at timestamptz not null default now()
  18. unique (app_id, version)

  19. project_app_entitlements

  20. id uuid pk
  21. org_id uuid not null references organizations(id) on delete cascade
  22. project_id uuid not null references projects(id) on delete cascade
  23. app_id uuid not null references app_catalog(id) on delete cascade
  24. enabled boolean not null default true
  25. policy_overrides jsonb not null default '{}'::jsonb
  26. updated_by_user_id uuid null references users(id)
  27. created_at timestamptz not null default now()
  28. updated_at timestamptz not null default now()
  29. unique (project_id, app_id)

  30. app_instances

  31. id uuid pk
  32. resource_name text not null unique
  33. org_id uuid not null references organizations(id) on delete cascade
  34. project_id uuid not null references projects(id) on delete cascade
  35. app_id uuid not null references app_catalog(id) on delete restrict
  36. app_version_id uuid not null references app_versions(id) on delete restrict
  37. display_name text not null
  38. operating_mode text not null default 'tenant_dedicated' check (operating_mode in ('tenant_dedicated','platform_managed'))
  39. control_plane_scope text not null default 'project' check (control_plane_scope in ('project','tenant','platform'))
  40. tenant_boundary_mode text not null default 'tenant_isolated' check (tenant_boundary_mode in ('tenant_isolated','shared_service'))
  41. status text not null check (status in ('requested','deploying','running','failed','deleting','deleted'))
  42. requested_by_user_id uuid not null references users(id)
  43. operator_service_account_id uuid null references service_accounts(id)
  44. config jsonb not null default '{}'::jsonb
  45. runtime_state jsonb not null default '{}'::jsonb
  46. failure_reason text null
  47. created_at timestamptz not null default now()
  48. updated_at timestamptz not null default now()
  49. deleted_at timestamptz null

Required indexes

  1. ix_app_instances_project_status_created on (project_id, status, created_at desc)
  2. ix_app_instances_org_status on (org_id, status)
  3. ix_project_app_entitlements_project on (project_id)
  4. ix_app_versions_app_status on (app_id, status, created_at desc)

Required integrity constraints

  1. app_instances.project_id must belong to app_instances.org_id.
  2. project_app_entitlements.project_id must belong to project_app_entitlements.org_id.
  3. app_instances.operator_service_account_id (if set) must belong to same project_id and org_id.
  4. app_instances.app_version_id must reference same app_id as app_instances.app_id.

Migration/cutover strategy

  1. Add tables and indexes without touching existing allocation/storage paths.
  2. Launch catalog read APIs first.
  3. Gate entitlement and instance mutations behind feature flag.
  4. Introduce one reference operator integration before broad app onboarding.