App Control Plane v1 (Extensibility Baseline)¶
Goal¶
Enable product teams to add platform applications (for example model serving, inference, schedulers) without changing core allocation APIs per app.
Core principle: - Core platform exposes primitives. - App teams integrate through a consistent app control plane contract.
Companion baseline for scheduler apps:
- doc/architecture/Scheduler_as_Platform_App_v1.md
- doc/architecture/App_Runtime_Operating_Modes_v1.md
- doc/architecture/Clustered_App_Model_v1.md
- doc/architecture/App_Platform_Primitive_Boundary_v1.md
- doc/architecture/Slurm_First_Slice_Platform_App_Split_v1.md
- doc/architecture/App_Platform_OCI_Registry_Baseline_v1.md
Platform Invariants (Non-Negotiable)¶
These invariants must hold for every app-platform feature, including internal reference apps.
- Policy/IAM is first-class, not optional.
- Every app action is tenant/project scoped and evaluated through the same role/policy path as core resources.
- No internal-app bypass paths are allowed for authz.
-
Privileged app mutations must produce audit logs and canonical error envelopes with
correlation_id. -
Artifact and runtime neutrality is mandatory.
- The control plane contract stays runtime-agnostic (
k8s|slurm|ray|bare_metaladapters behind one API model). - Artifact sources are policy-governed allowlists; no hardcoded single-vendor/source coupling in API semantics.
-
OCI registry integration and non-OCI artifact source workflows are foundation requirements and are not yet fully implemented in runtime.
-
Lifecycle is event-driven and loosely coupled.
- App instance lifecycle transitions must emit typed domain events (
apps.instance.*). - Integrations consume contracts/events, not database internals.
-
Async state changes are observable by correlation and trace context across services.
-
Internal reference apps must prove platform generality.
- Any first-party app added to validate the platform must use the same public contracts and operator model as third-party teams.
- If a feature only works for an internal app via special-case code, it is considered a platform defect.
Scope (v1)¶
- App catalog (platform-owned metadata).
- Project app entitlements (tenant/project scoping + policy overlays).
- App instances (request/deploy/run/fail/delete lifecycle).
- Async lifecycle events for operators and observability.
Out of scope for v1: - App runtime implementation details (k8s/slurm/ray operator internals). - UI workflows beyond existing shell. - Final pricing engine and runtime-specific metering implementation.
Ownership Model¶
- App instance ownership: project.
- Attribution:
requested_by_user_id. - Tenant boundary: inferred from project -> org.
- Service accounts remain project-scoped and are used by app operators.
Important distinction:
1. app instance ownership remains project-scoped
2. runtime control plane may be project, tenant, or platform scoped depending on operating mode
This is required so the platform can start with tenant-dedicated app backends and later introduce platform-managed services without changing ownership semantics.
Current limitation:
- a tenant-scoped runtime control plane is not yet the same thing as a
tenant-owned shared runtime contract,
- tenant-owned shared mode needs an explicit attached-project model instead of
overloading one project-owned app instance to mean tenant ownership,
- see:
- doc/architecture/App_Tenant_Shared_Attachment_Model_v1.md
- doc/architecture/App_Tenant_Shared_Runtime_API_Direction_v1.md
Operating Mode Baseline¶
See doc/architecture/App_Runtime_Operating_Modes_v1.md.
Initial direction:
1. production default is tenant_dedicated
2. future shared offerings use platform_managed
3. both modes use the same app catalog, entitlement, app instance, IAM, audit, and observability paths
Billing Attribution Baseline¶
See doc/architecture/App_Runtime_Billing_Model_v1.md.
Baseline rules:
1. project remains the primary app-runtime billing anchor, even when the effective control plane is tenant or platform scoped
2. app-runtime billing must preserve:
- org_id
- project_id
- app_instance_id
- app_slug
- operating_mode
- control_plane_scope
- runtime_backend
- correlation_id
3. tenant_dedicated + project is the clean default for dev/test/stage/prod style environment boundaries
4. tenant-scoped shared control planes are supported, but any cross-project cost distribution must be explicit and policy-driven
5. platform-managed shared-service costs must still reconcile back to tenant and project usage records
API Surface (contract-first)¶
See doc/api/openapi.draft.yaml.
Added surfaces:
1. GET /api/v1/apps/catalog
2. GET /api/v1/apps/catalog/{app_slug}/versions
3. POST /api/v1/admin/apps/catalog/{app_slug}/versions/{version}/publish
4. POST /api/v1/admin/apps/catalog/{app_slug}/versions/{version}/deprecate
5. GET /api/v1/projects/{project_id}/apps/entitlements
6. PUT /api/v1/projects/{project_id}/apps/entitlements/{app_slug}
7. GET /api/v1/projects/{project_id}/app-instances
8. POST /api/v1/projects/{project_id}/app-instances
9. GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}
10. DELETE /api/v1/projects/{project_id}/app-instances/{app_instance_id}
11. GET /api/v1/apps/registry
12. GET /api/v1/projects/{project_id}/app-artifacts
13. POST /api/v1/projects/{project_id}/app-artifacts/publish-intents
14. POST /api/v1/projects/{project_id}/app-artifacts
15. POST /api/v1/projects/{project_id}/app-artifacts/{artifact_id}/promote
16. POST /api/v1/projects/{project_id}/app-artifacts/{artifact_id}/deprecate
17. POST /api/v1/projects/{project_id}/app-artifacts/{artifact_id}/retire
18. GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/members
19. GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/members/{member_id}
20. POST /api/v1/projects/{project_id}/app-instances/{app_instance_id}/member-operations
21. GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/member-operations/{operation_id}
Behavioral intent:
- Catalog is read-only for tenant users.
- Entitlements are project-scoped controls (enable/disable + policy overrides).
- Instance create/delete are async (202 Accepted), state transitions tracked via events.
- Member lifecycle requests are async (202 Accepted) and remain generic platform operation envelopes; runtime-specific implementation stays in the adapter.
- Artifact upload bytes flow directly to the registry; the API owns publish intent, digest registration, lifecycle, and audit.
Event Surface¶
See doc/api/asyncapi.draft.yaml.
Added lifecycle events:
1. apps.entitlement.updated
2. apps.instance.requested
3. apps.instance.running
4. apps.instance.failed
5. apps.instance.deleting
6. apps.instance.deleted
7. apps.artifact.registered
8. apps.artifact.promoted
9. apps.artifact.deprecated
10. apps.artifact.retired
Envelope remains standard:
- event_id
- event_type
- occurred_at
- version
- correlation_id
- payload
Security and Isolation¶
- Project context (
X-Project-IDor project path) is authoritative for all project-owned app operations. - Cross-project/cross-tenant operations are denied by default.
- App operators authenticate via project-scoped service accounts only.
- No raw command execution surface is exposed via app APIs.
RBAC Action Matrix (v1 Baseline)¶
Project-scoped app actions¶
| Action | platform_superadmin | platform_ops | tenant_owner | tenant_admin | project_owner | project_admin | project_member | project_viewer | service_account |
|---|---|---|---|---|---|---|---|---|---|
| apps.catalog.read | allow | allow | allow | allow | allow | allow | allow | allow | deny |
| apps.versions.read | allow | allow | allow | allow | allow | allow | allow | allow | deny |
| apps.entitlement.read | allow | allow | allow | allow | allow | allow | deny | deny | deny |
| apps.entitlement.write | allow | allow | allow | allow | allow | allow | deny | deny | deny |
| apps.instance.read | allow | allow | allow | allow | allow | allow | allow | allow | allow (same project only) |
| apps.instance.create | allow | allow | allow | allow | allow | allow | allow | deny | allow (same project only) |
| apps.instance.delete | allow | allow | allow | allow | allow | allow | deny | deny | deny |
| apps.instance.member.read | allow | allow | allow | allow | allow | allow | allow | allow | allow (same project only) |
| apps.instance.member.operate | allow | allow | allow | allow | allow | allow | deny | deny | allow (same project only, explicit allowlist only) |
Rules:
1. platform_superadmin and platform_ops are break-glass/platform operations and bypass tenant-level membership checks for explicit admin endpoints only.
2. service_account permissions are constrained to same-project resources and explicit allowlisted endpoint set.
3. tenant_owner and tenant_admin can manage project entitlements and app instances inside their tenant.
4. Project-scoped role evaluation follows project -> tenant -> platform decision chain from role lifecycle baseline.
Policy Overlay Direction¶
Overlay resolution order (future implementation): 1. global defaults 2. tenant overrides 3. project app entitlement overrides
Most-specific scope wins. Global hard-deny remains non-overridable.
Overlay schema direction (v1)¶
project_app_entitlements.policy_overrides supports:
1. allowed_regions: array of region codes.
2. allowed_skus: array of catalog sku codes.
3. max_instances_per_project: integer.
4. max_gpus_per_instance: integer.
5. artifact_source_allowlist: array of host patterns.
6. allowed_operating_modes: array of tenant_dedicated | platform_managed.
7. allowed_control_plane_scopes: array of project | tenant | platform.
Restrictions: 1. Project override can only narrow scope versus tenant/global policy. 2. Project override cannot enable an app disabled by tenant/global hard-deny. 3. Conflicts resolve by most-specific then most-restrictive.
Observability Requirements¶
Every app-instance mutation and event should carry:
- correlation_id
- org_id
- project_id
- app_slug
- app_instance_id
Target triage path: - API error envelope -> correlation id - Loki lookup by correlation id - Tempo trace by trace_id - App lifecycle event timeline from async stream
Follow-ups (next iterations)¶
- Runtime-specific metering implementation and usage-record to ledger pipeline.
- Operator onboarding guide (reference implementation for one app, e.g. model serving).
- Admin catalog version disable endpoint and retirement workflow.
- Registry credential delivery through Vault-backed publish/pull secret paths.
- Generic clustered-app topology and component-role contract for multi-member example apps.
- Tenant-shared runtime ownership and attachment contract for apps that support tenant-owned shared mode.
DB Schema Proposal (v1 Draft)¶
SQL companion:
- doc/architecture/db_schema_app_control_plane_phase1_draft.sql
Tables¶
app_catalogid uuid pkslug text unique not nulldisplay_name text not nullcategory text not nullpublisher text not nullstatus text not null check (status in ('active','deprecated','disabled'))created_at timestamptz not null default now()-
updated_at timestamptz not null default now() -
app_versions id uuid pkapp_id uuid not null references app_catalog(id) on delete cascadeversion text not nullruntime_backend text not null check (runtime_backend in ('k8s','rke2','slurm','ray','bare_metal'))manifest jsonb not nullstatus text not null check (status in ('active','deprecated','disabled'))created_at timestamptz not null default now()-
unique (app_id, version) -
project_app_entitlements id uuid pkorg_id uuid not null references organizations(id) on delete cascadeproject_id uuid not null references projects(id) on delete cascadeapp_id uuid not null references app_catalog(id) on delete cascadeenabled boolean not null default truepolicy_overrides jsonb not null default '{}'::jsonbupdated_by_user_id uuid null references users(id)created_at timestamptz not null default now()updated_at timestamptz not null default now()-
unique (project_id, app_id) -
app_instances id uuid pkresource_name text not null uniqueorg_id uuid not null references organizations(id) on delete cascadeproject_id uuid not null references projects(id) on delete cascadeapp_id uuid not null references app_catalog(id) on delete restrictapp_version_id uuid not null references app_versions(id) on delete restrictdisplay_name text not nulloperating_mode text not null default 'tenant_dedicated' check (operating_mode in ('tenant_dedicated','platform_managed'))control_plane_scope text not null default 'project' check (control_plane_scope in ('project','tenant','platform'))tenant_boundary_mode text not null default 'tenant_isolated' check (tenant_boundary_mode in ('tenant_isolated','shared_service'))status text not null check (status in ('requested','deploying','running','failed','deleting','deleted'))requested_by_user_id uuid not null references users(id)operator_service_account_id uuid null references service_accounts(id)config jsonb not null default '{}'::jsonbruntime_state jsonb not null default '{}'::jsonbfailure_reason text nullcreated_at timestamptz not null default now()updated_at timestamptz not null default now()deleted_at timestamptz null
Required indexes¶
ix_app_instances_project_status_createdon(project_id, status, created_at desc)ix_app_instances_org_statuson(org_id, status)ix_project_app_entitlements_projecton(project_id)ix_app_versions_app_statuson(app_id, status, created_at desc)
Required integrity constraints¶
app_instances.project_idmust belong toapp_instances.org_id.project_app_entitlements.project_idmust belong toproject_app_entitlements.org_id.app_instances.operator_service_account_id(if set) must belong to sameproject_idandorg_id.app_instances.app_version_idmust reference sameapp_idasapp_instances.app_id.
Migration/cutover strategy¶
- Add tables and indexes without touching existing allocation/storage paths.
- Launch catalog read APIs first.
- Gate entitlement and instance mutations behind feature flag.
- Introduce one reference operator integration before broad app onboarding.