App Runtime Operating Modes v1¶
Purpose¶
Define the operating modes for platform apps so GPUaaS can start with tenant-dedicated deployments and evolve to platform-managed services later without rewriting the control-plane contract.
This document is the missing bridge between: 1. app instance lifecycle, 2. scheduler/model-serving app design, 3. tenant/project IAM, 4. future managed-service operation.
Decision Summary¶
GPUaaS supports two operating modes for platform apps:
tenant_dedicated- the app runtime/control plane is deployed for one tenant only
- worker/compute nodes assigned to that app are tenant-bounded
-
this is the default production starting mode
-
platform_managed - the app runtime/control plane is operated as a shared platform service
- tenants/projects consume the service through the same control-plane contracts
- this is a later operating mode, not the initial production assumption
Why This Exists¶
Without an explicit operating-mode model, the API and app lifecycle drift toward an accidental assumption: - every app instance is a project-scoped isolated deployment, or - every app is a shared managed service.
Neither assumption is stable enough for the platform roadmap.
The actual intended path is: 1. phase 1: tenant-dedicated app backends 2. phase 2: selective platform-managed offerings 3. same app catalog, entitlement, app instance, IAM, audit, and observability contracts across both
Core Invariants¶
These rules must hold in both operating modes.
- App instances remain project-owned control-plane resources.
- Authentication and authorization remain tenant/project scoped and policy-driven.
- Internal and external app teams use the same contracts.
- Operating mode changes deployment topology, not the authz or audit model.
- Node and compute isolation must remain explicit; no implicit cross-tenant worker sharing.
Operating Modes¶
1. tenant_dedicated¶
Use when: 1. the tenant needs strong isolation, 2. the scheduler/runtime is not mature enough for shared multi-tenant operation, 3. support or compliance requires tenant-bounded control components.
Typical shape: 1. tenant gets a dedicated control-plane deployment for the app 2. app workers join from tenant-bounded nodes or tenant-bounded allocations 3. app instance still belongs to a project, but the runtime control plane may be project- or tenant-scoped
Examples: 1. per-tenant Slurm controller + tenant compute nodes 2. per-tenant Ray head + tenant workers 3. per-tenant model-serving control deployment
Operational consequences: 1. easier blast-radius control 2. simpler initial production isolation 3. clearer tenant-specific upgrades and rollback 4. higher footprint cost per tenant
2. platform_managed¶
Use when: 1. the app runtime is sufficiently mature and supportable as a shared service, 2. tenancy boundaries can be enforced without per-tenant control-plane duplication, 3. billing and noisy-neighbor controls are credible.
Typical shape: 1. platform operates a shared control-plane deployment 2. tenant/project resources consume managed service capacity 3. project-scoped app instance becomes a tenancy and lifecycle object against a shared backend
Examples: 1. managed model serving 2. shared inference gateway 3. future shared scheduler service tiers where justified
Operational consequences: 1. lower footprint cost 2. more efficient upgrades 3. harder isolation and support requirements 4. stricter need for quota, fairness, and billing evidence
Scope Model¶
Two fields matter and must stay distinct:
app instance ownership scope- always project-scoped in control-plane APIs
-
drives authz, audit, entitlement, and billing attribution
-
runtime control plane scope projecttenantplatform
This distinction is required because:
1. a project-owned app instance may attach to a tenant-scoped control plane in tenant_dedicated mode
2. a project-owned app instance may attach to a platform-scoped control plane in platform_managed mode
Why project scope matters¶
project scope is not an edge case. It is a normal environment and isolation boundary.
Use control_plane_scope = project when:
1. a tenant uses projects as dev, test, stage, or prod environment boundaries
2. one project needs isolated control-plane upgrades or experiments
3. internal chargeback or quota policy needs clean per-project runtime attribution
4. the org wants to prevent one project's runtime state from affecting another project in the same tenant
Use control_plane_scope = tenant when:
1. multiple projects inside the tenant are expected to share one scheduler/runtime control plane
2. tenant admins, rather than individual project teams, own the runtime lifecycle
Use control_plane_scope = platform only for selective later platform_managed services.
Product-Facing Ownership Modes¶
The API fields above are the implementation contract, but product-facing app modes should be described more plainly:
project-scoped mode- maps to:
operating_mode = tenant_dedicatedcontrol_plane_scope = project
-
best for isolated per-project apps such as project-local databases or environment-specific runtimes
-
tenant-owned shared mode - target mapping:
operating_mode = tenant_dedicatedcontrol_plane_scope = tenant
- best for shared tenant control planes such as Slurm
-
projects attach as consumers by explicit policy
-
platform-managed shared mode - maps to:
operating_mode = platform_managedcontrol_plane_scope = platform
- reserved for future shared platform-operated offerings
Important current limitation:
- the v1 control-plane contract still models app instances as project-owned
resources
- so tenant-owned shared mode is a product target, not yet a fully modeled
ownership contract
- implementing it cleanly requires an explicit attached-project / tenant-shared
runtime model rather than pretending one project-owned instance is the same as
a tenant-owned scheduler
- see:
- doc/architecture/App_Tenant_Shared_Attachment_Model_v1.md
API Contract Direction¶
The app-instance contract should reserve the following concepts now:
operating_mode-
tenant_dedicated | platform_managed -
control_plane_scope -
project | tenant | platform -
runtime_backend -
effective runtime backing the instance (
k8s | slurm | ray | bare_metal) -
tenant_boundary_mode - whether the instance is expected to run only on tenant-bounded resources versus a managed shared substrate
- initial direction:
tenant_isolatedshared_service
Implementation note:
The initial API can make these output fields optional and accept only operating_mode as a request hint. Server policy decides the effective values.
Authorization Implications¶
Operating mode must not change core authz principles.
Rules: 1. project-scoped actors still manage project-owned app instances 2. tenant-admins may manage tenant-dedicated app control surfaces only within their tenant 3. platform-admin/platform-ops are explicit break-glass actors for managed-service operations 4. service accounts remain project-scoped, even when the runtime control plane is tenant- or platform-scoped
Node and Compute Model¶
In initial production, tenant-dedicated mode is the default.
That implies: 1. scheduler/control nodes may be tenant-dedicated 2. worker/compute nodes for those apps are tenant-bounded 3. MaaS and placement policies must be able to express tenant-bounded compute pools
This does not require every app instance to own a node. It requires the placement and runtime model to preserve tenant boundaries explicitly.
Billing Implications¶
Operating mode affects cost attribution.
tenant_dedicated¶
Potential billable components: 1. dedicated control-plane footprint 2. tenant-bounded worker capacity 3. runtime-specific usage signals
platform_managed¶
Potential billable components: 1. shared managed-service capacity consumption 2. per-request/per-job/per-runtime usage 3. quota or tier-based control-plane overhead allocation
Requirement: Billing contracts must not assume one mode only.
Observability Implications¶
Every app lifecycle and runtime incident should be explainable with:
1. correlation_id
2. org_id
3. project_id
4. app_instance_id
5. operating_mode
6. control_plane_scope
7. runtime_backend
This is necessary because incident ownership differs by mode: 1. tenant-dedicated incidents often route to tenant-bounded runtime/operator investigation 2. platform-managed incidents often route to shared-service operations and SRE paths
Recommended Production Rollout¶
- start with
tenant_dedicated - validate reference apps there first
- add
platform_managedselectively after: - IAM and policy model are proven
- billing confidence is strong
- support/runbooks are mature
- runtime isolation evidence is credible
Examples¶
Slurm¶
- first mode:
tenant_dedicated - control plane scope may be
projectfor environment-isolated deployments ortenantfor shared tenant schedulers - project app instance may consume a project-scoped or tenant-scoped Slurm control plane depending on policy
Ray¶
- first mode:
tenant_dedicated - control plane scope may be
projectortenant - future possible managed mode after maturity
Model Serving¶
- first mode may be
tenant_dedicatedfor enterprise/private models - later selective
platform_managedtiers for shared inference
Required Follow-On Adjustments¶
- OpenAPI app-instance schemas should reserve operating-mode fields now.
- AsyncAPI app lifecycle payloads should carry operating-mode fields where useful.
- App control-plane docs should distinguish project ownership from runtime control-plane scope.
- Builder/operator guidance should explicitly support both modes.
Related Docs¶
doc/architecture/App_Control_Plane_v1.mddoc/architecture/App_Runtime_Instance_Lifecycle_v1.mddoc/architecture/Build_an_App_for_GPUaaS_v1.mddoc/architecture/Scheduler_as_Platform_App_v1.mddoc/architecture/Allocation_Node_Placement_v1.mddoc/architecture/App_Tenant_Shared_Attachment_Model_v1.md