App Runtime Operating Modes v1¶

Purpose¶

Define the operating modes for platform apps so GPUaaS can start with tenant-dedicated deployments and evolve to platform-managed services later without rewriting the control-plane contract.

This document is the missing bridge between: 1. app instance lifecycle, 2. scheduler/model-serving app design, 3. tenant/project IAM, 4. future managed-service operation.

Decision Summary¶

GPUaaS supports two operating modes for platform apps:

tenant_dedicated
the app runtime/control plane is deployed for one tenant only
worker/compute nodes assigned to that app are tenant-bounded
this is the default production starting mode
platform_managed
the app runtime/control plane is operated as a shared platform service
tenants/projects consume the service through the same control-plane contracts
this is a later operating mode, not the initial production assumption

Why This Exists¶

Without an explicit operating-mode model, the API and app lifecycle drift toward an accidental assumption: - every app instance is a project-scoped isolated deployment, or - every app is a shared managed service.

Neither assumption is stable enough for the platform roadmap.

The actual intended path is: 1. phase 1: tenant-dedicated app backends 2. phase 2: selective platform-managed offerings 3. same app catalog, entitlement, app instance, IAM, audit, and observability contracts across both

Core Invariants¶

These rules must hold in both operating modes.

App instances remain project-owned control-plane resources.
Authentication and authorization remain tenant/project scoped and policy-driven.
Internal and external app teams use the same contracts.
Operating mode changes deployment topology, not the authz or audit model.
Node and compute isolation must remain explicit; no implicit cross-tenant worker sharing.

Operating Modes¶

1. `tenant_dedicated`¶

Use when: 1. the tenant needs strong isolation, 2. the scheduler/runtime is not mature enough for shared multi-tenant operation, 3. support or compliance requires tenant-bounded control components.

Typical shape: 1. tenant gets a dedicated control-plane deployment for the app 2. app workers join from tenant-bounded nodes or tenant-bounded allocations 3. app instance still belongs to a project, but the runtime control plane may be project- or tenant-scoped

Examples: 1. per-tenant Slurm controller + tenant compute nodes 2. per-tenant Ray head + tenant workers 3. per-tenant model-serving control deployment

Operational consequences: 1. easier blast-radius control 2. simpler initial production isolation 3. clearer tenant-specific upgrades and rollback 4. higher footprint cost per tenant

2. `platform_managed`¶

Use when: 1. the app runtime is sufficiently mature and supportable as a shared service, 2. tenancy boundaries can be enforced without per-tenant control-plane duplication, 3. billing and noisy-neighbor controls are credible.

Typical shape: 1. platform operates a shared control-plane deployment 2. tenant/project resources consume managed service capacity 3. project-scoped app instance becomes a tenancy and lifecycle object against a shared backend

Examples: 1. managed model serving 2. shared inference gateway 3. future shared scheduler service tiers where justified

Operational consequences: 1. lower footprint cost 2. more efficient upgrades 3. harder isolation and support requirements 4. stricter need for quota, fairness, and billing evidence

Scope Model¶

Two fields matter and must stay distinct:

app instance ownership scope
always project-scoped in control-plane APIs
drives authz, audit, entitlement, and billing attribution
runtime control plane scope
project
tenant
platform

This distinction is required because: 1. a project-owned app instance may attach to a tenant-scoped control plane in tenant_dedicated mode 2. a project-owned app instance may attach to a platform-scoped control plane in platform_managed mode

Why `project` scope matters¶

project scope is not an edge case. It is a normal environment and isolation boundary.

Use control_plane_scope = project when: 1. a tenant uses projects as dev, test, stage, or prod environment boundaries 2. one project needs isolated control-plane upgrades or experiments 3. internal chargeback or quota policy needs clean per-project runtime attribution 4. the org wants to prevent one project's runtime state from affecting another project in the same tenant

Use control_plane_scope = tenant when: 1. multiple projects inside the tenant are expected to share one scheduler/runtime control plane 2. tenant admins, rather than individual project teams, own the runtime lifecycle

Use control_plane_scope = platform only for selective later platform_managed services.

Product-Facing Ownership Modes¶

The API fields above are the implementation contract, but product-facing app modes should be described more plainly:

project-scoped mode
maps to:
- operating_mode = tenant_dedicated
- control_plane_scope = project
best for isolated per-project apps such as project-local databases or environment-specific runtimes
tenant-owned shared mode
target mapping:
- operating_mode = tenant_dedicated
- control_plane_scope = tenant
best for shared tenant control planes such as Slurm
projects attach as consumers by explicit policy
platform-managed shared mode
maps to:
- operating_mode = platform_managed
- control_plane_scope = platform
reserved for future shared platform-operated offerings

Important current limitation: - the v1 control-plane contract still models app instances as project-owned resources - so tenant-owned shared mode is a product target, not yet a fully modeled ownership contract - implementing it cleanly requires an explicit attached-project / tenant-shared runtime model rather than pretending one project-owned instance is the same as a tenant-owned scheduler - see: - doc/architecture/App_Tenant_Shared_Attachment_Model_v1.md

API Contract Direction¶

The app-instance contract should reserve the following concepts now:

operating_mode
tenant_dedicated | platform_managed
control_plane_scope
project | tenant | platform
runtime_backend
effective runtime backing the instance (k8s | slurm | ray | bare_metal)
tenant_boundary_mode
whether the instance is expected to run only on tenant-bounded resources versus a managed shared substrate
initial direction:
- tenant_isolated
- shared_service

Implementation note: The initial API can make these output fields optional and accept only operating_mode as a request hint. Server policy decides the effective values.

Authorization Implications¶

Operating mode must not change core authz principles.

Rules: 1. project-scoped actors still manage project-owned app instances 2. tenant-admins may manage tenant-dedicated app control surfaces only within their tenant 3. platform-admin/platform-ops are explicit break-glass actors for managed-service operations 4. service accounts remain project-scoped, even when the runtime control plane is tenant- or platform-scoped

Node and Compute Model¶

In initial production, tenant-dedicated mode is the default.

That implies: 1. scheduler/control nodes may be tenant-dedicated 2. worker/compute nodes for those apps are tenant-bounded 3. MaaS and placement policies must be able to express tenant-bounded compute pools

This does not require every app instance to own a node. It requires the placement and runtime model to preserve tenant boundaries explicitly.

Billing Implications¶

Operating mode affects cost attribution.

`tenant_dedicated`¶

Potential billable components: 1. dedicated control-plane footprint 2. tenant-bounded worker capacity 3. runtime-specific usage signals

`platform_managed`¶

Potential billable components: 1. shared managed-service capacity consumption 2. per-request/per-job/per-runtime usage 3. quota or tier-based control-plane overhead allocation

Requirement: Billing contracts must not assume one mode only.

Observability Implications¶

Every app lifecycle and runtime incident should be explainable with: 1. correlation_id 2. org_id 3. project_id 4. app_instance_id 5. operating_mode 6. control_plane_scope 7. runtime_backend

This is necessary because incident ownership differs by mode: 1. tenant-dedicated incidents often route to tenant-bounded runtime/operator investigation 2. platform-managed incidents often route to shared-service operations and SRE paths

Recommended Production Rollout¶

start with tenant_dedicated
validate reference apps there first
add platform_managed selectively after:
IAM and policy model are proven
billing confidence is strong
support/runbooks are mature
runtime isolation evidence is credible

Examples¶

Slurm¶

first mode: tenant_dedicated
control plane scope may be project for environment-isolated deployments or tenant for shared tenant schedulers
project app instance may consume a project-scoped or tenant-scoped Slurm control plane depending on policy

Ray¶

first mode: tenant_dedicated
control plane scope may be project or tenant
future possible managed mode after maturity

Model Serving¶

first mode may be tenant_dedicated for enterprise/private models
later selective platform_managed tiers for shared inference

Required Follow-On Adjustments¶

OpenAPI app-instance schemas should reserve operating-mode fields now.
AsyncAPI app lifecycle payloads should carry operating-mode fields where useful.
App control-plane docs should distinguish project ownership from runtime control-plane scope.
Builder/operator guidance should explicitly support both modes.

doc/architecture/App_Control_Plane_v1.md
doc/architecture/App_Runtime_Instance_Lifecycle_v1.md
doc/architecture/Build_an_App_for_GPUaaS_v1.md
doc/architecture/Scheduler_as_Platform_App_v1.md
doc/architecture/Allocation_Node_Placement_v1.md
doc/architecture/App_Tenant_Shared_Attachment_Model_v1.md