Skip to content

App Runtime Operating Modes v1

Purpose

Define the operating modes for platform apps so GPUaaS can start with tenant-dedicated deployments and evolve to platform-managed services later without rewriting the control-plane contract.

This document is the missing bridge between: 1. app instance lifecycle, 2. scheduler/model-serving app design, 3. tenant/project IAM, 4. future managed-service operation.

Decision Summary

GPUaaS supports two operating modes for platform apps:

  1. tenant_dedicated
  2. the app runtime/control plane is deployed for one tenant only
  3. worker/compute nodes assigned to that app are tenant-bounded
  4. this is the default production starting mode

  5. platform_managed

  6. the app runtime/control plane is operated as a shared platform service
  7. tenants/projects consume the service through the same control-plane contracts
  8. this is a later operating mode, not the initial production assumption

Why This Exists

Without an explicit operating-mode model, the API and app lifecycle drift toward an accidental assumption: - every app instance is a project-scoped isolated deployment, or - every app is a shared managed service.

Neither assumption is stable enough for the platform roadmap.

The actual intended path is: 1. phase 1: tenant-dedicated app backends 2. phase 2: selective platform-managed offerings 3. same app catalog, entitlement, app instance, IAM, audit, and observability contracts across both

Core Invariants

These rules must hold in both operating modes.

  1. App instances remain project-owned control-plane resources.
  2. Authentication and authorization remain tenant/project scoped and policy-driven.
  3. Internal and external app teams use the same contracts.
  4. Operating mode changes deployment topology, not the authz or audit model.
  5. Node and compute isolation must remain explicit; no implicit cross-tenant worker sharing.

Operating Modes

1. tenant_dedicated

Use when: 1. the tenant needs strong isolation, 2. the scheduler/runtime is not mature enough for shared multi-tenant operation, 3. support or compliance requires tenant-bounded control components.

Typical shape: 1. tenant gets a dedicated control-plane deployment for the app 2. app workers join from tenant-bounded nodes or tenant-bounded allocations 3. app instance still belongs to a project, but the runtime control plane may be project- or tenant-scoped

Examples: 1. per-tenant Slurm controller + tenant compute nodes 2. per-tenant Ray head + tenant workers 3. per-tenant model-serving control deployment

Operational consequences: 1. easier blast-radius control 2. simpler initial production isolation 3. clearer tenant-specific upgrades and rollback 4. higher footprint cost per tenant

2. platform_managed

Use when: 1. the app runtime is sufficiently mature and supportable as a shared service, 2. tenancy boundaries can be enforced without per-tenant control-plane duplication, 3. billing and noisy-neighbor controls are credible.

Typical shape: 1. platform operates a shared control-plane deployment 2. tenant/project resources consume managed service capacity 3. project-scoped app instance becomes a tenancy and lifecycle object against a shared backend

Examples: 1. managed model serving 2. shared inference gateway 3. future shared scheduler service tiers where justified

Operational consequences: 1. lower footprint cost 2. more efficient upgrades 3. harder isolation and support requirements 4. stricter need for quota, fairness, and billing evidence

Scope Model

Two fields matter and must stay distinct:

  1. app instance ownership scope
  2. always project-scoped in control-plane APIs
  3. drives authz, audit, entitlement, and billing attribution

  4. runtime control plane scope

  5. project
  6. tenant
  7. platform

This distinction is required because: 1. a project-owned app instance may attach to a tenant-scoped control plane in tenant_dedicated mode 2. a project-owned app instance may attach to a platform-scoped control plane in platform_managed mode

Why project scope matters

project scope is not an edge case. It is a normal environment and isolation boundary.

Use control_plane_scope = project when: 1. a tenant uses projects as dev, test, stage, or prod environment boundaries 2. one project needs isolated control-plane upgrades or experiments 3. internal chargeback or quota policy needs clean per-project runtime attribution 4. the org wants to prevent one project's runtime state from affecting another project in the same tenant

Use control_plane_scope = tenant when: 1. multiple projects inside the tenant are expected to share one scheduler/runtime control plane 2. tenant admins, rather than individual project teams, own the runtime lifecycle

Use control_plane_scope = platform only for selective later platform_managed services.

Product-Facing Ownership Modes

The API fields above are the implementation contract, but product-facing app modes should be described more plainly:

  1. project-scoped mode
  2. maps to:
    • operating_mode = tenant_dedicated
    • control_plane_scope = project
  3. best for isolated per-project apps such as project-local databases or environment-specific runtimes

  4. tenant-owned shared mode

  5. target mapping:
    • operating_mode = tenant_dedicated
    • control_plane_scope = tenant
  6. best for shared tenant control planes such as Slurm
  7. projects attach as consumers by explicit policy

  8. platform-managed shared mode

  9. maps to:
    • operating_mode = platform_managed
    • control_plane_scope = platform
  10. reserved for future shared platform-operated offerings

Important current limitation: - the v1 control-plane contract still models app instances as project-owned resources - so tenant-owned shared mode is a product target, not yet a fully modeled ownership contract - implementing it cleanly requires an explicit attached-project / tenant-shared runtime model rather than pretending one project-owned instance is the same as a tenant-owned scheduler - see: - doc/architecture/App_Tenant_Shared_Attachment_Model_v1.md

API Contract Direction

The app-instance contract should reserve the following concepts now:

  1. operating_mode
  2. tenant_dedicated | platform_managed

  3. control_plane_scope

  4. project | tenant | platform

  5. runtime_backend

  6. effective runtime backing the instance (k8s | slurm | ray | bare_metal)

  7. tenant_boundary_mode

  8. whether the instance is expected to run only on tenant-bounded resources versus a managed shared substrate
  9. initial direction:
    • tenant_isolated
    • shared_service

Implementation note: The initial API can make these output fields optional and accept only operating_mode as a request hint. Server policy decides the effective values.

Authorization Implications

Operating mode must not change core authz principles.

Rules: 1. project-scoped actors still manage project-owned app instances 2. tenant-admins may manage tenant-dedicated app control surfaces only within their tenant 3. platform-admin/platform-ops are explicit break-glass actors for managed-service operations 4. service accounts remain project-scoped, even when the runtime control plane is tenant- or platform-scoped

Node and Compute Model

In initial production, tenant-dedicated mode is the default.

That implies: 1. scheduler/control nodes may be tenant-dedicated 2. worker/compute nodes for those apps are tenant-bounded 3. MaaS and placement policies must be able to express tenant-bounded compute pools

This does not require every app instance to own a node. It requires the placement and runtime model to preserve tenant boundaries explicitly.

Billing Implications

Operating mode affects cost attribution.

tenant_dedicated

Potential billable components: 1. dedicated control-plane footprint 2. tenant-bounded worker capacity 3. runtime-specific usage signals

platform_managed

Potential billable components: 1. shared managed-service capacity consumption 2. per-request/per-job/per-runtime usage 3. quota or tier-based control-plane overhead allocation

Requirement: Billing contracts must not assume one mode only.

Observability Implications

Every app lifecycle and runtime incident should be explainable with: 1. correlation_id 2. org_id 3. project_id 4. app_instance_id 5. operating_mode 6. control_plane_scope 7. runtime_backend

This is necessary because incident ownership differs by mode: 1. tenant-dedicated incidents often route to tenant-bounded runtime/operator investigation 2. platform-managed incidents often route to shared-service operations and SRE paths

  1. start with tenant_dedicated
  2. validate reference apps there first
  3. add platform_managed selectively after:
  4. IAM and policy model are proven
  5. billing confidence is strong
  6. support/runbooks are mature
  7. runtime isolation evidence is credible

Examples

Slurm

  1. first mode: tenant_dedicated
  2. control plane scope may be project for environment-isolated deployments or tenant for shared tenant schedulers
  3. project app instance may consume a project-scoped or tenant-scoped Slurm control plane depending on policy

Ray

  1. first mode: tenant_dedicated
  2. control plane scope may be project or tenant
  3. future possible managed mode after maturity

Model Serving

  1. first mode may be tenant_dedicated for enterprise/private models
  2. later selective platform_managed tiers for shared inference

Required Follow-On Adjustments

  1. OpenAPI app-instance schemas should reserve operating-mode fields now.
  2. AsyncAPI app lifecycle payloads should carry operating-mode fields where useful.
  3. App control-plane docs should distinguish project ownership from runtime control-plane scope.
  4. Builder/operator guidance should explicitly support both modes.
  1. doc/architecture/App_Control_Plane_v1.md
  2. doc/architecture/App_Runtime_Instance_Lifecycle_v1.md
  3. doc/architecture/Build_an_App_for_GPUaaS_v1.md
  4. doc/architecture/Scheduler_as_Platform_App_v1.md
  5. doc/architecture/Allocation_Node_Placement_v1.md
  6. doc/architecture/App_Tenant_Shared_Attachment_Model_v1.md