Skip to content

App Runtime Instance Lifecycle v1

Purpose

Define the contract-first lifecycle for project-scoped app instances so teams can deploy and operate apps without bypassing IAM, policy, audit, or observability controls.

Scope

Includes: 1. app instance lifecycle states and transitions 2. REST endpoint surface (v1 target) 3. domain event surface (apps.instance.*) 4. authorization and audit requirements 5. observability and incident triage requirements 6. effective runtime operating-mode metadata

Excludes: 1. provider-specific runtime implementation details (K8s/Slurm/Ray specifics) 2. MaaS internals 3. app-specific custom workflows

Companion operating-mode model: - doc/architecture/App_Runtime_Operating_Modes_v1.md

Lifecycle State Model

Canonical states: 1. requested 2. deploying 3. running 4. upgrading 5. rolling_back 6. decommissioning 7. decommissioned 8. failed

Terminal states: 1. decommissioned 2. failed (recoverable only via explicit retry/redeploy action)

Transition Rules

  1. requested -> deploying -> running
  2. running -> upgrading -> running
  3. running -> rolling_back -> running
  4. running|failed -> decommissioning -> decommissioned
  5. any in-flight state may transition to failed on irrecoverable error

Operating-Mode Metadata

App lifecycle does not imply a single deployment topology.

Effective instance metadata should include: 1. operating_mode (tenant_dedicated | platform_managed) 2. control_plane_scope (project | tenant | platform) 3. runtime_backend (k8s | slurm | ray | bare_metal) 4. tenant_boundary_mode (tenant_isolated | shared_service)

Rules: 1. app instance ownership remains project-scoped 2. runtime control plane may be project-, tenant-, or platform-scoped 3. server computes effective mode/scope from app policy and backend rules

API Surface (v1 target)

Project-scoped operations: 1. GET /api/v1/projects/{project_id}/apps/instances 2. POST /api/v1/projects/{project_id}/apps/instances 3. GET /api/v1/projects/{project_id}/apps/instances/{instance_id} 4. POST /api/v1/projects/{project_id}/apps/instances/{instance_id}/upgrade 5. POST /api/v1/projects/{project_id}/apps/instances/{instance_id}/rollback 6. POST /api/v1/projects/{project_id}/apps/instances/{instance_id}/decommission

Admin/operator read-only surface: 1. GET /api/v1/admin/apps/instances 2. GET /api/v1/admin/apps/instances/{instance_id}

Contract requirements: 1. canonical error envelope for all failures (code, message, correlation_id, details) 2. idempotency-key semantics on mutation endpoints 3. explicit X-Project-ID context enforcement where applicable

Event Surface (v1 target)

All events use canonical envelope from doc/api/asyncapi.draft.yaml.

Event types: 1. apps.instance.requested 2. apps.instance.deploying 3. apps.instance.running 4. apps.instance.upgrade_requested 5. apps.instance.upgraded 6. apps.instance.rollback_requested 7. apps.instance.rolled_back 8. apps.instance.decommission_requested 9. apps.instance.decommissioned 10. apps.instance.failed

Outbox requirement: 1. lifecycle state change and outbox row must commit in the same DB transaction.

Authorization Baseline

  1. all instance mutations are project-scoped
  2. actor must satisfy project role permissions from role-policy model
  3. platform-admin paths are explicit and auditable; no hidden bypass
  4. action authorization evaluates against canonical decision interface (actor, tenant, project, action, resource)

Audit Baseline

Every privileged mutation writes audit_logs with: 1. actor identity/role 2. target app_instance_id 3. action (deploy, upgrade, rollback, decommission) 4. result 5. correlation_id

Observability Baseline

Required logging fields: 1. correlation_id 2. trace_id 3. org_id 4. project_id 5. resource_name (when resolved)

Triage baseline: 1. start from UI/API correlation_id 2. pivot to logs by correlation_id 3. pivot to trace by trace_id 4. reconcile with apps.instance.* event timeline

v1 Deliverables

  1. OpenAPI contract updates for endpoints/schemas
  2. AsyncAPI contract updates for lifecycle events
  3. queue task split for A/B/C implementation and ops runbook coverage
  4. local smoke path covering create -> running -> decommission lifecycle