Runbook: App Platform Operator Incident¶

Trigger¶

An app team reports deploy/create/upgrade/decommission failures while using the GPUaaS app platform.
A project-scoped operator service account can browse app instances but cannot mutate them.
Scheduler-style reference apps (for example Slurm) fail at the control-plane boundary even when the backend runtime is healthy.

Required Context¶

correlation_id from API/UI/CLI/SDK error envelope.
trace_id if present.
Scope and identity:
org_id
project_id
app_slug
app_instance_id if one exists
operator_service_account_id when automation is involved
Effective app-runtime metadata:
operating_mode
control_plane_scope
runtime_backend
tenant_boundary_mode

Correlation-First Triage¶

API logs by correlation_id:
{service="gpuaas-api"} | json | correlation_id="<CORRELATION_ID>"
Outbox relay logs by correlation_id:
{service="gpuaas-outbox-relay"} | json | correlation_id="<CORRELATION_ID>"
App runtime worker logs by correlation_id:
{service="gpuaas-app-runtime-worker"} | json | correlation_id="<CORRELATION_ID>"
If a scheduler/operator adapter is involved, pivot by app_instance_id, app_slug, and operator_service_account_id.
Open Tempo trace by trace_id and verify lineage:
expected baseline: gpuaas-api -> gpuaas-outbox-relay -> gpuaas-app-runtime-worker

Boundary Validation¶

Confirm the app instance is project-owned and the project_id in the request matches the instance record.
Confirm the operator service account belongs to the same project.
Confirm the requested action is permitted for the actor and the service account allowlist.
Confirm the effective runtime metadata is coherent:
tenant_dedicated + project|tenant
platform_managed + platform
For scheduler reference apps, confirm no scheduler-specific request was routed through core allocation handlers.

Common Failure Classes¶

insufficient_permissions
actor lacks project-scoped permission or service-account allowlist entry.
invalid_request
missing project context, invalid operating_mode, invalid lifecycle action, or invalid operator service account reference.
service_unavailable
app runtime worker, outbox relay, or backend runtime path unavailable.
internal_error
unexpected defect in API/runtime translation path.
proxied UI bootstrap failure
route is present but browser bootstrap fails due to browser-session, asset path, or launcher behavior
use Proxied App UI Incident Runbook before debugging the upstream app itself

Service Account Specific Checks¶

Verify operator_service_account_id is active and belongs to the target project.
Confirm the service account token was minted recently and not expired.
Confirm the action is on the explicit same-project service-account allowlist.
Do not debug this as a user-login incident unless the actor is a user token rather than a service account.

Operating-Mode Specific Checks¶

`tenant_dedicated`¶

Verify tenant/project boundary remains explicit in logs and event payloads.
Verify support evidence does not assume a shared control plane.
If control_plane_scope=project, treat project as an environment boundary candidate (dev|test|stage|prod).

`platform_managed`¶

Verify the incident is not caused by cross-tenant/shared-service saturation.
Capture service-level evidence and not only project-local evidence.
Escalate to managed-service owner if multiple tenants/projects show the same failure shape.

Recovery Guidance¶

Permission or allowlist issue:
correct role/service-account scope and retry.
App runtime async issue:
restore queue/outbox/worker health first, then reissue the lifecycle action.
Invalid mode/scope request:
retry with server-supported app/runtime defaults.
Repeat scheduler/operator failure:
capture evidence and file a platform primitive gap if the issue requires core special-casing.

Evidence to Capture¶

correlation_id, trace_id, and exact action attempted.
app_slug, app_instance_id, operator_service_account_id.
Effective runtime metadata: operating_mode, control_plane_scope, runtime_backend.
API, outbox, and app-runtime worker log evidence.
Whether the issue is:
IAM/allowlist
lifecycle state
async delivery
runtime/backend adapter
platform primitive gap

Escalation Rule¶

If the only way to make the app work would be scheduler-specific or app-specific branching in core handlers, stop and raise a platform defect. Do not workaround it in the runbook as a normal operating step.