Runbook: App Platform Operator Incident¶
Trigger¶
- An app team reports deploy/create/upgrade/decommission failures while using the GPUaaS app platform.
- A project-scoped operator service account can browse app instances but cannot mutate them.
- Scheduler-style reference apps (for example Slurm) fail at the control-plane boundary even when the backend runtime is healthy.
Required Context¶
correlation_idfrom API/UI/CLI/SDK error envelope.trace_idif present.- Scope and identity:
org_idproject_idapp_slugapp_instance_idif one existsoperator_service_account_idwhen automation is involved- Effective app-runtime metadata:
operating_modecontrol_plane_scoperuntime_backendtenant_boundary_mode
Correlation-First Triage¶
- API logs by
correlation_id: {service="gpuaas-api"} | json | correlation_id="<CORRELATION_ID>"- Outbox relay logs by
correlation_id: {service="gpuaas-outbox-relay"} | json | correlation_id="<CORRELATION_ID>"- App runtime worker logs by
correlation_id: {service="gpuaas-app-runtime-worker"} | json | correlation_id="<CORRELATION_ID>"- If a scheduler/operator adapter is involved, pivot by
app_instance_id,app_slug, andoperator_service_account_id. - Open Tempo trace by
trace_idand verify lineage: - expected baseline:
gpuaas-api -> gpuaas-outbox-relay -> gpuaas-app-runtime-worker
Boundary Validation¶
- Confirm the app instance is project-owned and the
project_idin the request matches the instance record. - Confirm the operator service account belongs to the same project.
- Confirm the requested action is permitted for the actor and the service account allowlist.
- Confirm the effective runtime metadata is coherent:
tenant_dedicated + project|tenantplatform_managed + platform- For scheduler reference apps, confirm no scheduler-specific request was routed through core allocation handlers.
Common Failure Classes¶
insufficient_permissions- actor lacks project-scoped permission or service-account allowlist entry.
invalid_request- missing project context, invalid
operating_mode, invalid lifecycle action, or invalid operator service account reference. service_unavailable- app runtime worker, outbox relay, or backend runtime path unavailable.
internal_error- unexpected defect in API/runtime translation path.
- proxied UI bootstrap failure
- route is present but browser bootstrap fails due to browser-session, asset path, or launcher behavior
- use Proxied App UI Incident Runbook before debugging the upstream app itself
Service Account Specific Checks¶
- Verify
operator_service_account_idis active and belongs to the target project. - Confirm the service account token was minted recently and not expired.
- Confirm the action is on the explicit same-project service-account allowlist.
- Do not debug this as a user-login incident unless the actor is a user token rather than a service account.
Operating-Mode Specific Checks¶
tenant_dedicated¶
- Verify tenant/project boundary remains explicit in logs and event payloads.
- Verify support evidence does not assume a shared control plane.
- If
control_plane_scope=project, treat project as an environment boundary candidate (dev|test|stage|prod).
platform_managed¶
- Verify the incident is not caused by cross-tenant/shared-service saturation.
- Capture service-level evidence and not only project-local evidence.
- Escalate to managed-service owner if multiple tenants/projects show the same failure shape.
Recovery Guidance¶
- Permission or allowlist issue:
- correct role/service-account scope and retry.
- App runtime async issue:
- restore queue/outbox/worker health first, then reissue the lifecycle action.
- Invalid mode/scope request:
- retry with server-supported app/runtime defaults.
- Repeat scheduler/operator failure:
- capture evidence and file a platform primitive gap if the issue requires core special-casing.
Evidence to Capture¶
correlation_id,trace_id, and exact action attempted.app_slug,app_instance_id,operator_service_account_id.- Effective runtime metadata:
operating_mode,control_plane_scope,runtime_backend. - API, outbox, and app-runtime worker log evidence.
- Whether the issue is:
- IAM/allowlist
- lifecycle state
- async delivery
- runtime/backend adapter
- platform primitive gap
Escalation Rule¶
If the only way to make the app work would be scheduler-specific or app-specific branching in core handlers, stop and raise a platform defect. Do not workaround it in the runbook as a normal operating step.