Skip to content

Runbook: App Platform Operator Incident

Trigger

  1. An app team reports deploy/create/upgrade/decommission failures while using the GPUaaS app platform.
  2. A project-scoped operator service account can browse app instances but cannot mutate them.
  3. Scheduler-style reference apps (for example Slurm) fail at the control-plane boundary even when the backend runtime is healthy.

Required Context

  1. correlation_id from API/UI/CLI/SDK error envelope.
  2. trace_id if present.
  3. Scope and identity:
  4. org_id
  5. project_id
  6. app_slug
  7. app_instance_id if one exists
  8. operator_service_account_id when automation is involved
  9. Effective app-runtime metadata:
  10. operating_mode
  11. control_plane_scope
  12. runtime_backend
  13. tenant_boundary_mode

Correlation-First Triage

  1. API logs by correlation_id:
  2. {service="gpuaas-api"} | json | correlation_id="<CORRELATION_ID>"
  3. Outbox relay logs by correlation_id:
  4. {service="gpuaas-outbox-relay"} | json | correlation_id="<CORRELATION_ID>"
  5. App runtime worker logs by correlation_id:
  6. {service="gpuaas-app-runtime-worker"} | json | correlation_id="<CORRELATION_ID>"
  7. If a scheduler/operator adapter is involved, pivot by app_instance_id, app_slug, and operator_service_account_id.
  8. Open Tempo trace by trace_id and verify lineage:
  9. expected baseline: gpuaas-api -> gpuaas-outbox-relay -> gpuaas-app-runtime-worker

Boundary Validation

  1. Confirm the app instance is project-owned and the project_id in the request matches the instance record.
  2. Confirm the operator service account belongs to the same project.
  3. Confirm the requested action is permitted for the actor and the service account allowlist.
  4. Confirm the effective runtime metadata is coherent:
  5. tenant_dedicated + project|tenant
  6. platform_managed + platform
  7. For scheduler reference apps, confirm no scheduler-specific request was routed through core allocation handlers.

Common Failure Classes

  1. insufficient_permissions
  2. actor lacks project-scoped permission or service-account allowlist entry.
  3. invalid_request
  4. missing project context, invalid operating_mode, invalid lifecycle action, or invalid operator service account reference.
  5. service_unavailable
  6. app runtime worker, outbox relay, or backend runtime path unavailable.
  7. internal_error
  8. unexpected defect in API/runtime translation path.
  9. proxied UI bootstrap failure
  10. route is present but browser bootstrap fails due to browser-session, asset path, or launcher behavior
  11. use Proxied App UI Incident Runbook before debugging the upstream app itself

Service Account Specific Checks

  1. Verify operator_service_account_id is active and belongs to the target project.
  2. Confirm the service account token was minted recently and not expired.
  3. Confirm the action is on the explicit same-project service-account allowlist.
  4. Do not debug this as a user-login incident unless the actor is a user token rather than a service account.

Operating-Mode Specific Checks

tenant_dedicated

  1. Verify tenant/project boundary remains explicit in logs and event payloads.
  2. Verify support evidence does not assume a shared control plane.
  3. If control_plane_scope=project, treat project as an environment boundary candidate (dev|test|stage|prod).

platform_managed

  1. Verify the incident is not caused by cross-tenant/shared-service saturation.
  2. Capture service-level evidence and not only project-local evidence.
  3. Escalate to managed-service owner if multiple tenants/projects show the same failure shape.

Recovery Guidance

  1. Permission or allowlist issue:
  2. correct role/service-account scope and retry.
  3. App runtime async issue:
  4. restore queue/outbox/worker health first, then reissue the lifecycle action.
  5. Invalid mode/scope request:
  6. retry with server-supported app/runtime defaults.
  7. Repeat scheduler/operator failure:
  8. capture evidence and file a platform primitive gap if the issue requires core special-casing.

Evidence to Capture

  1. correlation_id, trace_id, and exact action attempted.
  2. app_slug, app_instance_id, operator_service_account_id.
  3. Effective runtime metadata: operating_mode, control_plane_scope, runtime_backend.
  4. API, outbox, and app-runtime worker log evidence.
  5. Whether the issue is:
  6. IAM/allowlist
  7. lifecycle state
  8. async delivery
  9. runtime/backend adapter
  10. platform primitive gap

Escalation Rule

If the only way to make the app work would be scheduler-specific or app-specific branching in core handlers, stop and raise a platform defect. Do not workaround it in the runbook as a normal operating step.