Skip to content

Runbook: App Runtime Lifecycle Incident

Trigger

  1. App instance deploy/upgrade/rollback/decommission action fails in UI/API/CLI/SDK.
  2. App instance remains stuck in requested|upgrading|rolling_back|decommissioning.
  3. Support ticket includes correlation_id for app lifecycle failure.

Required Context

  1. correlation_id from error envelope.
  2. trace_id if present.
  3. Scope identifiers:
  4. org_id
  5. project_id
  6. app_instance_id
  7. app_slug

Correlation-First Triage

  1. API logs by correlation_id:
  2. {service="gpuaas-api"} | json | correlation_id="<CORRELATION_ID>"
  3. Outbox relay logs by correlation_id:
  4. {service="gpuaas-outbox-relay"} | json | correlation_id="<CORRELATION_ID>"
  5. App runtime worker logs by correlation_id:
  6. {service="gpuaas-app-runtime-worker"} | json | correlation_id="<CORRELATION_ID>"
  7. Open Tempo trace by trace_id and validate lineage:
  8. expected services: gpuaas-api -> gpuaas-outbox-relay -> gpuaas-app-runtime-worker

Event Sequence Expectations

  1. Deploy path:
  2. apps.instance.requested -> apps.instance.running (or apps.instance.failed)
  3. Upgrade path:
  4. apps.instance.upgrade_requested -> apps.instance.running (or apps.instance.failed)
  5. Rollback path:
  6. apps.instance.rollback_requested -> apps.instance.running (or apps.instance.failed)
  7. Decommission path:
  8. apps.instance.decommission_requested -> apps.instance.deleted

Data Validation

  1. Confirm app instance scope:
  2. app_instances.org_id and app_instances.project_id match request scope.
  3. Confirm state transition is valid for requested operation.
  4. Confirm outbox progression:
  5. pending rows drain to published without growing failed backlog.

Common Failure Classes

  1. insufficient_permissions:
  2. actor lacks required tenant/project role for lifecycle mutation.
  3. invalid_request:
  4. missing project context, invalid version, or invalid lifecycle action for current state.
  5. service_unavailable|upstream_error:
  6. NATS/DB/worker path degradation.
  7. internal_error:
  8. unexpected runtime processing failure in API/outbox/worker.

Mitigation

  1. Permission/scope failures:
  2. correct actor role or project context and retry.
  3. Queue/outbox failures:
  4. follow Queue_Backlog_Runbook.md and restore relay/consumer health first.
  5. Stuck lifecycle state:
  6. inspect worker handling outcome and run state-corrective action only with audit trail.
  7. Repeat failures:
  8. capture payload + envelope evidence and escalate to runtime owner.

Recovery Criteria

  1. Lifecycle request completes to expected terminal state.
  2. No active outbox relay degradation (outbox_relay_ok=true).
  3. No growing DLQ for apps.instance.* subjects.
  4. Trace path spans all expected services for a fresh test request.

Evidence to Capture

  1. correlation_id, trace_id, and exact API action attempted.
  2. App instance before/after state snapshot.
  3. Log evidence from API, outbox relay, and app-runtime worker.
  4. Follow-up action: bug/task ID and owner.