Runbook: App Runtime Lifecycle Incident
Trigger
- App instance deploy/upgrade/rollback/decommission action fails in UI/API/CLI/SDK.
- App instance remains stuck in
requested|upgrading|rolling_back|decommissioning.
- Support ticket includes
correlation_id for app lifecycle failure.
Required Context
correlation_id from error envelope.
trace_id if present.
- Scope identifiers:
org_id
project_id
app_instance_id
app_slug
Correlation-First Triage
- API logs by
correlation_id:
{service="gpuaas-api"} | json | correlation_id="<CORRELATION_ID>"
- Outbox relay logs by
correlation_id:
{service="gpuaas-outbox-relay"} | json | correlation_id="<CORRELATION_ID>"
- App runtime worker logs by
correlation_id:
{service="gpuaas-app-runtime-worker"} | json | correlation_id="<CORRELATION_ID>"
- Open Tempo trace by
trace_id and validate lineage:
- expected services:
gpuaas-api -> gpuaas-outbox-relay -> gpuaas-app-runtime-worker
Event Sequence Expectations
- Deploy path:
apps.instance.requested -> apps.instance.running (or apps.instance.failed)
- Upgrade path:
apps.instance.upgrade_requested -> apps.instance.running (or apps.instance.failed)
- Rollback path:
apps.instance.rollback_requested -> apps.instance.running (or apps.instance.failed)
- Decommission path:
apps.instance.decommission_requested -> apps.instance.deleted
Data Validation
- Confirm app instance scope:
app_instances.org_id and app_instances.project_id match request scope.
- Confirm state transition is valid for requested operation.
- Confirm outbox progression:
- pending rows drain to published without growing failed backlog.
Common Failure Classes
insufficient_permissions:
- actor lacks required tenant/project role for lifecycle mutation.
invalid_request:
- missing project context, invalid version, or invalid lifecycle action for current state.
service_unavailable|upstream_error:
- NATS/DB/worker path degradation.
internal_error:
- unexpected runtime processing failure in API/outbox/worker.
Mitigation
- Permission/scope failures:
- correct actor role or project context and retry.
- Queue/outbox failures:
- follow
Queue_Backlog_Runbook.md and restore relay/consumer health first.
- Stuck lifecycle state:
- inspect worker handling outcome and run state-corrective action only with audit trail.
- Repeat failures:
- capture payload + envelope evidence and escalate to runtime owner.
Recovery Criteria
- Lifecycle request completes to expected terminal state.
- No active outbox relay degradation (
outbox_relay_ok=true).
- No growing DLQ for
apps.instance.* subjects.
- Trace path spans all expected services for a fresh test request.
Evidence to Capture
correlation_id, trace_id, and exact API action attempted.
- App instance before/after state snapshot.
- Log evidence from API, outbox relay, and app-runtime worker.
- Follow-up action: bug/task ID and owner.