Skip to content

Runbook: CLI Incident and Support Triage

Trigger

  • gpuaas CLI command fails for auth, project context, allocation, terminal, or billing path.
  • User reports non-zero exit and provides correlation_id.

Required Incident Inputs

  1. CLI command invoked (with sensitive values redacted).
  2. CLI stderr output with:
  3. code
  4. message
  5. correlation_id
  6. Active project context used by CLI (--project-id or active default).
  7. Timestamp and user identity.

Correlation-First Triage

  1. API log lookup:
  2. {service="gpuaas-api"} | json | correlation_id="<CORRELATION_ID>"
  3. If terminal command path involved:
  4. {service=~"gpuaas-(terminal-gateway|api)"} | json | correlation_id="<CORRELATION_ID>"
  5. If allocation provisioning path involved:
  6. {service=~"gpuaas-(provisioning-worker|api)"} | json | correlation_id="<CORRELATION_ID>"
  7. If billing path involved:
  8. {service=~"gpuaas-(billing-worker|webhook-worker|api)"} | json | correlation_id="<CORRELATION_ID>"
  9. Pivot to Tempo using trace_id from matching log line when present.

Common Failure Classes and Routing

  • token_missing|token_invalid|token_expired
  • route: auth/session owner.
  • invalid_request with project-context message
  • route: tenant/project authz runbook.
  • insufficient_permissions|admin_required
  • route: IAM membership/role assignment runbook.
  • allocation_*|sku_unavailable|node_*
  • route: provisioning/inventory owner.
  • service_unavailable|upstream_error|internal_error
  • route: API degradation runbook.

Operator Response Baseline

  1. Do not request DB access from user.
  2. Use correlation_id as primary key for all investigation notes.
  3. Capture canonical resource_name when present and include it in incident handoff.
  4. Provide user-facing remediation step tied to the specific error code class.

Escalation Runbooks

  • doc/operations/runbooks/API_Degradation_Runbook.md
  • doc/operations/runbooks/Tenant_Project_Authorization_Runbook.md
  • doc/operations/runbooks/IAM_Role_Assignment_and_Membership_Incident_Runbook.md
  • doc/operations/runbooks/Provisioning_Workflow_Stuck_Runbook.md
  • doc/operations/runbooks/Terminal_Gateway_Incident_Runbook.md
  • doc/operations/runbooks/Billing_Worker_Failure_Runbook.md