Skip to content

Runbook: Python SDK Incident and Observability Triage

Trigger

  • Python SDK call fails and raises exception with correlation_id.
  • SDK smoke or integration example fails in CI/dev.
  • Tenant-shared runtime or shared-worker control-plane calls fail from Python automation.

Required Support Capture

  1. SDK method called and parameters (redact secrets/tokens).
  2. Exception fields:
  3. SDK exception class/type
  4. API error_code
  5. correlation_id
  6. Project context (X-Project-ID) and actor identity.
  7. Approximate timestamp.

Exception -> API Error Mapping Baseline

  • auth/session exceptions -> token_*
  • permission exceptions -> insufficient_permissions|admin_required
  • context/validation exceptions -> invalid_request|validation_error
  • allocation/provisioning exceptions -> allocation_*|sku_unavailable|node_*
  • shared-runtime control-plane exceptions -> shared_runtime_*
  • upstream/platform exceptions -> service_unavailable|upstream_error|internal_error

Triage Workflow

  1. Start in Loki with correlation_id:
  2. {service="gpuaas-api"} | json | correlation_id="<CORRELATION_ID>"
  3. Identify route and owning service by path and code.
  4. For async-backed workflows, inspect worker logs by same correlation_id:
  5. {service=~"gpuaas-(provisioning-worker|billing-worker|notification-relay|webhook-worker)"} | json | correlation_id="<CORRELATION_ID>"
  6. Pivot to Tempo via trace_id if present.
  7. Confirm tenant boundary fields (org_id, project_id) match caller context.

Support Triage Decision Tree

  1. SDK misuse (missing project context, invalid parameter shape):
  2. provide corrected SDK usage example.
  3. Auth/session failure:
  4. rotate token/session and retry.
  5. API-side deterministic business error:
  6. return specific remediation by code class.
  7. Platform/runtime failure:
  8. open incident with correlation_id, trace_id, and route ownership.

Current Python SDK Coverage Notes

The current SDK baseline covers: - catalog - allocations - terminal token minting - billing - shared runtimes - shared runtime attachments - shared runtime workers - shared runtime worker operations

So support should first verify whether the failing path is one of the supported public methods before classifying the issue as a caller misuse or missing-client-surface gap.

Escalation Runbooks

  • doc/operations/runbooks/API_Degradation_Runbook.md
  • doc/operations/runbooks/Tenant_Project_Authorization_Runbook.md
  • doc/operations/runbooks/Provisioning_Workflow_Stuck_Runbook.md
  • doc/operations/runbooks/Billing_Worker_Failure_Runbook.md
  • doc/operations/runbooks/Terminal_Gateway_Incident_Runbook.md