Runbook: Python SDK Incident and Observability Triage¶
Trigger¶
- Python SDK call fails and raises exception with
correlation_id. - SDK smoke or integration example fails in CI/dev.
- Tenant-shared runtime or shared-worker control-plane calls fail from Python automation.
Required Support Capture¶
- SDK method called and parameters (redact secrets/tokens).
- Exception fields:
- SDK exception class/type
- API
error_code correlation_id- Project context (
X-Project-ID) and actor identity. - Approximate timestamp.
Exception -> API Error Mapping Baseline¶
- auth/session exceptions ->
token_* - permission exceptions ->
insufficient_permissions|admin_required - context/validation exceptions ->
invalid_request|validation_error - allocation/provisioning exceptions ->
allocation_*|sku_unavailable|node_* - shared-runtime control-plane exceptions ->
shared_runtime_* - upstream/platform exceptions ->
service_unavailable|upstream_error|internal_error
Triage Workflow¶
- Start in Loki with
correlation_id: {service="gpuaas-api"} | json | correlation_id="<CORRELATION_ID>"- Identify route and owning service by path and code.
- For async-backed workflows, inspect worker logs by same
correlation_id: {service=~"gpuaas-(provisioning-worker|billing-worker|notification-relay|webhook-worker)"} | json | correlation_id="<CORRELATION_ID>"- Pivot to Tempo via
trace_idif present. - Confirm tenant boundary fields (
org_id,project_id) match caller context.
Support Triage Decision Tree¶
- SDK misuse (missing project context, invalid parameter shape):
- provide corrected SDK usage example.
- Auth/session failure:
- rotate token/session and retry.
- API-side deterministic business error:
- return specific remediation by code class.
- Platform/runtime failure:
- open incident with
correlation_id,trace_id, and route ownership.
Current Python SDK Coverage Notes¶
The current SDK baseline covers: - catalog - allocations - terminal token minting - billing - shared runtimes - shared runtime attachments - shared runtime workers - shared runtime worker operations
So support should first verify whether the failing path is one of the supported public methods before classifying the issue as a caller misuse or missing-client-surface gap.
Escalation Runbooks¶
doc/operations/runbooks/API_Degradation_Runbook.mddoc/operations/runbooks/Tenant_Project_Authorization_Runbook.mddoc/operations/runbooks/Provisioning_Workflow_Stuck_Runbook.mddoc/operations/runbooks/Billing_Worker_Failure_Runbook.mddoc/operations/runbooks/Terminal_Gateway_Incident_Runbook.md