Skip to content

GPUaaS Review Portal

Runbook: Fleet Telemetry Incident (CPU/GPU/Memory/Storage)

Runbook: Fleet Telemetry Incident (CPU/GPU/Memory/Storage)¶

Trigger¶

/admin/telemetry shows missing or stale charts.
CPU/GPU/Memory/Storage rollups show unexpected zeros or sharp anomalies.
Fleet telemetry API returns error envelope with correlation_id.

Primary Surfaces¶

UI: /admin/telemetry
API: GET /api/v1/admin/telemetry/fleet?range=<window>&points=<n>
Metrics/logs/traces: Prometheus, Loki, Tempo

Immediate Triage¶

Capture:
correlation_id
selected range
selected telemetry dimension/tab (CPU, GPU, Memory, Storage)
Verify API endpoint health:
curl -sf http://localhost:8081/api/v1/healthz
Validate observability stack readiness:
make ops-observability-stack-smoke

Correlation-First Query Flow¶

API log lookup:
{service="gpuaas-api"} | json | correlation_id="<CORRELATION_ID>"
Trace pivot:
open trace_id from log line in Tempo.
Metrics sanity:
sum(rate(http_server_requests_total{status=~"5.."}[5m])) by (service)
verify API scrape targets and freshness in Prometheus.

Expected Failure Patterns¶

invalid_request: unsupported range/points values.
insufficient_permissions / admin_required: non-admin actor calling admin telemetry endpoint.
service_unavailable: metrics backend not reachable.
internal_error: server-side aggregation failure.

Alert Mapping¶

API telemetry endpoint 5xx spike -> ops.api.degradation
Observability backend unavailable -> stack smoke/runbook escalation
Persistent tab-specific anomalies (CPU/GPU/Memory/Storage) with healthy endpoint:
treat as data quality/collector issue and escalate to platform observability owner.

Escalation¶

API degradation: doc/operations/runbooks/API_Degradation_Runbook.md
Queue/worker side effects: doc/operations/runbooks/Queue_Backlog_Runbook.md
Incident comms: doc/operations/runbooks/Incident_Communication_Runbook.md