Skip to content

Runbook: Fleet Telemetry Incident (CPU/GPU/Memory/Storage)

Trigger

  • /admin/telemetry shows missing or stale charts.
  • CPU/GPU/Memory/Storage rollups show unexpected zeros or sharp anomalies.
  • Fleet telemetry API returns error envelope with correlation_id.

Primary Surfaces

  • UI: /admin/telemetry
  • API: GET /api/v1/admin/telemetry/fleet?range=<window>&points=<n>
  • Metrics/logs/traces: Prometheus, Loki, Tempo

Immediate Triage

  1. Capture:
  2. correlation_id
  3. selected range
  4. selected telemetry dimension/tab (CPU, GPU, Memory, Storage)
  5. Verify API endpoint health:
  6. curl -sf http://localhost:8081/api/v1/healthz
  7. Validate observability stack readiness:
  8. make ops-observability-stack-smoke

Correlation-First Query Flow

  1. API log lookup:
  2. {service="gpuaas-api"} | json | correlation_id="<CORRELATION_ID>"
  3. Trace pivot:
  4. open trace_id from log line in Tempo.
  5. Metrics sanity:
  6. sum(rate(http_server_requests_total{status=~"5.."}[5m])) by (service)
  7. verify API scrape targets and freshness in Prometheus.

Expected Failure Patterns

  • invalid_request: unsupported range/points values.
  • insufficient_permissions / admin_required: non-admin actor calling admin telemetry endpoint.
  • service_unavailable: metrics backend not reachable.
  • internal_error: server-side aggregation failure.

Alert Mapping

  • API telemetry endpoint 5xx spike -> ops.api.degradation
  • Observability backend unavailable -> stack smoke/runbook escalation
  • Persistent tab-specific anomalies (CPU/GPU/Memory/Storage) with healthy endpoint:
  • treat as data quality/collector issue and escalate to platform observability owner.

Escalation

  • API degradation: doc/operations/runbooks/API_Degradation_Runbook.md
  • Queue/worker side effects: doc/operations/runbooks/Queue_Backlog_Runbook.md
  • Incident comms: doc/operations/runbooks/Incident_Communication_Runbook.md