Skip to content

Runbook: Admin Ops Dashboard Usage

Purpose

  • Define how operators should interpret Admin Ops signals and what action to take.

Default /admin/ops order: 1. Decision Header 2. Action Required 3. Health Summary 4. Investigation Tools 5. Fleet and Sample Detail

Key Semantics

  • outbox_relay_ok:
  • Live relay health signal.
  • true means no recent outbox publish failures in the rolling health window.
  • false means relay is currently degraded and needs immediate investigation.
  • dlq_pending:
  • Historical failed outbox backlog not yet requeued.
  • Can be non-zero even when relay is healthy.

Decision Matrix

  1. Action Required is empty
  2. State: Healthy or recovering without current intervention.
  3. Action: Confirm freshness in the Decision Header, then use Health Summary only if a caller reports a symptom.
  4. Action Required contains one or more incident cards
  5. State: Active degradation or unresolved risk.
  6. Action: Start from the highest-severity card, open the linked runbook, and only then use Investigation Tools for deeper diagnosis.

Standard Operator Workflow

  1. Check the Decision Header:
  2. freshness
  3. incident count
  4. highest-severity summary
  5. Work Action Required first:
  6. outbox relay degraded
  7. DLQ backlog
  8. API 5xx elevated
  9. worker failures
  10. node metrics degraded
  11. Use the runbook linked from the incident card as the primary workflow.
  12. Use Investigation Tools for:
  13. correlation_id
  14. trace_id
  15. Loki/Tempo/Grafana pivots
  16. Use Fleet and Sample Detail as supporting evidence after the incident class is known.

Signal Routing

  1. If outbox relay is degraded:
  2. route to ops.outbox.relay
  3. If DLQ backlog is present:
  4. route to ops.queue.backlog
  5. If API 5xx is elevated:
  6. route to ops.api.degradation
  7. If billing/provisioning worker failures are elevated:
  8. route to the owning worker runbook first
  9. If node metrics are degraded or stale:
  10. route to ops.fleet.telemetry
  11. If the problem is auth/login related and shows as WARN/401:
  12. route to auth-focused saved queries and the relevant onboarding/federation/IAM runbook rather than waiting for 5xx panels

Escalation

  • If outbox_relay_ok=false for more than one alert window, escalate using:
  • doc/operations/runbooks/Incident_Communication_Runbook.md
  • doc/operations/runbooks/Queue_Backlog_Runbook.md