Runbook: Admin Ops Dashboard Usage¶
Purpose¶
- Define how operators should interpret Admin Ops signals and what action to take.
Default /admin/ops order:
1. Decision Header
2. Action Required
3. Health Summary
4. Investigation Tools
5. Fleet and Sample Detail
Key Semantics¶
outbox_relay_ok:- Live relay health signal.
truemeans no recent outbox publish failures in the rolling health window.falsemeans relay is currently degraded and needs immediate investigation.dlq_pending:- Historical failed outbox backlog not yet requeued.
- Can be non-zero even when relay is healthy.
Decision Matrix¶
Action Requiredis empty- State: Healthy or recovering without current intervention.
- Action: Confirm freshness in the Decision Header, then use Health Summary only if a caller reports a symptom.
Action Requiredcontains one or more incident cards- State: Active degradation or unresolved risk.
- Action: Start from the highest-severity card, open the linked runbook, and only then use Investigation Tools for deeper diagnosis.
Standard Operator Workflow¶
- Check the Decision Header:
- freshness
- incident count
- highest-severity summary
- Work
Action Requiredfirst: - outbox relay degraded
- DLQ backlog
- API 5xx elevated
- worker failures
- node metrics degraded
- Use the runbook linked from the incident card as the primary workflow.
- Use Investigation Tools for:
correlation_idtrace_id- Loki/Tempo/Grafana pivots
- Use Fleet and Sample Detail as supporting evidence after the incident class is known.
Signal Routing¶
- If outbox relay is degraded:
- route to
ops.outbox.relay - If DLQ backlog is present:
- route to
ops.queue.backlog - If API 5xx is elevated:
- route to
ops.api.degradation - If billing/provisioning worker failures are elevated:
- route to the owning worker runbook first
- If node metrics are degraded or stale:
- route to
ops.fleet.telemetry - If the problem is auth/login related and shows as WARN/401:
- route to auth-focused saved queries and the relevant onboarding/federation/IAM runbook rather than waiting for 5xx panels
Escalation¶
- If
outbox_relay_ok=falsefor more than one alert window, escalate using: doc/operations/runbooks/Incident_Communication_Runbook.mddoc/operations/runbooks/Queue_Backlog_Runbook.md