Skip to content

Runbook: Queue Backlog / Worker Saturation

Trigger

  • Queue lag/backlog exceeds alert threshold.
  • Admin Ops shows dlq_pending > 0 and/or outbox_relay_ok=false.

Immediate Actions

  1. Identify affected topics/consumers.
  2. Verify consumer health and error rates.
  3. Pause non-critical producers if necessary.

Diagnostics

  • Check failing message patterns.
  • Check downstream dependency health.
  • Check retry storm or poison message indications.
  • Use metric semantics:
  • outbox_relay_ok=false means active relay degradation now.
  • dlq_pending>0 with outbox_relay_ok=true means historical backlog after recovery.
  • Distinguish the two backlog sources before taking action:
  • GET /api/v1/admin/dlq/messages lists live NATS DLQ stream messages.
  • GET /api/v1/admin/outbox/failed lists historical failed outbox rows; Admin Ops reports these as queue_depth.dlq_pending.

Mitigation

  • Scale worker replicas for affected consumers.
  • Route poison messages to DLQ.
  • Apply backpressure controls on producers.
  • Use admin backlog controls:
  • Outbox backlog:
    • GET /api/v1/admin/outbox/failed
    • POST /api/v1/admin/outbox/{event_id}/requeue
    • POST /api/v1/admin/outbox/{event_id}/discard
  • DLQ stream backlog:
    • GET /api/v1/admin/dlq/messages?subject_prefix=dlq.<domain>.
    • POST /api/v1/admin/dlq/messages/{stream_seq}/requeue
    • POST /api/v1/admin/dlq/messages/{stream_seq}/discard

Recovery Criteria

  • Backlog trending down to normal window.
  • Consumer error rates normalized.
  • Admin Ops:
  • outbox_relay_ok=true
  • dlq_pending stable at zero or actively draining

Historical Failed Outbox Cleanup

Use discard only after confirming the events are stale and the current desired state already exists. Common demo/staging examples are old app artifact lifecycle events whose artifacts were republished or superseded.

Checklist: 1. Confirm live DLQ stream is empty or unrelated: - GET /api/v1/admin/dlq/messages?page_size=50 2. Inspect failed outbox rows: - GET /api/v1/admin/outbox/failed?page_size=50 3. If rows are stale, discard each row through the admin API with an idempotency key: - POST /api/v1/admin/outbox/{event_id}/discard 4. Verify Admin Ops: - GET /api/v1/admin/ops/overview - Expect queue_depth.dlq_pending=0 and service_health.outbox_relay_ok=true.

Post-Incident

  • Add/adjust consumer idempotency and retry policies.
  • Audit requeue/discard actions and classify poison-message patterns for future guardrails.