Runbook: Queue Backlog / Worker Saturation¶
Trigger¶
- Queue lag/backlog exceeds alert threshold.
- Admin Ops shows
dlq_pending > 0and/oroutbox_relay_ok=false.
Immediate Actions¶
- Identify affected topics/consumers.
- Verify consumer health and error rates.
- Pause non-critical producers if necessary.
Diagnostics¶
- Check failing message patterns.
- Check downstream dependency health.
- Check retry storm or poison message indications.
- Use metric semantics:
outbox_relay_ok=falsemeans active relay degradation now.dlq_pending>0withoutbox_relay_ok=truemeans historical backlog after recovery.- Distinguish the two backlog sources before taking action:
GET /api/v1/admin/dlq/messageslists live NATS DLQ stream messages.GET /api/v1/admin/outbox/failedlists historical failed outbox rows; Admin Ops reports these asqueue_depth.dlq_pending.
Mitigation¶
- Scale worker replicas for affected consumers.
- Route poison messages to DLQ.
- Apply backpressure controls on producers.
- Use admin backlog controls:
- Outbox backlog:
GET /api/v1/admin/outbox/failedPOST /api/v1/admin/outbox/{event_id}/requeuePOST /api/v1/admin/outbox/{event_id}/discard
- DLQ stream backlog:
GET /api/v1/admin/dlq/messages?subject_prefix=dlq.<domain>.POST /api/v1/admin/dlq/messages/{stream_seq}/requeuePOST /api/v1/admin/dlq/messages/{stream_seq}/discard
Recovery Criteria¶
- Backlog trending down to normal window.
- Consumer error rates normalized.
- Admin Ops:
outbox_relay_ok=truedlq_pendingstable at zero or actively draining
Historical Failed Outbox Cleanup¶
Use discard only after confirming the events are stale and the current desired state already exists. Common demo/staging examples are old app artifact lifecycle events whose artifacts were republished or superseded.
Checklist:
1. Confirm live DLQ stream is empty or unrelated:
- GET /api/v1/admin/dlq/messages?page_size=50
2. Inspect failed outbox rows:
- GET /api/v1/admin/outbox/failed?page_size=50
3. If rows are stale, discard each row through the admin API with an idempotency key:
- POST /api/v1/admin/outbox/{event_id}/discard
4. Verify Admin Ops:
- GET /api/v1/admin/ops/overview
- Expect queue_depth.dlq_pending=0 and service_health.outbox_relay_ok=true.
Post-Incident¶
- Add/adjust consumer idempotency and retry policies.
- Audit requeue/discard actions and classify poison-message patterns for future guardrails.