Database Latency or Failover Runbook
Trigger
- Elevated DB query latency, saturation, or connection errors.
- Planned/unplanned failover event.
Impact
- API/worker degradation across provisioning, billing, and auth flows.
- Queue backlog growth and potential timeout cascades.
- Confirm DB primary availability and connection pool pressure.
- Reduce non-essential load (batch jobs/heavy admin queries).
- If failover in progress, hold writes where possible until primary stabilizes.
Diagnosis
- Check DB health metrics (CPU, IOPS, locks, replication lag).
- Inspect slow query logs and top wait events.
- Verify app pool settings and retry behavior.
- Confirm outbox relay and workers recover after DB connectivity returns.
Recovery
- Complete failover or restore primary service.
- Re-enable paused workloads in controlled order.
- Validate critical write/read paths (allocations, ledger, outbox, auth).
- Run backup/restore integrity checks if data risk suspected.
Post-Incident
- Capture failover timeline and achieved RTO/RPO.
- Add query/index/pool tuning actions as follow-ups.