Skip to content

Database Latency or Failover Runbook

Trigger

  • Elevated DB query latency, saturation, or connection errors.
  • Planned/unplanned failover event.

Impact

  • API/worker degradation across provisioning, billing, and auth flows.
  • Queue backlog growth and potential timeout cascades.

Immediate Mitigation

  1. Confirm DB primary availability and connection pool pressure.
  2. Reduce non-essential load (batch jobs/heavy admin queries).
  3. If failover in progress, hold writes where possible until primary stabilizes.

Diagnosis

  1. Check DB health metrics (CPU, IOPS, locks, replication lag).
  2. Inspect slow query logs and top wait events.
  3. Verify app pool settings and retry behavior.
  4. Confirm outbox relay and workers recover after DB connectivity returns.

Recovery

  1. Complete failover or restore primary service.
  2. Re-enable paused workloads in controlled order.
  3. Validate critical write/read paths (allocations, ledger, outbox, auth).
  4. Run backup/restore integrity checks if data risk suspected.

Post-Incident

  • Capture failover timeline and achieved RTO/RPO.
  • Add query/index/pool tuning actions as follow-ups.