Skip to content

Runbook: API Degradation / High Error Rate

Trigger

  • API error rate breaches threshold or p95 latency SLO burn alert fires.

Immediate Actions

  1. Confirm blast radius (all endpoints vs specific domain).
  2. Check recent deploys and rollback candidate.
  3. Enable traffic protection/rate-limit if abuse suspected.

Diagnostics

  • Check service health dashboards and traces.
  • Identify top failing endpoints.
  • Inspect dependency health (DB/Redis/queue).

Mitigation

  • Roll back latest release if regression confirmed.
  • Scale API replicas if saturation.
  • Isolate failing domain behind feature flag if possible.

Recovery Criteria

  • Error rate and latency return to SLO range.
  • No active queue backlog amplification.

Post-Incident

  • Open RCA and capture preventive actions.