Runbook: API Degradation / High Error Rate
Trigger
- API error rate breaches threshold or p95 latency SLO burn alert fires.
- Confirm blast radius (all endpoints vs specific domain).
- Check recent deploys and rollback candidate.
- Enable traffic protection/rate-limit if abuse suspected.
Diagnostics
- Check service health dashboards and traces.
- Identify top failing endpoints.
- Inspect dependency health (DB/Redis/queue).
Mitigation
- Roll back latest release if regression confirmed.
- Scale API replicas if saturation.
- Isolate failing domain behind feature flag if possible.
Recovery Criteria
- Error rate and latency return to SLO range.
- No active queue backlog amplification.
Post-Incident
- Open RCA and capture preventive actions.