Observability Standards¶
Purpose: - Define mandatory telemetry standards before observability implementation starts. - Ensure consistent, secure, low-noise operational signals across services.
1. Required Telemetry Outputs¶
Every runtime component (cmd/api, all workers) must emit:
- structured logs
- OpenTelemetry traces
- Prometheus-compatible metrics
All signals must include correlation context where applicable.
2. Required Fields¶
Required log fields:
- timestamp
- level
- service
- message
- correlation_id
Required span attributes:
- service.name
- deployment.environment
- correlation_id
Required metric dimensions:
- Keep labels bounded and low-cardinality.
- Allowed: service, route_template, status_code_class, operation.
- Forbidden: raw UUIDs, session IDs, token IDs, email addresses.
3. Naming and Units¶
Rules:
- Counters use _total.
- Durations in seconds.
- Monetary values in minor units (integers).
- Retry/failure metrics must be explicit (*_failures_total, *_retries_total).
4. Redaction and Data Safety¶
Never emit into logs/spans/metrics labels: - access/refresh/id tokens - passwords and password hashes - SSH private key material - payment card data - full webhook payloads containing PII
Reference:
- doc/governance/Coding_Standards.md sanitize/redaction rules are mandatory.
5. Error and Correlation Requirements¶
Requirements:
- Every error response must include canonical ErrorResponse with correlation_id.
- X-Correlation-ID must be returned on every HTTP response.
- Async handlers must carry correlation IDs from event envelope to logs/spans.
6. Dashboards and Alerts (Minimum v1)¶
Must-have dashboards: - API SLO dashboard (latency/error/availability) - Billing + payment processing dashboard - Provisioning workflow dashboard - Queue/relay health dashboard
Must-have alerts: - API error-rate burn - high p95 latency - outbox relay publish failures/backlog growth - webhook processing failures - worker heartbeat loss
7. CI and Review Gates¶
Before merge for observability-impacting changes: - unit tests for new metrics/log fields (where practical) - integration smoke for telemetry endpoints where available - documentation updated for added/changed metrics - no new high-cardinality labels introduced
Review checklist: - [ ] Correlation preserved end-to-end - [ ] PII-safe telemetry - [ ] Dashboard/alert impact documented - [ ] Contract docs updated before code