Skip to content

Observability Standards

Purpose: - Define mandatory telemetry standards before observability implementation starts. - Ensure consistent, secure, low-noise operational signals across services.

1. Required Telemetry Outputs

Every runtime component (cmd/api, all workers) must emit: - structured logs - OpenTelemetry traces - Prometheus-compatible metrics

All signals must include correlation context where applicable.

2. Required Fields

Required log fields: - timestamp - level - service - message - correlation_id

Required span attributes: - service.name - deployment.environment - correlation_id

Required metric dimensions: - Keep labels bounded and low-cardinality. - Allowed: service, route_template, status_code_class, operation. - Forbidden: raw UUIDs, session IDs, token IDs, email addresses.

3. Naming and Units

Rules: - Counters use _total. - Durations in seconds. - Monetary values in minor units (integers). - Retry/failure metrics must be explicit (*_failures_total, *_retries_total).

4. Redaction and Data Safety

Never emit into logs/spans/metrics labels: - access/refresh/id tokens - passwords and password hashes - SSH private key material - payment card data - full webhook payloads containing PII

Reference: - doc/governance/Coding_Standards.md sanitize/redaction rules are mandatory.

5. Error and Correlation Requirements

Requirements: - Every error response must include canonical ErrorResponse with correlation_id. - X-Correlation-ID must be returned on every HTTP response. - Async handlers must carry correlation IDs from event envelope to logs/spans.

6. Dashboards and Alerts (Minimum v1)

Must-have dashboards: - API SLO dashboard (latency/error/availability) - Billing + payment processing dashboard - Provisioning workflow dashboard - Queue/relay health dashboard

Must-have alerts: - API error-rate burn - high p95 latency - outbox relay publish failures/backlog growth - webhook processing failures - worker heartbeat loss

7. CI and Review Gates

Before merge for observability-impacting changes: - unit tests for new metrics/log fields (where practical) - integration smoke for telemetry endpoints where available - documentation updated for added/changed metrics - no new high-cardinality labels introduced

Review checklist: - [ ] Correlation preserved end-to-end - [ ] PII-safe telemetry - [ ] Dashboard/alert impact documented - [ ] Contract docs updated before code