Observability Architecture v1¶
Purpose: - Define the telemetry backend and data flow before implementing Ops UI and operational workflows. - Keep observability implementation contract-driven and consistent across API and workers.
1. Backend Decision (v1)¶
Selected stack: - OpenTelemetry SDKs in services/workers. - OpenTelemetry Collector as the single telemetry pipeline. - Prometheus for metrics scrape and alert rule evaluation. - Tempo for trace storage/query. - Loki for log storage/query. - Grafana for dashboards and alert operations.
Deferred: - Vector log pipeline agent (defer unless multi-sink routing or heavy log transforms are required).
Rationale: - One collector pipeline reduces per-service telemetry complexity. - Prometheus/Tempo/Loki/Grafana gives metrics+traces+logs with a single operational surface. - Matches current local stack direction and production platform baseline.
2. Topology¶
Local Development¶
cmd/api, workers -> OTLP (gRPC/HTTP) -> OTel Collector.- Collector exports:
- metrics -> Prometheus.
- traces -> Tempo.
- logs -> Loki.
- Grafana reads Prometheus/Tempo/Loki datasources.
Production¶
- Edge/API/worker telemetry -> OTel Collector deployment (HA).
- Collector exports to managed or self-hosted backends:
- metrics: Prometheus + long-term store (Mimir/Thanos) when scale requires.
- traces: Tempo.
- logs: Loki.
- Alerting through Grafana + Prometheus Alertmanager integration.
3. Telemetry Contract¶
Required resource attributes on all services:
- service.name
- service.version
- deployment.environment
- service.instance.id
Required request/event correlation fields:
- correlation_id (log field and span attribute)
- user_id where available (redacted policy applies)
- org_id where available
- event_id for async events
Metrics contract:
- Use stable metric names and units.
- For counters use _total suffix.
- For durations use seconds.
- Avoid high-cardinality labels (no raw UUIDs/session IDs in labels).
Tracing contract: - Every incoming HTTP request has a root span. - NATS publish/consume creates spans linked by correlation context. - Stripe and SSH operations use child spans with failure status tags.
Logging contract:
- Structured JSON only.
- Include timestamp, level, service, message, correlation_id.
- Redaction rules follow doc/governance/Coding_Standards.md.
4. Security and Retention¶
Security requirements: - Telemetry transport must use TLS in production. - Access to logs/traces/dashboards restricted by role (admin at v1, ops role in v2). - No secrets/tokens/private keys in logs or span attributes.
Retention baseline: - Metrics: 30 days minimum (longer with remote storage after scale trigger). - Traces: 7 to 14 days baseline. - Logs: 30 days baseline for operational logs, longer for security/audit streams per policy.
5. Ops UI Integration (Admin v1)¶
Initial UI route:
- /admin/ops (admin role only in v1).
Panel sources: - Service health and internal stats endpoints. - Aggregated telemetry overview endpoint (to be added to OpenAPI before coding UI panel data fetch). - Deep links to Grafana dashboards for detailed troubleshooting.
Rule: - Do not directly query Prometheus/Loki/Tempo from browser. - Browser calls API/BFF only; backend enforces authz and returns sanitized operational summaries.
6. Pre-Implementation Gates¶
Before coding observability features:
- OpenAPI includes admin ops summary endpoint contract(s).
- AsyncAPI references any new operational events (if introduced).
- UX mock exists for /admin/ops with state matrix.
- Governance standards for telemetry are approved.