Observability Architecture v1¶

Purpose: - Define the telemetry backend and data flow before implementing Ops UI and operational workflows. - Keep observability implementation contract-driven and consistent across API and workers.

1. Backend Decision (v1)¶

Selected stack: - OpenTelemetry SDKs in services/workers. - OpenTelemetry Collector as the single telemetry pipeline. - Prometheus for metrics scrape and alert rule evaluation. - Tempo for trace storage/query. - Loki for log storage/query. - Grafana for dashboards and alert operations.

Deferred: - Vector log pipeline agent (defer unless multi-sink routing or heavy log transforms are required).

Rationale: - One collector pipeline reduces per-service telemetry complexity. - Prometheus/Tempo/Loki/Grafana gives metrics+traces+logs with a single operational surface. - Matches current local stack direction and production platform baseline.

2. Topology¶

Local Development¶

cmd/api, workers -> OTLP (gRPC/HTTP) -> OTel Collector.
Collector exports:
metrics -> Prometheus.
traces -> Tempo.
logs -> Loki.
Grafana reads Prometheus/Tempo/Loki datasources.

Production¶

Edge/API/worker telemetry -> OTel Collector deployment (HA).
Collector exports to managed or self-hosted backends:
metrics: Prometheus + long-term store (Mimir/Thanos) when scale requires.
traces: Tempo.
logs: Loki.
Alerting through Grafana + Prometheus Alertmanager integration.

3. Telemetry Contract¶

Required resource attributes on all services: - service.name - service.version - deployment.environment - service.instance.id

Required request/event correlation fields: - correlation_id (log field and span attribute) - user_id where available (redacted policy applies) - org_id where available - event_id for async events

Metrics contract: - Use stable metric names and units. - For counters use _total suffix. - For durations use seconds. - Avoid high-cardinality labels (no raw UUIDs/session IDs in labels).

Tracing contract: - Every incoming HTTP request has a root span. - NATS publish/consume creates spans linked by correlation context. - Stripe and SSH operations use child spans with failure status tags.

Logging contract: - Structured JSON only. - Include timestamp, level, service, message, correlation_id. - Redaction rules follow doc/governance/Coding_Standards.md.

4. Security and Retention¶

Security requirements: - Telemetry transport must use TLS in production. - Access to logs/traces/dashboards restricted by role (admin at v1, ops role in v2). - No secrets/tokens/private keys in logs or span attributes.

Retention baseline: - Metrics: 30 days minimum (longer with remote storage after scale trigger). - Traces: 7 to 14 days baseline. - Logs: 30 days baseline for operational logs, longer for security/audit streams per policy.

5. Ops UI Integration (Admin v1)¶

Initial UI route: - /admin/ops (admin role only in v1).

Panel sources: - Service health and internal stats endpoints. - Aggregated telemetry overview endpoint (to be added to OpenAPI before coding UI panel data fetch). - Deep links to Grafana dashboards for detailed troubleshooting.

Rule: - Do not directly query Prometheus/Loki/Tempo from browser. - Browser calls API/BFF only; backend enforces authz and returns sanitized operational summaries.

6. Pre-Implementation Gates¶

Before coding observability features: - OpenAPI includes admin ops summary endpoint contract(s). - AsyncAPI references any new operational events (if introduced). - UX mock exists for /admin/ops with state matrix. - Governance standards for telemetry are approved.