Observability stack¶
Implemented
doc/architecture/Observability_Architecture.md · doc/governance/Observability_Standards.md · doc/operations/Observability_Baseline.md
Stack¶
flowchart LR
classDef src fill:#e8f5e9,stroke:#2e7d32
classDef col fill:#fff3e0,stroke:#e65100
classDef bk fill:#e3f2fd,stroke:#1565c0
classDef ui fill:#f3e5f5,stroke:#6a1b9a
subgraph SRC[Sources]
A[cmd/api]:::src
B[cmd/provisioning-worker]:::src
C[cmd/billing-worker]:::src
D[cmd/node-agent]:::src
E[other workers]:::src
F[web]:::src
end
subgraph COL[Collectors]
OC[OTel Collector]:::col
LP[Log shippers /<br/>node-log-gateway]:::col
end
subgraph BK[Backends]
PROM[(Prometheus<br/>metrics)]:::bk
TEMPO[(Tempo<br/>traces)]:::bk
LOKI[(Loki<br/>logs)]:::bk
end
subgraph UI[UI]
GRAF[Grafana<br/>dashboards + alerts]:::ui
OPS[/admin/ops/<br/>in-product ops panel/]:::ui
end
SRC --> OC
SRC --> LP
OC --> PROM
OC --> TEMPO
LP --> LOKI
PROM --> GRAF
TEMPO --> GRAF
LOKI --> GRAF
PROM --> OPS
Mandatory implementation rules¶
From Coding_Standards.md §Traceability-First Implementation Rules:
- Every runtime binary under
cmd/(except documented edge agents) initializes OTel viamiddleware.SetupOTel(...). - Every HTTP server binary wraps routers with
middleware.Tracing("<service-name>")(middleware.CorrelationID(...)). - Async consumers create a processing span per message with
correlation_id,event.type,event.id, subject. - Mutation handlers create child spans for: project/tenant resolution, domain service call, audit/outbox write.
- Error paths set span error status + catalog-aligned
error_code. - New services must have
OTEL_EXPORTER_OTLP_ENDPOINTwired.
CI gate: scripts/ci/observability_trace_gate.sh, also exposed as make ops-observability-trace-gate.
Sanitize-first¶
Sensitive fields never appear in logs or trace attributes:
password,password_hashaccess_token,refresh_token,id_tokenssh_private_key,ssh_private_key_encstripe_customer_id,payment_reference- High-volume PII (
email,usernameas identifier) - Anything from
access_secret_encorscheduler_metadatathat may contain credentials
Redaction format: replace with [REDACTED] — never omit, to preserve structure for debugging.
CI gate ensures middleware.Sanitize is invoked at service boundary.
Audit log metadata jsonb has an allowlist: reason, policy_key, old_value, new_value, status_from, status_to, error_code, request_scope, idempotency_key_hash, provider_ref, allocation_id, node_id. Unknown keys rejected.
SLOs (from PRD non-functional requirements)¶
| Surface | SLO target |
|---|---|
| Public API p95 | < 300 ms |
| Public API p99 | < 1 s |
| Authorization membership-resolution p95 | ≤ 20 ms |
| Authorization membership-resolution p99 | ≤ 50 ms |
| Allocation create → provisioning event emitted | < 1 s |
| Webhook signature verification → ledger credit | < 5 s |
Source: PRD.md §10 Non-Functional, Testing_Standards.md §Authorization SLO.
Tenant isolation (OTel)¶
→ Read source: OTEL_Collector_Tenant_Isolation.md.
The OTel collector enforces tenant isolation on trace/log labels so a noisy tenant cannot drown out neighbors' telemetry.
In-product ops panel¶
/admin/ops/overview is a role-gated, aggregated, sanitized read of platform telemetry:
- Health rollup
- Queue depth (NATS + node_tasks)
- Throughput
- Error rates
- Recent incidents
Per PRD.md §FR-12: aggregated, no raw infra secrets/tokens, admin-only.
Watchlist¶
→ Read source: Scalability_Security_Watchlist.md, Watchlist_Phase_Schedule.md.
The watchlist tracks emerging scalability or security signals before they become incidents.