Skip to content

Observability stack

Implemented

Source: doc/architecture/Observability_Architecture.md · doc/governance/Observability_Standards.md · doc/operations/Observability_Baseline.md

Stack

flowchart LR
    classDef src fill:#e8f5e9,stroke:#2e7d32
    classDef col fill:#fff3e0,stroke:#e65100
    classDef bk  fill:#e3f2fd,stroke:#1565c0
    classDef ui  fill:#f3e5f5,stroke:#6a1b9a

    subgraph SRC[Sources]
      A[cmd/api]:::src
      B[cmd/provisioning-worker]:::src
      C[cmd/billing-worker]:::src
      D[cmd/node-agent]:::src
      E[other workers]:::src
      F[web]:::src
    end

    subgraph COL[Collectors]
      OC[OTel Collector]:::col
      LP[Log shippers /<br/>node-log-gateway]:::col
    end

    subgraph BK[Backends]
      PROM[(Prometheus<br/>metrics)]:::bk
      TEMPO[(Tempo<br/>traces)]:::bk
      LOKI[(Loki<br/>logs)]:::bk
    end

    subgraph UI[UI]
      GRAF[Grafana<br/>dashboards + alerts]:::ui
      OPS[/admin/ops/<br/>in-product ops panel/]:::ui
    end

    SRC --> OC
    SRC --> LP
    OC --> PROM
    OC --> TEMPO
    LP --> LOKI
    PROM --> GRAF
    TEMPO --> GRAF
    LOKI --> GRAF
    PROM --> OPS

Mandatory implementation rules

From Coding_Standards.md §Traceability-First Implementation Rules:

  1. Every runtime binary under cmd/ (except documented edge agents) initializes OTel via middleware.SetupOTel(...).
  2. Every HTTP server binary wraps routers with middleware.Tracing("<service-name>")(middleware.CorrelationID(...)).
  3. Async consumers create a processing span per message with correlation_id, event.type, event.id, subject.
  4. Mutation handlers create child spans for: project/tenant resolution, domain service call, audit/outbox write.
  5. Error paths set span error status + catalog-aligned error_code.
  6. New services must have OTEL_EXPORTER_OTLP_ENDPOINT wired.

CI gate: scripts/ci/observability_trace_gate.sh, also exposed as make ops-observability-trace-gate.

Sanitize-first

Sensitive fields never appear in logs or trace attributes:

  • password, password_hash
  • access_token, refresh_token, id_token
  • ssh_private_key, ssh_private_key_enc
  • stripe_customer_id, payment_reference
  • High-volume PII (email, username as identifier)
  • Anything from access_secret_enc or scheduler_metadata that may contain credentials

Redaction format: replace with [REDACTED] — never omit, to preserve structure for debugging.

CI gate ensures middleware.Sanitize is invoked at service boundary.

Audit log metadata jsonb has an allowlist: reason, policy_key, old_value, new_value, status_from, status_to, error_code, request_scope, idempotency_key_hash, provider_ref, allocation_id, node_id. Unknown keys rejected.

SLOs (from PRD non-functional requirements)

Surface SLO target
Public API p95 < 300 ms
Public API p99 < 1 s
Authorization membership-resolution p95 ≤ 20 ms
Authorization membership-resolution p99 ≤ 50 ms
Allocation create → provisioning event emitted < 1 s
Webhook signature verification → ledger credit < 5 s

Source: PRD.md §10 Non-Functional, Testing_Standards.md §Authorization SLO.

Tenant isolation (OTel)

→ Read source: OTEL_Collector_Tenant_Isolation.md.

The OTel collector enforces tenant isolation on trace/log labels so a noisy tenant cannot drown out neighbors' telemetry.

In-product ops panel

/admin/ops/overview is a role-gated, aggregated, sanitized read of platform telemetry:

  • Health rollup
  • Queue depth (NATS + node_tasks)
  • Throughput
  • Error rates
  • Recent incidents

Per PRD.md §FR-12: aggregated, no raw infra secrets/tokens, admin-only.

Watchlist

→ Read source: Scalability_Security_Watchlist.md, Watchlist_Phase_Schedule.md.

The watchlist tracks emerging scalability or security signals before they become incidents.

Where to look next