Skip to content

OTel Collector Tenant Isolation and Local Wiring Guide

Purpose

  • Wire the existing local observability stack so telemetry is collected end-to-end.
  • Enforce tenant-safe telemetry practices early (before full multi-tenant rollout).
  • Keep local disk growth bounded with explicit retention limits.

Scope

  • Environment: local docker/dev stack (doc/operations/local-dev).
  • Signals: traces, metrics, logs.
  • Components: API/workers/gateway -> OTel Collector -> Prometheus/Tempo/Loki/Grafana.

Current Baseline (already present)

  • OTel Collector service exists in doc/operations/local-dev/docker-compose.observability.yaml.
  • Apps already send OTLP to collector via OTEL_EXPORTER_OTLP_ENDPOINT.
  • Collector exports:
  • traces -> Tempo
  • metrics -> Prometheus exporter
  • logs -> debug only (not persisted/searchable yet)

Implemented Local Wiring (current)

  1. Keep OTLP as the single ingest path for app telemetry.
  2. Export traces to Tempo, metrics to Prometheus, logs to Loki.
  3. Require tenant context attributes for tenant-scoped signals where context exists:
  4. org_id (tenant boundary)
  5. project_id
  6. correlation_id
  7. resource_name (when resolvable)
  8. error_code (on failures)

Collector Pipeline (current)

  • doc/operations/local-dev/observability/otel-collector.yaml now includes:
  • memory_limiter + batch processors for all pipelines.
  • traces -> Tempo (otlp/tempo)
  • metrics -> Prometheus exporter
  • logs -> Loki OTLP HTTP ingest (otlphttp/loki), plus debug exporter for local troubleshooting.
  • Next hardening step (optional): add transform processors that enforce normalized label sets for tenant-scoped logs.

Current local behavior: - OTLP-to-Loki path is active for services emitting OTLP logs. - Promtail is enabled in local observability overlay for container stdout log ingestion into Loki.

Tenant Isolation Rules (MVP practical)

  1. Platform/admin dashboards may read cross-tenant data.
  2. Tenant-facing dashboards and queries must always filter by org_id.
  3. Any log/trace without established tenant context is treated as platform-scope, not tenant-scope.
  4. Never rely on client-supplied tenant labels; use server-resolved context only.

Verification Checklist (local)

Run after make dev-up-observability:

  1. Collector health:
  2. curl -sf http://localhost:13133/ (if health endpoint exposed), or container health in compose.
  3. Prometheus targets:
  4. open http://localhost:9090/targets
  5. verify gpuaas-api, gpuaas-webhook-worker, otel-collector are UP.
  6. Trace path:
  7. generate API traffic and errors.
  8. open Grafana Explore (Tempo) and confirm traces for gpuaas-api.
  9. Metrics path:
  10. in Prometheus, query API counters (for example api_idempotency_replays_served_total).
  11. Logs path:
  12. confirm app logs appear in Loki Explore and include correlation_id, org_id, project_id when available.
  13. Promtail path:
  14. open http://localhost:9081/ready and confirm readiness.

Disk Growth Controls (local defaults applied)

These defaults are now set to avoid excessive local disk use:

  1. Prometheus
  2. doc/operations/local-dev/docker-compose.observability.yaml:
  3. --storage.tsdb.retention.time=72h
  4. --storage.tsdb.retention.size=2GB

  5. Tempo

  6. doc/operations/local-dev/observability/tempo/tempo.yaml:
  7. compactor.compaction.block_retention: 24h

  8. Loki

  9. doc/operations/local-dev/observability/loki/loki-config.yaml:
  10. compactor enabled (retention_enabled: true)
  11. retention period: 72h
  12. ingestion and query entry limits enabled to cap local growth.

  13. Docker volumes cleanup

  14. Periodically prune unused images/volumes in dev:
  15. docker system df
  16. docker volume ls
  17. use existing reset targets (make dev-down-observability, make dev-reset-full) when needed.

Shared/Kubernetes Environment Notes

  • Full sidecar model is not required for GPUaaS baseline.
  • Prefer gateway collector deployment first; add daemonset/sidecars only for strict isolation or special routing.
  • For true tenant-level isolation in shared env:
  • Loki multi-tenancy (X-Scope-OrgID) and query RBAC.
  • Tempo tenant-aware isolation or strict query policy by tenant labels.
  • Grafana datasource/folder RBAC split:
    • platform admin (cross-tenant)
    • tenant admin (tenant-scoped only).

Rollout Sequence

  1. Local: enable Loki log export and retention limits.
  2. Local: validate end-to-end checklist above.
  3. Shared env: deploy collector gateway and RBAC-scoped Grafana views.
  4. Shared env: enforce tenant query boundaries and audit cross-tenant access.

Definition of Done for this hardening slice

  • Collector exports traces+metrics+logs to persistent backends in local dev.
  • Required traceability fields are visible in at least one sample per signal type.
  • Retention limits are set for Prometheus/Loki/Tempo and documented.
  • make ops-observability-stack-smoke remains passing.