OTel Collector Tenant Isolation and Local Wiring Guide¶
Purpose¶
- Wire the existing local observability stack so telemetry is collected end-to-end.
- Enforce tenant-safe telemetry practices early (before full multi-tenant rollout).
- Keep local disk growth bounded with explicit retention limits.
Scope¶
- Environment: local docker/dev stack (
doc/operations/local-dev). - Signals: traces, metrics, logs.
- Components: API/workers/gateway -> OTel Collector -> Prometheus/Tempo/Loki/Grafana.
Current Baseline (already present)¶
- OTel Collector service exists in
doc/operations/local-dev/docker-compose.observability.yaml. - Apps already send OTLP to collector via
OTEL_EXPORTER_OTLP_ENDPOINT. - Collector exports:
- traces -> Tempo
- metrics -> Prometheus exporter
- logs -> debug only (not persisted/searchable yet)
Implemented Local Wiring (current)¶
- Keep OTLP as the single ingest path for app telemetry.
- Export traces to Tempo, metrics to Prometheus, logs to Loki.
- Require tenant context attributes for tenant-scoped signals where context exists:
org_id(tenant boundary)project_idcorrelation_idresource_name(when resolvable)error_code(on failures)
Collector Pipeline (current)¶
doc/operations/local-dev/observability/otel-collector.yamlnow includes:memory_limiter+batchprocessors for all pipelines.- traces -> Tempo (
otlp/tempo) - metrics -> Prometheus exporter
- logs -> Loki OTLP HTTP ingest (
otlphttp/loki), plus debug exporter for local troubleshooting. - Next hardening step (optional): add transform processors that enforce normalized label sets for tenant-scoped logs.
Current local behavior: - OTLP-to-Loki path is active for services emitting OTLP logs. - Promtail is enabled in local observability overlay for container stdout log ingestion into Loki.
Tenant Isolation Rules (MVP practical)¶
- Platform/admin dashboards may read cross-tenant data.
- Tenant-facing dashboards and queries must always filter by
org_id. - Any log/trace without established tenant context is treated as platform-scope, not tenant-scope.
- Never rely on client-supplied tenant labels; use server-resolved context only.
Verification Checklist (local)¶
Run after make dev-up-observability:
- Collector health:
curl -sf http://localhost:13133/(if health endpoint exposed), or container health in compose.- Prometheus targets:
- open
http://localhost:9090/targets - verify
gpuaas-api,gpuaas-webhook-worker,otel-collectorareUP. - Trace path:
- generate API traffic and errors.
- open Grafana Explore (Tempo) and confirm traces for
gpuaas-api. - Metrics path:
- in Prometheus, query API counters (for example
api_idempotency_replays_served_total). - Logs path:
- confirm app logs appear in Loki Explore and include
correlation_id,org_id,project_idwhen available. - Promtail path:
- open
http://localhost:9081/readyand confirm readiness.
Disk Growth Controls (local defaults applied)¶
These defaults are now set to avoid excessive local disk use:
- Prometheus
doc/operations/local-dev/docker-compose.observability.yaml:--storage.tsdb.retention.time=72h-
--storage.tsdb.retention.size=2GB -
Tempo
doc/operations/local-dev/observability/tempo/tempo.yaml:-
compactor.compaction.block_retention: 24h -
Loki
doc/operations/local-dev/observability/loki/loki-config.yaml:- compactor enabled (
retention_enabled: true) - retention period:
72h -
ingestion and query entry limits enabled to cap local growth.
-
Docker volumes cleanup
- Periodically prune unused images/volumes in dev:
docker system dfdocker volume ls- use existing reset targets (
make dev-down-observability,make dev-reset-full) when needed.
Shared/Kubernetes Environment Notes¶
- Full sidecar model is not required for GPUaaS baseline.
- Prefer gateway collector deployment first; add daemonset/sidecars only for strict isolation or special routing.
- For true tenant-level isolation in shared env:
- Loki multi-tenancy (
X-Scope-OrgID) and query RBAC. - Tempo tenant-aware isolation or strict query policy by tenant labels.
- Grafana datasource/folder RBAC split:
- platform admin (cross-tenant)
- tenant admin (tenant-scoped only).
Rollout Sequence¶
- Local: enable Loki log export and retention limits.
- Local: validate end-to-end checklist above.
- Shared env: deploy collector gateway and RBAC-scoped Grafana views.
- Shared env: enforce tenant query boundaries and audit cross-tenant access.
Definition of Done for this hardening slice¶
- Collector exports traces+metrics+logs to persistent backends in local dev.
- Required traceability fields are visible in at least one sample per signal type.
- Retention limits are set for Prometheus/Loki/Tempo and documented.
make ops-observability-stack-smokeremains passing.