Skip to content

Observability Baseline

Backend Stack Decision (v1)

  • OpenTelemetry Collector as the single telemetry pipeline.
  • Prometheus for metrics.
  • Tempo for traces.
  • Loki for logs.
  • Grafana for dashboards and alerting views.
  • Vector is deferred by default (add only if advanced multi-sink transforms are required).

Reference: - doc/architecture/Observability_Architecture.md - doc/governance/Observability_Standards.md - doc/operations/OTEL_Collector_Tenant_Isolation.md

Logging

  • Structured JSON logs with fields:
  • timestamp
  • level
  • service
  • correlation_id
  • trace_id / span_id (when span context exists)
  • org_id/project_id when available
  • error_code for failed requests/operations (catalog-aligned)
  • resource_name when the affected resource can be resolved

Runtime structured-log field contract (API/gateway/workers): - correlation_id - error_code - resource_name - org_id (tenant boundary) - project_id (project boundary)

Three-host lab host-role field contract (required for lab evidence and triage): - host_role: - platform_control - app_control - worker_compute - host_name: - dev-control-1 - dev-lab-1 - dev-gpu-1 - lab_stack when a platform-app control stack is involved (for example slurm-reference) - node_id when the real GPU worker host is involved

Field omission is allowed only when context is not yet established (for example, startup/bootstrap logs before a request scope exists).

Tracing

  • OpenTelemetry tracing enabled for:
  • API requests
  • async worker jobs
  • external integrations (Stripe, node SSH operations)

Metrics

Required core metrics: - API request rate, latency, error rate - Queue depth and consumer lag - Workflow success/failure counts - Billing debit/credit event counts - Webhook processing latency and failure rate

Provisioning control-loop required metrics: - provisioning_queue_depth (gauge): backlog of provisioning dispatch work. - provisioning_dispatch_latency_seconds (histogram): event enqueue-to-dispatch delay. - provisioning_timeouts_total (counter): provisioning task/workflow timeout outcomes. - provisioning_failures_total (counter): non-timeout provisioning failures. - nats_consumer_lag{stream="PROVISIONING"} (gauge): JetStream lag for provisioning stream.

Provisioning control-loop alert objectives: - Queue depth sustained above threshold triggers backlog alert. - Dispatch p95 latency sustained above threshold triggers dispatch-delay alert. - Timeout rate above threshold triggers timeout alert. - Failure burst above threshold triggers failure-rate alert. - All provisioning control-loop alerts must include runbook_id: ops.provisioning.stuck.

Webhook worker baseline counters (via GET /metrics on cmd/webhook-worker): - webhook_events_received_total - webhook_signature_failures_total - webhook_invalid_payload_total - webhook_persist_failures_total - webhook_processed_success_total - webhook_processing_failures_total - payments_reconcile_failed_total

API baseline counters/gauges (via GET /metrics on cmd/api): - api_ratelimit_fail_open_total - api_idempotency_persisted_body_json_total - api_idempotency_skipped_body_empty_total - api_idempotency_skipped_body_non_json_total - api_idempotency_replays_served_total - terminal_token_consumed_ok_total - terminal_token_replay_rejected_total - ws_notifications_active_connections - ws_notifications_forwarded_messages_total - ws_notifications_write_errors_total - api_platform_role_list_requests_total - api_platform_role_bind_requests_total - api_platform_role_revoke_requests_total - api_platform_role_mutation_success_total - api_platform_role_mutation_failure_total - api_platform_role_admin_denied_total - api_platform_role_service_unavailable_total

Note: - Platform-role counters are expected when the API runtime includes role-binding management telemetry. - In mixed-version local environments, smoke checks warn (not fail) until API runtime is refreshed.

Terminal stream relay observability baseline: - session lifecycle counters (open/close/error) with close-reason labels. - relay write error rate and drop counters at gateway/runtime boundary. - token replay/consume counters correlated with session failures. - alert annotations must map to ops.terminal.gateway.

SSH key management observability baseline: - key mutation counters (create/delete/default-switch/allocation-keyset-update). - authorization-denied and validation-error rate for SSH key APIs. - audit-log completeness checks for key-management mutations. - alert annotations must map to ops.node.onboarding.

Baseline validation command: - make ops-observability-smoke - Script: scripts/ops/observability_smoke.sh - Latest local evidence: doc/operations/evidence/observability_local_smoke_report.md

Correlation-first validation checks (required): - API error envelope includes: - code, message, correlation_id - machine-readable details with at least service, and when available trace_id, span_id - Terminal gateway error envelope includes: - same fields above, plus route/method metadata in details - NATS event path preserves context: - x-correlation-id in message headers - traceparent/tracestate propagation when trace context is present

Local overlay bring-up: - make dev-up-observability - Compose overlay: doc/operations/local-dev/docker-compose.observability.yaml - Stack readiness check: make ops-observability-stack-smoke - OTLP export is enabled for all core runtime services in observability mode: - gpuaas-api - gpuaas-terminal-gateway - gpuaas-billing-worker - gpuaas-provisioning-worker - gpuaas-webhook-worker - gpuaas-notification-relay - gpuaas-outbox-relay

platform_control deployed baseline: - namespace: gpuaas-observability - components: - otel-collector - prometheus - loki - tempo - grafana - promtail - current public endpoints on dev-control-1: - Grafana: http://100.90.157.34:3001 - Prometheus: http://100.90.157.34:9090 - Loki: http://100.90.157.34:3100 - Tempo: http://100.90.157.34:3200 - current proof points: - Prometheus scrapes gpuaas-core services - Tempo stores API request traces from gpuaas-api - Loki stores API warning/error logs from gpuaas-core

Three-host lab observability baseline: - dev-control-1 must remain identifiable as host_role=platform_control. - dev-lab-1 must remain identifiable as host_role=app_control. - dev-gpu-1 must remain identifiable as host_role=worker_compute. - dashboards, logs, and runbooks must preserve correlation_id across those boundaries. - real GPU incidents must stay distinguishable from scheduler/platform-app control-stack incidents. - platform-control Kubernetes logs must be queried with the live Kubernetes label model (namespace, job, host_role) plus JSON payload fields such as service, correlation_id, and trace_id; do not rely on the retired local-dev compose_service label. - platform_control observability is automation-owned through: - infra/k8s/base/observability/ - infra/ansible/roles/platform_control_k8s_observability/

collector-backed node-agent logs: - gpuaas-node-agent and gpuaas-metrics-helper emit normal structured stdout/journald/file logs. - Worker nodes run a host-local Vector collector when GPUAAS_NODE_LOG_COLLECTOR_ENABLED=1. - The collector tails: - gpuaas-node-agent.service - gpuaas-metrics-helper.service - gpuaas-metrics-helper.timer - /var/log/gpuaas-node-agent*.log - The collector forwards to gpuaas-node-log-gateway through the node-facing ingress with bounded disk buffering, not from service code directly and not directly to raw Loki in production. - Configure Vector's Loki sink endpoint as the gateway base path (https://node-api.<env>/internal/v1/node-logs). Vector appends /loki/api/v1/push; using the full push URL as the endpoint causes a doubled path and 404s. - gpuaas-node-log-gateway validates the node bearer token, caps request size, forwards only Loki push batches to in-cluster Loki, and exposes node_log_gateway_* Prometheus counters. - The collector must validate the gateway TLS chain with the node bootstrap CA bundle (GPUAAS_NODE_LOG_COLLECTOR_CA_FILE, default /etc/gpuaas/ca-bundle.crt). Do not disable certificate or hostname verification to work around node trust drift. - Required Loki labels: - service=gpuaas-node-agent|gpuaas-metrics-helper - component=node-agent|metrics-helper - source=journald|self-update-finalizer - systemd_unit - host_role=worker_compute - host_name - node_id - High-cardinality values such as task_id, allocation_id, and correlation_id stay as JSON fields and are queried with | json. - Bootstrap ownership: - build/node-agent-bootstrap/observability/vector-node-logs.toml.tmpl - build/node-agent-bootstrap/systemd/gpuaas-node-log-collector.service.tmpl - build/node-agent-bootstrap/systemd/gpuaas-metrics-helper.service.tmpl - build/node-agent-bootstrap/systemd/gpuaas-metrics-helper.timer.tmpl - Worker-host automation ownership: - infra/ansible/roles/worker_compute/ - Smoke validation: - make ops-node-agent-loki-smoke - set LOKI_BASE_URL and optionally NODE_ID.

Node-local Netdata telemetry edge: - Worker bootstrap owns the stable Netdata edge used by platform-proxy, not the API route layer. - Netdata listens on 127.0.0.1:19998. - nginx listens on 0.0.0.0:19999. - /gpuaas/telemetry/health returns ok. - /gpuaas/telemetry/netdata/ redirects to the locally detected Netdata UI version path. - Bootstrap ownership: - build/node-agent-bootstrap/nginx/gpuaas-netdata-edge.conf.tmpl - build/node-agent-bootstrap/install-node-agent.sh - Worker-host automation ownership: - infra/ansible/roles/worker_compute/ - Existing-node repair/convergence helper: - scripts/ops/gpuaas_netdata_edge_converge.sh

Ops metrics query pack (backend mode): - Use backend mode for durable totals displayed on Admin Ops views. - Query pack must include canonical mappings for: - request/error totals by service and status class - websocket/terminal session outcomes - queue/backlog and worker failure aggregates - Query failures in backend mode must emit explicit operator-facing degradation reason with fallback instructions.

Alerts

  • Alert on SLO burn for API latency/error budgets.
  • Alert on queue backlog thresholds.
  • Alert on webhook failures and billing worker failures.
  • Alert on repeated provisioning failures.
  • Alert on provisioning dispatch latency and timeout rates.
  • Alert on terminal stream relay degradation (session drop/error spikes).
  • Alert on SSH key management anomaly spikes (mutation failures/denials).
  • Alert rules should carry runbook_id annotations mapped to doc/operations/runbooks/runbooks.catalog.json.

Current local Prometheus rule pack (doc/operations/local-dev/observability/prometheus-alerts.yaml): - GPUAASWebhookProcessingFailuresSpike -> runbook_id: ops.webhook.outage - GPUAASTerminalTokenReplaySpike -> runbook_id: ops.terminal.gateway - GPUAASNotificationWriteErrorsSpike -> runbook_id: ops.terminal.gateway - GPUAASRateLimitFailOpenDetected -> runbook_id: ops.api.degradation

Alert drill command (synthetic test vectors): - make ops-observability-alert-drill - validates rule syntax and firing behavior via promtool test rules

Grafana alert routing baseline (local provisioning): - Contact points: - gpuaas-default (fallback) - gpuaas-platform - gpuaas-payments - Notification policy routes by: - owner_team label - runbook_id label (explicit runbook mapping) - Message templates: - gpuaas.alert.title - gpuaas.alert.body

Dashboards

  • Service health overview
  • Provisioning workflow dashboard
  • Terminal gateway/session reliability dashboard
  • Billing and payments dashboard
  • Node fleet health dashboard
  • Security/authentication anomalies dashboard

Local Grafana pack (current auto-provisioned baseline): - GPUaaS Control-Plane Overview (API/control-plane health + error logs) - GPUaaS Billing & Payments (webhook and reconcile path) - GPUaaS Terminal & Notifications (terminal token + websocket reliability) - GPUaaS Runtime Health (process/runtime saturation by scraped job) - GPUaaS Incident Correlation (correlation_id/trace_id pivots in Loki) - GPUaaS Local Overview (legacy starter dashboard; retained as compatibility view) - GPUaaS Fleet Telemetry (CPU/GPU/Memory/Storage rollups for /admin/telemetry)

Initial Grafana dashboard set ownership: - API/control-plane reliability: Platform/API owner. - Provisioning pipeline and worker lag: Provisioning owner. - Terminal session and token path reliability: Terminal owner. - Billing/payment reconciliation path: Billing owner. - Node fleet enrollment and health posture: Platform/Inventory owner.

Three-host lab dashboard/query direction: - platform_control: - GitLab, registry, control-plane stack, and observability stack health - app_control: - platform-app control stacks such as slurm-reference - worker_compute: - node-agent, terminal path, allocation runtime, and GPU validation - any host-role alert should carry runbook_id: ops.lab.three-host when the first failing boundary is not yet obvious

Admin Ops decision-first observability mapping: - Decision Header is the scan point for freshness and incident count. - Action Required is the default entry point for degraded signals needing action now. - Health Summary is for compact state confirmation, not primary diagnosis. - Investigation Tools is where correlation, trace, and saved-query pivots live after the incident class is selected. - Fleet and Sample Detail is supporting evidence only. - Auth/login failures must stay visible as WARN/401-class incidents and must not rely on 5xx-only dashboards.

Saved query cookbook (incident-ready defaults): - API 5xx burst by correlation: - Loki saved query: api_error_by_correlation_id - {service="gpuaas-api"} | json | status=~"5.." | correlation_id!="" - Terminal incident join by resource_name: - Loki saved query: terminal_resource_name_join - {service=~"gpuaas-(terminal-gateway|api|notification-relay)"} | json | resource_name="<RESOURCE_NAME>" - Provisioning timeout/failure sweep: - Loki saved query: provisioning_timeout_failure_window - {service="gpuaas-provisioning-worker"} | json | event_type=~"provisioning\\.(failed|release_failed)" - Billing/webhook reconciliation failures: - Loki saved query: billing_webhook_reconcile_failures - {service=~"gpuaas-(billing-worker|webhook-worker)"} | json | code=~"upstream_error|service_unavailable|internal_error" - App runtime billing reconciliation failures: - Loki saved query: app_runtime_billing_reconciliation - {service=~"gpuaas-(api|billing-worker|app-runtime-worker)"} | json | correlation_id!="" | (app_instance_id!="" or usage_source="app_runtime") - Fleet telemetry endpoint failures: - Loki saved query: fleet_telemetry_api_error - {service="gpuaas-api"} | json | path="/api/v1/admin/telemetry/fleet" | status=~"4..|5.." - App operator/service-account failures: - Loki saved query: app_operator_service_account_failure - {service="gpuaas-api"} | json | correlation_id!="" | operator_service_account_id!="" - Enterprise federation failures: - Loki saved query: enterprise_federation_auth_failure - {service="gpuaas-api"} | json | correlation_id!="" |~ "(oidc|saml|federation|state)" - Three-host lab control-plane failures: - Loki saved query: lab_control_plane_failure - {host_role="platform_control"} | json | correlation_id!="" - Three-host GPU worker failures: - Loki saved query: lab_gpu_worker_failure - {host_role="worker_compute"} | json | correlation_id!="" - Three-host app-control host failures: - Loki saved query: lab_control_host_failure - {host_role="app_control"} | json | correlation_id!="" - Trace pivot helper: - Tempo/Grafana saved query: trace_from_correlation_id - start from API error envelope details.trace_id, then inspect cross-service spans. - when details.trace_id is absent: 1. use Loki with correlation_id to find the API log line, 2. extract trace_id, 3. open the trace in Tempo by ID, 4. verify downstream spans from workers (billing, provisioning, notification, outbox) are present for async flows.

SLO/SLI shortlist (operations review baseline): - API availability SLI: non-5xx request ratio over rolling 30d. - API latency SLI: p95 request latency on authenticated API endpoints. - Provisioning workflow SLI: requested->active success ratio within SLO window. - Terminal session SLI: successful websocket open + stable session duration ratio. - Billing/reconcile SLI: successful webhook processing + reconcile completion ratio. - Queue health SLI: outbox/NATS backlog age below threshold.

Operator interpretation reference: - doc/operations/runbooks/Admin_Ops_Dashboard_Usage_Runbook.md - doc/operations/Ops_Runbook_Architecture.md - doc/operations/runbooks/Three_Host_Lab_Incident_Runbook.md