Observability Baseline¶

Backend Stack Decision (v1)¶

OpenTelemetry Collector as the single telemetry pipeline.
Prometheus for metrics.
Tempo for traces.
Loki for logs.
Grafana for dashboards and alerting views.
Vector is deferred by default (add only if advanced multi-sink transforms are required).

Reference: - doc/architecture/Observability_Architecture.md - doc/governance/Observability_Standards.md - doc/operations/OTEL_Collector_Tenant_Isolation.md

Logging¶

Structured JSON logs with fields:
timestamp
level
service
correlation_id
trace_id / span_id (when span context exists)
org_id/project_id when available
error_code for failed requests/operations (catalog-aligned)
resource_name when the affected resource can be resolved

Runtime structured-log field contract (API/gateway/workers): - correlation_id - error_code - resource_name - org_id (tenant boundary) - project_id (project boundary)

Three-host lab host-role field contract (required for lab evidence and triage): - host_role: - platform_control - app_control - worker_compute - host_name: - dev-control-1 - dev-lab-1 - dev-gpu-1 - lab_stack when a platform-app control stack is involved (for example slurm-reference) - node_id when the real GPU worker host is involved

Field omission is allowed only when context is not yet established (for example, startup/bootstrap logs before a request scope exists).

Tracing¶

OpenTelemetry tracing enabled for:
API requests
async worker jobs
external integrations (Stripe, node SSH operations)

Metrics¶

Required core metrics: - API request rate, latency, error rate - Queue depth and consumer lag - Workflow success/failure counts - Billing debit/credit event counts - Webhook processing latency and failure rate

Provisioning control-loop required metrics: - provisioning_queue_depth (gauge): backlog of provisioning dispatch work. - provisioning_dispatch_latency_seconds (histogram): event enqueue-to-dispatch delay. - provisioning_timeouts_total (counter): provisioning task/workflow timeout outcomes. - provisioning_failures_total (counter): non-timeout provisioning failures. - nats_consumer_lag{stream="PROVISIONING"} (gauge): JetStream lag for provisioning stream.

Provisioning control-loop alert objectives: - Queue depth sustained above threshold triggers backlog alert. - Dispatch p95 latency sustained above threshold triggers dispatch-delay alert. - Timeout rate above threshold triggers timeout alert. - Failure burst above threshold triggers failure-rate alert. - All provisioning control-loop alerts must include runbook_id: ops.provisioning.stuck.

Webhook worker baseline counters (via GET /metrics on cmd/webhook-worker): - webhook_events_received_total - webhook_signature_failures_total - webhook_invalid_payload_total - webhook_persist_failures_total - webhook_processed_success_total - webhook_processing_failures_total - payments_reconcile_failed_total

API baseline counters/gauges (via GET /metrics on cmd/api): - api_ratelimit_fail_open_total - api_idempotency_persisted_body_json_total - api_idempotency_skipped_body_empty_total - api_idempotency_skipped_body_non_json_total - api_idempotency_replays_served_total - terminal_token_consumed_ok_total - terminal_token_replay_rejected_total - ws_notifications_active_connections - ws_notifications_forwarded_messages_total - ws_notifications_write_errors_total - api_platform_role_list_requests_total - api_platform_role_bind_requests_total - api_platform_role_revoke_requests_total - api_platform_role_mutation_success_total - api_platform_role_mutation_failure_total - api_platform_role_admin_denied_total - api_platform_role_service_unavailable_total

Note: - Platform-role counters are expected when the API runtime includes role-binding management telemetry. - In mixed-version local environments, smoke checks warn (not fail) until API runtime is refreshed.

Terminal stream relay observability baseline: - session lifecycle counters (open/close/error) with close-reason labels. - relay write error rate and drop counters at gateway/runtime boundary. - token replay/consume counters correlated with session failures. - alert annotations must map to ops.terminal.gateway.

SSH key management observability baseline: - key mutation counters (create/delete/default-switch/allocation-keyset-update). - authorization-denied and validation-error rate for SSH key APIs. - audit-log completeness checks for key-management mutations. - alert annotations must map to ops.node.onboarding.

Baseline validation command: - make ops-observability-smoke - Script: scripts/ops/observability_smoke.sh - Latest local evidence: doc/operations/evidence/observability_local_smoke_report.md

Correlation-first validation checks (required): - API error envelope includes: - code, message, correlation_id - machine-readable details with at least service, and when available trace_id, span_id - Terminal gateway error envelope includes: - same fields above, plus route/method metadata in details - NATS event path preserves context: - x-correlation-id in message headers - traceparent/tracestate propagation when trace context is present

Local overlay bring-up: - make dev-up-observability - Compose overlay: doc/operations/local-dev/docker-compose.observability.yaml - Stack readiness check: make ops-observability-stack-smoke - OTLP export is enabled for all core runtime services in observability mode: - gpuaas-api - gpuaas-terminal-gateway - gpuaas-billing-worker - gpuaas-provisioning-worker - gpuaas-webhook-worker - gpuaas-notification-relay - gpuaas-outbox-relay

platform_control deployed baseline: - namespace: gpuaas-observability - components: - otel-collector - prometheus - loki - tempo - grafana - promtail - current public endpoints on dev-control-1: - Grafana: http://100.90.157.34:3001 - Prometheus: http://100.90.157.34:9090 - Loki: http://100.90.157.34:3100 - Tempo: http://100.90.157.34:3200 - current proof points: - Prometheus scrapes gpuaas-core services - Tempo stores API request traces from gpuaas-api - Loki stores API warning/error logs from gpuaas-core

Three-host lab observability baseline: - dev-control-1 must remain identifiable as host_role=platform_control. - dev-lab-1 must remain identifiable as host_role=app_control. - dev-gpu-1 must remain identifiable as host_role=worker_compute. - dashboards, logs, and runbooks must preserve correlation_id across those boundaries. - real GPU incidents must stay distinguishable from scheduler/platform-app control-stack incidents. - platform-control Kubernetes logs must be queried with the live Kubernetes label model (namespace, job, host_role) plus JSON payload fields such as service, correlation_id, and trace_id; do not rely on the retired local-dev compose_service label. - platform_control observability is automation-owned through: - infra/k8s/base/observability/ - infra/ansible/roles/platform_control_k8s_observability/

collector-backed node-agent logs: - gpuaas-node-agent and gpuaas-metrics-helper emit normal structured stdout/journald/file logs. - Worker nodes run a host-local Vector collector when GPUAAS_NODE_LOG_COLLECTOR_ENABLED=1. - The collector tails: - gpuaas-node-agent.service - gpuaas-metrics-helper.service - gpuaas-metrics-helper.timer - /var/log/gpuaas-node-agent*.log - The collector forwards to gpuaas-node-log-gateway through the node-facing ingress with bounded disk buffering, not from service code directly and not directly to raw Loki in production. - Configure Vector's Loki sink endpoint as the gateway base path (https://node-api.<env>/internal/v1/node-logs). Vector appends /loki/api/v1/push; using the full push URL as the endpoint causes a doubled path and 404s. - gpuaas-node-log-gateway validates the node bearer token, caps request size, forwards only Loki push batches to in-cluster Loki, and exposes node_log_gateway_* Prometheus counters. - The collector must validate the gateway TLS chain with the node bootstrap CA bundle (GPUAAS_NODE_LOG_COLLECTOR_CA_FILE, default /etc/gpuaas/ca-bundle.crt). Do not disable certificate or hostname verification to work around node trust drift. - Required Loki labels: - service=gpuaas-node-agent|gpuaas-metrics-helper - component=node-agent|metrics-helper - source=journald|self-update-finalizer - systemd_unit - host_role=worker_compute - host_name - node_id - High-cardinality values such as task_id, allocation_id, and correlation_id stay as JSON fields and are queried with | json. - Bootstrap ownership: - build/node-agent-bootstrap/observability/vector-node-logs.toml.tmpl - build/node-agent-bootstrap/systemd/gpuaas-node-log-collector.service.tmpl - build/node-agent-bootstrap/systemd/gpuaas-metrics-helper.service.tmpl - build/node-agent-bootstrap/systemd/gpuaas-metrics-helper.timer.tmpl - Worker-host automation ownership: - infra/ansible/roles/worker_compute/ - Smoke validation: - make ops-node-agent-loki-smoke - set LOKI_BASE_URL and optionally NODE_ID.

Node-local Netdata telemetry edge: - Worker bootstrap owns the stable Netdata edge used by platform-proxy, not the API route layer. - Netdata listens on 127.0.0.1:19998. - nginx listens on 0.0.0.0:19999. - /gpuaas/telemetry/health returns ok. - /gpuaas/telemetry/netdata/ redirects to the locally detected Netdata UI version path. - Bootstrap ownership: - build/node-agent-bootstrap/nginx/gpuaas-netdata-edge.conf.tmpl - build/node-agent-bootstrap/install-node-agent.sh - Worker-host automation ownership: - infra/ansible/roles/worker_compute/ - Existing-node repair/convergence helper: - scripts/ops/gpuaas_netdata_edge_converge.sh

Ops metrics query pack (backend mode): - Use backend mode for durable totals displayed on Admin Ops views. - Query pack must include canonical mappings for: - request/error totals by service and status class - websocket/terminal session outcomes - queue/backlog and worker failure aggregates - Query failures in backend mode must emit explicit operator-facing degradation reason with fallback instructions.

Alerts¶

Alert on SLO burn for API latency/error budgets.
Alert on queue backlog thresholds.
Alert on webhook failures and billing worker failures.
Alert on repeated provisioning failures.
Alert on provisioning dispatch latency and timeout rates.
Alert on terminal stream relay degradation (session drop/error spikes).
Alert on SSH key management anomaly spikes (mutation failures/denials).
Alert rules should carry runbook_id annotations mapped to doc/operations/runbooks/runbooks.catalog.json.

Current local Prometheus rule pack (doc/operations/local-dev/observability/prometheus-alerts.yaml): - GPUAASWebhookProcessingFailuresSpike -> runbook_id: ops.webhook.outage - GPUAASTerminalTokenReplaySpike -> runbook_id: ops.terminal.gateway - GPUAASNotificationWriteErrorsSpike -> runbook_id: ops.terminal.gateway - GPUAASRateLimitFailOpenDetected -> runbook_id: ops.api.degradation

Alert drill command (synthetic test vectors): - make ops-observability-alert-drill - validates rule syntax and firing behavior via promtool test rules

Grafana alert routing baseline (local provisioning): - Contact points: - gpuaas-default (fallback) - gpuaas-platform - gpuaas-payments - Notification policy routes by: - owner_team label - runbook_id label (explicit runbook mapping) - Message templates: - gpuaas.alert.title - gpuaas.alert.body

Dashboards¶

Service health overview
Provisioning workflow dashboard
Terminal gateway/session reliability dashboard
Billing and payments dashboard
Node fleet health dashboard
Security/authentication anomalies dashboard

Local Grafana pack (current auto-provisioned baseline): - GPUaaS Control-Plane Overview (API/control-plane health + error logs) - GPUaaS Billing & Payments (webhook and reconcile path) - GPUaaS Terminal & Notifications (terminal token + websocket reliability) - GPUaaS Runtime Health (process/runtime saturation by scraped job) - GPUaaS Incident Correlation (correlation_id/trace_id pivots in Loki) - GPUaaS Local Overview (legacy starter dashboard; retained as compatibility view) - GPUaaS Fleet Telemetry (CPU/GPU/Memory/Storage rollups for /admin/telemetry)

Initial Grafana dashboard set ownership: - API/control-plane reliability: Platform/API owner. - Provisioning pipeline and worker lag: Provisioning owner. - Terminal session and token path reliability: Terminal owner. - Billing/payment reconciliation path: Billing owner. - Node fleet enrollment and health posture: Platform/Inventory owner.

Three-host lab dashboard/query direction: - platform_control: - GitLab, registry, control-plane stack, and observability stack health - app_control: - platform-app control stacks such as slurm-reference - worker_compute: - node-agent, terminal path, allocation runtime, and GPU validation - any host-role alert should carry runbook_id: ops.lab.three-host when the first failing boundary is not yet obvious

Admin Ops decision-first observability mapping: - Decision Header is the scan point for freshness and incident count. - Action Required is the default entry point for degraded signals needing action now. - Health Summary is for compact state confirmation, not primary diagnosis. - Investigation Tools is where correlation, trace, and saved-query pivots live after the incident class is selected. - Fleet and Sample Detail is supporting evidence only. - Auth/login failures must stay visible as WARN/401-class incidents and must not rely on 5xx-only dashboards.

Saved query cookbook (incident-ready defaults): - API 5xx burst by correlation: - Loki saved query: api_error_by_correlation_id - {service="gpuaas-api"} | json | status=~"5.." | correlation_id!="" - Terminal incident join by resource_name: - Loki saved query: terminal_resource_name_join - {service=~"gpuaas-(terminal-gateway|api|notification-relay)"} | json | resource_name="<RESOURCE_NAME>" - Provisioning timeout/failure sweep: - Loki saved query: provisioning_timeout_failure_window - {service="gpuaas-provisioning-worker"} | json | event_type=~"provisioning\\.(failed|release_failed)" - Billing/webhook reconciliation failures: - Loki saved query: billing_webhook_reconcile_failures - {service=~"gpuaas-(billing-worker|webhook-worker)"} | json | code=~"upstream_error|service_unavailable|internal_error" - App runtime billing reconciliation failures: - Loki saved query: app_runtime_billing_reconciliation - {service=~"gpuaas-(api|billing-worker|app-runtime-worker)"} | json | correlation_id!="" | (app_instance_id!="" or usage_source="app_runtime") - Fleet telemetry endpoint failures: - Loki saved query: fleet_telemetry_api_error - {service="gpuaas-api"} | json | path="/api/v1/admin/telemetry/fleet" | status=~"4..|5.." - App operator/service-account failures: - Loki saved query: app_operator_service_account_failure - {service="gpuaas-api"} | json | correlation_id!="" | operator_service_account_id!="" - Enterprise federation failures: - Loki saved query: enterprise_federation_auth_failure - {service="gpuaas-api"} | json | correlation_id!="" |~ "(oidc|saml|federation|state)" - Three-host lab control-plane failures: - Loki saved query: lab_control_plane_failure - {host_role="platform_control"} | json | correlation_id!="" - Three-host GPU worker failures: - Loki saved query: lab_gpu_worker_failure - {host_role="worker_compute"} | json | correlation_id!="" - Three-host app-control host failures: - Loki saved query: lab_control_host_failure - {host_role="app_control"} | json | correlation_id!="" - Trace pivot helper: - Tempo/Grafana saved query: trace_from_correlation_id - start from API error envelope details.trace_id, then inspect cross-service spans. - when details.trace_id is absent: 1. use Loki with correlation_id to find the API log line, 2. extract trace_id, 3. open the trace in Tempo by ID, 4. verify downstream spans from workers (billing, provisioning, notification, outbox) are present for async flows.

SLO/SLI shortlist (operations review baseline): - API availability SLI: non-5xx request ratio over rolling 30d. - API latency SLI: p95 request latency on authenticated API endpoints. - Provisioning workflow SLI: requested->active success ratio within SLO window. - Terminal session SLI: successful websocket open + stable session duration ratio. - Billing/reconcile SLI: successful webhook processing + reconcile completion ratio. - Queue health SLI: outbox/NATS backlog age below threshold.

Operator interpretation reference: - doc/operations/runbooks/Admin_Ops_Dashboard_Usage_Runbook.md - doc/operations/Ops_Runbook_Architecture.md - doc/operations/runbooks/Three_Host_Lab_Incident_Runbook.md