Observability Baseline¶
Backend Stack Decision (v1)¶
- OpenTelemetry Collector as the single telemetry pipeline.
- Prometheus for metrics.
- Tempo for traces.
- Loki for logs.
- Grafana for dashboards and alerting views.
- Vector is deferred by default (add only if advanced multi-sink transforms are required).
Reference:
- doc/architecture/Observability_Architecture.md
- doc/governance/Observability_Standards.md
- doc/operations/OTEL_Collector_Tenant_Isolation.md
Logging¶
- Structured JSON logs with fields:
- timestamp
- level
- service
- correlation_id
- trace_id / span_id (when span context exists)
- org_id/project_id when available
- error_code for failed requests/operations (catalog-aligned)
- resource_name when the affected resource can be resolved
Runtime structured-log field contract (API/gateway/workers):
- correlation_id
- error_code
- resource_name
- org_id (tenant boundary)
- project_id (project boundary)
Three-host lab host-role field contract (required for lab evidence and triage):
- host_role:
- platform_control
- app_control
- worker_compute
- host_name:
- dev-control-1
- dev-lab-1
- dev-gpu-1
- lab_stack when a platform-app control stack is involved (for example slurm-reference)
- node_id when the real GPU worker host is involved
Field omission is allowed only when context is not yet established (for example, startup/bootstrap logs before a request scope exists).
Tracing¶
- OpenTelemetry tracing enabled for:
- API requests
- async worker jobs
- external integrations (Stripe, node SSH operations)
Metrics¶
Required core metrics: - API request rate, latency, error rate - Queue depth and consumer lag - Workflow success/failure counts - Billing debit/credit event counts - Webhook processing latency and failure rate
Provisioning control-loop required metrics:
- provisioning_queue_depth (gauge): backlog of provisioning dispatch work.
- provisioning_dispatch_latency_seconds (histogram): event enqueue-to-dispatch delay.
- provisioning_timeouts_total (counter): provisioning task/workflow timeout outcomes.
- provisioning_failures_total (counter): non-timeout provisioning failures.
- nats_consumer_lag{stream="PROVISIONING"} (gauge): JetStream lag for provisioning stream.
Provisioning control-loop alert objectives:
- Queue depth sustained above threshold triggers backlog alert.
- Dispatch p95 latency sustained above threshold triggers dispatch-delay alert.
- Timeout rate above threshold triggers timeout alert.
- Failure burst above threshold triggers failure-rate alert.
- All provisioning control-loop alerts must include runbook_id: ops.provisioning.stuck.
Webhook worker baseline counters (via GET /metrics on cmd/webhook-worker):
- webhook_events_received_total
- webhook_signature_failures_total
- webhook_invalid_payload_total
- webhook_persist_failures_total
- webhook_processed_success_total
- webhook_processing_failures_total
- payments_reconcile_failed_total
API baseline counters/gauges (via GET /metrics on cmd/api):
- api_ratelimit_fail_open_total
- api_idempotency_persisted_body_json_total
- api_idempotency_skipped_body_empty_total
- api_idempotency_skipped_body_non_json_total
- api_idempotency_replays_served_total
- terminal_token_consumed_ok_total
- terminal_token_replay_rejected_total
- ws_notifications_active_connections
- ws_notifications_forwarded_messages_total
- ws_notifications_write_errors_total
- api_platform_role_list_requests_total
- api_platform_role_bind_requests_total
- api_platform_role_revoke_requests_total
- api_platform_role_mutation_success_total
- api_platform_role_mutation_failure_total
- api_platform_role_admin_denied_total
- api_platform_role_service_unavailable_total
Note: - Platform-role counters are expected when the API runtime includes role-binding management telemetry. - In mixed-version local environments, smoke checks warn (not fail) until API runtime is refreshed.
Terminal stream relay observability baseline:
- session lifecycle counters (open/close/error) with close-reason labels.
- relay write error rate and drop counters at gateway/runtime boundary.
- token replay/consume counters correlated with session failures.
- alert annotations must map to ops.terminal.gateway.
SSH key management observability baseline:
- key mutation counters (create/delete/default-switch/allocation-keyset-update).
- authorization-denied and validation-error rate for SSH key APIs.
- audit-log completeness checks for key-management mutations.
- alert annotations must map to ops.node.onboarding.
Baseline validation command:
- make ops-observability-smoke
- Script: scripts/ops/observability_smoke.sh
- Latest local evidence: doc/operations/evidence/observability_local_smoke_report.md
Correlation-first validation checks (required):
- API error envelope includes:
- code, message, correlation_id
- machine-readable details with at least service, and when available trace_id, span_id
- Terminal gateway error envelope includes:
- same fields above, plus route/method metadata in details
- NATS event path preserves context:
- x-correlation-id in message headers
- traceparent/tracestate propagation when trace context is present
Local overlay bring-up:
- make dev-up-observability
- Compose overlay: doc/operations/local-dev/docker-compose.observability.yaml
- Stack readiness check: make ops-observability-stack-smoke
- OTLP export is enabled for all core runtime services in observability mode:
- gpuaas-api
- gpuaas-terminal-gateway
- gpuaas-billing-worker
- gpuaas-provisioning-worker
- gpuaas-webhook-worker
- gpuaas-notification-relay
- gpuaas-outbox-relay
platform_control deployed baseline:
- namespace: gpuaas-observability
- components:
- otel-collector
- prometheus
- loki
- tempo
- grafana
- promtail
- current public endpoints on dev-control-1:
- Grafana: http://100.90.157.34:3001
- Prometheus: http://100.90.157.34:9090
- Loki: http://100.90.157.34:3100
- Tempo: http://100.90.157.34:3200
- current proof points:
- Prometheus scrapes gpuaas-core services
- Tempo stores API request traces from gpuaas-api
- Loki stores API warning/error logs from gpuaas-core
Three-host lab observability baseline:
- dev-control-1 must remain identifiable as host_role=platform_control.
- dev-lab-1 must remain identifiable as host_role=app_control.
- dev-gpu-1 must remain identifiable as host_role=worker_compute.
- dashboards, logs, and runbooks must preserve correlation_id across those boundaries.
- real GPU incidents must stay distinguishable from scheduler/platform-app control-stack incidents.
- platform-control Kubernetes logs must be queried with the live Kubernetes label model (namespace, job, host_role) plus JSON payload fields such as service, correlation_id, and trace_id; do not rely on the retired local-dev compose_service label.
- platform_control observability is automation-owned through:
- infra/k8s/base/observability/
- infra/ansible/roles/platform_control_k8s_observability/
collector-backed node-agent logs:
- gpuaas-node-agent and gpuaas-metrics-helper emit normal structured stdout/journald/file logs.
- Worker nodes run a host-local Vector collector when
GPUAAS_NODE_LOG_COLLECTOR_ENABLED=1.
- The collector tails:
- gpuaas-node-agent.service
- gpuaas-metrics-helper.service
- gpuaas-metrics-helper.timer
- /var/log/gpuaas-node-agent*.log
- The collector forwards to gpuaas-node-log-gateway through the node-facing
ingress with bounded disk buffering, not from service code directly and not
directly to raw Loki in production.
- Configure Vector's Loki sink endpoint as the gateway base path
(https://node-api.<env>/internal/v1/node-logs). Vector appends
/loki/api/v1/push; using the full push URL as the endpoint causes a doubled
path and 404s.
- gpuaas-node-log-gateway validates the node bearer token, caps request size,
forwards only Loki push batches to in-cluster Loki, and exposes
node_log_gateway_* Prometheus counters.
- The collector must validate the gateway TLS chain with the node bootstrap CA
bundle (GPUAAS_NODE_LOG_COLLECTOR_CA_FILE, default
/etc/gpuaas/ca-bundle.crt). Do not disable certificate or hostname
verification to work around node trust drift.
- Required Loki labels:
- service=gpuaas-node-agent|gpuaas-metrics-helper
- component=node-agent|metrics-helper
- source=journald|self-update-finalizer
- systemd_unit
- host_role=worker_compute
- host_name
- node_id
- High-cardinality values such as task_id, allocation_id, and
correlation_id stay as JSON fields and are queried with | json.
- Bootstrap ownership:
- build/node-agent-bootstrap/observability/vector-node-logs.toml.tmpl
- build/node-agent-bootstrap/systemd/gpuaas-node-log-collector.service.tmpl
- build/node-agent-bootstrap/systemd/gpuaas-metrics-helper.service.tmpl
- build/node-agent-bootstrap/systemd/gpuaas-metrics-helper.timer.tmpl
- Worker-host automation ownership:
- infra/ansible/roles/worker_compute/
- Smoke validation:
- make ops-node-agent-loki-smoke
- set LOKI_BASE_URL and optionally NODE_ID.
Node-local Netdata telemetry edge:
- Worker bootstrap owns the stable Netdata edge used by platform-proxy, not the
API route layer.
- Netdata listens on 127.0.0.1:19998.
- nginx listens on 0.0.0.0:19999.
- /gpuaas/telemetry/health returns ok.
- /gpuaas/telemetry/netdata/ redirects to the locally detected Netdata UI
version path.
- Bootstrap ownership:
- build/node-agent-bootstrap/nginx/gpuaas-netdata-edge.conf.tmpl
- build/node-agent-bootstrap/install-node-agent.sh
- Worker-host automation ownership:
- infra/ansible/roles/worker_compute/
- Existing-node repair/convergence helper:
- scripts/ops/gpuaas_netdata_edge_converge.sh
Ops metrics query pack (backend mode):
- Use backend mode for durable totals displayed on Admin Ops views.
- Query pack must include canonical mappings for:
- request/error totals by service and status class
- websocket/terminal session outcomes
- queue/backlog and worker failure aggregates
- Query failures in backend mode must emit explicit operator-facing degradation reason
with fallback instructions.
Alerts¶
- Alert on SLO burn for API latency/error budgets.
- Alert on queue backlog thresholds.
- Alert on webhook failures and billing worker failures.
- Alert on repeated provisioning failures.
- Alert on provisioning dispatch latency and timeout rates.
- Alert on terminal stream relay degradation (session drop/error spikes).
- Alert on SSH key management anomaly spikes (mutation failures/denials).
- Alert rules should carry
runbook_idannotations mapped todoc/operations/runbooks/runbooks.catalog.json.
Current local Prometheus rule pack (doc/operations/local-dev/observability/prometheus-alerts.yaml):
- GPUAASWebhookProcessingFailuresSpike -> runbook_id: ops.webhook.outage
- GPUAASTerminalTokenReplaySpike -> runbook_id: ops.terminal.gateway
- GPUAASNotificationWriteErrorsSpike -> runbook_id: ops.terminal.gateway
- GPUAASRateLimitFailOpenDetected -> runbook_id: ops.api.degradation
Alert drill command (synthetic test vectors):
- make ops-observability-alert-drill
- validates rule syntax and firing behavior via promtool test rules
Grafana alert routing baseline (local provisioning):
- Contact points:
- gpuaas-default (fallback)
- gpuaas-platform
- gpuaas-payments
- Notification policy routes by:
- owner_team label
- runbook_id label (explicit runbook mapping)
- Message templates:
- gpuaas.alert.title
- gpuaas.alert.body
Dashboards¶
- Service health overview
- Provisioning workflow dashboard
- Terminal gateway/session reliability dashboard
- Billing and payments dashboard
- Node fleet health dashboard
- Security/authentication anomalies dashboard
Local Grafana pack (current auto-provisioned baseline):
- GPUaaS Control-Plane Overview (API/control-plane health + error logs)
- GPUaaS Billing & Payments (webhook and reconcile path)
- GPUaaS Terminal & Notifications (terminal token + websocket reliability)
- GPUaaS Runtime Health (process/runtime saturation by scraped job)
- GPUaaS Incident Correlation (correlation_id/trace_id pivots in Loki)
- GPUaaS Local Overview (legacy starter dashboard; retained as compatibility view)
- GPUaaS Fleet Telemetry (CPU/GPU/Memory/Storage rollups for /admin/telemetry)
Initial Grafana dashboard set ownership: - API/control-plane reliability: Platform/API owner. - Provisioning pipeline and worker lag: Provisioning owner. - Terminal session and token path reliability: Terminal owner. - Billing/payment reconciliation path: Billing owner. - Node fleet enrollment and health posture: Platform/Inventory owner.
Three-host lab dashboard/query direction:
- platform_control:
- GitLab, registry, control-plane stack, and observability stack health
- app_control:
- platform-app control stacks such as slurm-reference
- worker_compute:
- node-agent, terminal path, allocation runtime, and GPU validation
- any host-role alert should carry runbook_id: ops.lab.three-host when the first failing boundary is not yet obvious
Admin Ops decision-first observability mapping:
- Decision Header is the scan point for freshness and incident count.
- Action Required is the default entry point for degraded signals needing action now.
- Health Summary is for compact state confirmation, not primary diagnosis.
- Investigation Tools is where correlation, trace, and saved-query pivots live after the incident class is selected.
- Fleet and Sample Detail is supporting evidence only.
- Auth/login failures must stay visible as WARN/401-class incidents and must not rely on 5xx-only dashboards.
Saved query cookbook (incident-ready defaults):
- API 5xx burst by correlation:
- Loki saved query: api_error_by_correlation_id
- {service="gpuaas-api"} | json | status=~"5.." | correlation_id!=""
- Terminal incident join by resource_name:
- Loki saved query: terminal_resource_name_join
- {service=~"gpuaas-(terminal-gateway|api|notification-relay)"} | json | resource_name="<RESOURCE_NAME>"
- Provisioning timeout/failure sweep:
- Loki saved query: provisioning_timeout_failure_window
- {service="gpuaas-provisioning-worker"} | json | event_type=~"provisioning\\.(failed|release_failed)"
- Billing/webhook reconciliation failures:
- Loki saved query: billing_webhook_reconcile_failures
- {service=~"gpuaas-(billing-worker|webhook-worker)"} | json | code=~"upstream_error|service_unavailable|internal_error"
- App runtime billing reconciliation failures:
- Loki saved query: app_runtime_billing_reconciliation
- {service=~"gpuaas-(api|billing-worker|app-runtime-worker)"} | json | correlation_id!="" | (app_instance_id!="" or usage_source="app_runtime")
- Fleet telemetry endpoint failures:
- Loki saved query: fleet_telemetry_api_error
- {service="gpuaas-api"} | json | path="/api/v1/admin/telemetry/fleet" | status=~"4..|5.."
- App operator/service-account failures:
- Loki saved query: app_operator_service_account_failure
- {service="gpuaas-api"} | json | correlation_id!="" | operator_service_account_id!=""
- Enterprise federation failures:
- Loki saved query: enterprise_federation_auth_failure
- {service="gpuaas-api"} | json | correlation_id!="" |~ "(oidc|saml|federation|state)"
- Three-host lab control-plane failures:
- Loki saved query: lab_control_plane_failure
- {host_role="platform_control"} | json | correlation_id!=""
- Three-host GPU worker failures:
- Loki saved query: lab_gpu_worker_failure
- {host_role="worker_compute"} | json | correlation_id!=""
- Three-host app-control host failures:
- Loki saved query: lab_control_host_failure
- {host_role="app_control"} | json | correlation_id!=""
- Trace pivot helper:
- Tempo/Grafana saved query: trace_from_correlation_id
- start from API error envelope details.trace_id, then inspect cross-service spans.
- when details.trace_id is absent:
1. use Loki with correlation_id to find the API log line,
2. extract trace_id,
3. open the trace in Tempo by ID,
4. verify downstream spans from workers (billing, provisioning, notification, outbox) are present for async flows.
SLO/SLI shortlist (operations review baseline): - API availability SLI: non-5xx request ratio over rolling 30d. - API latency SLI: p95 request latency on authenticated API endpoints. - Provisioning workflow SLI: requested->active success ratio within SLO window. - Terminal session SLI: successful websocket open + stable session duration ratio. - Billing/reconcile SLI: successful webhook processing + reconcile completion ratio. - Queue health SLI: outbox/NATS backlog age below threshold.
Operator interpretation reference:
- doc/operations/runbooks/Admin_Ops_Dashboard_Usage_Runbook.md
- doc/operations/Ops_Runbook_Architecture.md
- doc/operations/runbooks/Three_Host_Lab_Incident_Runbook.md