Skip to content

Node Agent Log Collection to Loki v1

As of: May 3, 2026

Purpose

Node-agent recovery evidence must not depend on SSH access to GPU nodes. This design adds a scalable collector path for node-agent logs while preserving the node-agent control-plane trust model.

Goals

  1. Collect gpuaas-node-agent.service journald logs and /var/log/gpuaas-node-agent*.log finalizer logs.
  2. Survive node/control-plane network interruptions with bounded local buffering.
  3. Authenticate every log stream as a specific node_id.
  4. Keep Loki labels low-cardinality and queryable by incident workflows.
  5. Scale horizontally in platform-control and future production environments.

Non-Goals

  1. General host log collection for every system service.
  2. Direct node access to Loki, Grafana, object storage, Redis, Postgres, NATS, or Temporal.
  3. Replacing node task result output or audit logs.

Architecture

GPU node
  journald + /var/log/gpuaas-node-agent*.log
  node-local collector (Vector)
        │  bearer token + node labels + bounded disk buffer
  gpuaas-node-log-gateway replicas
  distributor / gateway (horizontal)
  Loki write path
  Grafana / ops UI / incident runbooks

The node-local collector sends to the same node-facing authority as the node agent: POST /internal/v1/node-logs/loki/api/v1/push. That route terminates at gpuaas-node-log-gateway, not raw Loki. The gateway validates the node bearer token, caps request size, forwards only Loki push batches to in-cluster Loki, and exposes Prometheus counters for accepted/rejected/forward-failed batches.

The log gateway is a scaling split, not a second trust root. It must remain behind the node-facing ingress and the same node mTLS/client-cert posture as /internal/v1/nodes/*. Loki itself is not node-facing.

Node Collector

Recommended first implementation: Vector. The bootstrap package and worker Ansible role install the config/unit templates disabled by default; enabling requires the node-log-gateway URL and node bearer token material.

Reasons: 1. Mature journald and file tail sources. 2. Disk buffering and backpressure controls. 3. Loki and HTTP sinks. 4. Small operational footprint on bare-metal nodes.

Required sources: 1. journald filtered to _SYSTEMD_UNIT=gpuaas-node-agent.service. 2. journald filtered to gpuaas-metrics-helper.service and gpuaas-metrics-helper.timer. 3. file for /var/log/gpuaas-node-agent-self-update.log. 4. Optional file glob for future /var/log/gpuaas-node-agent*.log.

Required transforms: 1. Add node_id from GPUAAS_NODE_ID. 2. Add component=node-agent. 3. Add source=journald|self-update-finalizer. 4. Drop or redact known secret fields before egress: GPUAAS_ENROLLMENT_TOKEN, Authorization, access_token, refresh_token, private keys, SSH private keys.

Required buffering: 1. Disk buffer enabled. 2. Bounded max size with oldest-drop policy after alerting. 3. Exponential retry to ingest endpoint.

Loki Labels

Use low-cardinality labels only:

Label Source
service constant gpuaas-node-agent
node_id trusted from mTLS path or collector env, verified by ingest
region control-plane node record lookup, not node-supplied
cluster deployment environment
component node-agent
source journald or self-update-finalizer

Do not label by task_id, allocation_id, user_id, error string, host path, or high-cardinality message content. Those remain structured log fields.

Ingest Semantics

The ingest endpoint should:

  1. Require node auth token and node mTLS when INTERNAL_NODE_MTLS_STRICT=true.
  2. Validate path node_id matches the mTLS node identity.
  3. Reject nodes in removed/deleted states.
  4. Enforce per-node request size and rate limits.
  5. Stamp trusted labels from the control-plane read model.
  6. Sanitize log payloads before forwarding.
  7. Return 202 Accepted after durable handoff to collector/gateway, not necessarily after Loki persistence.

Scaling Model

Start with API-owned ingest for local-kind/dev-control. For platform-control and production scale:

  1. Put a stateless log-gateway deployment behind the node-facing ingress.
  2. Horizontally scale log-gateway on requests/sec and queue depth.
  3. Use Loki distributor replicas and object-store-backed chunks.
  4. Keep node-local disk buffers large enough for control-plane maintenance windows.
  5. Alert on gateway 5xx, per-node throttling, dropped collector events, and Loki distributor backpressure.

Rollout Plan

  1. Add node-local Vector config to bootstrap package but keep it disabled by default. Done.
  2. Add a platform log-gateway in front of Loki when scale or node-auth boundaries require it. Done: cmd/node-log-gateway and infra/k8s/base/core/node-log-gateway.yaml.
  3. Enable in local-kind VM nodes and verify logs for cert recovery, self-update, and task result retry.
  4. Enable in platform-control nodes with a small disk buffer.
  5. Add Grafana panels and runbook queries:
  6. by node_id
  7. by correlation_id
  8. by task_id
  9. by component=node-agent source=self-update-finalizer
  10. Make collector enablement part of node bootstrap/image once validated. The bootstrap package now carries the disabled-by-default Vector unit/config so MAAS reimages do not lose the collector wiring.

Bootstrap Adjacency: Netdata Edge

The node-local nginx Netdata edge is part of the same worker observability baseline, even though it is not a Loki path:

  1. Netdata is configured as a node-local backend on 127.0.0.1:19998.
  2. nginx exposes the stable platform-proxy target on 0.0.0.0:19999.
  3. /gpuaas/telemetry/health is the node-local health check.
  4. /gpuaas/telemetry/netdata/ redirects to the detected Netdata dashboard version path.

This belongs in bootstrap or the host image, not only in scripts/ops/gpuaas_netdata_edge_converge.sh, so reimaged nodes do not regress to direct Netdata exposure or require API-side /v1//v2//v3 route probing.

Open Decisions

  1. Whether the gateway should evolve from bearer-token validation to strict per-node mTLS identity validation before production.
  2. Whether node-agent should expose a local health signal for collector liveness.
  3. Retention tier for node-agent logs in platform-control versus production.