Node Agent Log Collection to Loki v1¶

As of: May 3, 2026

Purpose¶

Node-agent recovery evidence must not depend on SSH access to GPU nodes. This design adds a scalable collector path for node-agent logs while preserving the node-agent control-plane trust model.

Goals¶

Collect gpuaas-node-agent.service journald logs and /var/log/gpuaas-node-agent*.log finalizer logs.
Survive node/control-plane network interruptions with bounded local buffering.
Authenticate every log stream as a specific node_id.
Keep Loki labels low-cardinality and queryable by incident workflows.
Scale horizontally in platform-control and future production environments.

Non-Goals¶

General host log collection for every system service.
Direct node access to Loki, Grafana, object storage, Redis, Postgres, NATS, or Temporal.
Replacing node task result output or audit logs.

Architecture¶

GPU node
  journald + /var/log/gpuaas-node-agent*.log
        │
        ▼
  node-local collector (Vector)
        │  bearer token + node labels + bounded disk buffer
        ▼
  gpuaas-node-log-gateway replicas
        │
        ▼
  distributor / gateway (horizontal)
        │
        ▼
  Loki write path
        │
        ▼
  Grafana / ops UI / incident runbooks

The node-local collector sends to the same node-facing authority as the node agent: POST /internal/v1/node-logs/loki/api/v1/push. That route terminates at gpuaas-node-log-gateway, not raw Loki. The gateway validates the node bearer token, caps request size, forwards only Loki push batches to in-cluster Loki, and exposes Prometheus counters for accepted/rejected/forward-failed batches.

The log gateway is a scaling split, not a second trust root. It must remain behind the node-facing ingress and the same node mTLS/client-cert posture as /internal/v1/nodes/*. Loki itself is not node-facing.

Node Collector¶

Recommended first implementation: Vector. The bootstrap package and worker Ansible role install the config/unit templates disabled by default; enabling requires the node-log-gateway URL and node bearer token material.

Reasons: 1. Mature journald and file tail sources. 2. Disk buffering and backpressure controls. 3. Loki and HTTP sinks. 4. Small operational footprint on bare-metal nodes.

Required sources: 1. journald filtered to _SYSTEMD_UNIT=gpuaas-node-agent.service. 2. journald filtered to gpuaas-metrics-helper.service and gpuaas-metrics-helper.timer. 3. file for /var/log/gpuaas-node-agent-self-update.log. 4. Optional file glob for future /var/log/gpuaas-node-agent*.log.

Required transforms: 1. Add node_id from GPUAAS_NODE_ID. 2. Add component=node-agent. 3. Add source=journald|self-update-finalizer. 4. Drop or redact known secret fields before egress: GPUAAS_ENROLLMENT_TOKEN, Authorization, access_token, refresh_token, private keys, SSH private keys.

Required buffering: 1. Disk buffer enabled. 2. Bounded max size with oldest-drop policy after alerting. 3. Exponential retry to ingest endpoint.

Loki Labels¶

Use low-cardinality labels only:

Label	Source
`service`	constant `gpuaas-node-agent`
`node_id`	trusted from mTLS path or collector env, verified by ingest
`region`	control-plane node record lookup, not node-supplied
`cluster`	deployment environment
`component`	`node-agent`
`source`	`journald` or `self-update-finalizer`

Do not label by task_id, allocation_id, user_id, error string, host path, or high-cardinality message content. Those remain structured log fields.

Ingest Semantics¶

The ingest endpoint should:

Require node auth token and node mTLS when INTERNAL_NODE_MTLS_STRICT=true.
Validate path node_id matches the mTLS node identity.
Reject nodes in removed/deleted states.
Enforce per-node request size and rate limits.
Stamp trusted labels from the control-plane read model.
Sanitize log payloads before forwarding.
Return 202 Accepted after durable handoff to collector/gateway, not necessarily after Loki persistence.

Scaling Model¶

Start with API-owned ingest for local-kind/dev-control. For platform-control and production scale:

Put a stateless log-gateway deployment behind the node-facing ingress.
Horizontally scale log-gateway on requests/sec and queue depth.
Use Loki distributor replicas and object-store-backed chunks.
Keep node-local disk buffers large enough for control-plane maintenance windows.
Alert on gateway 5xx, per-node throttling, dropped collector events, and Loki distributor backpressure.

Rollout Plan¶

Add node-local Vector config to bootstrap package but keep it disabled by default. Done.
Add a platform log-gateway in front of Loki when scale or node-auth boundaries require it. Done: cmd/node-log-gateway and infra/k8s/base/core/node-log-gateway.yaml.
Enable in local-kind VM nodes and verify logs for cert recovery, self-update, and task result retry.
Enable in platform-control nodes with a small disk buffer.
Add Grafana panels and runbook queries:
by node_id
by correlation_id
by task_id
by component=node-agent source=self-update-finalizer
Make collector enablement part of node bootstrap/image once validated. The bootstrap package now carries the disabled-by-default Vector unit/config so MAAS reimages do not lose the collector wiring.

Bootstrap Adjacency: Netdata Edge¶

The node-local nginx Netdata edge is part of the same worker observability baseline, even though it is not a Loki path:

Netdata is configured as a node-local backend on 127.0.0.1:19998.
nginx exposes the stable platform-proxy target on 0.0.0.0:19999.
/gpuaas/telemetry/health is the node-local health check.
/gpuaas/telemetry/netdata/ redirects to the detected Netdata dashboard version path.

This belongs in bootstrap or the host image, not only in scripts/ops/gpuaas_netdata_edge_converge.sh, so reimaged nodes do not regress to direct Netdata exposure or require API-side /v1//v2//v3 route probing.

Open Decisions¶

Whether the gateway should evolve from bearer-token validation to strict per-node mTLS identity validation before production.
Whether node-agent should expose a local health signal for collector liveness.
Retention tier for node-agent logs in platform-control versus production.