Node Agent Log Collection to Loki v1¶
As of: May 3, 2026
Purpose¶
Node-agent recovery evidence must not depend on SSH access to GPU nodes. This design adds a scalable collector path for node-agent logs while preserving the node-agent control-plane trust model.
Goals¶
- Collect
gpuaas-node-agent.servicejournald logs and/var/log/gpuaas-node-agent*.logfinalizer logs. - Survive node/control-plane network interruptions with bounded local buffering.
- Authenticate every log stream as a specific
node_id. - Keep Loki labels low-cardinality and queryable by incident workflows.
- Scale horizontally in platform-control and future production environments.
Non-Goals¶
- General host log collection for every system service.
- Direct node access to Loki, Grafana, object storage, Redis, Postgres, NATS, or Temporal.
- Replacing node task result output or audit logs.
Architecture¶
GPU node
journald + /var/log/gpuaas-node-agent*.log
│
▼
node-local collector (Vector)
│ bearer token + node labels + bounded disk buffer
▼
gpuaas-node-log-gateway replicas
│
▼
distributor / gateway (horizontal)
│
▼
Loki write path
│
▼
Grafana / ops UI / incident runbooks
The node-local collector sends to the same node-facing authority as the node
agent:
POST /internal/v1/node-logs/loki/api/v1/push. That route terminates at
gpuaas-node-log-gateway, not raw Loki. The gateway validates the node bearer
token, caps request size, forwards only Loki push batches to in-cluster Loki,
and exposes Prometheus counters for accepted/rejected/forward-failed batches.
The log gateway is a scaling split, not a second trust root. It must remain
behind the node-facing ingress and the same node mTLS/client-cert posture as
/internal/v1/nodes/*. Loki itself is not node-facing.
Node Collector¶
Recommended first implementation: Vector. The bootstrap package and worker Ansible role install the config/unit templates disabled by default; enabling requires the node-log-gateway URL and node bearer token material.
Reasons: 1. Mature journald and file tail sources. 2. Disk buffering and backpressure controls. 3. Loki and HTTP sinks. 4. Small operational footprint on bare-metal nodes.
Required sources:
1. journald filtered to _SYSTEMD_UNIT=gpuaas-node-agent.service.
2. journald filtered to gpuaas-metrics-helper.service and
gpuaas-metrics-helper.timer.
3. file for /var/log/gpuaas-node-agent-self-update.log.
4. Optional file glob for future /var/log/gpuaas-node-agent*.log.
Required transforms:
1. Add node_id from GPUAAS_NODE_ID.
2. Add component=node-agent.
3. Add source=journald|self-update-finalizer.
4. Drop or redact known secret fields before egress:
GPUAAS_ENROLLMENT_TOKEN, Authorization, access_token, refresh_token,
private keys, SSH private keys.
Required buffering: 1. Disk buffer enabled. 2. Bounded max size with oldest-drop policy after alerting. 3. Exponential retry to ingest endpoint.
Loki Labels¶
Use low-cardinality labels only:
| Label | Source |
|---|---|
service |
constant gpuaas-node-agent |
node_id |
trusted from mTLS path or collector env, verified by ingest |
region |
control-plane node record lookup, not node-supplied |
cluster |
deployment environment |
component |
node-agent |
source |
journald or self-update-finalizer |
Do not label by task_id, allocation_id, user_id, error string, host path, or
high-cardinality message content. Those remain structured log fields.
Ingest Semantics¶
The ingest endpoint should:
- Require node auth token and node mTLS when
INTERNAL_NODE_MTLS_STRICT=true. - Validate path
node_idmatches the mTLS node identity. - Reject nodes in removed/deleted states.
- Enforce per-node request size and rate limits.
- Stamp trusted labels from the control-plane read model.
- Sanitize log payloads before forwarding.
- Return
202 Acceptedafter durable handoff to collector/gateway, not necessarily after Loki persistence.
Scaling Model¶
Start with API-owned ingest for local-kind/dev-control. For platform-control and production scale:
- Put a stateless log-gateway deployment behind the node-facing ingress.
- Horizontally scale log-gateway on requests/sec and queue depth.
- Use Loki distributor replicas and object-store-backed chunks.
- Keep node-local disk buffers large enough for control-plane maintenance windows.
- Alert on gateway 5xx, per-node throttling, dropped collector events, and Loki distributor backpressure.
Rollout Plan¶
- Add node-local Vector config to bootstrap package but keep it disabled by default. Done.
- Add a platform log-gateway in front of Loki when scale or node-auth boundaries require it. Done:
cmd/node-log-gatewayandinfra/k8s/base/core/node-log-gateway.yaml. - Enable in local-kind VM nodes and verify logs for cert recovery, self-update, and task result retry.
- Enable in platform-control nodes with a small disk buffer.
- Add Grafana panels and runbook queries:
- by
node_id - by
correlation_id - by
task_id - by
component=node-agent source=self-update-finalizer - Make collector enablement part of node bootstrap/image once validated. The bootstrap package now carries the disabled-by-default Vector unit/config so MAAS reimages do not lose the collector wiring.
Bootstrap Adjacency: Netdata Edge¶
The node-local nginx Netdata edge is part of the same worker observability baseline, even though it is not a Loki path:
- Netdata is configured as a node-local backend on
127.0.0.1:19998. - nginx exposes the stable platform-proxy target on
0.0.0.0:19999. /gpuaas/telemetry/healthis the node-local health check./gpuaas/telemetry/netdata/redirects to the detected Netdata dashboard version path.
This belongs in bootstrap or the host image, not only in
scripts/ops/gpuaas_netdata_edge_converge.sh, so reimaged nodes do not regress
to direct Netdata exposure or require API-side /v1//v2//v3 route probing.
Open Decisions¶
- Whether the gateway should evolve from bearer-token validation to strict per-node mTLS identity validation before production.
- Whether node-agent should expose a local health signal for collector liveness.
- Retention tier for node-agent logs in platform-control versus production.