Skip to content

Node to Control Plane Communication Security Audit v1

As of: May 3, 2026

Purpose

Every node-to-control-plane channel must be node-initiated, authenticated, bounded, observable, and recoverable. Adding node-agent log collection must not open raw observability backends or new unauthenticated ports to worker nodes.

Current Node-Initiated Channels

Channel Node target Control-plane owner Auth boundary Payload bound Audit/evidence
Enrollment /internal/v1/nodes/enroll gpuaas-api bootstrap/enrollment token, node mTLS when strict API request limit enrollment audit/read model
Cert renew /internal/v1/nodes/{node_id}/cert/renew gpuaas-api node bearer token, node mTLS when strict API request limit node lifecycle state
Task wait/result /internal/v1/nodes/{node_id}/tasks/* gpuaas-api node bearer token, task signature verification, node mTLS when strict API request limit, stale-task sweeper task result rows, lifecycle run evidence
Terminal stream /internal/ws/terminal/{session_id} gpuaas-terminal-gateway node mTLS/internal session binding WebSocket/session TTL terminal session binding + gateway metrics
Node logs /internal/v1/node-logs/loki/api/v1/push gpuaas-node-log-gateway node bearer token, node-facing ingress mTLS posture NODE_LOG_GATEWAY_MAX_BODY_BYTES, Vector disk buffer gateway metrics + Loki records

Required Security Posture

  • Nodes must not connect directly to Postgres, Redis, NATS, Temporal, Grafana, Prometheus, Tempo, or raw Loki.
  • Nodes must not receive long-lived Loki credentials.
  • gpuaas-node-log-gateway is the only node-facing logs path and forwards only Loki push batches to in-cluster Loki.
  • Loki remains cluster-internal for node log writes. Public/admin Loki/Grafana access is for operators, not node agents.
  • Node-facing routes must remain behind the node API ingress, client-cert pass-through middleware, and the existing node API token posture.
  • Logs must be redacted at the node-local collector before egress; gateway logs must not echo request bodies or bearer tokens.
  • Every node-facing service must expose counters for accepted, rejected, and backend-failed requests so operations can audit abuse, drift, and outage behavior without SSH.

Follow-Up Hardening

The current log gateway validates the shared node bearer token to match the existing internal node API posture. Before production broad rollout:

  1. Promote the gateway to strict per-node mTLS identity validation.
  2. Stamp trusted node_id, region, and environment labels from the control-plane node read model instead of trusting node-supplied labels.
  3. Add per-node rate limits and drop counters.
  4. Add an operator read model for node log shipping posture: enabled, last_batch_at, last_reject_reason, dropped_events, and gateway_backend_errors.
  5. Add alert rules for gateway 401 spikes, 413 spikes, 5xx forwarding failures, and collector buffer pressure.