Node to Control Plane Communication Security Audit v1¶
As of: May 3, 2026
Purpose¶
Every node-to-control-plane channel must be node-initiated, authenticated, bounded, observable, and recoverable. Adding node-agent log collection must not open raw observability backends or new unauthenticated ports to worker nodes.
Current Node-Initiated Channels¶
| Channel | Node target | Control-plane owner | Auth boundary | Payload bound | Audit/evidence |
|---|---|---|---|---|---|
| Enrollment | /internal/v1/nodes/enroll |
gpuaas-api |
bootstrap/enrollment token, node mTLS when strict | API request limit | enrollment audit/read model |
| Cert renew | /internal/v1/nodes/{node_id}/cert/renew |
gpuaas-api |
node bearer token, node mTLS when strict | API request limit | node lifecycle state |
| Task wait/result | /internal/v1/nodes/{node_id}/tasks/* |
gpuaas-api |
node bearer token, task signature verification, node mTLS when strict | API request limit, stale-task sweeper | task result rows, lifecycle run evidence |
| Terminal stream | /internal/ws/terminal/{session_id} |
gpuaas-terminal-gateway |
node mTLS/internal session binding | WebSocket/session TTL | terminal session binding + gateway metrics |
| Node logs | /internal/v1/node-logs/loki/api/v1/push |
gpuaas-node-log-gateway |
node bearer token, node-facing ingress mTLS posture | NODE_LOG_GATEWAY_MAX_BODY_BYTES, Vector disk buffer |
gateway metrics + Loki records |
Required Security Posture¶
- Nodes must not connect directly to Postgres, Redis, NATS, Temporal, Grafana, Prometheus, Tempo, or raw Loki.
- Nodes must not receive long-lived Loki credentials.
gpuaas-node-log-gatewayis the only node-facing logs path and forwards only Loki push batches to in-cluster Loki.- Loki remains cluster-internal for node log writes. Public/admin Loki/Grafana access is for operators, not node agents.
- Node-facing routes must remain behind the node API ingress, client-cert pass-through middleware, and the existing node API token posture.
- Logs must be redacted at the node-local collector before egress; gateway logs must not echo request bodies or bearer tokens.
- Every node-facing service must expose counters for accepted, rejected, and backend-failed requests so operations can audit abuse, drift, and outage behavior without SSH.
Follow-Up Hardening¶
The current log gateway validates the shared node bearer token to match the existing internal node API posture. Before production broad rollout:
- Promote the gateway to strict per-node mTLS identity validation.
- Stamp trusted
node_id, region, and environment labels from the control-plane node read model instead of trusting node-supplied labels. - Add per-node rate limits and drop counters.
- Add an operator read model for node log shipping posture:
enabled,last_batch_at,last_reject_reason,dropped_events, andgateway_backend_errors. - Add alert rules for gateway 401 spikes, 413 spikes, 5xx forwarding failures, and collector buffer pressure.