2026-03 Node API mTLS Identity Handoff¶
Summary¶
External node-agent task polling was broken by a chain of control-plane defects: - wrong node-reachable URL shape - wrong CA material in Traefik client-auth config - stale node certificates after CA correction - missing client-cert identity forwarding from Traefik to API
The end result was a long sequence of 404, EOF, unknown certificate authority,
and invalid node identity failures before real task polling was restored.
Impact¶
- node-agent could not claim tasks from the control plane
- bootstrap/reinstall attempts appeared to succeed while runtime polling still failed
- recovery time increased because transport, TLS, auth, and lifecycle state failures masked each other
Symptoms¶
- node-agent
tasks/waitreturned404against raw-IP routing - hostname route then failed with TLS EOF before a server certificate was presented
- after router fixes, node-agent hit
remote error: tls: unknown certificate authority - after CA alignment, API returned
401 invalid node identity - later API logging showed
mtls_present=falsebecause forwarded client-cert headers were missing at the API boundary
Root Cause¶
The owner layer was the node API ingress and identity handoff path, not the node-agent.
Concrete defects:
- bootstrap/runtime used a node-reachable URL path that bypassed the intended hostname
router behavior
- deploy path populated gpuaas-node-mtls-ca with the wrong certificate content
- node certificates were issued from a transient CA while Traefik trusted a different CA
- Traefik was not forwarding usable client-cert identity to API on the node API route
- API initially only understood one forwarded cert format, while Traefik emitted another
Why Detection Was Weak¶
- node-agent logs only exposed the current transport/auth symptom, not the failing owner
- API rejection logs lacked caller mTLS fields until added during the incident
- Traefik logs were not centrally available in Loki
- direct quotes like
invalid node identityhid whether TLS peer identity was actually present at API
Recovery¶
Recovery required all of the following:
- restore hostname-based node API URLs with /etc/hosts override to the private control-plane IP
- persist and reuse a node CA shared by API issuance and Traefik client-auth trust
- patch Traefik CA secret content and shape
- force node re-enrollment so stale certificates were replaced
- convert node API routing to an explicit Traefik route where mTLS middleware applied
- teach API to parse Traefik forwarded client-cert headers
Follow-ups¶
- ship node-agent logs and Traefik logs to Loki with stable labels
- keep rejection-path caller identity fields in API logs
- add an integration test for external node-agent task polling through Traefik
- document the node bootstrap, cert, and
tasks/waitrecovery sequence in the runbook - investigate and remove the anonymous internal caller still polling
/internal/v1/nodes/.../tasks/wait