Skip to content

2026-03 Node API mTLS Identity Handoff

Summary

External node-agent task polling was broken by a chain of control-plane defects: - wrong node-reachable URL shape - wrong CA material in Traefik client-auth config - stale node certificates after CA correction - missing client-cert identity forwarding from Traefik to API

The end result was a long sequence of 404, EOF, unknown certificate authority, and invalid node identity failures before real task polling was restored.

Impact

  • node-agent could not claim tasks from the control plane
  • bootstrap/reinstall attempts appeared to succeed while runtime polling still failed
  • recovery time increased because transport, TLS, auth, and lifecycle state failures masked each other

Symptoms

  • node-agent tasks/wait returned 404 against raw-IP routing
  • hostname route then failed with TLS EOF before a server certificate was presented
  • after router fixes, node-agent hit remote error: tls: unknown certificate authority
  • after CA alignment, API returned 401 invalid node identity
  • later API logging showed mtls_present=false because forwarded client-cert headers were missing at the API boundary

Root Cause

The owner layer was the node API ingress and identity handoff path, not the node-agent.

Concrete defects: - bootstrap/runtime used a node-reachable URL path that bypassed the intended hostname router behavior - deploy path populated gpuaas-node-mtls-ca with the wrong certificate content - node certificates were issued from a transient CA while Traefik trusted a different CA - Traefik was not forwarding usable client-cert identity to API on the node API route - API initially only understood one forwarded cert format, while Traefik emitted another

Why Detection Was Weak

  • node-agent logs only exposed the current transport/auth symptom, not the failing owner
  • API rejection logs lacked caller mTLS fields until added during the incident
  • Traefik logs were not centrally available in Loki
  • direct quotes like invalid node identity hid whether TLS peer identity was actually present at API

Recovery

Recovery required all of the following: - restore hostname-based node API URLs with /etc/hosts override to the private control-plane IP - persist and reuse a node CA shared by API issuance and Traefik client-auth trust - patch Traefik CA secret content and shape - force node re-enrollment so stale certificates were replaced - convert node API routing to an explicit Traefik route where mTLS middleware applied - teach API to parse Traefik forwarded client-cert headers

Follow-ups

  • ship node-agent logs and Traefik logs to Loki with stable labels
  • keep rejection-path caller identity fields in API logs
  • add an integration test for external node-agent task polling through Traefik
  • document the node bootstrap, cert, and tasks/wait recovery sequence in the runbook
  • investigate and remove the anonymous internal caller still polling /internal/v1/nodes/.../tasks/wait