Node Agent Control Plane Recovery 2026-03¶
Purpose¶
This document captures the two-day recovery sequence for node-agent polling, terminal access, certificate alignment, and workflow cleanup issues observed in March 2026.
Use it as:
- a recovery runbook for similar incidents
- an evidence map for the related RCAs in doc/rca
Incident Areas¶
This recovery involved three overlapping problem sets: - node API mTLS and identity handoff - terminal stream transport behavior - provisioning/release workflow cleanup and observability gaps
Symptom To Owner Mapping¶
404 on /internal/v1/nodes/.../tasks/wait¶
Likely owner: - wrong node-reachable route or wrong hostname/IP path
Checks:
- inspect /etc/gpuaas/node-agent.env
- inspect /etc/hosts
- verify node API URL points at hostname route, not raw IP route
EOF before server certificate on node API hostname¶
Likely owner: - Traefik router/TLS config for node API hostname
Checks:
- curl -vk --cacert ... https://node-api.../internal/v1/nodes/.../tasks/wait
- openssl s_client -connect <private_ip>:443 -servername node-api...
- Traefik logs for router build/TLS errors
remote error: tls: unknown certificate authority¶
Likely owner: - Traefik client-auth CA does not match the node certificate issuer
Checks:
- inspect gpuaas-node-mtls-ca
- inspect node cert fingerprint and validity
- re-enroll node if control-plane CA changed after the current cert was issued
- compare the node cert issuer to /etc/gpuaas/node-cert-ca-bundle.crt; a locally valid
cert can still be stale if it was issued by the previous node CA
401 invalid node identity¶
Likely owner: - API cannot see or parse forwarded client-cert identity - or node is logically fenced by control-plane state - or a direct terminal stream path is bypassing ingress mTLS handoff without a matching direct node identity model
Checks:
- inspect API rejection logs for caller mTLS fields
- inspect node occupancy/allocation state
- verify stale release_failed allocations are not still pinning the node
- verify whether the failing path is using ingress or a direct node API endpoint
Terminal shows Connecting... or [session ready] without prompt¶
Likely owner: - terminal stream transport/gateway path, not SSH user creation
Checks:
- node-agent logs for:
- terminal pty started
- terminal pty produced first output
- terminal-gateway logs for:
- websocket upgraded
- downstream subscription ready
- session ready sent
- if PTY output exists but UI still shows no prompt, inspect transport buffering behavior
Commands¶
Node side¶
cat /etc/gpuaas/node-agent.env
grep -n 'node-api\\|api\\|loki' /etc/hosts
sudo journalctl -u gpuaas-node-agent -o cat --no-pager -n 50
sudo ss -tpn | rg 'gpuaas-node-age|:443'
sudo openssl x509 -in /etc/gpuaas/cert.pem -noout -subject -issuer -dates -fingerprint -sha256
sudo openssl x509 -in /etc/gpuaas/node-cert-ca-bundle.crt -noout -subject -issuer -fingerprint -sha256
Control plane¶
sudo /usr/local/bin/k3s kubectl -n gpuaas-core get secret gpuaas-node-mtls-ca -o yaml
sudo /usr/local/bin/k3s kubectl -n gpuaas-core logs deploy/gpuaas-api --since=30m
sudo /usr/local/bin/k3s kubectl -n gpuaas-core logs deploy/gpuaas-terminal-gateway --since=30m
sudo /usr/local/bin/k3s kubectl -n kube-system logs deploy/traefik --since=30m
Direct HTTP probes¶
curl -vk --cacert /etc/gpuaas/ca-bundle.crt \
https://node-api.100-90-157-34.sslip.io/internal/v1/nodes/<node_id>/tasks/wait \
-H 'Authorization: Bearer node-dev-token' # gitleaks:allow
sudo curl -vk \
--cacert /etc/gpuaas/ca-bundle.crt \
--cert /etc/gpuaas/cert.pem \
--key /etc/gpuaas/key.pem \
# gitleaks:allow
-H 'Authorization: Bearer node-dev-token' \
"https://node-api.100-90-157-34.sslip.io/internal/v1/nodes/<node_id>/tasks/wait"
Recovery Lessons¶
Bootstrap may preserve stale client certs¶
Manual bootstrap/rebootstrap refreshes node-agent binaries and env files, but older
installed agents may keep /etc/gpuaas/cert.pem and /etc/gpuaas/key.pem if the cert is
locally valid. If the control-plane node CA rotated, the cert can pass local identity checks
and still be rejected by node-api mTLS.
Recovery:
- verify STEP_CA_DEV_CA_CERT_PEM in gpuaas-core-secrets matches gpuaas-node-mtls-ca
- verify the API deployment has restarted after any secret patch
- verify the node cert issuer matches /etc/gpuaas/node-cert-ca-bundle.crt
- if it does not match and the node has a fresh GPUAAS_ENROLLMENT_TOKEN, move aside
/etc/gpuaas/cert.pem, /etc/gpuaas/key.pem, and /etc/gpuaas/node-cert-ca-bundle.crt,
then restart gpuaas-node-agent
- validate with scripts/ops/terminal_remote_smoke.sh
Endpoint drift or missing local env material¶
If the node's local env file is missing, points at the wrong node API endpoint, or the installed recovery token cannot be trusted, use the operator re-enrollment workflow instead of editing values by hand:
- From the control plane, issue
POST /api/v1/admin/nodes/{node_id}/reissue-enrollment. - Copy the returned
recovery_bundle.restart_commandto the node, or use the UI/CLI wrapper for the same admin API. - The command updates
GPUAAS_ENROLLMENT_TOKEN, removes stale cert/key material, and restartsgpuaas-node-agent. - If the env file is too damaged for in-place recovery, issue
POST /api/v1/admin/nodes/{node_id}/bootstrap-scriptand reinstall from the generated bootstrap script/package. - Validate the node reaches task wait and terminal smoke before marking recovery done.
This path is audited as nodes.reissue_enrollment and is allowed for registered,
bootstrap-issued, enrolling, active, offline, quarantined, and retired nodes. It is
blocked while lifecycle removal is actively draining/removing the node.
2026-05-03 hardening:
- node-agent task polling checks local certificate expiry before opening the mTLS task
wait request. If the certificate is expired and the current env has a valid
GPUAAS_ENROLLMENT_TOKEN, the agent attempts recovery enrollment immediately instead
of hammering task wait with an expired certificate.
- first enrollment remains one-time, but successful enrollment now promotes the token to
a node-bound recovery token so future expired-cert recovery can work after the initial
Redis node_enroll_token:* is consumed.
- node-agent self-update now rotates that recovery token, refreshes
GPUAAS_ENROLLMENT_TOKEN in the installed env, carries the configured container
runtime package into the installer, and writes a durable finalizer log path when
possible.
- task result POST is idempotent after terminal persistence. If the API commits the
result and the response is lost, a retry returns OK instead of turning a successful
node task into a false agent-side failure.
- node-agent retries task result POST with bounded backoff before returning to the poll
loop, reducing stale dispatched rows caused by transient network loss after local
task completion.
- provisioning-worker runs a conservative stale-dispatch reclaimer for allowlisted
idempotent node tasks. It requeues safe stale dispatched leases and republishes the
Redis node-task wakeup; unsafe long-running/destructive task types remain explicit
operator resume workflows.
- node-agent supports active/staged/grace Ed25519 verifier sets through
GPUAAS_TASK_SIGNING_PUBKEYS. API bootstrap publication uses
NODE_BOOTSTRAP_TASK_SIGNING_PUBKEYS.
- poll and maintenance retry backoffs saturate instead of overflowing after very large
failure counters.
Remaining gaps:
- implement the Loki collector design in
doc/architecture/Node_Agent_Log_Collection_Loki_v1.md so recovery evidence does not
depend on SSH access to /var/log/gpuaas-node-agent*.log.
- finish task-signing custody beyond env/configmap delivery: rotation audit and
Vault/KMS custody.
Control-plane deploy is not enough¶
Already-running nodes may still hold: - stale runtime config - stale certificates - stale state backoff windows
After control-plane cert or routing fixes, node re-enrollment or reinstall may still be required.
Node mTLS has three separate checks¶
All three must be true: - node cert issuance works - Traefik trusts the issuing CA for client auth - API receives usable forwarded client-cert identity
Terminal stream transport is sensitive¶
Current terminal relay shape is brittle over HTTP/2 and, more broadly, brittle when transport and identity are coupled to ingress behavior. The incident findings showed: - ingress path is reachable but buffers live duplex terminal traffic - direct path fixes buffering but needs its own explicit node identity model
Follow-up Work¶
- 2026-03 Node API mTLS Identity Handoff
- 2026-03 Terminal Stream HTTP2 Buffering
- 2026-03 Provisioning Workflow Recovery Gaps
Backlog themes: - durable node-agent logs via collector-backed Loki ingestion - terminal transport and identity redesign, likely typed bidirectional streaming - admin/operator cleanup APIs for stuck workflow states - local environment parity for cert and terminal reproduction