Skip to content

Node Agent Control Plane Recovery 2026-03

Purpose

This document captures the two-day recovery sequence for node-agent polling, terminal access, certificate alignment, and workflow cleanup issues observed in March 2026.

Use it as: - a recovery runbook for similar incidents - an evidence map for the related RCAs in doc/rca

Incident Areas

This recovery involved three overlapping problem sets: - node API mTLS and identity handoff - terminal stream transport behavior - provisioning/release workflow cleanup and observability gaps

Symptom To Owner Mapping

404 on /internal/v1/nodes/.../tasks/wait

Likely owner: - wrong node-reachable route or wrong hostname/IP path

Checks: - inspect /etc/gpuaas/node-agent.env - inspect /etc/hosts - verify node API URL points at hostname route, not raw IP route

EOF before server certificate on node API hostname

Likely owner: - Traefik router/TLS config for node API hostname

Checks: - curl -vk --cacert ... https://node-api.../internal/v1/nodes/.../tasks/wait - openssl s_client -connect <private_ip>:443 -servername node-api... - Traefik logs for router build/TLS errors

remote error: tls: unknown certificate authority

Likely owner: - Traefik client-auth CA does not match the node certificate issuer

Checks: - inspect gpuaas-node-mtls-ca - inspect node cert fingerprint and validity - re-enroll node if control-plane CA changed after the current cert was issued - compare the node cert issuer to /etc/gpuaas/node-cert-ca-bundle.crt; a locally valid cert can still be stale if it was issued by the previous node CA

401 invalid node identity

Likely owner: - API cannot see or parse forwarded client-cert identity - or node is logically fenced by control-plane state - or a direct terminal stream path is bypassing ingress mTLS handoff without a matching direct node identity model

Checks: - inspect API rejection logs for caller mTLS fields - inspect node occupancy/allocation state - verify stale release_failed allocations are not still pinning the node - verify whether the failing path is using ingress or a direct node API endpoint

Terminal shows Connecting... or [session ready] without prompt

Likely owner: - terminal stream transport/gateway path, not SSH user creation

Checks: - node-agent logs for: - terminal pty started - terminal pty produced first output - terminal-gateway logs for: - websocket upgraded - downstream subscription ready - session ready sent - if PTY output exists but UI still shows no prompt, inspect transport buffering behavior

Commands

Node side

cat /etc/gpuaas/node-agent.env
grep -n 'node-api\\|api\\|loki' /etc/hosts
sudo journalctl -u gpuaas-node-agent -o cat --no-pager -n 50
sudo ss -tpn | rg 'gpuaas-node-age|:443'
sudo openssl x509 -in /etc/gpuaas/cert.pem -noout -subject -issuer -dates -fingerprint -sha256
sudo openssl x509 -in /etc/gpuaas/node-cert-ca-bundle.crt -noout -subject -issuer -fingerprint -sha256

Control plane

sudo /usr/local/bin/k3s kubectl -n gpuaas-core get secret gpuaas-node-mtls-ca -o yaml
sudo /usr/local/bin/k3s kubectl -n gpuaas-core logs deploy/gpuaas-api --since=30m
sudo /usr/local/bin/k3s kubectl -n gpuaas-core logs deploy/gpuaas-terminal-gateway --since=30m
sudo /usr/local/bin/k3s kubectl -n kube-system logs deploy/traefik --since=30m

Direct HTTP probes

curl -vk --cacert /etc/gpuaas/ca-bundle.crt \
  https://node-api.100-90-157-34.sslip.io/internal/v1/nodes/<node_id>/tasks/wait \
  -H 'Authorization: Bearer node-dev-token' # gitleaks:allow

sudo curl -vk \
  --cacert /etc/gpuaas/ca-bundle.crt \
  --cert /etc/gpuaas/cert.pem \
  --key /etc/gpuaas/key.pem \
  # gitleaks:allow
  -H 'Authorization: Bearer node-dev-token' \
  "https://node-api.100-90-157-34.sslip.io/internal/v1/nodes/<node_id>/tasks/wait"

Recovery Lessons

Bootstrap may preserve stale client certs

Manual bootstrap/rebootstrap refreshes node-agent binaries and env files, but older installed agents may keep /etc/gpuaas/cert.pem and /etc/gpuaas/key.pem if the cert is locally valid. If the control-plane node CA rotated, the cert can pass local identity checks and still be rejected by node-api mTLS.

Recovery: - verify STEP_CA_DEV_CA_CERT_PEM in gpuaas-core-secrets matches gpuaas-node-mtls-ca - verify the API deployment has restarted after any secret patch - verify the node cert issuer matches /etc/gpuaas/node-cert-ca-bundle.crt - if it does not match and the node has a fresh GPUAAS_ENROLLMENT_TOKEN, move aside /etc/gpuaas/cert.pem, /etc/gpuaas/key.pem, and /etc/gpuaas/node-cert-ca-bundle.crt, then restart gpuaas-node-agent - validate with scripts/ops/terminal_remote_smoke.sh

Endpoint drift or missing local env material

If the node's local env file is missing, points at the wrong node API endpoint, or the installed recovery token cannot be trusted, use the operator re-enrollment workflow instead of editing values by hand:

  1. From the control plane, issue POST /api/v1/admin/nodes/{node_id}/reissue-enrollment.
  2. Copy the returned recovery_bundle.restart_command to the node, or use the UI/CLI wrapper for the same admin API.
  3. The command updates GPUAAS_ENROLLMENT_TOKEN, removes stale cert/key material, and restarts gpuaas-node-agent.
  4. If the env file is too damaged for in-place recovery, issue POST /api/v1/admin/nodes/{node_id}/bootstrap-script and reinstall from the generated bootstrap script/package.
  5. Validate the node reaches task wait and terminal smoke before marking recovery done.

This path is audited as nodes.reissue_enrollment and is allowed for registered, bootstrap-issued, enrolling, active, offline, quarantined, and retired nodes. It is blocked while lifecycle removal is actively draining/removing the node.

2026-05-03 hardening: - node-agent task polling checks local certificate expiry before opening the mTLS task wait request. If the certificate is expired and the current env has a valid GPUAAS_ENROLLMENT_TOKEN, the agent attempts recovery enrollment immediately instead of hammering task wait with an expired certificate. - first enrollment remains one-time, but successful enrollment now promotes the token to a node-bound recovery token so future expired-cert recovery can work after the initial Redis node_enroll_token:* is consumed. - node-agent self-update now rotates that recovery token, refreshes GPUAAS_ENROLLMENT_TOKEN in the installed env, carries the configured container runtime package into the installer, and writes a durable finalizer log path when possible. - task result POST is idempotent after terminal persistence. If the API commits the result and the response is lost, a retry returns OK instead of turning a successful node task into a false agent-side failure. - node-agent retries task result POST with bounded backoff before returning to the poll loop, reducing stale dispatched rows caused by transient network loss after local task completion. - provisioning-worker runs a conservative stale-dispatch reclaimer for allowlisted idempotent node tasks. It requeues safe stale dispatched leases and republishes the Redis node-task wakeup; unsafe long-running/destructive task types remain explicit operator resume workflows. - node-agent supports active/staged/grace Ed25519 verifier sets through GPUAAS_TASK_SIGNING_PUBKEYS. API bootstrap publication uses NODE_BOOTSTRAP_TASK_SIGNING_PUBKEYS. - poll and maintenance retry backoffs saturate instead of overflowing after very large failure counters.

Remaining gaps: - implement the Loki collector design in doc/architecture/Node_Agent_Log_Collection_Loki_v1.md so recovery evidence does not depend on SSH access to /var/log/gpuaas-node-agent*.log. - finish task-signing custody beyond env/configmap delivery: rotation audit and Vault/KMS custody.

Control-plane deploy is not enough

Already-running nodes may still hold: - stale runtime config - stale certificates - stale state backoff windows

After control-plane cert or routing fixes, node re-enrollment or reinstall may still be required.

Node mTLS has three separate checks

All three must be true: - node cert issuance works - Traefik trusts the issuing CA for client auth - API receives usable forwarded client-cert identity

Terminal stream transport is sensitive

Current terminal relay shape is brittle over HTTP/2 and, more broadly, brittle when transport and identity are coupled to ingress behavior. The incident findings showed: - ingress path is reachable but buffers live duplex terminal traffic - direct path fixes buffering but needs its own explicit node identity model

Follow-up Work

Backlog themes: - durable node-agent logs via collector-backed Loki ingestion - terminal transport and identity redesign, likely typed bidirectional streaming - admin/operator cleanup APIs for stuck workflow states - local environment parity for cert and terminal reproduction