Terminal Node Transport Redesign v1¶
Status¶
Decided — superseded by Terminal WebSocket Bridge Architecture v1
This document remains the option-analysis and decision trail.
Decision outcome:
- this document initially recommended Option C (gRPC bidirectional streaming)
- after further first-principles review, the chosen v1 architecture is Option B: a dedicated WebSocket bridge with a node-facing mTLS listener
- the active build specification is:
- Terminal WebSocket Bridge Architecture v1
Problem¶
The current node-agent terminal path mixes two concerns that should be designed together:
- live bidirectional terminal transport
- node identity and authorization for that transport
Production findings from 2026-03:
- the ingress path (
https://node-api...) is reachable from nodes - but the NDJSON duplex-over-HTTP relay buffers live terminal traffic badly enough that prompt/input behavior is effectively delayed until disconnect or session unwind
- a direct node-reachable path removes that streaming stall
- but the current direct path fails
401 invalid node identitybecause the present identity model depends on ingress/mTLS handoff behavior
So the current system does not have a stable terminal node/control-plane transport.
Goals¶
- terminal must be truly bidirectional during the live session
- transport behavior must not depend on proxy buffering quirks
- node identity must be explicit on the chosen transport
- browser contract stays stable:
POST /api/v1/allocations/{id}/terminal-tokenWS /ws/terminal/{allocation_id}- terminal transport must remain separable from node task polling/provisioning
- reconnection, close, resize, and readiness should be typed protocol events
Non-Goals¶
- changing the browser-facing websocket contract immediately
- coupling terminal redesign to task polling/provisioning transport changes
- continuing incremental incident patches to the current NDJSON duplex relay
Options¶
Option A: Keep NDJSON-over-HTTP and add more bypass rules¶
Shape: - browser -> terminal-gateway websocket - terminal-gateway/api -> node-agent via HTTP request/response streams
Pros: - smallest code churn - keeps existing OpenAPI-shaped internals
Cons: - already disproven in production as a robust solution - still proxy-sensitive - still requires transport-specific auth exceptions - harder to test and reason about full duplex behavior
Decision: - reject as long-term direction
Option B: Direct websocket from gateway to node-facing endpoint¶
Shape: - browser -> terminal-gateway websocket - terminal-gateway -> node-agent websocket or websocket-like direct stream
Pros: - true duplex transport - simpler browser/gateway mental model
Cons: - introduces a second public-ish node-facing runtime surface - harder to keep API as the control-plane authority - more awkward to integrate with existing node mTLS/task identity model
Decision: - not preferred
Option C: gRPC bidirectional stream between node-agent and control plane¶
Shape: - browser -> terminal-gateway websocket - terminal-gateway/API broker -> node-agent via gRPC bidi stream on a dedicated node-facing control-plane endpoint
Pros:
- transport matches the problem: true duplex typed streaming
- explicit stream lifecycle and backpressure
- typed protocol for ready, data, resize, close, error
- avoids dependence on ingress request/response buffering semantics
- identity can be designed explicitly for this stream instead of inherited from
the ingress header-forwarding path
Cons: - larger change than another HTTP patch - needs a new node-facing service boundary and tests
Decision: - recommended
Recommendation At Time Of Analysis¶
At the time this document was written, Option C was the recommended direction:
- keep browser edge as websocket through
cmd/terminal-gateway - move node/control-plane terminal transport to a dedicated gRPC bidirectional stream
- expose that stream on a node-reachable control-plane endpoint separate from the current ingress-buffered HTTP terminal relay
That recommendation was later superseded by the WebSocket bridge design after weighing:
- operational simplicity
- fit for byte-stream relay semantics
- reuse of existing websocket runtime and operational model
The active design is now: - Terminal WebSocket Bridge Architecture v1
Recommended Runtime Topology¶
browser
-> terminal-gateway websocket
-> terminal session broker in control plane
-> gRPC bidi stream
-> node-agent PTY
Suggested ownership split:
terminal-gateway- browser websocket termination
- browser token/session validation
-
browser resize/input/output relay
-
cmd/apior a dedicated terminal broker service - session authority
- allocation/user/node binding validation
- audit/session lifecycle events
-
issues short-lived node stream credentials
-
cmd/node-agent - opens one gRPC bidi stream per terminal session
- runs PTY as allocation user
- relays typed frames
Identity Model¶
This part must be explicit. The direct path should not depend on ingress header forwarding.
Recommended model:
-
Node remains enrolled and authenticated by its existing node certificate for lifecycle/task APIs.
-
Terminal stream transport gets a dedicated auth model:
- mTLS at transport level on the node-facing gRPC endpoint using the node cert, or
-
short-lived API-issued node stream token bound to:
node_idsession_idallocation_idexp
-
Server verifies both:
- the node identity
-
the terminal session binding
-
Terminal stream authorization does not depend on Traefik forwarding
X-Forwarded-Tls-Client-Cert*headers.
Preferred variant: - mTLS on the dedicated gRPC endpoint plus signed stream/session claims
Reason: - transport-level peer identity remains strong - session-level authorization remains explicit and auditable
Protocol Shape¶
Use a typed stream instead of free-form NDJSON frames.
Suggested messages:
TerminalOpenTerminalReadyTerminalDataTerminalResizeTerminalCloseTerminalErrorTerminalHeartbeat
Required invariants:
- exactly one active terminal session per allocation unless future multiplexing is explicitly added
- server-enforced TTL
- explicit close reasons
- correlation IDs carried end to end
Restart / Reconnect Model¶
The design should support restartability explicitly.
Requirements:
- if terminal-gateway restarts:
- browser reconnects with a new websocket
- broker resumes or cleanly reopens the node stream for the existing live session
- if node-agent restarts:
- session is closed explicitly with a typed reason
- UI gets a deterministic reconnectable/closed state
- if control-plane broker restarts:
- session registry rebuild is deterministic or active sessions are explicitly expired
Recommendation:
- keep session registry in broker-owned durable/replicated state
- support reopen-by-session semantics only if the PTY is still valid
- otherwise close cleanly and require a fresh terminal.open
Do not infer reconnect state from transport side effects.
Migration Plan¶
Phase 1: Design and contract¶
- define protobuf for node/control-plane terminal stream
- define auth model for the node-facing gRPC endpoint
- document session lifecycle and close semantics
Phase 2: Broker implementation¶
- implement gRPC stream server in API or dedicated terminal broker
- keep existing browser websocket contract unchanged
- add typed session/state logging
Phase 3: Node-agent implementation¶
- add gRPC terminal client in node-agent
- keep old HTTP path behind a temporary feature flag for rollback only
Phase 4: Cutover¶
- route a dev-control environment to the gRPC path
- validate:
- prompt delivery
- keystroke echo
- resize
- disconnect
- restart behavior
- remove old NDJSON duplex relay after soak
Test Requirements¶
Minimum required before calling the redesign done:
- unit tests for stream auth/session validation
- integration test for node-agent <-> broker bidi stream
- deployed-environment smoke that verifies:
- prompt appears
- typed key reaches PTY
- shell echoes before disconnect
- post-reimage terminal smoke
- restart tests:
- gateway restart
- node-agent restart
- broker restart
Decision¶
Do not continue incident-style fixes on the current NDJSON duplex relay.
Build a dedicated node-facing gRPC terminal stream with an explicit identity model, while preserving the browser websocket contract.