Skip to content

Terminal Node Transport Redesign v1

Status

Decided — superseded by Terminal WebSocket Bridge Architecture v1

This document remains the option-analysis and decision trail.

Decision outcome:

  • this document initially recommended Option C (gRPC bidirectional streaming)
  • after further first-principles review, the chosen v1 architecture is Option B: a dedicated WebSocket bridge with a node-facing mTLS listener
  • the active build specification is:
  • Terminal WebSocket Bridge Architecture v1

Problem

The current node-agent terminal path mixes two concerns that should be designed together:

  • live bidirectional terminal transport
  • node identity and authorization for that transport

Production findings from 2026-03:

  • the ingress path (https://node-api...) is reachable from nodes
  • but the NDJSON duplex-over-HTTP relay buffers live terminal traffic badly enough that prompt/input behavior is effectively delayed until disconnect or session unwind
  • a direct node-reachable path removes that streaming stall
  • but the current direct path fails 401 invalid node identity because the present identity model depends on ingress/mTLS handoff behavior

So the current system does not have a stable terminal node/control-plane transport.

Goals

  • terminal must be truly bidirectional during the live session
  • transport behavior must not depend on proxy buffering quirks
  • node identity must be explicit on the chosen transport
  • browser contract stays stable:
  • POST /api/v1/allocations/{id}/terminal-token
  • WS /ws/terminal/{allocation_id}
  • terminal transport must remain separable from node task polling/provisioning
  • reconnection, close, resize, and readiness should be typed protocol events

Non-Goals

  • changing the browser-facing websocket contract immediately
  • coupling terminal redesign to task polling/provisioning transport changes
  • continuing incremental incident patches to the current NDJSON duplex relay

Options

Option A: Keep NDJSON-over-HTTP and add more bypass rules

Shape: - browser -> terminal-gateway websocket - terminal-gateway/api -> node-agent via HTTP request/response streams

Pros: - smallest code churn - keeps existing OpenAPI-shaped internals

Cons: - already disproven in production as a robust solution - still proxy-sensitive - still requires transport-specific auth exceptions - harder to test and reason about full duplex behavior

Decision: - reject as long-term direction

Option B: Direct websocket from gateway to node-facing endpoint

Shape: - browser -> terminal-gateway websocket - terminal-gateway -> node-agent websocket or websocket-like direct stream

Pros: - true duplex transport - simpler browser/gateway mental model

Cons: - introduces a second public-ish node-facing runtime surface - harder to keep API as the control-plane authority - more awkward to integrate with existing node mTLS/task identity model

Decision: - not preferred

Option C: gRPC bidirectional stream between node-agent and control plane

Shape: - browser -> terminal-gateway websocket - terminal-gateway/API broker -> node-agent via gRPC bidi stream on a dedicated node-facing control-plane endpoint

Pros: - transport matches the problem: true duplex typed streaming - explicit stream lifecycle and backpressure - typed protocol for ready, data, resize, close, error - avoids dependence on ingress request/response buffering semantics - identity can be designed explicitly for this stream instead of inherited from the ingress header-forwarding path

Cons: - larger change than another HTTP patch - needs a new node-facing service boundary and tests

Decision: - recommended

Recommendation At Time Of Analysis

At the time this document was written, Option C was the recommended direction:

  • keep browser edge as websocket through cmd/terminal-gateway
  • move node/control-plane terminal transport to a dedicated gRPC bidirectional stream
  • expose that stream on a node-reachable control-plane endpoint separate from the current ingress-buffered HTTP terminal relay

That recommendation was later superseded by the WebSocket bridge design after weighing:

  • operational simplicity
  • fit for byte-stream relay semantics
  • reuse of existing websocket runtime and operational model

The active design is now: - Terminal WebSocket Bridge Architecture v1

browser
  -> terminal-gateway websocket
  -> terminal session broker in control plane
  -> gRPC bidi stream
  -> node-agent PTY

Suggested ownership split:

  • terminal-gateway
  • browser websocket termination
  • browser token/session validation
  • browser resize/input/output relay

  • cmd/api or a dedicated terminal broker service

  • session authority
  • allocation/user/node binding validation
  • audit/session lifecycle events
  • issues short-lived node stream credentials

  • cmd/node-agent

  • opens one gRPC bidi stream per terminal session
  • runs PTY as allocation user
  • relays typed frames

Identity Model

This part must be explicit. The direct path should not depend on ingress header forwarding.

Recommended model:

  1. Node remains enrolled and authenticated by its existing node certificate for lifecycle/task APIs.

  2. Terminal stream transport gets a dedicated auth model:

  3. mTLS at transport level on the node-facing gRPC endpoint using the node cert, or
  4. short-lived API-issued node stream token bound to:

    • node_id
    • session_id
    • allocation_id
    • exp
  5. Server verifies both:

  6. the node identity
  7. the terminal session binding

  8. Terminal stream authorization does not depend on Traefik forwarding X-Forwarded-Tls-Client-Cert* headers.

Preferred variant: - mTLS on the dedicated gRPC endpoint plus signed stream/session claims

Reason: - transport-level peer identity remains strong - session-level authorization remains explicit and auditable

Protocol Shape

Use a typed stream instead of free-form NDJSON frames.

Suggested messages:

  • TerminalOpen
  • TerminalReady
  • TerminalData
  • TerminalResize
  • TerminalClose
  • TerminalError
  • TerminalHeartbeat

Required invariants:

  • exactly one active terminal session per allocation unless future multiplexing is explicitly added
  • server-enforced TTL
  • explicit close reasons
  • correlation IDs carried end to end

Restart / Reconnect Model

The design should support restartability explicitly.

Requirements:

  • if terminal-gateway restarts:
  • browser reconnects with a new websocket
  • broker resumes or cleanly reopens the node stream for the existing live session
  • if node-agent restarts:
  • session is closed explicitly with a typed reason
  • UI gets a deterministic reconnectable/closed state
  • if control-plane broker restarts:
  • session registry rebuild is deterministic or active sessions are explicitly expired

Recommendation: - keep session registry in broker-owned durable/replicated state - support reopen-by-session semantics only if the PTY is still valid - otherwise close cleanly and require a fresh terminal.open

Do not infer reconnect state from transport side effects.

Migration Plan

Phase 1: Design and contract

  • define protobuf for node/control-plane terminal stream
  • define auth model for the node-facing gRPC endpoint
  • document session lifecycle and close semantics

Phase 2: Broker implementation

  • implement gRPC stream server in API or dedicated terminal broker
  • keep existing browser websocket contract unchanged
  • add typed session/state logging

Phase 3: Node-agent implementation

  • add gRPC terminal client in node-agent
  • keep old HTTP path behind a temporary feature flag for rollback only

Phase 4: Cutover

  • route a dev-control environment to the gRPC path
  • validate:
  • prompt delivery
  • keystroke echo
  • resize
  • disconnect
  • restart behavior
  • remove old NDJSON duplex relay after soak

Test Requirements

Minimum required before calling the redesign done:

  • unit tests for stream auth/session validation
  • integration test for node-agent <-> broker bidi stream
  • deployed-environment smoke that verifies:
  • prompt appears
  • typed key reaches PTY
  • shell echoes before disconnect
  • post-reimage terminal smoke
  • restart tests:
  • gateway restart
  • node-agent restart
  • broker restart

Decision

Do not continue incident-style fixes on the current NDJSON duplex relay.

Build a dedicated node-facing gRPC terminal stream with an explicit identity model, while preserving the browser websocket contract.