Terminal WebSocket Bridge Architecture v1¶
Status¶
Proposed v1 design target
Purpose¶
Define a first-principles terminal architecture for GPUaaS that:
- provides true full duplex terminal I/O
- does not depend on HTTP request/response streaming behavior through proxies
- uses explicit node identity verification
- scales to hundreds of simultaneous terminal sessions
- can serve as the primary access path in restricted or sovereign environments where direct SSH may be unavailable or undesirable
This proposal is intended to replace incident-driven patching of the current NDJSON terminal relay path.
Problem Statement¶
The existing terminal runtime mixes three separate concerns into a single ingress path:
- full duplex terminal transport
- node identity/authentication
- session authorization and control-plane ownership
That coupling created two production failure modes:
- ingress path was reachable but buffered live duplex traffic badly enough to delay prompt and input behavior until disconnect/session unwind
- direct path restored live stream behavior but failed node identity checks because the current identity model depends on ingress/mTLS handoff assumptions
The new design must separate these concerns.
First-Principles Requirements¶
Functional¶
- terminal input and output must flow in both directions independently and immediately
- browser contract should remain stable for users
- node-agent remains the terminal execution owner on the node
- terminal can be the primary remote shell surface when SSH is not the main access path
Security¶
- node identity must be verified without forwarded header dependence
- user/session authorization must remain control-plane owned
- transport must be encrypted in transit
- session open/close/error activity must remain auditable
Operational¶
- no correctness dependence on proxy buffering quirks
- support hundreds of simultaneous terminal sessions
- explicit connection caps and backpressure
- clear behavior on browser restart, gateway restart, and node-agent restart
Design Decision¶
Adopt a dual-WebSocket bridge design:
- browser ↔ terminal-gateway: WebSocket
- node-agent ↔ terminal-gateway: dedicated internal WebSocket over mTLS
The terminal-gateway becomes the terminal data-plane bridge.
The control plane (cmd/api) remains the session authority:
- mint terminal token
- validate allocation ownership
- create session binding
- enqueue
terminal.open - record audit trail
The API is not the terminal byte relay.
Why WebSocket Bridge¶
Why not HTTP streamed request/response¶
- request/response body streaming is not a reliable model for long-lived full duplex I/O
- proxy and ingress behavior can buffer or delay one or both directions
- production already proved this failure mode
Why WebSocket¶
- true duplex after upgrade
- widely supported operationally
- already used on the browser side of the system
- simpler than introducing gRPC runtime and protobuf for a byte-stream bridge
- easy to carry binary data plus a few control messages
Why not Redis pubsub in the frame path¶
- per-frame broker hops add latency and failure modes
- terminal data plane should remain in-process once both sockets are connected
- Redis remains acceptable for session binding and token state, not live byte transport
Topology¶
browser
-> websocket
terminal-gateway
-> session authority calls
cmd/api
-> node_tasks terminal.open
node-agent
-> websocket over mTLS
terminal-gateway
Live relay path:
No API byte relay in the steady-state data path.
Session Flow¶
Phase 1: Browser authorization¶
- Browser calls:
POST /api/v1/allocations/{id}/terminal-token- API validates:
- user owns allocation
- allocation is active
- rate limits
- API stores single-use short-lived token in Redis.
Phase 2: Browser terminal connect¶
- Browser opens:
WS /ws/terminal/{allocation_id}- Browser sends terminal token via
Sec-WebSocket-Protocol. - Terminal-gateway validates token and creates session binding:
session_idallocation_iduser_idnode_idusername- expiry / TTL
- Terminal-gateway requests/causes
terminal.opentask dispatch.
Phase 3: Node-agent connect¶
- Node-agent claims
terminal.open. - Node-agent starts PTY for the allocation user.
- Node-agent opens a dedicated internal WebSocket to terminal-gateway:
- node-facing listener only
- mTLS required
- session ID included in path or signed header/token
- Terminal-gateway validates:
- peer cert is a valid node cert
- node identity matches session binding
- session is still active
- Terminal-gateway bridges browser socket and node socket.
Phase 4: Live session¶
- browser input frames -> gateway -> node PTY stdin
- node PTY stdout/stderr -> gateway -> browser output frames
- resize/close/heartbeat remain typed control messages
Phase 5: Close and cleanup¶
- either side can close
- gateway records close reason and audit
- session binding is removed
- browser receives explicit close reason
Security Model¶
Security has two separate layers.
1. Transport identity: node mTLS¶
The node-facing internal WebSocket listener requires:
- TLS
- client certificate verification against the node CA
- peer identity extraction directly from the TLS layer
This must not depend on Traefik forwarding client-cert headers.
2. Session authorization¶
Session authorization remains control-plane owned.
The gateway accepts the node socket only if:
- the session binding exists
- the bound
node_idmatches the node identity proven by the client cert - the session is not expired or closed
Optional strengthening:
- require a short-lived signed session claim from API in addition to the cert
- bind that claim to:
session_idnode_idallocation_idexp
That is recommended but not strictly required for v1 if the binding store and mTLS checks are already strong.
Session Directory And Redis Schema¶
Redis remains the session-control store, but not the frame relay path.
Recommended v1 keys:
terminal_session:{session_id}- value:
allocation_iduser_idnode_idusernameexpires_atgateway_instance_idstatus
terminal_allocation_active:{allocation_id}- value:
session_id
terminal_gateway_sessions:{gateway_instance_id}- set of owned
session_id
Required semantics:
- session binding is created before the node socket attaches
gateway_instance_idis written when the gateway takes ownership- session ownership is removed on close or expiry
- session TTL remains enforced by control-plane policy
This extends the existing session-binding model rather than inventing a parallel session directory with unrelated ownership rules.
Protocol¶
Browser-side protocol¶
Use JSON text frames with typed control/data messages.
Examples:
readydataresizecloseerrorheartbeat
Payload bytes are base64 encoded for browser-side structured messaging.
Node-side protocol¶
Use:
- binary frames for PTY byte data
- text frames for control messages (
resize,close,heartbeat,error)
This keeps the hot path efficient while preserving structured control behavior.
Scale Model¶
This architecture must handle hundreds of simultaneous sessions.
Scaling assumptions¶
- many nodes may be connected concurrently
- a restricted environment may prefer terminal over SSH
- terminal sessions are long-lived and bursty
Required scale properties¶
Gateway horizontal scaling¶
Terminal-gateway must scale horizontally.
Each active session is owned by one gateway instance.
Session directory requirements:
session_id -> gateway_instance_idsession_id -> allocation_id, user_id, node_id, expiry
This registry can live in Redis or another shared store because it is control state, not the frame path.
No per-frame external dependency¶
There must be:
- no Redis publish/subscribe for live frame movement
- no database writes per frame
- no control-plane API hop per keypress
Backpressure and buffering¶
Each session must have bounded buffers.
Required policies:
- max outbound queue per browser socket
- max outbound queue per node socket
- close session or shed data predictably if peer is too slow
V1 default policy:
- do not silently drop terminal output bytes
- if a bounded queue is exceeded:
- close the session explicitly
- emit an explicit close reason such as:
output_backpressure_exceededinput_backpressure_exceeded
- surface that reason to logs and to the browser
Terminal should prefer correctness and bounded memory over unbounded buffering.
Connection caps¶
Enforce:
- per-user active session cap
- per-allocation active session cap
- per-gateway active session cap
Expose saturation metrics so operators know when to scale out.
Restart And Reconnect Model¶
This must be explicit, not accidental.
V1 recommendation¶
Non-resumable sessions.
Rules:
- browser disconnect:
- close session after short grace period or immediately, depending on product choice
- gateway restart:
- all sessions close with explicit reason
gateway_restart - node-agent restart:
- session closes with reason
node_restart - control-plane/API restart:
- existing gateway/node sockets continue if gateway remains alive; no byte relay through API
Why this is recommended for v1:
- simpler correctness model
- easier to test
- avoids hidden half-open session semantics
V2 option¶
Add resumable sessions only after v1 is stable.
That requires:
- durable session directory
- attach/re-attach semantics
- PTY survivability rules
Deployment Model¶
Node-facing listener¶
Expose a dedicated node-facing WebSocket listener outside the current HTTP ingress path.
Requirements:
- node-reachable address
- websocket-safe/L4-safe load balancing
- native mTLS at the gateway process
Do not place the critical node stream path behind the same ingress assumptions that caused the current buffering issue.
Concrete v1 deployment shape:
cmd/terminal-gatewayserves two listeners:- existing browser-facing websocket listener
- new node-facing internal websocket listener on a dedicated port
- Kubernetes exposes the node-facing listener through a dedicated Service:
- separate from the current browser-facing route
- separate from the current ingress-buffered
node-apipath - preferred v1 exposure:
- dedicated
LoadBalancerService for the node-facing websocket port - reachable from worker nodes on the node network
- TLS terminates in
cmd/terminal-gateway, not at Traefik
Explicit non-goal: - do not reuse the current Traefik/ingress terminal or node-api route for the live node stream path
Browser-facing listener¶
Keep the existing browser-facing websocket endpoint and routing model, provided it remains websocket-safe.
Observability Requirements¶
Required logs:
- browser websocket accepted
- session binding created
- node websocket accepted
- first downstream frame forwarded
- first upstream frame forwarded
- close reason and initiator
- auth failure reason
Required metrics:
- active sessions
- sessions opened / closed
- auth failures by reason
- average session duration
- buffer saturation / dropped-session counts
- gateway instance session counts
Test Plan¶
The redesign is not done without deployed-environment tests.
Must-have tests¶
- prompt appears in deployed environment
- typed key is echoed before disconnect
- resize changes PTY size
- session closes with explicit reason
- post-reimage terminal smoke
Failure-mode tests¶
- browser disconnect
- gateway restart
- node-agent restart
- expired session binding
- wrong node cert / wrong node_id
Scale tests¶
- N concurrent sessions across multiple gateways
- sustained typing/output across many sessions
- no per-session memory runaway
Migration Plan¶
Phase 1¶
- add dedicated node-facing internal WebSocket listener to terminal-gateway
- add mTLS verification at the listener
- add feature flag for node transport selection
Phase 2¶
- add node-agent internal WebSocket client for terminal sessions
- keep existing browser contract unchanged
Phase 3¶
- run dev-control soak
- validate duplex, auth, restart behavior, and post-reimage flow
Phase 4¶
- remove NDJSON HTTP relay path
- remove frame-path Redis pubsub coupling
- simplify terminal runbooks and alerts around the new architecture
Decision¶
The recommended v1 terminal redesign is:
- browser-side WebSocket unchanged
- node-side direct internal WebSocket over mTLS
- terminal-gateway as the only live byte bridge
- API as session authority, not byte relay
- no dependence on ingress behavior for node terminal streaming