Terminal WebSocket Bridge Architecture v1¶

Status¶

Proposed v1 design target

Purpose¶

Define a first-principles terminal architecture for GPUaaS that:

provides true full duplex terminal I/O
does not depend on HTTP request/response streaming behavior through proxies
uses explicit node identity verification
scales to hundreds of simultaneous terminal sessions
can serve as the primary access path in restricted or sovereign environments where direct SSH may be unavailable or undesirable

This proposal is intended to replace incident-driven patching of the current NDJSON terminal relay path.

Problem Statement¶

The existing terminal runtime mixes three separate concerns into a single ingress path:

full duplex terminal transport
node identity/authentication
session authorization and control-plane ownership

That coupling created two production failure modes:

ingress path was reachable but buffered live duplex traffic badly enough to delay prompt and input behavior until disconnect/session unwind
direct path restored live stream behavior but failed node identity checks because the current identity model depends on ingress/mTLS handoff assumptions

The new design must separate these concerns.

First-Principles Requirements¶

Functional¶

terminal input and output must flow in both directions independently and immediately
browser contract should remain stable for users
node-agent remains the terminal execution owner on the node
terminal can be the primary remote shell surface when SSH is not the main access path

Security¶

node identity must be verified without forwarded header dependence
user/session authorization must remain control-plane owned
transport must be encrypted in transit
session open/close/error activity must remain auditable

Operational¶

no correctness dependence on proxy buffering quirks
support hundreds of simultaneous terminal sessions
explicit connection caps and backpressure
clear behavior on browser restart, gateway restart, and node-agent restart

Design Decision¶

Adopt a dual-WebSocket bridge design:

browser ↔ terminal-gateway: WebSocket
node-agent ↔ terminal-gateway: dedicated internal WebSocket over mTLS

The terminal-gateway becomes the terminal data-plane bridge.

The control plane (cmd/api) remains the session authority:

mint terminal token
validate allocation ownership
create session binding
enqueue terminal.open
record audit trail

The API is not the terminal byte relay.

Why WebSocket Bridge¶

Why not HTTP streamed request/response¶

request/response body streaming is not a reliable model for long-lived full duplex I/O
proxy and ingress behavior can buffer or delay one or both directions
production already proved this failure mode

Why WebSocket¶

true duplex after upgrade
widely supported operationally
already used on the browser side of the system
simpler than introducing gRPC runtime and protobuf for a byte-stream bridge
easy to carry binary data plus a few control messages

Why not Redis pubsub in the frame path¶

per-frame broker hops add latency and failure modes
terminal data plane should remain in-process once both sockets are connected
Redis remains acceptable for session binding and token state, not live byte transport

Topology¶

browser
  -> websocket
terminal-gateway
  -> session authority calls
cmd/api
  -> node_tasks terminal.open
node-agent
  -> websocket over mTLS
terminal-gateway

Live relay path:

browser websocket <-> terminal-gateway <-> node websocket <-> PTY

No API byte relay in the steady-state data path.

Session Flow¶

Phase 1: Browser authorization¶

Browser calls:
POST /api/v1/allocations/{id}/terminal-token
API validates:
user owns allocation
allocation is active
rate limits
API stores single-use short-lived token in Redis.

Phase 2: Browser terminal connect¶

Browser opens:
WS /ws/terminal/{allocation_id}
Browser sends terminal token via Sec-WebSocket-Protocol.
Terminal-gateway validates token and creates session binding:
session_id
allocation_id
user_id
node_id
username
expiry / TTL
Terminal-gateway requests/causes terminal.open task dispatch.

Phase 3: Node-agent connect¶

Node-agent claims terminal.open.
Node-agent starts PTY for the allocation user.
Node-agent opens a dedicated internal WebSocket to terminal-gateway:
node-facing listener only
mTLS required
session ID included in path or signed header/token
Terminal-gateway validates:
peer cert is a valid node cert
node identity matches session binding
session is still active
Terminal-gateway bridges browser socket and node socket.

Phase 4: Live session¶

browser input frames -> gateway -> node PTY stdin
node PTY stdout/stderr -> gateway -> browser output frames
resize/close/heartbeat remain typed control messages

Phase 5: Close and cleanup¶

either side can close
gateway records close reason and audit
session binding is removed
browser receives explicit close reason

Security Model¶

Security has two separate layers.

1. Transport identity: node mTLS¶

The node-facing internal WebSocket listener requires:

TLS
client certificate verification against the node CA
peer identity extraction directly from the TLS layer

This must not depend on Traefik forwarding client-cert headers.

2. Session authorization¶

Session authorization remains control-plane owned.

The gateway accepts the node socket only if:

the session binding exists
the bound node_id matches the node identity proven by the client cert
the session is not expired or closed

Optional strengthening:

require a short-lived signed session claim from API in addition to the cert
bind that claim to:
session_id
node_id
allocation_id
exp

That is recommended but not strictly required for v1 if the binding store and mTLS checks are already strong.

Session Directory And Redis Schema¶

Redis remains the session-control store, but not the frame relay path.

Recommended v1 keys:

terminal_session:{session_id}
value:
- allocation_id
- user_id
- node_id
- username
- expires_at
- gateway_instance_id
- status
terminal_allocation_active:{allocation_id}
value:
- session_id
terminal_gateway_sessions:{gateway_instance_id}
set of owned session_id

Required semantics:

session binding is created before the node socket attaches
gateway_instance_id is written when the gateway takes ownership
session ownership is removed on close or expiry
session TTL remains enforced by control-plane policy

This extends the existing session-binding model rather than inventing a parallel session directory with unrelated ownership rules.

Protocol¶

Browser-side protocol¶

Use JSON text frames with typed control/data messages.

Examples:

ready
data
resize
close
error
heartbeat

Payload bytes are base64 encoded for browser-side structured messaging.

Node-side protocol¶

Use:

binary frames for PTY byte data
text frames for control messages (resize, close, heartbeat, error)

This keeps the hot path efficient while preserving structured control behavior.

Scale Model¶

This architecture must handle hundreds of simultaneous sessions.

Scaling assumptions¶

many nodes may be connected concurrently
a restricted environment may prefer terminal over SSH
terminal sessions are long-lived and bursty

Required scale properties¶

Gateway horizontal scaling¶

Terminal-gateway must scale horizontally.

Each active session is owned by one gateway instance.

Session directory requirements:

session_id -> gateway_instance_id
session_id -> allocation_id, user_id, node_id, expiry

This registry can live in Redis or another shared store because it is control state, not the frame path.

No per-frame external dependency¶

There must be:

no Redis publish/subscribe for live frame movement
no database writes per frame
no control-plane API hop per keypress

Backpressure and buffering¶

Each session must have bounded buffers.

Required policies:

max outbound queue per browser socket
max outbound queue per node socket
close session or shed data predictably if peer is too slow

V1 default policy:

do not silently drop terminal output bytes
if a bounded queue is exceeded:
close the session explicitly
emit an explicit close reason such as:
- output_backpressure_exceeded
- input_backpressure_exceeded
surface that reason to logs and to the browser

Terminal should prefer correctness and bounded memory over unbounded buffering.

Connection caps¶

Enforce:

per-user active session cap
per-allocation active session cap
per-gateway active session cap

Expose saturation metrics so operators know when to scale out.

Restart And Reconnect Model¶

This must be explicit, not accidental.

V1 recommendation¶

Non-resumable sessions.

Rules:

browser disconnect:
close session after short grace period or immediately, depending on product choice
gateway restart:
all sessions close with explicit reason gateway_restart
node-agent restart:
session closes with reason node_restart
control-plane/API restart:
existing gateway/node sockets continue if gateway remains alive; no byte relay through API

Why this is recommended for v1:

simpler correctness model
easier to test
avoids hidden half-open session semantics

V2 option¶

Add resumable sessions only after v1 is stable.

That requires:

durable session directory
attach/re-attach semantics
PTY survivability rules

Deployment Model¶

Node-facing listener¶

Expose a dedicated node-facing WebSocket listener outside the current HTTP ingress path.

Requirements:

node-reachable address
websocket-safe/L4-safe load balancing
native mTLS at the gateway process

Do not place the critical node stream path behind the same ingress assumptions that caused the current buffering issue.

Concrete v1 deployment shape:

cmd/terminal-gateway serves two listeners:
existing browser-facing websocket listener
new node-facing internal websocket listener on a dedicated port
Kubernetes exposes the node-facing listener through a dedicated Service:
separate from the current browser-facing route
separate from the current ingress-buffered node-api path
preferred v1 exposure:
dedicated LoadBalancer Service for the node-facing websocket port
reachable from worker nodes on the node network
TLS terminates in cmd/terminal-gateway, not at Traefik

Explicit non-goal: - do not reuse the current Traefik/ingress terminal or node-api route for the live node stream path

Browser-facing listener¶

Keep the existing browser-facing websocket endpoint and routing model, provided it remains websocket-safe.

Observability Requirements¶

Required logs:

browser websocket accepted
session binding created
node websocket accepted
first downstream frame forwarded
first upstream frame forwarded
close reason and initiator
auth failure reason

Required metrics:

active sessions
sessions opened / closed
auth failures by reason
average session duration
buffer saturation / dropped-session counts
gateway instance session counts

Test Plan¶

The redesign is not done without deployed-environment tests.

Must-have tests¶

prompt appears in deployed environment
typed key is echoed before disconnect
resize changes PTY size
session closes with explicit reason
post-reimage terminal smoke

Failure-mode tests¶

browser disconnect
gateway restart
node-agent restart
expired session binding
wrong node cert / wrong node_id

Scale tests¶

N concurrent sessions across multiple gateways
sustained typing/output across many sessions
no per-session memory runaway

Migration Plan¶

Phase 1¶

add dedicated node-facing internal WebSocket listener to terminal-gateway
add mTLS verification at the listener
add feature flag for node transport selection

Phase 2¶

add node-agent internal WebSocket client for terminal sessions
keep existing browser contract unchanged

Phase 3¶

run dev-control soak
validate duplex, auth, restart behavior, and post-reimage flow

Phase 4¶

remove NDJSON HTTP relay path
remove frame-path Redis pubsub coupling
simplify terminal runbooks and alerts around the new architecture

Decision¶

The recommended v1 terminal redesign is:

browser-side WebSocket unchanged
node-side direct internal WebSocket over mTLS
terminal-gateway as the only live byte bridge
API as session authority, not byte relay
no dependence on ingress behavior for node terminal streaming