Skip to content

Terminal WebSocket Bridge Architecture v1

Status

Proposed v1 design target

Purpose

Define a first-principles terminal architecture for GPUaaS that:

  • provides true full duplex terminal I/O
  • does not depend on HTTP request/response streaming behavior through proxies
  • uses explicit node identity verification
  • scales to hundreds of simultaneous terminal sessions
  • can serve as the primary access path in restricted or sovereign environments where direct SSH may be unavailable or undesirable

This proposal is intended to replace incident-driven patching of the current NDJSON terminal relay path.

Problem Statement

The existing terminal runtime mixes three separate concerns into a single ingress path:

  1. full duplex terminal transport
  2. node identity/authentication
  3. session authorization and control-plane ownership

That coupling created two production failure modes:

  • ingress path was reachable but buffered live duplex traffic badly enough to delay prompt and input behavior until disconnect/session unwind
  • direct path restored live stream behavior but failed node identity checks because the current identity model depends on ingress/mTLS handoff assumptions

The new design must separate these concerns.

First-Principles Requirements

Functional

  • terminal input and output must flow in both directions independently and immediately
  • browser contract should remain stable for users
  • node-agent remains the terminal execution owner on the node
  • terminal can be the primary remote shell surface when SSH is not the main access path

Security

  • node identity must be verified without forwarded header dependence
  • user/session authorization must remain control-plane owned
  • transport must be encrypted in transit
  • session open/close/error activity must remain auditable

Operational

  • no correctness dependence on proxy buffering quirks
  • support hundreds of simultaneous terminal sessions
  • explicit connection caps and backpressure
  • clear behavior on browser restart, gateway restart, and node-agent restart

Design Decision

Adopt a dual-WebSocket bridge design:

  • browser ↔ terminal-gateway: WebSocket
  • node-agent ↔ terminal-gateway: dedicated internal WebSocket over mTLS

The terminal-gateway becomes the terminal data-plane bridge.

The control plane (cmd/api) remains the session authority:

  • mint terminal token
  • validate allocation ownership
  • create session binding
  • enqueue terminal.open
  • record audit trail

The API is not the terminal byte relay.

Why WebSocket Bridge

Why not HTTP streamed request/response

  • request/response body streaming is not a reliable model for long-lived full duplex I/O
  • proxy and ingress behavior can buffer or delay one or both directions
  • production already proved this failure mode

Why WebSocket

  • true duplex after upgrade
  • widely supported operationally
  • already used on the browser side of the system
  • simpler than introducing gRPC runtime and protobuf for a byte-stream bridge
  • easy to carry binary data plus a few control messages

Why not Redis pubsub in the frame path

  • per-frame broker hops add latency and failure modes
  • terminal data plane should remain in-process once both sockets are connected
  • Redis remains acceptable for session binding and token state, not live byte transport

Topology

browser
  -> websocket
terminal-gateway
  -> session authority calls
cmd/api
  -> node_tasks terminal.open
node-agent
  -> websocket over mTLS
terminal-gateway

Live relay path:

browser websocket <-> terminal-gateway <-> node websocket <-> PTY

No API byte relay in the steady-state data path.

Session Flow

Phase 1: Browser authorization

  1. Browser calls:
  2. POST /api/v1/allocations/{id}/terminal-token
  3. API validates:
  4. user owns allocation
  5. allocation is active
  6. rate limits
  7. API stores single-use short-lived token in Redis.

Phase 2: Browser terminal connect

  1. Browser opens:
  2. WS /ws/terminal/{allocation_id}
  3. Browser sends terminal token via Sec-WebSocket-Protocol.
  4. Terminal-gateway validates token and creates session binding:
  5. session_id
  6. allocation_id
  7. user_id
  8. node_id
  9. username
  10. expiry / TTL
  11. Terminal-gateway requests/causes terminal.open task dispatch.

Phase 3: Node-agent connect

  1. Node-agent claims terminal.open.
  2. Node-agent starts PTY for the allocation user.
  3. Node-agent opens a dedicated internal WebSocket to terminal-gateway:
  4. node-facing listener only
  5. mTLS required
  6. session ID included in path or signed header/token
  7. Terminal-gateway validates:
  8. peer cert is a valid node cert
  9. node identity matches session binding
  10. session is still active
  11. Terminal-gateway bridges browser socket and node socket.

Phase 4: Live session

  • browser input frames -> gateway -> node PTY stdin
  • node PTY stdout/stderr -> gateway -> browser output frames
  • resize/close/heartbeat remain typed control messages

Phase 5: Close and cleanup

  • either side can close
  • gateway records close reason and audit
  • session binding is removed
  • browser receives explicit close reason

Security Model

Security has two separate layers.

1. Transport identity: node mTLS

The node-facing internal WebSocket listener requires:

  • TLS
  • client certificate verification against the node CA
  • peer identity extraction directly from the TLS layer

This must not depend on Traefik forwarding client-cert headers.

2. Session authorization

Session authorization remains control-plane owned.

The gateway accepts the node socket only if:

  • the session binding exists
  • the bound node_id matches the node identity proven by the client cert
  • the session is not expired or closed

Optional strengthening:

  • require a short-lived signed session claim from API in addition to the cert
  • bind that claim to:
  • session_id
  • node_id
  • allocation_id
  • exp

That is recommended but not strictly required for v1 if the binding store and mTLS checks are already strong.

Session Directory And Redis Schema

Redis remains the session-control store, but not the frame relay path.

Recommended v1 keys:

  • terminal_session:{session_id}
  • value:
    • allocation_id
    • user_id
    • node_id
    • username
    • expires_at
    • gateway_instance_id
    • status
  • terminal_allocation_active:{allocation_id}
  • value:
    • session_id
  • terminal_gateway_sessions:{gateway_instance_id}
  • set of owned session_id

Required semantics:

  • session binding is created before the node socket attaches
  • gateway_instance_id is written when the gateway takes ownership
  • session ownership is removed on close or expiry
  • session TTL remains enforced by control-plane policy

This extends the existing session-binding model rather than inventing a parallel session directory with unrelated ownership rules.

Protocol

Browser-side protocol

Use JSON text frames with typed control/data messages.

Examples:

  • ready
  • data
  • resize
  • close
  • error
  • heartbeat

Payload bytes are base64 encoded for browser-side structured messaging.

Node-side protocol

Use:

  • binary frames for PTY byte data
  • text frames for control messages (resize, close, heartbeat, error)

This keeps the hot path efficient while preserving structured control behavior.

Scale Model

This architecture must handle hundreds of simultaneous sessions.

Scaling assumptions

  • many nodes may be connected concurrently
  • a restricted environment may prefer terminal over SSH
  • terminal sessions are long-lived and bursty

Required scale properties

Gateway horizontal scaling

Terminal-gateway must scale horizontally.

Each active session is owned by one gateway instance.

Session directory requirements:

  • session_id -> gateway_instance_id
  • session_id -> allocation_id, user_id, node_id, expiry

This registry can live in Redis or another shared store because it is control state, not the frame path.

No per-frame external dependency

There must be:

  • no Redis publish/subscribe for live frame movement
  • no database writes per frame
  • no control-plane API hop per keypress

Backpressure and buffering

Each session must have bounded buffers.

Required policies:

  • max outbound queue per browser socket
  • max outbound queue per node socket
  • close session or shed data predictably if peer is too slow

V1 default policy:

  • do not silently drop terminal output bytes
  • if a bounded queue is exceeded:
  • close the session explicitly
  • emit an explicit close reason such as:
    • output_backpressure_exceeded
    • input_backpressure_exceeded
  • surface that reason to logs and to the browser

Terminal should prefer correctness and bounded memory over unbounded buffering.

Connection caps

Enforce:

  • per-user active session cap
  • per-allocation active session cap
  • per-gateway active session cap

Expose saturation metrics so operators know when to scale out.

Restart And Reconnect Model

This must be explicit, not accidental.

V1 recommendation

Non-resumable sessions.

Rules:

  • browser disconnect:
  • close session after short grace period or immediately, depending on product choice
  • gateway restart:
  • all sessions close with explicit reason gateway_restart
  • node-agent restart:
  • session closes with reason node_restart
  • control-plane/API restart:
  • existing gateway/node sockets continue if gateway remains alive; no byte relay through API

Why this is recommended for v1:

  • simpler correctness model
  • easier to test
  • avoids hidden half-open session semantics

V2 option

Add resumable sessions only after v1 is stable.

That requires:

  • durable session directory
  • attach/re-attach semantics
  • PTY survivability rules

Deployment Model

Node-facing listener

Expose a dedicated node-facing WebSocket listener outside the current HTTP ingress path.

Requirements:

  • node-reachable address
  • websocket-safe/L4-safe load balancing
  • native mTLS at the gateway process

Do not place the critical node stream path behind the same ingress assumptions that caused the current buffering issue.

Concrete v1 deployment shape:

  • cmd/terminal-gateway serves two listeners:
  • existing browser-facing websocket listener
  • new node-facing internal websocket listener on a dedicated port
  • Kubernetes exposes the node-facing listener through a dedicated Service:
  • separate from the current browser-facing route
  • separate from the current ingress-buffered node-api path
  • preferred v1 exposure:
  • dedicated LoadBalancer Service for the node-facing websocket port
  • reachable from worker nodes on the node network
  • TLS terminates in cmd/terminal-gateway, not at Traefik

Explicit non-goal: - do not reuse the current Traefik/ingress terminal or node-api route for the live node stream path

Browser-facing listener

Keep the existing browser-facing websocket endpoint and routing model, provided it remains websocket-safe.

Observability Requirements

Required logs:

  • browser websocket accepted
  • session binding created
  • node websocket accepted
  • first downstream frame forwarded
  • first upstream frame forwarded
  • close reason and initiator
  • auth failure reason

Required metrics:

  • active sessions
  • sessions opened / closed
  • auth failures by reason
  • average session duration
  • buffer saturation / dropped-session counts
  • gateway instance session counts

Test Plan

The redesign is not done without deployed-environment tests.

Must-have tests

  • prompt appears in deployed environment
  • typed key is echoed before disconnect
  • resize changes PTY size
  • session closes with explicit reason
  • post-reimage terminal smoke

Failure-mode tests

  • browser disconnect
  • gateway restart
  • node-agent restart
  • expired session binding
  • wrong node cert / wrong node_id

Scale tests

  • N concurrent sessions across multiple gateways
  • sustained typing/output across many sessions
  • no per-session memory runaway

Migration Plan

Phase 1

  • add dedicated node-facing internal WebSocket listener to terminal-gateway
  • add mTLS verification at the listener
  • add feature flag for node transport selection

Phase 2

  • add node-agent internal WebSocket client for terminal sessions
  • keep existing browser contract unchanged

Phase 3

  • run dev-control soak
  • validate duplex, auth, restart behavior, and post-reimage flow

Phase 4

  • remove NDJSON HTTP relay path
  • remove frame-path Redis pubsub coupling
  • simplify terminal runbooks and alerts around the new architecture

Decision

The recommended v1 terminal redesign is:

  • browser-side WebSocket unchanged
  • node-side direct internal WebSocket over mTLS
  • terminal-gateway as the only live byte bridge
  • API as session authority, not byte relay
  • no dependence on ingress behavior for node terminal streaming