Skip to content

Terminal WebSocket Bridge Implementation Plan v1

Status

Proposed

Build plan for: - Terminal WebSocket Bridge Architecture v1

Purpose

Break the terminal redesign into implementation slices that can be delivered, tested, and rolled out without falling back into incident-mode changes.

Governing Decisions

  • browser-facing terminal contract remains unchanged
  • terminal-gateway becomes the live terminal byte bridge
  • API remains session authority, not byte relay
  • node-facing terminal path is a dedicated internal WebSocket over mTLS
  • no per-frame Redis or DB in the hot data path
  • v1 sessions are non-resumable

New ADR Required

Before or during slice 1, add a new ADR under doc/architecture/adrs/:

  • ADR-011-terminal-node-websocket-bridge.md

That ADR should: - supersede the current internal HTTP relay assumption for terminal data plane - reference: - ADR-005-terminal-gateway-isolation.md - ADR-007-terminal-access-auth-model.md - Terminal_WebSocket_Bridge_Architecture_v1.md

Slice Order

Slice 1: Session Authority And Broker Contract Cleanup

Owner: - API + terminal service

Goal: - make session binding a stable broker-owned control-state model before changing transport

In scope: - normalize Redis session schema: - terminal_session:{session_id} - terminal_allocation_active:{allocation_id} - terminal_gateway_sessions:{gateway_instance_id} - ensure terminal.open task payload carries everything node-agent needs for bridge connect - define explicit close reasons and session states

Out of scope: - node-facing internal websocket listener - browser UI changes

Files likely touched: - packages/services/terminal/service.go - cmd/api/routes.go - doc/api/openapi.draft.yaml if contract text/fields change

Acceptance: - unit tests for token consume + session binding creation/cleanup - integration test for single active session per allocation - audit/session state logs remain correct

Slice 2: Terminal-Gateway Internal Node Listener

Owner: - terminal-gateway

Goal: - add dedicated node-facing internal WebSocket listener with native mTLS verification

In scope: - second listener/port in cmd/terminal-gateway - TLS client auth using node CA - session lookup and node identity validation - in-process browser socket <-> node socket bridge - first upstream/downstream frame logs

Out of scope: - node-agent switching to the new path

Files likely touched: - cmd/terminal-gateway/main.go - cmd/terminal-gateway/routes.go - packages/services/terminal/service.go

Acceptance: - local/integration test that gateway accepts node mTLS websocket - session ownership is registered on connect and cleared on close - no Redis pubsub used for live frame relay on this new path

Slice 3: Node-Agent Internal WebSocket Client

Owner: - node-agent

Goal: - replace HTTP terminal relay client with node-facing internal WebSocket client

In scope: - on terminal.open, node-agent opens internal websocket to gateway - PTY byte relay over binary frames - resize/close/heartbeat over typed control frames - explicit non-resumable close behavior

Out of scope: - lifecycle/task polling transport changes

Files likely touched: - cmd/node-agent/terminal_stream.go - cmd/node-agent/config.go

Acceptance: - node-agent unit tests for: - connect success - wrong node identity rejection - close reason propagation - PTY prompt appears and typed key echoes against a fake gateway

Slice 4: Kubernetes Exposure And Node-Reachable Routing

Owner: - infra / platform-control deploy path

Goal: - expose node-facing terminal listener on a worker-node-routable path

In scope: - dedicated Service / port for internal node terminal websocket - no Traefik in the critical node stream path - bootstrap/runtime config for node-agent internal terminal endpoint

Out of scope: - browser ingress route changes

Files likely touched: - infra/k8s/base/core/* - infra/k8s/overlays/dev-control/* - deploy scripts under scripts/ci/

Acceptance: - worker node can reach the terminal internal endpoint - mTLS handshake succeeds from node network - route does not depend on X-Forwarded-* identity propagation

Slice 5: Browser/Gateway Integration And UI State

Owner: - terminal-gateway + web

Goal: - keep browser contract stable while adapting gateway to the new bridge internals

In scope: - preserve: - POST /api/v1/allocations/{id}/terminal-token - WS /ws/terminal/{allocation_id} - browser receives typed ready, data, close, error - explicit close reason rendering

Out of scope: - browser contract redesign

Files likely touched: - packages/web/src/components/terminal/TerminalPanel.tsx - cmd/terminal-gateway/routes.go

Acceptance: - browser prompt appears - typed key leaves browser and echoes - resize works - explicit close reason visible

Slice 6: Deployed-Environment Smoke And Failure Tests

Owner: - cross-cutting

Goal: - prove the redesign in the environment that exposed the failure

Required tests: - deployed terminal smoke: - open terminal - prompt appears - type echo hi - verify echo/output before disconnect - post-reimage terminal smoke - browser disconnect test - gateway restart test - node-agent restart test - wrong node cert / wrong node_id test

Files likely touched: - packages/web/e2e/terminal-input.spec.ts - CI/deploy validation scripts under scripts/ci/

Acceptance: - this slice is required before declaring the redesign done

Ordering Rules

  • do not build slice 3 before slice 2 contracts are stable
  • do not roll out slice 4 before slice 2 and 3 can be tested together in a lower-risk environment
  • do not remove the old HTTP relay path until slice 6 is passing in deployed environment

Temporary Compatibility Strategy

Use a feature flag during migration:

  • TERMINAL_NODE_TRANSPORT=legacy_http|internal_ws

This flag is temporary and must be removed after successful soak.

Compatibility rules:

  • browser contract must remain unchanged during migration
  • task polling/provisioning transport must remain untouched
  • do not mix old and new node stream behavior within the same live session

Default Backpressure Behavior

V1 rule: - bounded buffers only - no silent output dropping - if queue exceeds limit: - close session - emit explicit close reason - log saturation event

Risks

  • mTLS setup for the dedicated node-facing listener may expose CA/config drift again
  • gateway session ownership bugs could create split-brain live sessions
  • node-facing listener exposure may require infra changes on worker-reachable networking

Success Criteria

The redesign is successful when all of these are true in deployed environment:

  • prompt appears without disconnect tricks
  • a typed key echoes before disconnect
  • no terminal byte path depends on ingress request/response buffering
  • node identity is validated directly by the node-facing listener
  • browser contract is unchanged
  • terminal survives normal load at the target concurrency envelope

Not Allowed

  • more incremental fixes to the old NDJSON duplex-over-HTTP relay
  • relying on Traefik forwarded client-cert headers for the node-facing terminal path
  • per-frame Redis pubsub as the steady-state bridge