Terminal WebSocket Bridge Implementation Plan v1¶
Status¶
Proposed
Build plan for: - Terminal WebSocket Bridge Architecture v1
Purpose¶
Break the terminal redesign into implementation slices that can be delivered, tested, and rolled out without falling back into incident-mode changes.
Governing Decisions¶
- browser-facing terminal contract remains unchanged
- terminal-gateway becomes the live terminal byte bridge
- API remains session authority, not byte relay
- node-facing terminal path is a dedicated internal WebSocket over mTLS
- no per-frame Redis or DB in the hot data path
- v1 sessions are non-resumable
New ADR Required¶
Before or during slice 1, add a new ADR under doc/architecture/adrs/:
ADR-011-terminal-node-websocket-bridge.md
That ADR should:
- supersede the current internal HTTP relay assumption for terminal data plane
- reference:
- ADR-005-terminal-gateway-isolation.md
- ADR-007-terminal-access-auth-model.md
- Terminal_WebSocket_Bridge_Architecture_v1.md
Slice Order¶
Slice 1: Session Authority And Broker Contract Cleanup¶
Owner: - API + terminal service
Goal: - make session binding a stable broker-owned control-state model before changing transport
In scope:
- normalize Redis session schema:
- terminal_session:{session_id}
- terminal_allocation_active:{allocation_id}
- terminal_gateway_sessions:{gateway_instance_id}
- ensure terminal.open task payload carries everything node-agent needs for bridge connect
- define explicit close reasons and session states
Out of scope: - node-facing internal websocket listener - browser UI changes
Files likely touched:
- packages/services/terminal/service.go
- cmd/api/routes.go
- doc/api/openapi.draft.yaml if contract text/fields change
Acceptance: - unit tests for token consume + session binding creation/cleanup - integration test for single active session per allocation - audit/session state logs remain correct
Slice 2: Terminal-Gateway Internal Node Listener¶
Owner: - terminal-gateway
Goal: - add dedicated node-facing internal WebSocket listener with native mTLS verification
In scope:
- second listener/port in cmd/terminal-gateway
- TLS client auth using node CA
- session lookup and node identity validation
- in-process browser socket <-> node socket bridge
- first upstream/downstream frame logs
Out of scope: - node-agent switching to the new path
Files likely touched:
- cmd/terminal-gateway/main.go
- cmd/terminal-gateway/routes.go
- packages/services/terminal/service.go
Acceptance: - local/integration test that gateway accepts node mTLS websocket - session ownership is registered on connect and cleared on close - no Redis pubsub used for live frame relay on this new path
Slice 3: Node-Agent Internal WebSocket Client¶
Owner: - node-agent
Goal: - replace HTTP terminal relay client with node-facing internal WebSocket client
In scope:
- on terminal.open, node-agent opens internal websocket to gateway
- PTY byte relay over binary frames
- resize/close/heartbeat over typed control frames
- explicit non-resumable close behavior
Out of scope: - lifecycle/task polling transport changes
Files likely touched:
- cmd/node-agent/terminal_stream.go
- cmd/node-agent/config.go
Acceptance: - node-agent unit tests for: - connect success - wrong node identity rejection - close reason propagation - PTY prompt appears and typed key echoes against a fake gateway
Slice 4: Kubernetes Exposure And Node-Reachable Routing¶
Owner: - infra / platform-control deploy path
Goal: - expose node-facing terminal listener on a worker-node-routable path
In scope: - dedicated Service / port for internal node terminal websocket - no Traefik in the critical node stream path - bootstrap/runtime config for node-agent internal terminal endpoint
Out of scope: - browser ingress route changes
Files likely touched:
- infra/k8s/base/core/*
- infra/k8s/overlays/dev-control/*
- deploy scripts under scripts/ci/
Acceptance:
- worker node can reach the terminal internal endpoint
- mTLS handshake succeeds from node network
- route does not depend on X-Forwarded-* identity propagation
Slice 5: Browser/Gateway Integration And UI State¶
Owner: - terminal-gateway + web
Goal: - keep browser contract stable while adapting gateway to the new bridge internals
In scope:
- preserve:
- POST /api/v1/allocations/{id}/terminal-token
- WS /ws/terminal/{allocation_id}
- browser receives typed ready, data, close, error
- explicit close reason rendering
Out of scope: - browser contract redesign
Files likely touched:
- packages/web/src/components/terminal/TerminalPanel.tsx
- cmd/terminal-gateway/routes.go
Acceptance: - browser prompt appears - typed key leaves browser and echoes - resize works - explicit close reason visible
Slice 6: Deployed-Environment Smoke And Failure Tests¶
Owner: - cross-cutting
Goal: - prove the redesign in the environment that exposed the failure
Required tests:
- deployed terminal smoke:
- open terminal
- prompt appears
- type echo hi
- verify echo/output before disconnect
- post-reimage terminal smoke
- browser disconnect test
- gateway restart test
- node-agent restart test
- wrong node cert / wrong node_id test
Files likely touched:
- packages/web/e2e/terminal-input.spec.ts
- CI/deploy validation scripts under scripts/ci/
Acceptance: - this slice is required before declaring the redesign done
Ordering Rules¶
- do not build slice 3 before slice 2 contracts are stable
- do not roll out slice 4 before slice 2 and 3 can be tested together in a lower-risk environment
- do not remove the old HTTP relay path until slice 6 is passing in deployed environment
Temporary Compatibility Strategy¶
Use a feature flag during migration:
TERMINAL_NODE_TRANSPORT=legacy_http|internal_ws
This flag is temporary and must be removed after successful soak.
Compatibility rules:
- browser contract must remain unchanged during migration
- task polling/provisioning transport must remain untouched
- do not mix old and new node stream behavior within the same live session
Default Backpressure Behavior¶
V1 rule: - bounded buffers only - no silent output dropping - if queue exceeds limit: - close session - emit explicit close reason - log saturation event
Risks¶
- mTLS setup for the dedicated node-facing listener may expose CA/config drift again
- gateway session ownership bugs could create split-brain live sessions
- node-facing listener exposure may require infra changes on worker-reachable networking
Success Criteria¶
The redesign is successful when all of these are true in deployed environment:
- prompt appears without disconnect tricks
- a typed key echoes before disconnect
- no terminal byte path depends on ingress request/response buffering
- node identity is validated directly by the node-facing listener
- browser contract is unchanged
- terminal survives normal load at the target concurrency envelope
Not Allowed¶
- more incremental fixes to the old NDJSON duplex-over-HTTP relay
- relying on Traefik forwarded client-cert headers for the node-facing terminal path
- per-frame Redis pubsub as the steady-state bridge