Trail: Terminal & Sessions¶
Browser terminal + SSH access to allocations: token mint, WebSocket bridge, SSH key release, session limits, transport redesign, and the RCA that drove the gateway split.
Trail map¶
flowchart TB
classDef impl fill:#d1e7dd,stroke:#0a3622,color:#0a3622
classDef des fill:#fff3cd,stroke:#332701,color:#332701
classDef run fill:#e9d6ff,stroke:#1e1530,color:#1e1530
classDef rca fill:#f8d7da,stroke:#42101e,color:#42101e
T1[1. Terminal model]:::impl --> T2[2. Token mint]:::impl
T2 --> T3[3. WS bridge]:::impl
T3 --> T4[4. SSH key release]:::impl
T4 --> T5[5. Session limits]:::impl
T5 --> T6[6. Transport redesign]:::des
T6 --> T7[7. RCA: HTTP/2 buffering]:::rca
1. Terminal model¶
Implemented
flowchart LR
classDef edge fill:#fff3e0,stroke:#e65100
classDef cp fill:#e3f2fd,stroke:#1565c0
classDef host fill:#fce4ec,stroke:#c2185b
B[Browser<br/>xterm.js]:::edge
WAF[WAF / Gateway]:::edge
API[cmd/api]:::cp
TG[cmd/terminal-gateway]:::cp
REDIS[(Redis<br/>terminal_token:*)]:::cp
NA[cmd/node-agent]:::host
VM[VM / Bare metal<br/>PTY]:::host
B --> WAF
WAF -->|HTTPS POST /terminal-token| API
API <-->|mint/validate| REDIS
B -.WSS Sec-WebSocket-Protocol: token.-> WAF
WAF -.WSS upgrade.-> TG
TG <-->|validate token via API| API
TG <-->|stream relay HTTPS mTLS| NA
NA <-->|SSH with host terminal key| VM
A terminal is never a direct TCP from the browser to the host. The flow always passes through the gateway, which enforces the single-use token contract and brokers SSH using a host-resident key.
2. Token mint¶
Implemented
sequenceDiagram
autonumber
participant U as User
participant API as cmd/api
participant RL as rate limiter
participant AZ as authz
participant R as Redis
participant AUD as audit_logs
U->>API: POST /api/v1/allocations/{id}/terminal-token
API->>RL: enforce rate_limit.terminal_token_requests_per_minute (10)
alt over limit
RL-->>U: 429 rate_limit_exceeded
end
API->>AZ: user owns allocation OR is admin?
alt not allowed
AZ-->>U: 403 ownership_required
end
API->>API: generate 256-bit random token
API->>R: SETEX terminal_token:{token} 300<br/>{user_id, allocation_id, expiry}
API->>AUD: INSERT audit_logs<br/>(actor, action=terminal.token.mint, target=allocation, result=success)
API-->>U: {token, ws_url, expires_in: 300}
Token shape and rules:
| Property | Value |
|---|---|
| Format | Opaque 256-bit random base64url |
| Storage | Redis, key terminal_token:{token} |
| TTL | 300 seconds |
| Single-use | Deleted on first validation |
| Rate limit | rate_limit.terminal_token_requests_per_minute (default 10) |
| Audited | Every mint, every validation |
This is the only explicitly non-idempotent mutation in the platform (per Coding_Standards.md §2 exception). Replays would defeat single-use semantics.
3. WS bridge¶
Implemented
cmd/terminal-gateway is a dedicated binary on a dedicated port. The reason it exists is RCA 2026-03-terminal-stream-http2-buffering (step 7).
sequenceDiagram
autonumber
participant B as Browser
participant TG as cmd/terminal-gateway
participant API as cmd/api
participant R as Redis
participant NA as cmd/node-agent
participant VM as VM/PTY
B->>TG: WSS /ws/terminal/{allocation_id}<br/>Sec-WebSocket-Protocol: <token>
Note over B,TG: NO ?token= in URL.<br/>Coding_Standards.md §8
TG->>API: validate token
API->>R: GETDEL terminal_token:{token}
R-->>API: {user_id, allocation_id, expiry}
alt not found / expired
API-->>TG: 401
TG-->>B: close 1008
end
API-->>TG: validation OK + host/private_ip + default_user
TG->>NA: open stream relay (mTLS)
NA->>VM: SSH with host terminal key<br/>(/var/lib/gpuaas/terminal/id_ed25519)
loop terminal session
B-->>TG: input bytes (WS frame)
TG-->>NA: bytes (HTTPS streaming)
NA-->>VM: stdin
VM-->>NA: stdout/stderr
NA-->>TG: bytes
TG-->>B: WS frame
end
Note over B,TG: terminal.session_max_ttl_seconds<br/>(default 14400 = 4h)<br/>enforced by gateway + node-agent
TG-->>B: close on TTL or user disconnect
Auth on WS uses Sec-WebSocket-Protocol (the documented exception to header-only transport). No ?token= allowed anywhere.
4. SSH key release¶
Implemented
For non-browser access, users register public keys on their allocation. Node-agent installs them into the OS user's authorized_keys during allocation.provision_user.
erDiagram
users ||--o{ ssh_public_keys : "owns"
projects ||--o{ ssh_public_keys : "scoped to (optional)"
allocations ||--o{ allocation_ssh_public_keys : "uses"
ssh_public_keys ||--o{ allocation_ssh_public_keys : "granted to"
ssh_public_keys {
uuid id PK
uuid user_id FK
uuid project_id "nullable, project scope"
text title
text key_type "ed25519|rsa|ecdsa"
text key_body "public material only"
text fingerprint
}
allocation_ssh_public_keys {
uuid allocation_id FK
uuid ssh_public_key_id FK
timestamp granted_at
text granted_by
}
Production never stores user SSH private keys server-side. Acceptance test AT-033 explicitly retires the legacy private-key download endpoint.
The host terminal SSH key (/var/lib/gpuaas/terminal/id_ed25519) is different: it's per-host, generated by the host, used only by the gateway to broker terminals. Never copied off the host.
→ Sources: Allocation_Project_SSH_Access_v1.md, Allocation_Project_SSH_Access_Grants_v1.md
5. Session limits¶
Implemented
stateDiagram-v2
[*] --> connecting: WS upgrade requested
connecting --> validating: token presented
validating --> active: token valid + GETDEL
validating --> rejected: invalid/expired/replay
rejected --> [*]: WS close 1008
active --> idle_warn: idle warning timer
active --> shutting_down: user disconnect
active --> shutting_down: max_ttl reached
idle_warn --> active: input received
idle_warn --> shutting_down: idle timeout
shutting_down --> [*]: cleanup PTY + close WS
| Policy key | Default | Effect |
|---|---|---|
terminal.session_max_ttl_seconds |
14400 (4h) | Hard cap on active session length |
rate_limit.terminal_token_requests_per_minute |
10 | Token mint per user per minute |
The TTL is enforced both at the gateway (drops the WS) and at the node-agent (closes the underlying SSH). Independent enforcement so a misbehaving gateway can't extend sessions.
6. Transport redesign¶
Designed
Future-state for the terminal transport: more granular flow control, multi-pane support, session recording (compliance ask from external review).
flowchart TB
NOW[v1 today]
NOW --> N1[Single-pane WS<br/>byte-stream relay<br/>no recording]
NOW --> N2[xterm.js client]
NEXT[Designed evolution]
NEXT --> X1[Multi-pane / split sessions]
NEXT --> X2[Session recording<br/>compliance gate]
NEXT --> X3[Granular backpressure]
NEXT --> X4[Reconnect with state replay]
classDef now fill:#d1e7dd,stroke:#0a3622
classDef next fill:#fff3cd,stroke:#332701
class N1,N2 now
class X1,X2,X3,X4 next
→ Sources: Terminal_WebSocket_Bridge_Architecture_v1.md, Terminal_WebSocket_Bridge_Implementation_Plan_v1.md, Terminal_Node_Transport_Redesign_v1.md
7. RCA: HTTP/2 buffering¶
RCA
sequenceDiagram
autonumber
participant B as Browser
participant LB as Reverse proxy<br/>(HTTP/2 enabled)
participant API as cmd/api (combined)
participant N as node-agent
Note over B,LB: combined cmd/api served everything including WS
B->>LB: WSS /api/v1/allocations/{id}/terminal-stream
LB->>API: HTTP/2 upgrade
API->>N: stream relay
N-->>API: byte: "h"
API-->>LB: byte: "h"
LB-->>LB: HTTP/2 small writes BUFFERED
Note over LB: customers see "terminal hangs after<br/>a few seconds of typing"
LB-->>B: (silence)
Root cause: the reverse proxy was using HTTP/2 and buffering small writes. The single combined API+terminal binary made bypassing it awkward.
The fix:
sequenceDiagram
autonumber
participant B as Browser
participant LB as Reverse proxy
participant TG as cmd/terminal-gateway<br/>(dedicated binary, dedicated port)
participant API as cmd/api
participant N as node-agent
B->>LB: WSS /ws/terminal/{allocation_id}
LB->>TG: HTTP/1.1 keep-alive, no buffering
TG->>API: validate token (one-time)
TG->>N: stream relay (mTLS)
N-->>TG: bytes
TG-->>LB: bytes (HTTP/1.1, immediate)
LB-->>B: bytes (no buffering)
Permanent rule documented in Coding_Standards.md:
WebSocket endpoints live behind a dedicated process; any new WS surface goes through the gateway pattern with HTTP/1.1 keep-alive and no buffering at intermediaries.
→ Source: 2026-03-terminal-stream-http2-buffering.md. Runbook: Terminal Gateway Incident
Recap¶
sequenceDiagram
autonumber
participant U as User
participant API as cmd/api
participant TG as cmd/terminal-gateway
participant R as Redis
participant NA as cmd/node-agent
participant VM as allocation
U->>API: POST /allocations/{id}/terminal-token
API->>R: SETEX terminal_token:* 300
API-->>U: token + ws_url
U->>TG: WSS Sec-WebSocket-Protocol: token
TG->>API: validate
API->>R: GETDEL
R-->>API: payload
API-->>TG: ok + host info
TG->>NA: stream relay
NA->>VM: SSH (host terminal key)
Note over U,VM: live PTY relay
U-->>TG: disconnect / TTL hit
TG-->>NA: tear down
NA-->>VM: close SSH