Trail: Terminal & Sessions¶

Browser terminal + SSH access to allocations: token mint, WebSocket bridge, SSH key release, session limits, transport redesign, and the RCA that drove the gateway split.

Trail map¶

flowchart TB
    classDef impl fill:#d1e7dd,stroke:#0a3622,color:#0a3622
    classDef des  fill:#fff3cd,stroke:#332701,color:#332701
    classDef run  fill:#e9d6ff,stroke:#1e1530,color:#1e1530
    classDef rca  fill:#f8d7da,stroke:#42101e,color:#42101e

    T1[1. Terminal model]:::impl --> T2[2. Token mint]:::impl
    T2 --> T3[3. WS bridge]:::impl
    T3 --> T4[4. SSH key release]:::impl
    T4 --> T5[5. Session limits]:::impl
    T5 --> T6[6. Transport redesign]:::des
    T6 --> T7[7. RCA: HTTP/2 buffering]:::rca

1. Terminal model¶

Implemented

flowchart LR
    classDef edge fill:#fff3e0,stroke:#e65100
    classDef cp fill:#e3f2fd,stroke:#1565c0
    classDef host fill:#fce4ec,stroke:#c2185b

    B[Browser<br/>xterm.js]:::edge
    WAF[WAF / Gateway]:::edge
    API[cmd/api]:::cp
    TG[cmd/terminal-gateway]:::cp
    REDIS[(Redis<br/>terminal_token:*)]:::cp
    NA[cmd/node-agent]:::host
    VM[VM / Bare metal<br/>PTY]:::host

    B --> WAF
    WAF -->|HTTPS POST /terminal-token| API
    API <-->|mint/validate| REDIS
    B -.WSS Sec-WebSocket-Protocol: token.-> WAF
    WAF -.WSS upgrade.-> TG
    TG <-->|validate token via API| API
    TG <-->|stream relay HTTPS mTLS| NA
    NA <-->|SSH with host terminal key| VM

A terminal is never a direct TCP from the browser to the host. The flow always passes through the gateway, which enforces the single-use token contract and brokers SSH using a host-resident key.

2. Token mint¶

Implemented

sequenceDiagram
    autonumber
    participant U as User
    participant API as cmd/api
    participant RL as rate limiter
    participant AZ as authz
    participant R as Redis
    participant AUD as audit_logs

    U->>API: POST /api/v1/allocations/{id}/terminal-token
    API->>RL: enforce rate_limit.terminal_token_requests_per_minute (10)
    alt over limit
        RL-->>U: 429 rate_limit_exceeded
    end
    API->>AZ: user owns allocation OR is admin?
    alt not allowed
        AZ-->>U: 403 ownership_required
    end
    API->>API: generate 256-bit random token
    API->>R: SETEX terminal_token:{token} 300<br/>{user_id, allocation_id, expiry}
    API->>AUD: INSERT audit_logs<br/>(actor, action=terminal.token.mint, target=allocation, result=success)
    API-->>U: {token, ws_url, expires_in: 300}

Token shape and rules:

Property	Value
Format	Opaque 256-bit random base64url
Storage	Redis, key `terminal_token:{token}`
TTL	300 seconds
Single-use	Deleted on first validation
Rate limit	`rate_limit.terminal_token_requests_per_minute` (default 10)
Audited	Every mint, every validation

This is the only explicitly non-idempotent mutation in the platform (per Coding_Standards.md §2 exception). Replays would defeat single-use semantics.

3. WS bridge¶

Implemented

cmd/terminal-gateway is a dedicated binary on a dedicated port. The reason it exists is RCA 2026-03-terminal-stream-http2-buffering (step 7).

sequenceDiagram
    autonumber
    participant B as Browser
    participant TG as cmd/terminal-gateway
    participant API as cmd/api
    participant R as Redis
    participant NA as cmd/node-agent
    participant VM as VM/PTY

    B->>TG: WSS /ws/terminal/{allocation_id}<br/>Sec-WebSocket-Protocol: <token>
    Note over B,TG: NO ?token= in URL.<br/>Coding_Standards.md §8
    TG->>API: validate token
    API->>R: GETDEL terminal_token:{token}
    R-->>API: {user_id, allocation_id, expiry}
    alt not found / expired
        API-->>TG: 401
        TG-->>B: close 1008
    end
    API-->>TG: validation OK + host/private_ip + default_user
    TG->>NA: open stream relay (mTLS)
    NA->>VM: SSH with host terminal key<br/>(/var/lib/gpuaas/terminal/id_ed25519)
    loop terminal session
        B-->>TG: input bytes (WS frame)
        TG-->>NA: bytes (HTTPS streaming)
        NA-->>VM: stdin
        VM-->>NA: stdout/stderr
        NA-->>TG: bytes
        TG-->>B: WS frame
    end
    Note over B,TG: terminal.session_max_ttl_seconds<br/>(default 14400 = 4h)<br/>enforced by gateway + node-agent
    TG-->>B: close on TTL or user disconnect

Auth on WS uses Sec-WebSocket-Protocol (the documented exception to header-only transport). No ?token= allowed anywhere.

4. SSH key release¶

Implemented

For non-browser access, users register public keys on their allocation. Node-agent installs them into the OS user's authorized_keys during allocation.provision_user.

erDiagram
    users ||--o{ ssh_public_keys : "owns"
    projects ||--o{ ssh_public_keys : "scoped to (optional)"
    allocations ||--o{ allocation_ssh_public_keys : "uses"
    ssh_public_keys ||--o{ allocation_ssh_public_keys : "granted to"

    ssh_public_keys {
        uuid id PK
        uuid user_id FK
        uuid project_id "nullable, project scope"
        text title
        text key_type "ed25519|rsa|ecdsa"
        text key_body "public material only"
        text fingerprint
    }
    allocation_ssh_public_keys {
        uuid allocation_id FK
        uuid ssh_public_key_id FK
        timestamp granted_at
        text granted_by
    }

Production never stores user SSH private keys server-side. Acceptance test AT-033 explicitly retires the legacy private-key download endpoint.

The host terminal SSH key (/var/lib/gpuaas/terminal/id_ed25519) is different: it's per-host, generated by the host, used only by the gateway to broker terminals. Never copied off the host.

→ Sources: Allocation_Project_SSH_Access_v1.md, Allocation_Project_SSH_Access_Grants_v1.md

5. Session limits¶

Implemented

stateDiagram-v2
    [*] --> connecting: WS upgrade requested
    connecting --> validating: token presented
    validating --> active: token valid + GETDEL
    validating --> rejected: invalid/expired/replay
    rejected --> [*]: WS close 1008
    active --> idle_warn: idle warning timer
    active --> shutting_down: user disconnect
    active --> shutting_down: max_ttl reached
    idle_warn --> active: input received
    idle_warn --> shutting_down: idle timeout
    shutting_down --> [*]: cleanup PTY + close WS

Policy key	Default	Effect
`terminal.session_max_ttl_seconds`	14400 (4h)	Hard cap on active session length
`rate_limit.terminal_token_requests_per_minute`	10	Token mint per user per minute

The TTL is enforced both at the gateway (drops the WS) and at the node-agent (closes the underlying SSH). Independent enforcement so a misbehaving gateway can't extend sessions.

6. Transport redesign¶

Designed

Future-state for the terminal transport: more granular flow control, multi-pane support, session recording (compliance ask from external review).

flowchart TB
    NOW[v1 today]
    NOW --> N1[Single-pane WS<br/>byte-stream relay<br/>no recording]
    NOW --> N2[xterm.js client]

    NEXT[Designed evolution]
    NEXT --> X1[Multi-pane / split sessions]
    NEXT --> X2[Session recording<br/>compliance gate]
    NEXT --> X3[Granular backpressure]
    NEXT --> X4[Reconnect with state replay]

    classDef now fill:#d1e7dd,stroke:#0a3622
    classDef next fill:#fff3cd,stroke:#332701
    class N1,N2 now
    class X1,X2,X3,X4 next

→ Sources: Terminal_WebSocket_Bridge_Architecture_v1.md, Terminal_WebSocket_Bridge_Implementation_Plan_v1.md, Terminal_Node_Transport_Redesign_v1.md

7. RCA: HTTP/2 buffering¶

RCA

sequenceDiagram
    autonumber
    participant B as Browser
    participant LB as Reverse proxy<br/>(HTTP/2 enabled)
    participant API as cmd/api (combined)
    participant N as node-agent

    Note over B,LB: combined cmd/api served everything including WS
    B->>LB: WSS /api/v1/allocations/{id}/terminal-stream
    LB->>API: HTTP/2 upgrade
    API->>N: stream relay
    N-->>API: byte: "h"
    API-->>LB: byte: "h"
    LB-->>LB: HTTP/2 small writes BUFFERED
    Note over LB: customers see "terminal hangs after<br/>a few seconds of typing"
    LB-->>B: (silence)

Root cause: the reverse proxy was using HTTP/2 and buffering small writes. The single combined API+terminal binary made bypassing it awkward.

The fix:

sequenceDiagram
    autonumber
    participant B as Browser
    participant LB as Reverse proxy
    participant TG as cmd/terminal-gateway<br/>(dedicated binary, dedicated port)
    participant API as cmd/api
    participant N as node-agent

    B->>LB: WSS /ws/terminal/{allocation_id}
    LB->>TG: HTTP/1.1 keep-alive, no buffering
    TG->>API: validate token (one-time)
    TG->>N: stream relay (mTLS)
    N-->>TG: bytes
    TG-->>LB: bytes (HTTP/1.1, immediate)
    LB-->>B: bytes (no buffering)

Permanent rule documented in Coding_Standards.md:

WebSocket endpoints live behind a dedicated process; any new WS surface goes through the gateway pattern with HTTP/1.1 keep-alive and no buffering at intermediaries.

→ Source: 2026-03-terminal-stream-http2-buffering.md. Runbook: Terminal Gateway Incident

Recap¶

sequenceDiagram
    autonumber
    participant U as User
    participant API as cmd/api
    participant TG as cmd/terminal-gateway
    participant R as Redis
    participant NA as cmd/node-agent
    participant VM as allocation

    U->>API: POST /allocations/{id}/terminal-token
    API->>R: SETEX terminal_token:* 300
    API-->>U: token + ws_url
    U->>TG: WSS Sec-WebSocket-Protocol: token
    TG->>API: validate
    API->>R: GETDEL
    R-->>API: payload
    API-->>TG: ok + host info
    TG->>NA: stream relay
    NA->>VM: SSH (host terminal key)
    Note over U,VM: live PTY relay
    U-->>TG: disconnect / TTL hit
    TG-->>NA: tear down
    NA-->>VM: close SSH