Terminal Gateway Incident Runbook¶

Trigger¶

Terminal websocket sessions fail on the terminal-gateway runtime (/ws/terminal/{allocation_id}).
Spike in terminal token replay rejects or websocket write/upgrade failures.
Terminal gateway health checks fail or ingress route switch produces elevated 5xx/timeout.
terminal stream relay degradation:
sustained stream setup failures
elevated relay write/drop errors
abnormal session churn (rapid connect/disconnect)

Impact¶

Users cannot establish terminal sessions to active allocations.
Support load increases and admin operations may require gateway config rollback/redeploy.

Immediate Mitigation¶

Confirm ingress route target for /ws/terminal/* points to cmd/terminal-gateway.
If customer impact is ongoing, execute gateway rollback path:
revert recent gateway deployment/config changes.
keep /ws/terminal/* contract and gateway route unchanged.
Freeze further terminal-gateway config changes until error rate stabilizes.

Diagnosis¶

Check terminal-gateway process health and restart events.
Inspect gateway and ingress logs for websocket upgrade failures.
Validate terminal token consume/replay behavior and redis connectivity.
Verify network policy allows required gateway ingress/egress paths.
Confirm alert annotations map to this runbook in alert manifest/catalog.
Review terminal stream relay counters and error trends:
ws_notifications_write_errors_total (relay/write failure proxy)
terminal_token_replay_rejected_total (session control anomaly)
terminal stream relay service-specific counters if enabled
Perform correlation-id-first tracing:
capture correlation_id from error envelope/log/event first
pivot logs/traces/alerts using that correlation value across gateway, API, and worker paths

Recovery¶

Restore known-good ingress route and policy set.
Re-run terminal websocket smoke checks.
Confirm token mint/consume path success for new sessions.
Re-enable full terminal-gateway traffic incrementally (canary/percentage) after stabilization.
Validate terminal stream relay recovery over a soak window before full traffic restore.

Post-Incident¶

Record cutover/rollback timestamps and impacted session counts.
Capture root cause and permanent corrective action.
Update rollout evidence: doc/operations/evidence/terminal_gateway_rollout_plan.md.
Add terminal stream relay incident notes and metric snapshots to on-call evidence log.

Correlation Lookup Workflow¶

Start from user-visible/API failure and extract correlation_id from the returned error envelope.
Resolve the canonical resource_name for the impacted allocation/session context.
Search terminal-gateway logs by the same correlation_id.
Pivot to exact resource_name matches to connect API, gateway, and worker evidence deterministically.
Correlate with API logs and alert fire timeline.
Confirm final incident record includes one canonical correlation_id trail and resource_name.

Canonical resource_name format: - core42:aicloud:{region}:{tenant_id}:{project_id}:gpuaas/allocation:{allocation_id}

Public Funnel / Browser WS Checks¶

Use this path when terminals work from a Tailscale-connected machine but fail from a non-Tailscale browser with repeated websocket errors.

Symptoms: - Browser console shows websocket failures for /ws/terminal/{allocation_id}. - https://gpuaas-dev-term.tailfe39f5.ts.net/healthz may be healthy. - https://gpuaas-kind-term.tailfe39f5.ts.net/healthz may be healthy for kind demo environments. - A direct websocket probe against gpuaas-dev-term.tailfe39f5.ts.net returns 101 Switching Protocols with a fresh terminal token. - The deployed web bundle still references an internal/private websocket host such as wss://term.100-90-157-34.sslip.io.

Checks: 1. Verify terminal Funnel health: - curl -fsS https://gpuaas-dev-term.tailfe39f5.ts.net/healthz - curl -fsS https://gpuaas-kind-term.tailfe39f5.ts.net/healthz 2. Verify the deployed configmap has browser-facing public WS bases: - sudo k3s kubectl -n gpuaas-core get configmap gpuaas-core-config -o jsonpath='{.data.NEXT_PUBLIC_WS_BASE_URL}{"\n"}{.data.NEXT_PUBLIC_NOTIFICATIONS_WS_BASE_URL}{"\n"}' - Expected for platform-control demo: wss://gpuaas-dev-term.tailfe39f5.ts.net and wss://gpuaas-dev-api.tailfe39f5.ts.net. - Expected for kind demo: wss://gpuaas-kind-term.tailfe39f5.ts.net and wss://gpuaas-kind-api.tailfe39f5.ts.net. 3. Verify the served frontend bundle matches the deployed config: - PLATFORM_CONTROL_WEB_URL=https://gpuaas-dev-app.tailfe39f5.ts.net scripts/ci/platform_control_web_runtime_assert.sh - For kind, rebuild/redeploy the web image with bash scripts/ops/build_kind_public_funnel_web.sh; this bakes NEXT_PUBLIC_WS_BASE_URL into the Next.js bundle. 4. If a live allocation is available, mint a terminal token and probe the public gateway. Browser-compatible token transport uses only the token value in Sec-WebSocket-Protocol; do not send ?token= and do not use query-string auth.

Fix: 1. Update platform-control web build defaults and dev-control configmap so NEXT_PUBLIC_WS_BASE_URL points to wss://gpuaas-dev-term.tailfe39f5.ts.net. 2. Update NEXT_PUBLIC_NOTIFICATIONS_WS_BASE_URL to wss://gpuaas-dev-api.tailfe39f5.ts.net. 3. Promote release/platform-control and deploy a rebuilt web runtime. A configmap-only rollout is not sufficient because Next.js public env values are baked into the browser bundle at build time. 4. Confirm non-Tailscale browsers no longer resolve terminal websocket traffic to 100.x or *.sslip.io private hosts.

Environment helpers: - Platform-control: bash scripts/ops/platform_control_tailscale_funnel_edges.sh start-term && bash scripts/ops/platform_control_tailscale_funnel_edges.sh verify - Kind: bash scripts/ops/kind_tailscale_funnel_edges.sh start-term && bash scripts/ops/kind_tailscale_funnel_edges.sh verify

Important: app/api/auth public edges are not enough for browser terminals. The terminal gateway is a separate public edge because browsers must reach the websocket gateway directly.

Operator Query Cheatsheet¶

Use these exact queries during terminal incidents:

Terminal gateway errors by correlation:
{service="gpuaas-terminal-gateway"} | json | correlation_id="<CORRELATION_ID>" | level=~"ERROR|WARN"
API-side token/session failures by correlation:
{service="gpuaas-api"} | json | correlation_id="<CORRELATION_ID>" | code=~"token_.*|service_unavailable|internal_error"
Node-agent terminal stream events by session:
{service="gpuaas-node-agent"} | json | session_id="<SESSION_ID>"
Terminal websocket outcomes (5m):
sum(rate(terminal_gateway_ws_events_total[5m])) by (outcome, reason)
Token replay anomalies (5m):
sum(rate(terminal_token_replay_rejected_total[5m]))

Evidence capture minimum: - correlation_id - trace_id (if present in details.trace_id or response X-Trace-ID) - session_id (terminal incidents) - final error code and mitigation action