Terminal Gateway Incident Runbook¶
Trigger¶
- Terminal websocket sessions fail on the terminal-gateway runtime (
/ws/terminal/{allocation_id}). - Spike in terminal token replay rejects or websocket write/upgrade failures.
- Terminal gateway health checks fail or ingress route switch produces elevated 5xx/timeout.
- terminal stream relay degradation:
- sustained stream setup failures
- elevated relay write/drop errors
- abnormal session churn (rapid connect/disconnect)
Impact¶
- Users cannot establish terminal sessions to active allocations.
- Support load increases and admin operations may require gateway config rollback/redeploy.
Immediate Mitigation¶
- Confirm ingress route target for
/ws/terminal/*points tocmd/terminal-gateway. - If customer impact is ongoing, execute gateway rollback path:
- revert recent gateway deployment/config changes.
- keep
/ws/terminal/*contract and gateway route unchanged. - Freeze further terminal-gateway config changes until error rate stabilizes.
Diagnosis¶
- Check terminal-gateway process health and restart events.
- Inspect gateway and ingress logs for websocket upgrade failures.
- Validate terminal token consume/replay behavior and redis connectivity.
- Verify network policy allows required gateway ingress/egress paths.
- Confirm alert annotations map to this runbook in alert manifest/catalog.
- Review terminal stream relay counters and error trends:
ws_notifications_write_errors_total(relay/write failure proxy)terminal_token_replay_rejected_total(session control anomaly)- terminal stream relay service-specific counters if enabled
- Perform correlation-id-first tracing:
- capture
correlation_idfrom error envelope/log/event first - pivot logs/traces/alerts using that correlation value across gateway, API, and worker paths
Recovery¶
- Restore known-good ingress route and policy set.
- Re-run terminal websocket smoke checks.
- Confirm token mint/consume path success for new sessions.
- Re-enable full terminal-gateway traffic incrementally (canary/percentage) after stabilization.
- Validate terminal stream relay recovery over a soak window before full traffic restore.
Post-Incident¶
- Record cutover/rollback timestamps and impacted session counts.
- Capture root cause and permanent corrective action.
- Update rollout evidence:
doc/operations/evidence/terminal_gateway_rollout_plan.md. - Add terminal stream relay incident notes and metric snapshots to on-call evidence log.
Correlation Lookup Workflow¶
- Start from user-visible/API failure and extract
correlation_idfrom the returned error envelope. - Resolve the canonical
resource_namefor the impacted allocation/session context. - Search terminal-gateway logs by the same
correlation_id. - Pivot to exact
resource_namematches to connect API, gateway, and worker evidence deterministically. - Correlate with API logs and alert fire timeline.
- Confirm final incident record includes one canonical
correlation_idtrail andresource_name.
Canonical resource_name format:
- core42:aicloud:{region}:{tenant_id}:{project_id}:gpuaas/allocation:{allocation_id}
Public Funnel / Browser WS Checks¶
Use this path when terminals work from a Tailscale-connected machine but fail from a non-Tailscale browser with repeated websocket errors.
Symptoms:
- Browser console shows websocket failures for /ws/terminal/{allocation_id}.
- https://gpuaas-dev-term.tailfe39f5.ts.net/healthz may be healthy.
- https://gpuaas-kind-term.tailfe39f5.ts.net/healthz may be healthy for kind demo environments.
- A direct websocket probe against gpuaas-dev-term.tailfe39f5.ts.net returns 101 Switching Protocols with a fresh terminal token.
- The deployed web bundle still references an internal/private websocket host such as wss://term.100-90-157-34.sslip.io.
Checks:
1. Verify terminal Funnel health:
- curl -fsS https://gpuaas-dev-term.tailfe39f5.ts.net/healthz
- curl -fsS https://gpuaas-kind-term.tailfe39f5.ts.net/healthz
2. Verify the deployed configmap has browser-facing public WS bases:
- sudo k3s kubectl -n gpuaas-core get configmap gpuaas-core-config -o jsonpath='{.data.NEXT_PUBLIC_WS_BASE_URL}{"\n"}{.data.NEXT_PUBLIC_NOTIFICATIONS_WS_BASE_URL}{"\n"}'
- Expected for platform-control demo: wss://gpuaas-dev-term.tailfe39f5.ts.net and wss://gpuaas-dev-api.tailfe39f5.ts.net.
- Expected for kind demo: wss://gpuaas-kind-term.tailfe39f5.ts.net and wss://gpuaas-kind-api.tailfe39f5.ts.net.
3. Verify the served frontend bundle matches the deployed config:
- PLATFORM_CONTROL_WEB_URL=https://gpuaas-dev-app.tailfe39f5.ts.net scripts/ci/platform_control_web_runtime_assert.sh
- For kind, rebuild/redeploy the web image with bash scripts/ops/build_kind_public_funnel_web.sh; this bakes NEXT_PUBLIC_WS_BASE_URL into the Next.js bundle.
4. If a live allocation is available, mint a terminal token and probe the public gateway. Browser-compatible token transport uses only the token value in Sec-WebSocket-Protocol; do not send ?token= and do not use query-string auth.
Fix:
1. Update platform-control web build defaults and dev-control configmap so NEXT_PUBLIC_WS_BASE_URL points to wss://gpuaas-dev-term.tailfe39f5.ts.net.
2. Update NEXT_PUBLIC_NOTIFICATIONS_WS_BASE_URL to wss://gpuaas-dev-api.tailfe39f5.ts.net.
3. Promote release/platform-control and deploy a rebuilt web runtime. A configmap-only rollout is not sufficient because Next.js public env values are baked into the browser bundle at build time.
4. Confirm non-Tailscale browsers no longer resolve terminal websocket traffic to 100.x or *.sslip.io private hosts.
Environment helpers:
- Platform-control: bash scripts/ops/platform_control_tailscale_funnel_edges.sh start-term && bash scripts/ops/platform_control_tailscale_funnel_edges.sh verify
- Kind: bash scripts/ops/kind_tailscale_funnel_edges.sh start-term && bash scripts/ops/kind_tailscale_funnel_edges.sh verify
Important: app/api/auth public edges are not enough for browser terminals. The terminal gateway is a separate public edge because browsers must reach the websocket gateway directly.
Operator Query Cheatsheet¶
Use these exact queries during terminal incidents:
- Terminal gateway errors by correlation:
{service="gpuaas-terminal-gateway"} | json | correlation_id="<CORRELATION_ID>" | level=~"ERROR|WARN"- API-side token/session failures by correlation:
{service="gpuaas-api"} | json | correlation_id="<CORRELATION_ID>" | code=~"token_.*|service_unavailable|internal_error"- Node-agent terminal stream events by session:
{service="gpuaas-node-agent"} | json | session_id="<SESSION_ID>"- Terminal websocket outcomes (5m):
sum(rate(terminal_gateway_ws_events_total[5m])) by (outcome, reason)- Token replay anomalies (5m):
sum(rate(terminal_token_replay_rejected_total[5m]))
Evidence capture minimum:
- correlation_id
- trace_id (if present in details.trace_id or response X-Trace-ID)
- session_id (terminal incidents)
- final error code and mitigation action