Skip to content

Terminal Gateway Incident Runbook

Trigger

  • Terminal websocket sessions fail on the terminal-gateway runtime (/ws/terminal/{allocation_id}).
  • Spike in terminal token replay rejects or websocket write/upgrade failures.
  • Terminal gateway health checks fail or ingress route switch produces elevated 5xx/timeout.
  • terminal stream relay degradation:
  • sustained stream setup failures
  • elevated relay write/drop errors
  • abnormal session churn (rapid connect/disconnect)

Impact

  • Users cannot establish terminal sessions to active allocations.
  • Support load increases and admin operations may require gateway config rollback/redeploy.

Immediate Mitigation

  1. Confirm ingress route target for /ws/terminal/* points to cmd/terminal-gateway.
  2. If customer impact is ongoing, execute gateway rollback path:
  3. revert recent gateway deployment/config changes.
  4. keep /ws/terminal/* contract and gateway route unchanged.
  5. Freeze further terminal-gateway config changes until error rate stabilizes.

Diagnosis

  1. Check terminal-gateway process health and restart events.
  2. Inspect gateway and ingress logs for websocket upgrade failures.
  3. Validate terminal token consume/replay behavior and redis connectivity.
  4. Verify network policy allows required gateway ingress/egress paths.
  5. Confirm alert annotations map to this runbook in alert manifest/catalog.
  6. Review terminal stream relay counters and error trends:
  7. ws_notifications_write_errors_total (relay/write failure proxy)
  8. terminal_token_replay_rejected_total (session control anomaly)
  9. terminal stream relay service-specific counters if enabled
  10. Perform correlation-id-first tracing:
  11. capture correlation_id from error envelope/log/event first
  12. pivot logs/traces/alerts using that correlation value across gateway, API, and worker paths

Recovery

  1. Restore known-good ingress route and policy set.
  2. Re-run terminal websocket smoke checks.
  3. Confirm token mint/consume path success for new sessions.
  4. Re-enable full terminal-gateway traffic incrementally (canary/percentage) after stabilization.
  5. Validate terminal stream relay recovery over a soak window before full traffic restore.

Post-Incident

  • Record cutover/rollback timestamps and impacted session counts.
  • Capture root cause and permanent corrective action.
  • Update rollout evidence: doc/operations/evidence/terminal_gateway_rollout_plan.md.
  • Add terminal stream relay incident notes and metric snapshots to on-call evidence log.

Correlation Lookup Workflow

  1. Start from user-visible/API failure and extract correlation_id from the returned error envelope.
  2. Resolve the canonical resource_name for the impacted allocation/session context.
  3. Search terminal-gateway logs by the same correlation_id.
  4. Pivot to exact resource_name matches to connect API, gateway, and worker evidence deterministically.
  5. Correlate with API logs and alert fire timeline.
  6. Confirm final incident record includes one canonical correlation_id trail and resource_name.

Canonical resource_name format: - core42:aicloud:{region}:{tenant_id}:{project_id}:gpuaas/allocation:{allocation_id}

Public Funnel / Browser WS Checks

Use this path when terminals work from a Tailscale-connected machine but fail from a non-Tailscale browser with repeated websocket errors.

Symptoms: - Browser console shows websocket failures for /ws/terminal/{allocation_id}. - https://gpuaas-dev-term.tailfe39f5.ts.net/healthz may be healthy. - https://gpuaas-kind-term.tailfe39f5.ts.net/healthz may be healthy for kind demo environments. - A direct websocket probe against gpuaas-dev-term.tailfe39f5.ts.net returns 101 Switching Protocols with a fresh terminal token. - The deployed web bundle still references an internal/private websocket host such as wss://term.100-90-157-34.sslip.io.

Checks: 1. Verify terminal Funnel health: - curl -fsS https://gpuaas-dev-term.tailfe39f5.ts.net/healthz - curl -fsS https://gpuaas-kind-term.tailfe39f5.ts.net/healthz 2. Verify the deployed configmap has browser-facing public WS bases: - sudo k3s kubectl -n gpuaas-core get configmap gpuaas-core-config -o jsonpath='{.data.NEXT_PUBLIC_WS_BASE_URL}{"\n"}{.data.NEXT_PUBLIC_NOTIFICATIONS_WS_BASE_URL}{"\n"}' - Expected for platform-control demo: wss://gpuaas-dev-term.tailfe39f5.ts.net and wss://gpuaas-dev-api.tailfe39f5.ts.net. - Expected for kind demo: wss://gpuaas-kind-term.tailfe39f5.ts.net and wss://gpuaas-kind-api.tailfe39f5.ts.net. 3. Verify the served frontend bundle matches the deployed config: - PLATFORM_CONTROL_WEB_URL=https://gpuaas-dev-app.tailfe39f5.ts.net scripts/ci/platform_control_web_runtime_assert.sh - For kind, rebuild/redeploy the web image with bash scripts/ops/build_kind_public_funnel_web.sh; this bakes NEXT_PUBLIC_WS_BASE_URL into the Next.js bundle. 4. If a live allocation is available, mint a terminal token and probe the public gateway. Browser-compatible token transport uses only the token value in Sec-WebSocket-Protocol; do not send ?token= and do not use query-string auth.

Fix: 1. Update platform-control web build defaults and dev-control configmap so NEXT_PUBLIC_WS_BASE_URL points to wss://gpuaas-dev-term.tailfe39f5.ts.net. 2. Update NEXT_PUBLIC_NOTIFICATIONS_WS_BASE_URL to wss://gpuaas-dev-api.tailfe39f5.ts.net. 3. Promote release/platform-control and deploy a rebuilt web runtime. A configmap-only rollout is not sufficient because Next.js public env values are baked into the browser bundle at build time. 4. Confirm non-Tailscale browsers no longer resolve terminal websocket traffic to 100.x or *.sslip.io private hosts.

Environment helpers: - Platform-control: bash scripts/ops/platform_control_tailscale_funnel_edges.sh start-term && bash scripts/ops/platform_control_tailscale_funnel_edges.sh verify - Kind: bash scripts/ops/kind_tailscale_funnel_edges.sh start-term && bash scripts/ops/kind_tailscale_funnel_edges.sh verify

Important: app/api/auth public edges are not enough for browser terminals. The terminal gateway is a separate public edge because browsers must reach the websocket gateway directly.

Operator Query Cheatsheet

Use these exact queries during terminal incidents:

  1. Terminal gateway errors by correlation:
  2. {service="gpuaas-terminal-gateway"} | json | correlation_id="<CORRELATION_ID>" | level=~"ERROR|WARN"
  3. API-side token/session failures by correlation:
  4. {service="gpuaas-api"} | json | correlation_id="<CORRELATION_ID>" | code=~"token_.*|service_unavailable|internal_error"
  5. Node-agent terminal stream events by session:
  6. {service="gpuaas-node-agent"} | json | session_id="<SESSION_ID>"
  7. Terminal websocket outcomes (5m):
  8. sum(rate(terminal_gateway_ws_events_total[5m])) by (outcome, reason)
  9. Token replay anomalies (5m):
  10. sum(rate(terminal_token_replay_rejected_total[5m]))

Evidence capture minimum: - correlation_id - trace_id (if present in details.trace_id or response X-Trace-ID) - session_id (terminal incidents) - final error code and mitigation action