Skip to content

Runbook: Proxied App UI Incident

Trigger

  1. A proxied workload UI such as JupyterLab opens to a blank page.
  2. A proxied platform UI such as Redoc or Grafana intermittently fails until the browser is refreshed or a private window is used.
  3. Open ... succeeds but the origin page still shows a generic launcher error.
  4. A workload or platform proxy route works in kind but fails in platform-control.

Scope

This runbook covers: - app proxy routes under /backend/w/... and /w/..., - platform proxy routes under /backend/p/... and /p/..., - browser-session minting, - proxy launcher behavior, - proxied HTML/bootstrap asset failures, - node-agent drift that causes workload launch/runtime divergence.

Use this before treating the problem as an upstream app bug.

Required Context

  1. correlation_id from the visible error if one exists.
  2. project_id
  3. app_instance_id or proxied service name
  4. exact browser URL the user opened
  5. whether the failure is:
  6. 404
  7. 401 token_missing
  8. blank page
  9. false launcher error after a successful tab open

Fast Classification

A. Raw path misuse

Symptoms: - app host is opened with /p/... or /w/... instead of /backend/p/... or /backend/w/... - API returns token_missing

Interpretation: - the user bypassed the intended browser-session/public prefix path

Action: 1. verify the canonical URL form: - app host: /backend/p/... and /backend/w/... - api host: /p/... and /w/... 2. prefer the UI Open ... button over pasted raw protected paths

B. HTML loads but page is blank

Symptoms: - title loads correctly - blank page after initial render - browser console or network shows one or more JS assets failing

Interpretation: - routing/base path is close to correct - one or more bootstrap assets are not loading

Common causes: 1. browser-session cookie only valid on /backend/... while the app emits /w/... or /p/... asset paths 2. stale browser state from an earlier broken proxy/config state 3. app emits runtime-relative asset paths not covered by current proxy cookie scope 4. HTML rewrite changed an inline bootstrap script but the CSP hash was not recomputed

C. Workload is healthy but UI shows deploying or failed

Symptoms: - container or process is already running - control plane still shows deploying or failed

Interpretation: - likely interrupted node task completion after the side effect already happened

Action: 1. inspect recent status probe evidence 2. verify app-runtime reconciled the workload back to observed truth

D. Static platform tool fails before opening

Symptoms: - Proxy launch failed - A project must be selected to access this resource - the route is Grafana, Temporal UI, Swagger, or Redoc

Interpretation: - adapter scope strategy is wrong for an admin-global / org-level tool

Action: 1. verify the proxied service is configured as org_only 2. verify browser-session mint succeeds without X-Project-ID 3. do not debug this as an upstream app problem first

Primary Commands

Platform service proxy smoke

make ops-platform-proxy-smoke PROXY_PATH=/backend/p/grafana/ SERVICE=grafana EXPECT_TITLE=Grafana
make ops-platform-proxy-smoke PROXY_PATH=/backend/p/redoc SERVICE=redoc EXPECT_TITLE=ReDoc

Workload app proxy smoke

APP_PUBLIC_URL=https://gpuaas-dev-app.tailfe39f5.ts.net \
AUTH_PUBLIC_URL=https://gpuaas-dev-auth.tailfe39f5.ts.net \
CONTROL_HOST=hpcadmin@100.90.157.34 \
bash scripts/ops/smoke_app_proxy.sh \
  --instance <app-instance-id> \
  --endpoint web \
  --expect-title JupyterLab

Or via make:

make ops-app-proxy-smoke INSTANCE=<app-instance-id> ENDPOINT=web EXPECT_TITLE=JupyterLab

What success means: - http_code=200 - expected title present - upstream_route_status=200

If KEEP_ARTIFACTS=true is set, inspect: - body.html - cookies.txt - headers.txt - any referenced JS asset path

Correlation-First Checks

API logs

Search by correlation_id and app instance: - browser-session mint - app proxy request path - auth miss / auth success - upstream response status

Useful markers: - proxy debug app mint request - proxy debug app mint issued - app proxy upstream request failed

App-runtime worker

Use when control-plane lifecycle state and runtime truth differ: - look for stalled deploying - look for ambiguous recent launch_failed - confirm status probes were enqueued and applied

Node-agent

Use when the app launch command may be wrong or missing required base-path args: - verify reported build/version - verify the actual launched container/process arguments - confirm the node is not still on agent=dev / commit=unknown

JupyterLab-Specific Checks

  1. Verify the proxied HTML contains:
  2. baseUrl: "/w/<instance>/web/"
  3. asset references under /w/<instance>/web/static/...
  4. Verify the browser-session cookie is valid on both:
  5. /backend/w/<instance>/web
  6. /w/<instance>/web
  7. Verify Jupyter was launched with:
  8. --ServerApp.base_url=/w/<instance>/web/
  9. --ServerApp.trust_xheaders=True
  10. Verify a proxied asset request succeeds:
curl -sk -b cookies.txt \
  "https://<app-host>/w/<instance>/web/static/lab/main....js" -I

If HTML returns 200 but the JS asset returns 401, the user will see a blank page.

If launcher mint succeeds but the resolved open_url lands on the public API host instead of the app host: 1. verify whether both app-host and api-host public bases are configured 2. verify request-aware public base selection prefers: - https://<app-host>/backend/w/<instance>/web/lab 3. treat: - https://<api-host>/w/<instance>/web/lab as a platform defect in browser-session URL selection

Netdata-Specific Checks

Modern Netdata agents use the V3 dashboard entrypoint even when the product version still reports 2.x.

Checks: 1. verify the node is not still on an old distro-packaged Netdata build 2. verify browser-session open URL uses /backend/p/netdata/.../v3/ for modern agents 3. do not assume URL path generation matches product semver

If kind and platform-control disagree on Netdata behavior, verify the Netdata package/channel version on the underlying nodes before changing proxy logic.

Temporal-Specific Checks

If Temporal UI loads HTML but stays blank: 1. inspect browser console for CSP violations 2. verify the proxied HTML CSP hash matches the rewritten inline bootstrap script 3. verify one referenced /backend/p/temporal/_app/... asset returns 200

Stale Browser State Recovery

The preferred model is automatic recovery, not operator instructions to use incognito.

Current launcher behavior: 1. mint browser-session 2. fetch proxied HTML 3. fetch one referenced JS asset 4. if bootstrap fails, mint once more and retry 5. redirect only after bootstrap succeeds

Operator guidance: - if the shared launcher already recovers, do not ask users to clear cookies first - only fall back to a private window or site-data clearing when validating whether stale browser state was the cause

Node Drift Checks

In /admin/nodes, treat these as drift: - Unknown build - Outdated - Config drift - agent=dev - commit=unknown

If a proxied workload behaves differently across environments or nodes, verify the target node reports a real commit/build before debugging the proxy further.

Recovery Paths

Blank page due to asset auth/path mismatch

  1. verify smoke succeeds on HTML and route
  2. verify a referenced JS asset path
  3. if asset path escapes the browser-visible prefix, extend cookie scope or route handling for the runtime prefix
  4. redeploy API/web as needed

Stuck deploying or recent ambiguous failed

  1. confirm runtime truth on the node
  2. confirm app-runtime status probe healed the instance
  3. if not, investigate reconcile selection and node-task completion evidence

False launcher error in the origin page

  1. verify the popup/new tab actually opened and the proxied app loaded
  2. fix the launcher/origin-tab flow; do not treat this as an upstream app failure

Escalation Rule

If the only way to make the app work is: - app-specific ad hoc cookie behavior outside the adapter model, - route-shape hacks only for one environment, - or manual browser-state instructions as the normal fix,

stop and raise a platform defect. The fix belongs in the shared proxy adapter, browser-session layer, or launcher model.