Runbook: Proxied App UI Incident¶
Trigger¶
- A proxied workload UI such as JupyterLab opens to a blank page.
- A proxied platform UI such as Redoc or Grafana intermittently fails until the browser is refreshed or a private window is used.
Open ...succeeds but the origin page still shows a generic launcher error.- A workload or platform proxy route works in kind but fails in
platform-control.
Scope¶
This runbook covers:
- app proxy routes under /backend/w/... and /w/...,
- platform proxy routes under /backend/p/... and /p/...,
- browser-session minting,
- proxy launcher behavior,
- proxied HTML/bootstrap asset failures,
- node-agent drift that causes workload launch/runtime divergence.
Use this before treating the problem as an upstream app bug.
Required Context¶
correlation_idfrom the visible error if one exists.project_idapp_instance_idor proxied service name- exact browser URL the user opened
- whether the failure is:
404401 token_missing- blank page
- false launcher error after a successful tab open
Fast Classification¶
A. Raw path misuse¶
Symptoms:
- app host is opened with /p/... or /w/... instead of /backend/p/... or /backend/w/...
- API returns token_missing
Interpretation: - the user bypassed the intended browser-session/public prefix path
Action:
1. verify the canonical URL form:
- app host: /backend/p/... and /backend/w/...
- api host: /p/... and /w/...
2. prefer the UI Open ... button over pasted raw protected paths
B. HTML loads but page is blank¶
Symptoms: - title loads correctly - blank page after initial render - browser console or network shows one or more JS assets failing
Interpretation: - routing/base path is close to correct - one or more bootstrap assets are not loading
Common causes:
1. browser-session cookie only valid on /backend/... while the app emits /w/... or /p/... asset paths
2. stale browser state from an earlier broken proxy/config state
3. app emits runtime-relative asset paths not covered by current proxy cookie scope
4. HTML rewrite changed an inline bootstrap script but the CSP hash was not recomputed
C. Workload is healthy but UI shows deploying or failed¶
Symptoms:
- container or process is already running
- control plane still shows deploying or failed
Interpretation: - likely interrupted node task completion after the side effect already happened
Action: 1. inspect recent status probe evidence 2. verify app-runtime reconciled the workload back to observed truth
D. Static platform tool fails before opening¶
Symptoms:
- Proxy launch failed
- A project must be selected to access this resource
- the route is Grafana, Temporal UI, Swagger, or Redoc
Interpretation: - adapter scope strategy is wrong for an admin-global / org-level tool
Action:
1. verify the proxied service is configured as org_only
2. verify browser-session mint succeeds without X-Project-ID
3. do not debug this as an upstream app problem first
Primary Commands¶
Platform service proxy smoke¶
make ops-platform-proxy-smoke PROXY_PATH=/backend/p/grafana/ SERVICE=grafana EXPECT_TITLE=Grafana
make ops-platform-proxy-smoke PROXY_PATH=/backend/p/redoc SERVICE=redoc EXPECT_TITLE=ReDoc
Workload app proxy smoke¶
APP_PUBLIC_URL=https://gpuaas-dev-app.tailfe39f5.ts.net \
AUTH_PUBLIC_URL=https://gpuaas-dev-auth.tailfe39f5.ts.net \
CONTROL_HOST=hpcadmin@100.90.157.34 \
bash scripts/ops/smoke_app_proxy.sh \
--instance <app-instance-id> \
--endpoint web \
--expect-title JupyterLab
Or via make:
What success means:
- http_code=200
- expected title present
- upstream_route_status=200
If KEEP_ARTIFACTS=true is set, inspect:
- body.html
- cookies.txt
- headers.txt
- any referenced JS asset path
Correlation-First Checks¶
API logs¶
Search by correlation_id and app instance:
- browser-session mint
- app proxy request path
- auth miss / auth success
- upstream response status
Useful markers:
- proxy debug app mint request
- proxy debug app mint issued
- app proxy upstream request failed
App-runtime worker¶
Use when control-plane lifecycle state and runtime truth differ:
- look for stalled deploying
- look for ambiguous recent launch_failed
- confirm status probes were enqueued and applied
Node-agent¶
Use when the app launch command may be wrong or missing required base-path args:
- verify reported build/version
- verify the actual launched container/process arguments
- confirm the node is not still on agent=dev / commit=unknown
JupyterLab-Specific Checks¶
- Verify the proxied HTML contains:
baseUrl: "/w/<instance>/web/"- asset references under
/w/<instance>/web/static/... - Verify the browser-session cookie is valid on both:
/backend/w/<instance>/web/w/<instance>/web- Verify Jupyter was launched with:
--ServerApp.base_url=/w/<instance>/web/--ServerApp.trust_xheaders=True- Verify a proxied asset request succeeds:
If HTML returns 200 but the JS asset returns 401, the user will see a blank page.
If launcher mint succeeds but the resolved open_url lands on the public API
host instead of the app host:
1. verify whether both app-host and api-host public bases are configured
2. verify request-aware public base selection prefers:
- https://<app-host>/backend/w/<instance>/web/lab
3. treat:
- https://<api-host>/w/<instance>/web/lab
as a platform defect in browser-session URL selection
Netdata-Specific Checks¶
Modern Netdata agents use the V3 dashboard entrypoint even when the product
version still reports 2.x.
Checks:
1. verify the node is not still on an old distro-packaged Netdata build
2. verify browser-session open URL uses /backend/p/netdata/.../v3/ for modern agents
3. do not assume URL path generation matches product semver
If kind and platform-control disagree on Netdata behavior, verify the Netdata package/channel version on the underlying nodes before changing proxy logic.
Temporal-Specific Checks¶
If Temporal UI loads HTML but stays blank:
1. inspect browser console for CSP violations
2. verify the proxied HTML CSP hash matches the rewritten inline bootstrap script
3. verify one referenced /backend/p/temporal/_app/... asset returns 200
Stale Browser State Recovery¶
The preferred model is automatic recovery, not operator instructions to use incognito.
Current launcher behavior: 1. mint browser-session 2. fetch proxied HTML 3. fetch one referenced JS asset 4. if bootstrap fails, mint once more and retry 5. redirect only after bootstrap succeeds
Operator guidance: - if the shared launcher already recovers, do not ask users to clear cookies first - only fall back to a private window or site-data clearing when validating whether stale browser state was the cause
Node Drift Checks¶
In /admin/nodes, treat these as drift:
- Unknown build
- Outdated
- Config drift
- agent=dev
- commit=unknown
If a proxied workload behaves differently across environments or nodes, verify the target node reports a real commit/build before debugging the proxy further.
Recovery Paths¶
Blank page due to asset auth/path mismatch¶
- verify smoke succeeds on HTML and route
- verify a referenced JS asset path
- if asset path escapes the browser-visible prefix, extend cookie scope or route handling for the runtime prefix
- redeploy API/web as needed
Stuck deploying or recent ambiguous failed¶
- confirm runtime truth on the node
- confirm app-runtime status probe healed the instance
- if not, investigate reconcile selection and node-task completion evidence
False launcher error in the origin page¶
- verify the popup/new tab actually opened and the proxied app loaded
- fix the launcher/origin-tab flow; do not treat this as an upstream app failure
Escalation Rule¶
If the only way to make the app work is: - app-specific ad hoc cookie behavior outside the adapter model, - route-shape hacks only for one environment, - or manual browser-state instructions as the normal fix,
stop and raise a platform defect. The fix belongs in the shared proxy adapter, browser-session layer, or launcher model.