Release Smoke Checklist¶

Purpose: fast, repeatable validation for the user-critical path after deploys and before MVP demos.

Preconditions¶

Stack is healthy (make dev-up or target environment equivalent).
At least one node is active and occupancy=available.
Test user can sign in and has balance to provision.

Environment Recreation Baseline¶

When recreating platform-control or rebuilding the dev-control runtime images, re-apply these public/runtime settings first. If these drift, proxy/browser/auth failures can look random after restart.

Public URL baseline for dev-control Funnel¶

app: https://gpuaas-dev-app.tailfe39f5.ts.net
api: https://gpuaas-dev-api.tailfe39f5.ts.net
auth: https://gpuaas-dev-auth.tailfe39f5.ts.net
term: https://gpuaas-dev-term.tailfe39f5.ts.net

Required config keys¶

APP_BASE_URL=https://gpuaas-dev-app.tailfe39f5.ts.net
KEYCLOAK_PUBLIC_ISSUER_URL=https://gpuaas-dev-auth.tailfe39f5.ts.net/realms/gpuaas
KEYCLOAK_TOKEN_BASE_URL=https://gpuaas-dev-auth.tailfe39f5.ts.net
NEXT_PUBLIC_API_BASE_URL=/backend
NEXT_PUBLIC_GRAFANA_BASE_URL=https://gpuaas-dev-app.tailfe39f5.ts.net/backend/p/grafana
NEXT_PUBLIC_NOTIFICATIONS_WS_BASE_URL=wss://gpuaas-dev-api.tailfe39f5.ts.net
NEXT_PUBLIC_WS_BASE_URL=wss://gpuaas-dev-term.tailfe39f5.ts.net
GF_SERVER_ROOT_URL=https://gpuaas-dev-app.tailfe39f5.ts.net/backend/p/grafana/
GF_SERVER_SERVE_FROM_SUB_PATH=true
rate_limit.platform_proxy_requests_per_minute=1200

Catalog/runtime defaults that must exist¶

Jupyter launchable OCI default exposure mode: platform_proxy
vLLM launchable OCI default exposure mode: platform_proxy

Re-apply command¶

bash scripts/ops/configure_platform_control_dev_auth_urls.sh

Post-recreate verification¶

make ops-platform-proxy-smoke PROXY_PATH=/backend/p/grafana/ SERVICE=grafana EXPECT_TITLE=Grafana
make ops-platform-proxy-smoke PROXY_PATH=/backend/p/redoc SERVICE=redoc EXPECT_TITLE=ReDoc
make ops-platform-proxy-smoke PROXY_PATH=/backend/p/temporal/namespaces/default/workflows/test SERVICE=temporal EXPECT_TITLE=Temporal
make ops-app-proxy-smoke INSTANCE=<jupyter-instance-id> ENDPOINT=web EXPECT_TITLE=JupyterLab

User-Critical Smoke Path¶

Sign in as a regular user.
Open Marketplace, submit one allocation request, and wait for active.
Open My Allocations.
Click Metrics on the allocation row.
Verify allocation detail opens with panel=metrics.
Verify Live Metrics renders:
CPU/GPU/GPU memory cards have values.
Trend charts render for last 15 minutes.
Open Netdata link is visible and clickable.
Click Console.
Click Connect.
Verify terminal status becomes connected and prompt appears.
Click Release.
Confirm release in modal.
Verify allocation transitions to releasing and then released.

API/Control-Plane Validation¶

Correlation IDs present on any error responses.
GET /api/v1/allocations/{id}/metrics returns non-empty snapshot.
GET /api/v1/allocations/{id}/metrics/timeseries returns arrays for cpu, gpu, gpu_memory.
Terminal token mint endpoint returns 200 for active allocations only.

Node/Telemetry Validation¶

Node remains active while allocation is active.
Node occupancy transitions:
available -> assigned -> available across allocate/release.
Netdata reachable in admin ops:
Node metrics summary shows netdata_reachable_nodes >= 1 for active nodes.

Automated Checks (Local)¶

Frontend smoke e2e:
pnpm --dir packages/web e2e -- packages/web/e2e/allocation-smoke.spec.ts
Backend API tests:
go test ./cmd/api
Web type safety:
pnpm --dir packages/web typecheck

Exit Criteria¶

All steps above pass without manual DB fixes.
No Request failed banners in the happy path.
No terminal silent failures (must show connected/closed state clearly).