Release Smoke Checklist¶
Purpose: fast, repeatable validation for the user-critical path after deploys and before MVP demos.
Preconditions¶
- Stack is healthy (
make dev-upor target environment equivalent). - At least one node is
activeandoccupancy=available. - Test user can sign in and has balance to provision.
Environment Recreation Baseline¶
When recreating platform-control or rebuilding the dev-control runtime images,
re-apply these public/runtime settings first. If these drift, proxy/browser/auth
failures can look random after restart.
Public URL baseline for dev-control Funnel¶
- app:
https://gpuaas-dev-app.tailfe39f5.ts.net - api:
https://gpuaas-dev-api.tailfe39f5.ts.net - auth:
https://gpuaas-dev-auth.tailfe39f5.ts.net - term:
https://gpuaas-dev-term.tailfe39f5.ts.net
Required config keys¶
APP_BASE_URL=https://gpuaas-dev-app.tailfe39f5.ts.netKEYCLOAK_PUBLIC_ISSUER_URL=https://gpuaas-dev-auth.tailfe39f5.ts.net/realms/gpuaasKEYCLOAK_TOKEN_BASE_URL=https://gpuaas-dev-auth.tailfe39f5.ts.netNEXT_PUBLIC_API_BASE_URL=/backendNEXT_PUBLIC_GRAFANA_BASE_URL=https://gpuaas-dev-app.tailfe39f5.ts.net/backend/p/grafanaNEXT_PUBLIC_NOTIFICATIONS_WS_BASE_URL=wss://gpuaas-dev-api.tailfe39f5.ts.netNEXT_PUBLIC_WS_BASE_URL=wss://gpuaas-dev-term.tailfe39f5.ts.netGF_SERVER_ROOT_URL=https://gpuaas-dev-app.tailfe39f5.ts.net/backend/p/grafana/GF_SERVER_SERVE_FROM_SUB_PATH=truerate_limit.platform_proxy_requests_per_minute=1200
Catalog/runtime defaults that must exist¶
- Jupyter launchable OCI default exposure mode:
platform_proxy - vLLM launchable OCI default exposure mode:
platform_proxy
Re-apply command¶
Post-recreate verification¶
make ops-platform-proxy-smoke PROXY_PATH=/backend/p/grafana/ SERVICE=grafana EXPECT_TITLE=Grafana
make ops-platform-proxy-smoke PROXY_PATH=/backend/p/redoc SERVICE=redoc EXPECT_TITLE=ReDoc
make ops-platform-proxy-smoke PROXY_PATH=/backend/p/temporal/namespaces/default/workflows/test SERVICE=temporal EXPECT_TITLE=Temporal
make ops-app-proxy-smoke INSTANCE=<jupyter-instance-id> ENDPOINT=web EXPECT_TITLE=JupyterLab
User-Critical Smoke Path¶
- Sign in as a regular user.
- Open
Marketplace, submit one allocation request, and wait foractive. - Open
My Allocations. - Click
Metricson the allocation row. - Verify allocation detail opens with
panel=metrics. - Verify
Live Metricsrenders: - CPU/GPU/GPU memory cards have values.
- Trend charts render for last 15 minutes.
Open Netdatalink is visible and clickable.- Click
Console. - Click
Connect. - Verify terminal status becomes
connectedand prompt appears. - Click
Release. - Confirm release in modal.
- Verify allocation transitions to
releasingand thenreleased.
API/Control-Plane Validation¶
- Correlation IDs present on any error responses.
GET /api/v1/allocations/{id}/metricsreturns non-empty snapshot.GET /api/v1/allocations/{id}/metrics/timeseriesreturns arrays forcpu,gpu,gpu_memory.- Terminal token mint endpoint returns 200 for active allocations only.
Node/Telemetry Validation¶
- Node remains
activewhile allocation isactive. - Node occupancy transitions:
available -> assigned -> availableacross allocate/release.- Netdata reachable in admin ops:
Node metricssummary showsnetdata_reachable_nodes >= 1for active nodes.
Automated Checks (Local)¶
- Frontend smoke e2e:
pnpm --dir packages/web e2e -- packages/web/e2e/allocation-smoke.spec.ts- Backend API tests:
go test ./cmd/api- Web type safety:
pnpm --dir packages/web typecheck
Exit Criteria¶
- All steps above pass without manual DB fixes.
- No
Request failedbanners in the happy path. - No terminal silent failures (must show connected/closed state clearly).