Skip to content

Release Smoke Checklist

Purpose: fast, repeatable validation for the user-critical path after deploys and before MVP demos.

Preconditions

  • Stack is healthy (make dev-up or target environment equivalent).
  • At least one node is active and occupancy=available.
  • Test user can sign in and has balance to provision.

Environment Recreation Baseline

When recreating platform-control or rebuilding the dev-control runtime images, re-apply these public/runtime settings first. If these drift, proxy/browser/auth failures can look random after restart.

Public URL baseline for dev-control Funnel

  • app: https://gpuaas-dev-app.tailfe39f5.ts.net
  • api: https://gpuaas-dev-api.tailfe39f5.ts.net
  • auth: https://gpuaas-dev-auth.tailfe39f5.ts.net
  • term: https://gpuaas-dev-term.tailfe39f5.ts.net

Required config keys

  • APP_BASE_URL=https://gpuaas-dev-app.tailfe39f5.ts.net
  • KEYCLOAK_PUBLIC_ISSUER_URL=https://gpuaas-dev-auth.tailfe39f5.ts.net/realms/gpuaas
  • KEYCLOAK_TOKEN_BASE_URL=https://gpuaas-dev-auth.tailfe39f5.ts.net
  • NEXT_PUBLIC_API_BASE_URL=/backend
  • NEXT_PUBLIC_GRAFANA_BASE_URL=https://gpuaas-dev-app.tailfe39f5.ts.net/backend/p/grafana
  • NEXT_PUBLIC_NOTIFICATIONS_WS_BASE_URL=wss://gpuaas-dev-api.tailfe39f5.ts.net
  • NEXT_PUBLIC_WS_BASE_URL=wss://gpuaas-dev-term.tailfe39f5.ts.net
  • GF_SERVER_ROOT_URL=https://gpuaas-dev-app.tailfe39f5.ts.net/backend/p/grafana/
  • GF_SERVER_SERVE_FROM_SUB_PATH=true
  • rate_limit.platform_proxy_requests_per_minute=1200

Catalog/runtime defaults that must exist

  • Jupyter launchable OCI default exposure mode: platform_proxy
  • vLLM launchable OCI default exposure mode: platform_proxy

Re-apply command

bash scripts/ops/configure_platform_control_dev_auth_urls.sh

Post-recreate verification

make ops-platform-proxy-smoke PROXY_PATH=/backend/p/grafana/ SERVICE=grafana EXPECT_TITLE=Grafana
make ops-platform-proxy-smoke PROXY_PATH=/backend/p/redoc SERVICE=redoc EXPECT_TITLE=ReDoc
make ops-platform-proxy-smoke PROXY_PATH=/backend/p/temporal/namespaces/default/workflows/test SERVICE=temporal EXPECT_TITLE=Temporal
make ops-app-proxy-smoke INSTANCE=<jupyter-instance-id> ENDPOINT=web EXPECT_TITLE=JupyterLab

User-Critical Smoke Path

  1. Sign in as a regular user.
  2. Open Marketplace, submit one allocation request, and wait for active.
  3. Open My Allocations.
  4. Click Metrics on the allocation row.
  5. Verify allocation detail opens with panel=metrics.
  6. Verify Live Metrics renders:
  7. CPU/GPU/GPU memory cards have values.
  8. Trend charts render for last 15 minutes.
  9. Open Netdata link is visible and clickable.
  10. Click Console.
  11. Click Connect.
  12. Verify terminal status becomes connected and prompt appears.
  13. Click Release.
  14. Confirm release in modal.
  15. Verify allocation transitions to releasing and then released.

API/Control-Plane Validation

  • Correlation IDs present on any error responses.
  • GET /api/v1/allocations/{id}/metrics returns non-empty snapshot.
  • GET /api/v1/allocations/{id}/metrics/timeseries returns arrays for cpu, gpu, gpu_memory.
  • Terminal token mint endpoint returns 200 for active allocations only.

Node/Telemetry Validation

  • Node remains active while allocation is active.
  • Node occupancy transitions:
  • available -> assigned -> available across allocate/release.
  • Netdata reachable in admin ops:
  • Node metrics summary shows netdata_reachable_nodes >= 1 for active nodes.

Automated Checks (Local)

  • Frontend smoke e2e:
  • pnpm --dir packages/web e2e -- packages/web/e2e/allocation-smoke.spec.ts
  • Backend API tests:
  • go test ./cmd/api
  • Web type safety:
  • pnpm --dir packages/web typecheck

Exit Criteria

  • All steps above pass without manual DB fixes.
  • No Request failed banners in the happy path.
  • No terminal silent failures (must show connected/closed state clearly).