Skip to content

Execution Progress

Tracks implementation progress against doc/Implementation_Roadmap.md so we keep an auditable "done / next" list while coding unattended.

Document Role

  • Purpose: live implementation ledger mapped to roadmap phases.
  • Scope: completed commits, partial progress, remaining items, and active next queue.
  • This does not replace phase definitions (doc/Implementation_Roadmap.md) or readiness gates (doc/Phase_Readiness_Tracker.md).
  • Treat this file as append-only execution history plus progress notes; the current task source of truth is doc/governance/Agent_Work_Queue.yaml.

Last updated: 2026-04-14 (launchable OCI, Compose, and node-agent lifecycle docs refreshed)

2026-04-14 Progress Note

  • Node-agent lifecycle:
  • commit 97697190 delivered node-agent lifecycle self-update and passed platform-control CI.
  • platform-control and local kind nodes were refreshed to the current agent generation; host bootstrap now owns Docker, Docker Compose, NVIDIA Container Toolkit, registry trust, and H200 site-bootstrap prerequisites for the current launchable OCI and Compose slices.
  • remaining lifecycle backlog is full fleet reconciliation and read-model telemetry hardening, especially desired version/prerequisite drift, agent_version, agent_connected_at, runtime capability, and stale task/config cleanup.
  • Launchable OCI and Docker Compose:
  • commit 593a1690 platformized the manifest-owned Docker Compose topology path.
  • JupyterLab remains the proven single-container launchable OCI reference app.
  • vllm-openai is now the proven single-node Compose reference app; platform-control H200 validation launched Mistral Small 3.2 on one H200 and validated /v1/models plus /v1/chat/completions through private access.
  • the current Compose product contract is curated renderer plus manifest topology, not arbitrary user-authored Compose YAML.
  • App-developer documentation:
  • refreshed doc/architecture/Build_an_App_for_GPUaaS_v1.md, doc/architecture/External_App_Team_Integration_Guide_v1.md, doc/architecture/Launchable_OCI_Workload_Profile_Contract_v1.md, and doc/architecture/App_Platform_Gap_Tracker_v1.md.
  • the active backlog is now app-developer release packaging, Compose topology generalization, private credential/runtime-secret binding, platform-proxy exposure, persistent storage/derived environments, and fleet lifecycle reconciliation.

Tracking Rule

  • Every implementation commit must be mapped to a roadmap phase here.
  • Keep entries additive and append-only.
  • Do not mark phase done unless tests pass and wiring is runnable.

Latest Verification Sweep (2026-02-23)

Status: Passed Evidence: - go test ./... green (all packages) - go vet ./... green - make test-integration-selected green - make ci-local-dry-run green (contract + reviewguard + codegen + build/test gates) - scripts/demo_smoke.sh green (health/auth/catalog/nodes/create allocation/poll/release + terminal token + ws smoke) Notes: - AsyncAPI validator reports informational recommendation to move to AsyncAPI 3.1.0; non-blocking in current pipeline. - CI dry-run integration-smoke stage is skipped when DATABASE_URL is unset in shell context; targeted integration suite executed successfully via make test-integration-selected.

Roadmap Mapping

Phase 1A — shared package tests

Status: Completed Evidence: - e1685c5 baseline shared tests (errors/events/middleware/policy) - go test ./... green

Phase 1B — cmd/api + outbox relay wiring

Status: Completed (MVP scaffold) Evidence: - 1b2aef0 cmd/api health/config/server scaffold - 70a7916 outbox relay runtime (Postgres claim + NATS publish) - a142a74 outbox retry/backoff hardening + integration test (-tags integration) - API local relay parity hardening: - replaced cmd/api outbox ticker placeholder with shared packages/shared/outbox relay loop - wired API process to outbox.NewPostgresStore + JetStream publisher with event-type allowlist - added cmd/api/outbox_test.go coverage for supported/unsupported event-type gating

Notes: - Dedicated relay (cmd/outbox-relay) is in place and compose-wired.

Phase 2 — Auth + Users

Status: In Progress Completed: - e13afa2 auth deny-list support (admin token revocation check) - 996792d OIDC/public auth endpoint implementation scaffold: - GET /api/v1/auth/oidc/authorize - POST /api/v1/auth/oidc/exchange - POST /api/v1/auth/personal/login (personal path) - POST /api/v1/auth/token/refresh - POST /api/v1/auth/logout - GET /api/v1/users/me - route split for public vs protected API paths in cmd/api - 043470a auth provider-path tests: - token endpoint success decode - non-2xx provider response handling - 3c1185f auth provider failure-mode coverage: - invalid token endpoint JSON decode path - missing token fields validation path - empty refresh token guard - missing OIDC exchange field guard - provider transport error assertion - 16b9a7f logout revocation semantics: - POST /api/v1/auth/logout now revokes bearer token at OIDC provider revoke endpoint - writes current admin token jti to Redis deny-list with bounded TTL - auth/middleware deny-list supports explicit revoke operation + tests - b242705 OIDC verification hardening: - OIDC access token claims now come from JWKS-verified JWT parsing (signature + issuer + expiry validation), not raw payload decode - JWKS verification retries once with forced refresh to cover key-rotation races - direct JWKS HTTP fetch fallback added for cache bootstrap/refresh failure paths - expanded auth tests for signed-token success, invalid signature rejection, and wrong-issuer rejection - 5896e78 logout refresh-token revoke extension: - POST /api/v1/auth/logout accepts optional refresh_token and revokes it at OIDC provider when supplied - access-token revoke behavior preserved; deny-list revocation path unchanged - route and auth service tests added for token-type hint correctness and invalid request body handling

Remaining: - none for MVP scope

Phase 3 — Inventory

Status: Completed (MVP scope) Completed: - 99fa11b inventory service + API handlers: - GET /api/v1/skus (active catalog with cursor/page_size) - GET /api/v1/nodes (user-facing node summaries with status/region filters) - caeff96 admin inventory extension: - GET /api/v1/admin/nodes - node occupancy projection in inventory service - admin-role guard in API handler - 8ec890b admin node operations: - POST /api/v1/admin/nodes (SKU validation, duplicate detection, optional probe) - POST /api/v1/admin/nodes/{node_id}/probe (status update online/offline) - DELETE /api/v1/admin/nodes/{node_id} (soft-disable to offline, conflict if in use) - route-level tests for admin auth and conflict/not-found behavior - 16b9a7f node mutation audit logging: - transactional audit_logs writes for create/probe/disable node operations - metadata constrained to allowed keys (node_id, status_from, status_to)

Remaining: - none for MVP scope

Phase 4 — Provisioning lifecycle

Status: In Progress Completed: - dc6e0e4 provisioning worker pull-consumer runtime scaffold - a142a74 create-allocation orchestration start (requested + outbox event) - user allocation API completion for demo path: - GET /api/v1/allocations (status filter + pagination envelope) - GET /api/v1/allocations/{allocation_id} - POST /api/v1/allocations/{allocation_id}/release (202 + outbox provisioning.releasing.requested) - user-managed SSH key model introduced via /api/v1/ssh-keys endpoints (private-key download endpoint retired) - orchestrator service exposes list/get/release methods with ownership checks. - c2eee14 provisioning worker state-transition handlers: - handle provisioning.requested -> allocation state update (requested/provisioning to provisioning/active) - handle provisioning.releasing.requested -> released + emit provisioning.releasing.completed - handle provisioning.force_release_requested -> releasing + emit provisioning.releasing.requested - 735ba4d Temporal registration path: - provisioning worker boots Temporal worker with dedicated task queue - workflow registered (ProvisioningEventWorkflow) with activity handlers for requested/releasing/force-release paths - configurable TEMPORAL_TASK_QUEUE with default provisioning-workflows - provisioning state-machine hardening (current commit): - HandleProvisionRequested now performs explicit runtime step + success/failure branching: - success path transitions allocation to active, stores SSH material fields, emits provisioning.active - failure path transitions allocation to failed, records failure_reason, emits provisioning.failed - HandleReleasingRequested now performs explicit runtime release step + success/failure branching: - success path transitions allocation to released, emits provisioning.releasing.completed - failure path transitions allocation to release_failed, records release_failed_reason, emits provisioning.release_failed - runtime abstraction added (NewServiceWithRuntime) so real SSH executor can be plugged in without handler changes - unit tests expanded for SSH material generation helper and payload shape expectations - provisioning private-key envelope alignment: - generated SSH private key material is now serialized and encrypted via packages/shared/crypto before writing allocations.ssh_private_key_enc - provisioning runtime activation: - provisioning worker runtime selection is now limited to agent or noop. - legacy SSH provisioning runtime has been retired and removed from the active codebase. - node lifecycle execution is expected to flow through the node-agent task runtime when not in noop.

Remaining: - harden production key-management integration at the app/runtime boundary without reintroducing provisioning-worker SSH private-key handling

Phase 5 — Billing + ledger

Status: Completed (MVP scope) Completed: - 1a4e1c8 billing worker pull-consumer runtime scaffold - 16b9a7f billing read path service + API handlers: - GET /api/v1/billing/balance - GET /api/v1/billing/usage (cursor/page_size + filters) - GET /api/v1/billing/usage/csv - handler + service unit tests for ordering/filter parsing/CSV formatting - bd4e6df billing worker event handling implementation: - provisioning.active creates idempotent open usage_records row - provisioning.releasing.completed / provisioning.release_failed close usage window - closeout computes usage debit and posts ledger_entries row (usage_debit) transactionally - unit tests cover dispatch + usage-cost calculation - 66f3078 billing closeout integration coverage: - -tags integration test validates usage close, ledger debit creation, and idempotent repeat close - integrated into make test-integration-selected - billing low-balance/depletion policy logic: - billing closeout now resolves policy-driven thresholds via packages/shared/policy. - emits outbox events for billing.low_balance_warning, billing.auto_release_pending, and billing.balance_depleted. - emits provisioning.force_release_requested events for all user active allocations once balance is depleted. - unit tests added for policy fallback/override and projected depletion calculations.

Remaining: - webhook reconciliation observability metrics + alert thresholds tuning in ops layer

Phase 6 — Notifications

Status: Partially Started Completed: - 3989917 notification relay NATS->Redis bridge + payload transforms - 6700275 notification relay event coverage expansion: - added end-to-end processMessage tests for all supported notification subjects. - added malformed-envelope and Redis publish failure-path tests. - notification WS hub baseline: - packages/services/notification/ws.go implements Redis Pub/Sub -> WebSocket fanout. - GET /ws/notifications route wired in cmd/api with JWT auth from Sec-WebSocket-Protocol (browser) or Authorization (non-browser). - route enforces deny-list checks for admin tokens. - tests added for route auth behavior and WS fanout path. - notification reliability/test hardening: - added route-level end-to-end test for /ws/notifications with real Redis pub/sub fanout and token parsing path. - added lightweight WS hub counters (active_connections, forwarded_messages, write_errors) exposed via WSService.Snapshot(). - API observability export: - cmd/api now exposes GET /api/v1/internal/stats (guarded by INTERNAL_STATS_TOKEN; disabled with 404 when unset). - GET /metrics now exports Prometheus counters/gauges for: - rate-limiter fail-open count - idempotency persistence/replay counters - terminal token consume/replay counters - notification websocket connection/forward/error stats - route tests added for internal stats auth/enablement and metrics payload baseline.

Remaining: - persistence/audit policy for notifications (if promoted beyond best-effort)

Storage service (MVP local backend)

Status: Completed (MVP scope) Completed: - local filesystem-backed storage service implemented in packages/services/storage/service.go - traversal-safe namespace resolution integrated via packages/shared/storagepath - API handlers wired in cmd/api/routes.go: - GET /api/v1/storage/list - GET /api/v1/storage/download - PUT /api/v1/storage/upload - POST /api/v1/storage/mkdir - POST /api/v1/storage/rename - DELETE /api/v1/storage/delete - API process wiring in cmd/api/main.go + config support (STORAGE_ROOT_DIR) - tests: - packages/services/storage/service_test.go - cmd/api/routes_test.go storage route coverage

Remaining: - swap local backend with S3-compatible implementation for production target

Phase 7 — Payments/Webhooks

Status: In Progress Completed: - d4d4f91 webhook ingest service scaffold + stripe event persistence + outbox enqueue path - 02d97e0 webhook hardening + finalize path: - Stripe-style HMAC signature verification with timestamp tolerance - checkout.session.completed payment session finalize flow - ledger stripe_credit entry creation - payment session status transition to credited - payments.balance_credited outbox emission - f479ace reconciliation and provider-edge handling: - amount/currency mismatch path marks payment_sessions.status = failed_reconcile - writes payments.reconcile_failed audit log for ops visibility - provider lifecycle mapping for checkout.session.expired and checkout.session.async_payment_failed - unit tests for provider event-state mapping - 17adfa7 webhook integration coverage: - integration test (-tags integration) verifies mismatch path updates session to failed_reconcile - asserts no ledger credit is created on mismatch - asserts payments.reconcile_failed audit log row is written - a45c553 reconciliation alert routing: - emits payments.reconcile_failed outbox event on mismatch path - notification-relay consumes payments.reconcile_failed and publishes user notification - AsyncAPI/API surface/event taxonomy/NATS consumer docs updated for the new contract - webhook observability baseline: - webhook worker now tracks in-memory counters for: - received events - signature failures - invalid payload failures - persistence failures - successful processing - reconcile-failed outcomes - internal stats endpoint added: GET /api/v1/internal/stats - internal stats endpoint now requires INTERNAL_STATS_TOKEN bearer auth and returns 404 when disabled. - internal stats token comparison uses constant-time equality check. - prometheus counter endpoint added: GET /metrics with webhook + reconcile failure counters. - unit tests added for stats endpoint baseline and signature-failure counter behavior.

Remaining: - alert thresholds tuning in ops layer

Phase 9 — Terminal service

Status: Completed (MVP scope) Completed: - terminal token service baseline: - packages/services/terminal/service.go implements token mint + Redis storage (terminal_token:{token}). - token consume path uses atomic Redis GETDEL for single-use enforcement. - allocation ownership + active-state checks are enforced before token mint. - API wiring: - POST /api/v1/allocations/{allocation_id}/terminal-token implemented in cmd/api/routes.go. - config support for TTL via TERMINAL_TOKEN_TTL_SECONDS (default 300). - tests: - packages/services/terminal/service_test.go covers mint/consume, single-use, ownership/state validation, allocation mismatch. - cmd/api/routes_test.go covers success and inactive-allocation conflict mapping. - websocket terminal proxy: - /ws/terminal/{allocation_id} is served by the terminal gateway after token auth/consume. - gateway creates a node stream binding, enqueues node-agent terminal open, and proxies stream frames. - terminal control frames are emitted for session_ready, session_error, and session_closed. - terminal integration coverage: - added concurrent single-use consume race test to verify only one terminal token consume succeeds under contention. - terminal service exposes snapshot counters for successful consumes and replay-rejected consumes.

Remaining: - none for MVP scope

Phase 11 — Admin service

Status: Completed (MVP scope) Completed: - GET /api/v1/admin/overview - GET /api/v1/admin/users and GET /api/v1/admin/users/{user_id} - POST /api/v1/admin/users - POST /api/v1/admin/users/{user_id}/balance - POST /api/v1/admin/users/{user_id}/refunds - GET /api/v1/admin/allocations - POST /api/v1/admin/allocations/{allocation_id}/force-release - GET /api/v1/admin/nodes, POST /api/v1/admin/nodes, POST /api/v1/admin/nodes/{node_id}/probe, DELETE /api/v1/admin/nodes/{node_id} - GET /api/v1/admin/audit-logs, GET /api/v1/admin/audit-logs/export - GET /api/v1/admin/payments/sessions Remaining: - none for MVP scope

CI / Governance enforcement during build

Status: In Progress Completed: - e05784f CI scripts enforce go vet and token-query contract guard - f25c00d integration smoke gate: - scripts/ci/integration_smoke.sh runs selected -tags integration suites - make test-integration-selected target added for local + CI reuse - .gitlab-ci.yml includes integration_smoke stage job - 66f3078 integration smoke scope expanded to include billing-worker integration suite - integration smoke scope expanded to include API integration suite: - make test-integration-selected now includes ./cmd/api (//go:build integration route tests). - local/CI integration smoke now continuously validates admin DB-backed route behavior. - bf6f2d9 contract lint runtime enforcement: - contracts_validate.sh now runs Spectral/AsyncAPI via native CLI or npx fallback - CI enforces contract-lint tool presence (REQUIRE_CONTRACT_LINT_TOOLS=1) - b242705 strict AsyncAPI validation enabled: - AsyncAPI contract updated to validator-clean 2.6 structure (removed unsupported address keys, added metadata/message IDs) - contracts_validate.sh now defaults ALLOW_ASYNCAPI_VALIDATE_FAILURE to 0 (blocking mode) - .gitlab-ci.yml enforces blocking AsyncAPI validation (ALLOW_ASYNCAPI_VALIDATE_FAILURE=0) - OpenAPI lint signal cleanup: - Spectral policy updated to match project conventions (global security model and external behavior docs). - OpenAPI metadata/errors/idempotency coverage updated on flagged endpoints. - OpenAPI lint now reports zero warnings/errors under doc/governance/openapi.spectral.yaml. - 8b75ff0 local GitLab parity tooling: - added scripts/ci/gitlab_local_dry_run.sh to execute CI gates in .gitlab-ci.yml order. - added local GitLab compose/runbook (operations/local-dev/docker-compose.gitlab.yaml, operations/local-dev/GitLab_Local_Setup.md). - make ci-local-dry-run target added; dry run verified green. - generated artifact cleanliness enforcement: - scripts/ci/sdk_codegen_smoke.sh now fails when generated trees are dirty in strict mode. - .gitlab-ci.yml sets CODEGEN_ENFORCE_CLEAN=1 for hosted CI parity. - ops evidence gate: - added scripts/ops/parallel_ops_evidence_check.sh and make ops-parallel-evidence-check. - added scripts/ci/ops_evidence_gate.sh and wired it into gitlab_local_dry_run.sh. - asyncapi validation noise reduction: - contracts_validate.sh now exports SUPPRESS_NO_CONFIG_WARNING=1 for asyncapi validate calls. - integration smoke default DB wiring: - integration_smoke.sh now defaults to local dev DATABASE_URL and runs selected integration suites when DB is reachable. - skip behavior is now limited to unreachable DB instead of unset env. - security scan gate hardening: - security_scans.sh now runs tool-aware scans when available (govulncheck, trivy, gitleaks), with explicit warnings when missing. - contract breaking-change guard hardening: - contracts_breaking_change.sh now detects when API contract files changed but diff tools are missing. - optional strict mode added via REQUIRE_BREAKING_DIFF_TOOLS=1. - dockerized web runtime guard: - added scripts/ops/web_container_smoke.sh and wired it into gitlab_local_dry_run.sh (when Docker is available). - local web compose now isolates .next as container volume to prevent missing-chunk dev runtime errors.

Remaining: - none (continue keeping CI strict-mode flags enabled)

Demo Readiness

Status: In Progress Completed: - scripts/demo_smoke.sh added for local same-day walkthrough (health/auth/skus/nodes/create/get/list/release allocation). - operations/local-dev/Demo_Runbook.md added with quickstart and realtime validation pointers. - local compose/runtime hardening: - exposed API host port in operations/local-dev/docker-compose.app.yaml (${API_PORT}:8080). - exposed webhook worker host port in operations/local-dev/docker-compose.app.yaml (${WEBHOOK_PORT}:8082). - aligned webhook worker signature-secret env var with runtime config (STRIPE_WEBHOOK_SECRET). - removed obsolete compose version key warning from app compose file. - fixed JWKS cache init options in middleware/auth service to avoid startup crash. - demo smoke resiliency: - added API readiness wait loop. - auto-applies schema/seed on fresh local DB (AUTO_INIT_DB=1). - default output switched to compact list summaries to keep unattended logs readable; full payload mode available via DEMO_VERBOSE=1. - local token flow now pins Keycloak host header (KC_HOST_HEADER) so token issuer matches API JWKS issuer in compose network. - local OIDC subject bootstrap now ensures a matching users.id record exists for protected endpoint walkthrough. - release step now conditionally skips when allocation is not yet active (expected in async flow). - demo path now provisions through active-state locally: - smoke script bootstraps a demo admin node through admin API and creates allocations with explicit node_id. - provisioning worker local mode supports PROVISIONING_RUNTIME_MODE=noop to exercise async lifecycle without real SSH side effects. - smoke run verifies requested -> active, terminal-token mint, and release request (active -> releasing). - smoke run now includes terminal websocket upgrade/control-frame check via scripts/ws_terminal_smoke.go.

Frontend UX Foundation (Slice 0)

Status: In Progress Completed: - 0f015ca UX planning hardening: - added doc/product/UX_Execution_Plan.md - hardened Slice 1/2 mocks and execution order. - current scaffold: - added packages/web/src/components/system/* shared UI states and modal primitives. - added packages/web/src/components/a11y/* helpers (FocusTrap, LiveRegion, keyboard shortcut hook). - added packages/web/src/lib/api/rateLimit.ts (Retry-After parsing/countdown) and packages/web/src/lib/api/errors.ts. - added package docs and barrel exports (packages/web/README.md, packages/web/src/index.ts). - frontend test harness baseline: - added vitest + jsdom + @testing-library/react in packages/web. - added vitest.config.ts and test setup bootstrap. - added initial tests for: - src/lib/api/rateLimit.ts - src/lib/api/errors.ts - restricted-state rendering for /notifications and /storage. - added component tests for: - RateLimitedState countdown and retry enabled behavior. - FocusTrap initial focus and tab-cycle wrapping behavior. - added page-level error-path tests: - /storage renders mapped ErrConflict and ErrNotFound error states. - /notifications renders websocket error state fallback. - added route-level 429 countdown tests: - /auth/login reads and renders Retry-After header as countdown. - /allocations/{allocation_id} now uses API Retry-After when available and falls back to 30s. - added admin mutation-path tests: - /admin/allocations force-release requires reason before submit. - /admin/nodes delete action disabled for in-use nodes. - added admin user mutation error handling + tests: - /admin/users/{user_id} balance adjust/refund actions now surface mapped API errors. - tests cover ErrConflict and ErrNotFound mutation error rendering. - added remaining admin mutation error handling + tests: - /admin/nodes create/probe/delete actions now surface mapped API errors. - /admin/allocations force-release action now surfaces mapped API errors. - tests cover conflict/not-found mutation failures on both routes. - added billing/admin-payments edge-path tests: - /billing Stripe return query handling (?canceled=true) and CSV export action behavior. - /admin/payments/sessions status-filter call shaping (all -> credited). - /admin/audit-logs action-filter and cursor pagination call shaping. - added payment action edge-path hardening: - /billing checkout/customer-portal actions now surface mapped API errors. - tests cover action call dispatch and checkout failure UX rendering. - added terminal panel route-level tests: - validates no-session error path. - validates mint failure error path. - validates websocket control-frame rendering (session_ready). - validates terminal input send and explicit disconnect behavior. Remaining: - Keep frontend smoke gate in sync as web package test/build commands evolve.

Frontend Slice 1 (Auth + Marketplace)

Status: In Progress Completed: - Next.js app shell baseline implemented in packages/web: - app/layout.tsx, app/globals.css, app/page.tsx - package/runtime config (package.json, tsconfig.json, next.config.mjs) - Auth route baseline: - /auth/login with OIDC authorize-start request + PKCE verifier staging + error/rate-limit states. - /auth/callback wired to POST /api/v1/auth/oidc/exchange, session bootstrap, and redirect. - Marketplace route baseline: - /marketplace authenticated typed fetch for SKUs/nodes using browser session token. - sold-out SKU rendering + estimate-first provision modal UX. - real POST /api/v1/allocations wiring from provision modal with async acceptance and route handoff: - single accepted request -> /allocations/{allocation_id} - multi accepted requests -> /allocations - added allocations route scaffolds: - /allocations - /allocations/{allocation_id} - Added additional UX parity routes: - /schedulers route as the current scheduler surface baseline (detailed managed scheduler UX remains future work). - /settings/profile route with GET /api/v1/users/me. - Validation: - pnpm typecheck (web) green. - pnpm build (web) green. Remaining: - Add web test runner and component-level tests. - Replace allocation list/detail scaffolds with Slice 2 lifecycle polling + terminal/release controls.

Frontend Slice 2 (Allocations + Terminal Baseline)

Status: In Progress Completed: - Implemented /allocations page with authenticated allocation list fetch (GET /api/v1/allocations). - Implemented /allocations/{allocation_id} page with polling detail fetch (GET /api/v1/allocations/{allocation_id}). - Wired release action (POST /api/v1/allocations/{allocation_id}/release) through confirmation modal. - Wired terminal-token action (POST /api/v1/allocations/{allocation_id}/terminal-token) into the terminal UX path. - Replaced the earlier token display-only step with the terminal panel component: - mints single-use token and connects browser WS via Sec-WebSocket-Protocol. - renders terminal control frames (session_ready, session_error, session_closed). - supports reconnect/remint and interactive input send. - Added connection info + copyable SSH command rendering when allocation has connection metadata. - Added full prototype-intent action row on allocation detail: - Metrics (config-driven deep link via NEXT_PUBLIC_METRICS_BASE_URL) - Console (terminal panel) - Key (routes to user key-management/help flow) - Release (confirmed async release) - Added user key-management guidance copy on allocation detail. - allocations list operator controls: - server-driven status filter (listAllocations query parameter) for active/released/etc views. - local search on allocation id/sku/node id for quick narrowing without extra round-trips. - visible "showing x of y" counter and edge tests for status-query + local search behavior. - Validation: - pnpm typecheck (web) green. - pnpm build (web) green. Remaining: - continue lifecycle UX refinement for richer per-step progress and action disablement patterns.

Frontend Slice 3 (Billing + Notifications Baseline)

Status: In Progress Completed: - Implemented /billing page with contract-backed flows: - GET /api/v1/billing/balance - GET /api/v1/billing/usage - GET /api/v1/billing/usage/csv (export trigger) - POST /api/v1/payments/checkout-session - POST /api/v1/payments/customer-portal-session - Added Stripe return-param handling on billing route (session_id / canceled). - Added top-nav notification bell/panel: - websocket subscribe to /ws/notifications - browser auth via Sec-WebSocket-Protocol (bearer, <access_token>) - in-app notification list with deep-link action support. - d8d172f dedicated notifications center route: - added /notifications page for full list management. - introduced shared websocket notification store hook (useNotifications) reused by bell + route. - added dismiss/clear behavior and local persistence for retained notifications. - Added global persistent low-balance banner in app layout: - periodic balance polling on authenticated sessions - warning/depleted variants with billing CTA. - Validation: - pnpm typecheck (web) green. - pnpm build (web) green. Remaining: - Refine read/unread persistence semantics once notification retention policy is finalized.

Frontend Slice 5 (Storage Baseline)

Status: In Progress Completed: - 0546ef3 storage UI baseline: - added /storage page with contract-backed flows: - GET /api/v1/storage/list - GET /api/v1/storage/download - PUT /api/v1/storage/upload - POST /api/v1/storage/mkdir - POST /api/v1/storage/rename - DELETE /api/v1/storage/delete - added storage API client methods in packages/web/src/lib/api/client.ts. - added top-nav Storage route link. - storage operator controls: - local type filter (all/file/dir) and name search added. - visible "showing x of y" counter for directory triage. - tests cover filter behavior without additional API calls. Validation: - pnpm build (web) green. - pnpm typecheck (web) green. Remaining: - None for MVP scope beyond backend-level S3 swap.

Frontend Slice 4 (Admin Surfaces Baseline)

Status: In Progress Completed: - Implemented /admin/overview: - GET /api/v1/admin/overview - auto-refresh every 5s with pause/resume and last-updated indicator. - Implemented /admin/allocations: - GET /api/v1/admin/allocations with status filter. - POST /api/v1/admin/allocations/{allocation_id}/force-release with required reason via confirmation modal. - Implemented /admin/nodes: - GET /api/v1/admin/nodes - POST /api/v1/admin/nodes - POST /api/v1/admin/nodes/{node_id}/probe - DELETE /api/v1/admin/nodes/{node_id} with occupancy guard in UI. - admin nodes operator controls: - local status filter (all/online/offline/maintenance) and search (host/node id/sku) added. - visible "showing x of y" counter for quick triage on large node sets. - tests cover local filter behavior in addition to existing mutation guards. - Added admin navigation link in app layout. - admin users list hardening: - create-user mutation failures now render mapped API errors with correlation id. - local operator controls added for quick filtering without extra API calls: - search by username/user id - role filter (all/user/admin) - visible "showing x of y" counter - edge tests cover cursor pagination, create-user error mapping, and local filter behavior. - admin user action-history context: - admin user detail now includes direct context links to: - /admin/allocations?user_id=<id> - /admin/payments/sessions?user_id=<id> - /admin/audit-logs?actor_user_id=<id> - tests validate link targets for consistent operator navigation. - target admin pages now consume seeded query filters at route entry: - allocations: user_id, status - payment sessions: user_id, status - audit logs: actor_user_id, action - edge tests cover query-seeded API call shaping. - Validation: - pnpm typecheck (web) green. - pnpm build (web) green. Remaining: - Keep per-screen filter ergonomics aligned as new routes are added.

Parallel Ops Evidence

Status: In Progress Completed: - backup/restore drill baseline: - added scripts/ops/backup_restore_smoke.sh for repeatable local Postgres dump+restore validation. - added make ops-backup-restore-smoke convenience target. - moved Parallel_Ops_Track item 3 (Backup/Restore/DR) to in_progress with concrete evidence link. - captured local rehearsal report artifact with observed restore duration/table checks. - latest local run refreshed evidence (restore_smoke_1771895409, 23 tables, 1s, success). - SLO/alert artifact baseline: - added baseline alert rule manifest at doc/operations/evidence/alert_rule_manifest_baseline.yaml. - added API hardening alerts for fail-open rate limiting, terminal token replay spikes, and notification websocket write errors. - added simulation playbook/report artifacts for repeatable validation. - outbox payload minimization guard baseline: - added scripts/ci/outbox_payload_guard.sh and wired it into scripts/ci/contracts_validate.sh. - runbooks/on-call artifact baseline: - added explicit on-call roster + escalation artifact. - added incident drill calendar/report template artifact. - completed baseline runbook coverage for provisioning, webhook, database failover, and incident communications. - east/west security artifact baseline: - added baseline network policy manifest for default-deny + explicit allow-list flows. - added TLS cert expiry check script + make target for repeatable verification. - key-rotation runbook baseline: - added unified runbook for planned rotation and compromise response across JWKS, terminal, control keys, and envelope keys. - data-growth guard baseline: - added scripts/ops/data_growth_check.sh with row/size threshold checks for high-growth tables. - added make ops-data-growth-check and evidence doc wiring. - latest local run refreshed row/size evidence for usage_records, ledger_entries, and audit_logs. - evidence-gate baseline: - added scripts/ops/parallel_ops_evidence_check.sh to validate required evidence artifacts for Parallel Ops items 1-5. - added make ops-parallel-evidence-check for local repeatable verification. - wired CI dry-run gate via scripts/ci/ops_evidence_gate.sh and scripts/ci/gitlab_local_dry_run.sh. - observability smoke baseline: - added scripts/ops/observability_smoke.sh and make ops-observability-smoke. - validates API/webhook metrics endpoint reachability + required metric names. - added local evidence report doc/operations/evidence/observability_local_smoke_report.md.

Security foundation (shared package)

Status: In Progress Completed: - encryption envelope baseline (current commit): - doc/architecture/Encryption_Envelope_Spec.md defines canonical _enc JSON shape and AES-256-GCM baseline. - packages/shared/crypto/envelope.go implements envelope encrypt/decrypt + marshal helpers. - packages/shared/crypto/envelope_test.go adds round-trip and guardrail coverage. - storage path-safety baseline: - packages/shared/storagepath/path.go defines namespace-rooted path resolution (filepath.Clean + namespace escape checks). - packages/shared/storagepath/path_test.go adds traversal/absolute-path rejection tests. - envelope material stability: - provisioning worker now uses config-driven/stable envelope key material (ENVELOPE_KEY_B64, ENVELOPE_KEY_ID) instead of per-process random keys. - deterministic local-dev fallback keeps decrypt paths consistent across processes. - shared key loader moved to packages/shared/crypto/key_material.go and reused by provisioning + terminal. - node-probe SSRF guardrails: - inventory node create/probe flows now validate target host resolution and block unsafe targets before network dial. - deny set includes loopback/link-local/multicast/unspecified and metadata endpoint 169.254.169.254. - optional CIDR allowlist support via NODE_PROBE_ALLOWED_CIDRS for controlled admin probe scope. - API route tests added for denied-target responses (400) on admin node create/probe endpoints. - KMS key-source command hardening: - legacy provisioning-worker SSH command paths were removed with retirement of the SSH runtime. - remaining key-fetch command surfaces should be reviewed at the active terminal/app boundary instead of reintroducing worker-side private-key handling. - rate-limit fail-open observability baseline: - RateLimiter now records fail-open occurrences when Redis eval path errors. - unit test added to assert fail-open request path and counter increment behavior. - policy cache invalidation baseline: - PostgresClient now supports Invalidate(key) and InvalidateAll() for in-process policy cache eviction. - API process subscribes to Redis policy.invalidate.* and evicts local cache entries immediately. - protected API chain now mounts policy-backed RateLimiter middleware. - idempotency response-body sanitization baseline: - idempotency middleware now sanitizes JSON response bodies before storing replay payloads. - invalid/non-JSON response payloads are not persisted to idempotency_keys.response_body (fail-safe default). - tests added for sensitive-key redaction and bearer-token string scrubbing. - snapshot counters added for persisted/ skipped/replay-served idempotency behavior (IdempotencySnapshot()). - JWKS break-glass baseline: - added JWKSAuth.ForceRefresh(ctx) for immediate cache refresh during incidents. - added doc/operations/runbooks/JWKS_Compromise_Breakglass_Runbook.md and linked it into secrets/key ops evidence. - API internal incident endpoint added: POST /internal/auth/jwks/refresh guarded by INTERNAL_JWKS_REFRESH_TOKEN. - scheduler metadata encryption baseline: - allocation create path now envelope-encrypts scheduler_request into allocations.scheduler_metadata. - unit test validates decryptable round-trip using shared envelope key material. - ERD/schema docs now enforce envelope encryption expectation for credential-bearing scheduler metadata.

Remaining: - wire shared crypto helper into storage runtime secret paths if storage metadata starts carrying credential material.

API slice hardening (admin + payments)

Status: In Progress Completed: - admin route wiring (DB-backed): - registered and implemented GET /api/v1/admin/overview. - registered and implemented GET/POST /api/v1/admin/users. - registered and implemented GET /api/v1/admin/users/{user_id}. - registered and implemented POST /api/v1/admin/users/{user_id}/balance. - registered and implemented POST /api/v1/admin/users/{user_id}/refunds. - registered and implemented GET /api/v1/admin/payments/sessions. - registered and implemented GET /api/v1/admin/audit-logs. - registered and implemented GET /api/v1/admin/audit-logs/export. - auth/role guard ordering: - admin handlers now enforce auth/role checks before dependency health checks, so non-admin requests deterministically return 403 instead of surfacing infra state. - route guardrail tests: - added table-driven coverage for all new admin endpoints: - non-admin claims -> 403. - admin claims with missing DB dependency -> 503. - integration route coverage (DB-backed): - added cmd/api/routes_integration_test.go (//go:build integration) to exercise admin HTTP handlers against real Postgres schema: - GET /api/v1/admin/overview - POST /api/v1/admin/users - GET /api/v1/admin/users/{user_id} - includes malformed-id (not-a-uuid) 400 contract assertion - GET /api/v1/admin/users (pagination envelope shape) - GET /api/v1/admin/payments/sessions (filter by user_id, status) - GET /api/v1/admin/audit-logs (filter by actor_user_id, action) - GET /api/v1/admin/audit-logs/export (CSV header + row presence) - validated with: - DATABASE_URL=... go test -tags integration ./cmd/api -count=1 - admin mutation error classification hardening: - adminAdjustBalanceHandler / adminCreateRefundHandler now classify DB failures by PG error code: - 23503 -> 404 (user not found) - 22P02 -> 400 (invalid user id) - other DB errors -> 500 - admin path-parameter validation hardening: - adminGetUserHandler, adminAdjustBalanceHandler, and adminCreateRefundHandler now validate user_id as UUID before dependency/DB access. - malformed user_id now returns deterministic 400 invalid user id. - route tests added for invalid-UUID cases on balance/refund admin mutations. - OpenAPI updated to declare 400 BadRequest for: - GET /api/v1/admin/users/{user_id} - POST /api/v1/admin/users/{user_id}/balance - local CI contract parity: - ran scripts/ci/gitlab_local_dry_run.sh end-to-end after route changes and codegen sync; all gates pass. - added optional Playwright stage to local dry-run (RUN_WEB_E2E=1) with automatic stack up/down via scripts/ci/frontend_e2e.sh. - validated full parity path with RUN_WEB_E2E=1 RUN_WEB_CONTAINER_SMOKE=1 bash scripts/ci/gitlab_local_dry_run.sh (green). - raised e2e gate strictness: - scripts/ci/gitlab_local_dry_run.sh now runs frontend e2e by default (RUN_WEB_E2E=1). - .gitlab-ci.yml now includes dedicated frontend_e2e job in build_test stage. - codegen drift fix: - synced generated OpenAPI artifacts: - packages/shared/gen/openapi_types.gen.go - packages/web/src/lib/gen/openapi.types.ts - revalidated with go test ./... and pnpm --dir packages/web typecheck. - authenticated UI flow regression coverage: - expanded Playwright OIDC coverage in packages/web/e2e/auth-login.spec.ts: - post-login route access checks for /allocations, /storage, /notifications. - post-login admin route access check for /admin/users. - post-login admin overview route access check for /admin/overview. - mock billing checkout redirect path remains covered. - marked OIDC suite serial to avoid shared-session race flakes against a single local Keycloak/API stack. - added persona click-through smoke suite: - packages/web/e2e/persona-smoke.spec.ts covers user/admin nav journeys and asserts no generic fallback errors appear. - added persona click-through smoke suite packages/web/e2e/persona-smoke.spec.ts: - user journey nav walk: marketplace -> allocations -> billing -> storage -> notifications -> schedulers -> settings - admin journey nav walk: overview -> users -> nodes -> allocations -> audit logs -> payment sessions - explicit guard assertion that generic fallback errors (Request failed / Something went wrong) do not appear. - allocations page-shell consistency: - packages/web/app/allocations/page.tsx now preserves the page header/shell (My Allocations) across loading/restricted/error/empty states. - added unit coverage in packages/web/app/allocations/page.test.tsx for empty-state and restricted-state heading visibility. - billing page-shell consistency: - packages/web/app/billing/page.tsx now preserves the page header/shell (Billing) across loading/restricted/error states. - added unit coverage in packages/web/app/billing/page.edge.test.tsx for restricted-state heading visibility. - storage page-shell consistency: - packages/web/app/storage/page.tsx now preserves the page header/shell (My Storage) across restricted/loading/error states. - added unit coverage in packages/web/app/storage/page.test.tsx to assert heading visibility when unauthenticated. - notifications page-shell consistency: - packages/web/app/notifications/page.tsx now preserves the page header/shell (Notifications) across restricted/error states. - added unit coverage in packages/web/app/notifications/page.test.tsx and packages/web/app/notifications/page.error-state.test.tsx for heading visibility. - admin overview page test coverage: - added packages/web/app/admin/overview/page.edge.test.tsx with coverage for: - successful metric rendering - non-admin restricted behavior - mapped API-error rendering with correlation ID - pause/resume auto-refresh toggle behavior - backend 401 response path now enforces restricted state (session-expiry regression guard) - API error mapping hardening: - packages/web/src/lib/api/errors.ts now supports both nested (error.code) and canonical flat (code) error envelopes. - added coverage for flat middleware auth payload (token_missing) to prevent generic "Request failed" UX fallback.

Current Next Queue (unattended)

  1. Execute and attach concrete staging evidence for doc/operations/Parallel_Ops_Track.md items 1-5 (launch-critical), then flip each from in_progress to done.
  2. Add policy invalidation publisher on admin policy mutation endpoints once policy-management APIs land.
  3. Keep CI parity strict and re-run make ci-local-dry-run on every service slice merge.
  4. Track MAAS/cloud-init provisioning visibility as a release-candidate gap under A-MAAS-BOOTSTRAP-PROGRESS-VISIBILITY-001:
  5. emit explicit phase markers for site bootstrap, node bootstrap, and node-agent enrollment
  6. surface those markers through MAAS-visible status/events when reachable
  7. also expose the same progress in GPUaaS onboarding detail/events so operators can distinguish slow package installs from real bootstrap or enrollment failure without SSHing to the node
  8. short term on the MAAS side, poll curated INFO events during wait-loop ticks and filter them to the current workflow/stage window
  9. long term, move MAAS progress collection to a shared site-level refresher if reimage/onboarding concurrency grows
  10. separate the remaining long-running site_bootstrap interior gap under A-MAAS-SITE-BOOTSTRAP-MIDPHASE-PROGRESS-001 so the missing package-install / Lambda-stack / DOCA-OFED callbacks do not get lost in the broader visibility thread
  11. track OCI site-bootstrap publish/profile-reference discipline separately under A-MAAS-SITE-BOOTSTRAP-BUNDLE-RELEASE-DISCIPLINE-001
  12. track a cleaner admin lifecycle/progress read surface separately under A-MAAS-ADMIN-LIFECYCLE-PROGRESS-SURFACE-001
  13. Track MAAS workflow UI ordering cleanup under A-MAAS-WORKFLOW-TIMELINE-ORDERING-001:
  14. make Stage Progress use explicit workflow stage order instead of first-seen timestamp order
  15. make Current Attempt / timeline rows deterministic when multiple events land in the same second
  16. keep GPUaaS workflow detail ordering aligned with Temporal execution order during incidents

2026-04-13 Progress Note

  • App-runtime validation reached a first stable platform-control state across:
  • Slurm Reference,
  • Self-managed Kubernetes (RKE2),
  • JupyterLab launchable OCI workload.
  • Launchable OCI / JupyterLab:
  • commit 149a244f added curated JupyterLab runtime image definitions for CPU, NVIDIA H200/CUDA, and AMD ROCm variants.
  • node-agent bootstrap now installs Docker when no approved OCI runtime is present and auto-configures nvidia-container-toolkit when nvidia-smi is available.
  • app-runtime now sends bounded GPU requests (gpu_request.kind=count) and node-agent maps them to Docker --gpus N.
  • platform-control pipeline 593 deployed the slice successfully.
  • H200 validation launched JupyterLab with gpu_count=1; Docker inspect reported device request Count: 1, and nvidia-smi inside the container showed one visible NVIDIA H200.
  • Slurm Reference:
  • commit 5491774f fixed stale workload health by reporting completed Slurm bootstrap as healthy instead of progressing.
  • platform-control pipeline 595 deployed the fix successfully.
  • existing Slurm instance was corrected through the public report API and now reports status=running, health_status=healthy, phase=slurm_bootstrap_completed.
  • RKE2:
  • platform-control validation is stable for the first single-server path.
  • remaining product decisions are storage/CSI, external exposure, kubeconfig privilege/copy UX, repair/reconcile, and cleanup semantics.
  • Follow-up work after first stable validation:
  • added scripts/ops/app_runtime_first_slice_smoke.sh as a configurable local-kind/platform-control smoke entry point. It can observe existing running Slurm/RKE2/Jupyter instances, or create/wait/access-check/decommission instances when placement/artifact inputs are supplied by the operator.
  • added doc/product/Jupyter_Package_Install_v1.md for package version constraints, ephemeral install semantics, failure reporting, and the derived image follow-up.
  • node-agent now returns a bounded pip install log excerpt for launchable OCI workloads and the workload Events tab surfaces it with the installed package count.
  • added RKE2 kubeconfig copy support in the workload Access tab.
  • confirmed launchable OCI Access already documents private SSH tunnel access and the future platform-proxy mode; no extra item remains there for the first slice.
  • strengthened local-kind controller refresh handling by stamping Slurm and RKE2 controller pod templates with build metadata and by deploying the RKE2 controller from the parity validation flow, reducing stale same-tag image and stale runtime-config drift.
  • Remaining next work:
  • validate the new single-node vllm-openai Docker Compose app profile in local kind with CPU-safe inputs, then promote to platform-control and run an H200/GPU smoke with a production-sized model selection,
  • finish node-agent lifecycle upgrade delivery so existing nodes do not need manual rebootstrap for host prerequisites and binary updates,
  • keep RKE2 storage/CSI and external exposure as explicit infra/product open items until infra confirms ownership and the supported route.

2026-04-04 Progress Note

  • MAAS drift reconciliation first slice implemented locally:
  • provisioning-worker now ensures a Temporal schedule that runs the periodic MAAS reconciliation workflow/activity
  • MAAS service scan refreshes node_maas_state for MAAS-managed nodes and classifies drift such as ip_drift, machine_missing, released_outside_workflow, agent_offline, retired_node_still_deployed, and MAAS read failures
  • current periodic scope is intentionally limited to MAAS-managed nodes already active; non-active states should be checked by the owning lifecycle workflows/events rather than the broad fleet scan
  • current scan intentionally records report-only/candidate follow-up actions; automatic lifecycle execution remains blocked on the owning node-agent update path task
  • Separate MAAS discovery follow-up tracked:
  • phase 1 should be an explicit admin Discover action that produces an operator-reviewed MAAS candidate list with first-pass SKU inference, so admins can choose candidates from MAAS inventory instead of typing every onboarding input manually
  • long term, that candidate list can be refreshed by Temporal on a schedule once the discovery model is trusted, but adoption should remain operator-reviewed by default
  • MAAS operating-model boundary now clarified:
  • periodic Temporal reconciliation is for known MAAS-managed nodes already active
  • non-active MAAS states should be checked by their owning lifecycle workflows/events
  • unknown MAAS machines belong to discovery/adoption flow, not the steady-state reconciler
  • MAAS provisioning visibility gap is now tracked as a first-class backend task:
  • A-MAAS-BOOTSTRAP-PROGRESS-VISIBILITY-001
  • this should be treated as part of the current MAAS release-candidate quality bar, not as a later observability extra
  • Local frontend CI cleanup also ready to batch with the next push:
  • tenant-shared scheduler test now waits for the async-enabled request button before clicking