Execution Progress¶
Tracks implementation progress against doc/Implementation_Roadmap.md so we keep an auditable "done / next" list while coding unattended.
Document Role¶
- Purpose: live implementation ledger mapped to roadmap phases.
- Scope: completed commits, partial progress, remaining items, and active next queue.
- This does not replace phase definitions (
doc/Implementation_Roadmap.md) or readiness gates (doc/Phase_Readiness_Tracker.md). - Treat this file as append-only execution history plus progress notes; the current task source of truth is
doc/governance/Agent_Work_Queue.yaml.
Last updated: 2026-04-14 (launchable OCI, Compose, and node-agent lifecycle docs refreshed)
2026-04-14 Progress Note¶
- Node-agent lifecycle:
- commit
97697190delivered node-agent lifecycle self-update and passed platform-control CI. - platform-control and local kind nodes were refreshed to the current agent generation; host bootstrap now owns Docker, Docker Compose, NVIDIA Container Toolkit, registry trust, and H200 site-bootstrap prerequisites for the current launchable OCI and Compose slices.
- remaining lifecycle backlog is full fleet reconciliation and read-model
telemetry hardening, especially desired version/prerequisite drift,
agent_version,agent_connected_at, runtime capability, and stale task/config cleanup. - Launchable OCI and Docker Compose:
- commit
593a1690platformized the manifest-owned Docker Compose topology path. - JupyterLab remains the proven single-container launchable OCI reference app.
vllm-openaiis now the proven single-node Compose reference app; platform-control H200 validation launched Mistral Small 3.2 on one H200 and validated/v1/modelsplus/v1/chat/completionsthrough private access.- the current Compose product contract is curated renderer plus manifest topology, not arbitrary user-authored Compose YAML.
- App-developer documentation:
- refreshed
doc/architecture/Build_an_App_for_GPUaaS_v1.md,doc/architecture/External_App_Team_Integration_Guide_v1.md,doc/architecture/Launchable_OCI_Workload_Profile_Contract_v1.md, anddoc/architecture/App_Platform_Gap_Tracker_v1.md. - the active backlog is now app-developer release packaging, Compose topology generalization, private credential/runtime-secret binding, platform-proxy exposure, persistent storage/derived environments, and fleet lifecycle reconciliation.
Tracking Rule¶
- Every implementation commit must be mapped to a roadmap phase here.
- Keep entries additive and append-only.
- Do not mark phase done unless tests pass and wiring is runnable.
Latest Verification Sweep (2026-02-23)¶
Status: Passed
Evidence:
- go test ./... green (all packages)
- go vet ./... green
- make test-integration-selected green
- make ci-local-dry-run green (contract + reviewguard + codegen + build/test gates)
- scripts/demo_smoke.sh green (health/auth/catalog/nodes/create allocation/poll/release + terminal token + ws smoke)
Notes:
- AsyncAPI validator reports informational recommendation to move to AsyncAPI 3.1.0; non-blocking in current pipeline.
- CI dry-run integration-smoke stage is skipped when DATABASE_URL is unset in shell context; targeted integration suite executed successfully via make test-integration-selected.
Roadmap Mapping¶
Phase 1A — shared package tests¶
Status: Completed
Evidence:
- e1685c5 baseline shared tests (errors/events/middleware/policy)
- go test ./... green
Phase 1B — cmd/api + outbox relay wiring¶
Status: Completed (MVP scaffold)
Evidence:
- 1b2aef0 cmd/api health/config/server scaffold
- 70a7916 outbox relay runtime (Postgres claim + NATS publish)
- a142a74 outbox retry/backoff hardening + integration test (-tags integration)
- API local relay parity hardening:
- replaced cmd/api outbox ticker placeholder with shared packages/shared/outbox relay loop
- wired API process to outbox.NewPostgresStore + JetStream publisher with event-type allowlist
- added cmd/api/outbox_test.go coverage for supported/unsupported event-type gating
Notes:
- Dedicated relay (cmd/outbox-relay) is in place and compose-wired.
Phase 2 — Auth + Users¶
Status: In Progress
Completed:
- e13afa2 auth deny-list support (admin token revocation check)
- 996792d OIDC/public auth endpoint implementation scaffold:
- GET /api/v1/auth/oidc/authorize
- POST /api/v1/auth/oidc/exchange
- POST /api/v1/auth/personal/login (personal path)
- POST /api/v1/auth/token/refresh
- POST /api/v1/auth/logout
- GET /api/v1/users/me
- route split for public vs protected API paths in cmd/api
- 043470a auth provider-path tests:
- token endpoint success decode
- non-2xx provider response handling
- 3c1185f auth provider failure-mode coverage:
- invalid token endpoint JSON decode path
- missing token fields validation path
- empty refresh token guard
- missing OIDC exchange field guard
- provider transport error assertion
- 16b9a7f logout revocation semantics:
- POST /api/v1/auth/logout now revokes bearer token at OIDC provider revoke endpoint
- writes current admin token jti to Redis deny-list with bounded TTL
- auth/middleware deny-list supports explicit revoke operation + tests
- b242705 OIDC verification hardening:
- OIDC access token claims now come from JWKS-verified JWT parsing (signature + issuer + expiry validation), not raw payload decode
- JWKS verification retries once with forced refresh to cover key-rotation races
- direct JWKS HTTP fetch fallback added for cache bootstrap/refresh failure paths
- expanded auth tests for signed-token success, invalid signature rejection, and wrong-issuer rejection
- 5896e78 logout refresh-token revoke extension:
- POST /api/v1/auth/logout accepts optional refresh_token and revokes it at OIDC provider when supplied
- access-token revoke behavior preserved; deny-list revocation path unchanged
- route and auth service tests added for token-type hint correctness and invalid request body handling
Remaining: - none for MVP scope
Phase 3 — Inventory¶
Status: Completed (MVP scope)
Completed:
- 99fa11b inventory service + API handlers:
- GET /api/v1/skus (active catalog with cursor/page_size)
- GET /api/v1/nodes (user-facing node summaries with status/region filters)
- caeff96 admin inventory extension:
- GET /api/v1/admin/nodes
- node occupancy projection in inventory service
- admin-role guard in API handler
- 8ec890b admin node operations:
- POST /api/v1/admin/nodes (SKU validation, duplicate detection, optional probe)
- POST /api/v1/admin/nodes/{node_id}/probe (status update online/offline)
- DELETE /api/v1/admin/nodes/{node_id} (soft-disable to offline, conflict if in use)
- route-level tests for admin auth and conflict/not-found behavior
- 16b9a7f node mutation audit logging:
- transactional audit_logs writes for create/probe/disable node operations
- metadata constrained to allowed keys (node_id, status_from, status_to)
Remaining: - none for MVP scope
Phase 4 — Provisioning lifecycle¶
Status: In Progress
Completed:
- dc6e0e4 provisioning worker pull-consumer runtime scaffold
- a142a74 create-allocation orchestration start (requested + outbox event)
- user allocation API completion for demo path:
- GET /api/v1/allocations (status filter + pagination envelope)
- GET /api/v1/allocations/{allocation_id}
- POST /api/v1/allocations/{allocation_id}/release (202 + outbox provisioning.releasing.requested)
- user-managed SSH key model introduced via /api/v1/ssh-keys endpoints (private-key download endpoint retired)
- orchestrator service exposes list/get/release methods with ownership checks.
- c2eee14 provisioning worker state-transition handlers:
- handle provisioning.requested -> allocation state update (requested/provisioning to provisioning/active)
- handle provisioning.releasing.requested -> released + emit provisioning.releasing.completed
- handle provisioning.force_release_requested -> releasing + emit provisioning.releasing.requested
- 735ba4d Temporal registration path:
- provisioning worker boots Temporal worker with dedicated task queue
- workflow registered (ProvisioningEventWorkflow) with activity handlers for requested/releasing/force-release paths
- configurable TEMPORAL_TASK_QUEUE with default provisioning-workflows
- provisioning state-machine hardening (current commit):
- HandleProvisionRequested now performs explicit runtime step + success/failure branching:
- success path transitions allocation to active, stores SSH material fields, emits provisioning.active
- failure path transitions allocation to failed, records failure_reason, emits provisioning.failed
- HandleReleasingRequested now performs explicit runtime release step + success/failure branching:
- success path transitions allocation to released, emits provisioning.releasing.completed
- failure path transitions allocation to release_failed, records release_failed_reason, emits provisioning.release_failed
- runtime abstraction added (NewServiceWithRuntime) so real SSH executor can be plugged in without handler changes
- unit tests expanded for SSH material generation helper and payload shape expectations
- provisioning private-key envelope alignment:
- generated SSH private key material is now serialized and encrypted via packages/shared/crypto before writing allocations.ssh_private_key_enc
- provisioning runtime activation:
- provisioning worker runtime selection is now limited to agent or noop.
- legacy SSH provisioning runtime has been retired and removed from the active codebase.
- node lifecycle execution is expected to flow through the node-agent task runtime when not in noop.
Remaining: - harden production key-management integration at the app/runtime boundary without reintroducing provisioning-worker SSH private-key handling
Phase 5 — Billing + ledger¶
Status: Completed (MVP scope)
Completed:
- 1a4e1c8 billing worker pull-consumer runtime scaffold
- 16b9a7f billing read path service + API handlers:
- GET /api/v1/billing/balance
- GET /api/v1/billing/usage (cursor/page_size + filters)
- GET /api/v1/billing/usage/csv
- handler + service unit tests for ordering/filter parsing/CSV formatting
- bd4e6df billing worker event handling implementation:
- provisioning.active creates idempotent open usage_records row
- provisioning.releasing.completed / provisioning.release_failed close usage window
- closeout computes usage debit and posts ledger_entries row (usage_debit) transactionally
- unit tests cover dispatch + usage-cost calculation
- 66f3078 billing closeout integration coverage:
- -tags integration test validates usage close, ledger debit creation, and idempotent repeat close
- integrated into make test-integration-selected
- billing low-balance/depletion policy logic:
- billing closeout now resolves policy-driven thresholds via packages/shared/policy.
- emits outbox events for billing.low_balance_warning, billing.auto_release_pending, and billing.balance_depleted.
- emits provisioning.force_release_requested events for all user active allocations once balance is depleted.
- unit tests added for policy fallback/override and projected depletion calculations.
Remaining: - webhook reconciliation observability metrics + alert thresholds tuning in ops layer
Phase 6 — Notifications¶
Status: Partially Started
Completed:
- 3989917 notification relay NATS->Redis bridge + payload transforms
- 6700275 notification relay event coverage expansion:
- added end-to-end processMessage tests for all supported notification subjects.
- added malformed-envelope and Redis publish failure-path tests.
- notification WS hub baseline:
- packages/services/notification/ws.go implements Redis Pub/Sub -> WebSocket fanout.
- GET /ws/notifications route wired in cmd/api with JWT auth from Sec-WebSocket-Protocol (browser) or Authorization (non-browser).
- route enforces deny-list checks for admin tokens.
- tests added for route auth behavior and WS fanout path.
- notification reliability/test hardening:
- added route-level end-to-end test for /ws/notifications with real Redis pub/sub fanout and token parsing path.
- added lightweight WS hub counters (active_connections, forwarded_messages, write_errors) exposed via WSService.Snapshot().
- API observability export:
- cmd/api now exposes GET /api/v1/internal/stats (guarded by INTERNAL_STATS_TOKEN; disabled with 404 when unset).
- GET /metrics now exports Prometheus counters/gauges for:
- rate-limiter fail-open count
- idempotency persistence/replay counters
- terminal token consume/replay counters
- notification websocket connection/forward/error stats
- route tests added for internal stats auth/enablement and metrics payload baseline.
Remaining: - persistence/audit policy for notifications (if promoted beyond best-effort)
Storage service (MVP local backend)¶
Status: Completed (MVP scope)
Completed:
- local filesystem-backed storage service implemented in packages/services/storage/service.go
- traversal-safe namespace resolution integrated via packages/shared/storagepath
- API handlers wired in cmd/api/routes.go:
- GET /api/v1/storage/list
- GET /api/v1/storage/download
- PUT /api/v1/storage/upload
- POST /api/v1/storage/mkdir
- POST /api/v1/storage/rename
- DELETE /api/v1/storage/delete
- API process wiring in cmd/api/main.go + config support (STORAGE_ROOT_DIR)
- tests:
- packages/services/storage/service_test.go
- cmd/api/routes_test.go storage route coverage
Remaining: - swap local backend with S3-compatible implementation for production target
Phase 7 — Payments/Webhooks¶
Status: In Progress
Completed:
- d4d4f91 webhook ingest service scaffold + stripe event persistence + outbox enqueue path
- 02d97e0 webhook hardening + finalize path:
- Stripe-style HMAC signature verification with timestamp tolerance
- checkout.session.completed payment session finalize flow
- ledger stripe_credit entry creation
- payment session status transition to credited
- payments.balance_credited outbox emission
- f479ace reconciliation and provider-edge handling:
- amount/currency mismatch path marks payment_sessions.status = failed_reconcile
- writes payments.reconcile_failed audit log for ops visibility
- provider lifecycle mapping for checkout.session.expired and checkout.session.async_payment_failed
- unit tests for provider event-state mapping
- 17adfa7 webhook integration coverage:
- integration test (-tags integration) verifies mismatch path updates session to failed_reconcile
- asserts no ledger credit is created on mismatch
- asserts payments.reconcile_failed audit log row is written
- a45c553 reconciliation alert routing:
- emits payments.reconcile_failed outbox event on mismatch path
- notification-relay consumes payments.reconcile_failed and publishes user notification
- AsyncAPI/API surface/event taxonomy/NATS consumer docs updated for the new contract
- webhook observability baseline:
- webhook worker now tracks in-memory counters for:
- received events
- signature failures
- invalid payload failures
- persistence failures
- successful processing
- reconcile-failed outcomes
- internal stats endpoint added: GET /api/v1/internal/stats
- internal stats endpoint now requires INTERNAL_STATS_TOKEN bearer auth and returns 404 when disabled.
- internal stats token comparison uses constant-time equality check.
- prometheus counter endpoint added: GET /metrics with webhook + reconcile failure counters.
- unit tests added for stats endpoint baseline and signature-failure counter behavior.
Remaining: - alert thresholds tuning in ops layer
Phase 9 — Terminal service¶
Status: Completed (MVP scope)
Completed:
- terminal token service baseline:
- packages/services/terminal/service.go implements token mint + Redis storage (terminal_token:{token}).
- token consume path uses atomic Redis GETDEL for single-use enforcement.
- allocation ownership + active-state checks are enforced before token mint.
- API wiring:
- POST /api/v1/allocations/{allocation_id}/terminal-token implemented in cmd/api/routes.go.
- config support for TTL via TERMINAL_TOKEN_TTL_SECONDS (default 300).
- tests:
- packages/services/terminal/service_test.go covers mint/consume, single-use, ownership/state validation, allocation mismatch.
- cmd/api/routes_test.go covers success and inactive-allocation conflict mapping.
- websocket terminal proxy:
- /ws/terminal/{allocation_id} is served by the terminal gateway after token auth/consume.
- gateway creates a node stream binding, enqueues node-agent terminal open, and proxies stream frames.
- terminal control frames are emitted for session_ready, session_error, and session_closed.
- terminal integration coverage:
- added concurrent single-use consume race test to verify only one terminal token consume succeeds under contention.
- terminal service exposes snapshot counters for successful consumes and replay-rejected consumes.
Remaining: - none for MVP scope
Phase 11 — Admin service¶
Status: Completed (MVP scope)
Completed:
- GET /api/v1/admin/overview
- GET /api/v1/admin/users and GET /api/v1/admin/users/{user_id}
- POST /api/v1/admin/users
- POST /api/v1/admin/users/{user_id}/balance
- POST /api/v1/admin/users/{user_id}/refunds
- GET /api/v1/admin/allocations
- POST /api/v1/admin/allocations/{allocation_id}/force-release
- GET /api/v1/admin/nodes, POST /api/v1/admin/nodes, POST /api/v1/admin/nodes/{node_id}/probe, DELETE /api/v1/admin/nodes/{node_id}
- GET /api/v1/admin/audit-logs, GET /api/v1/admin/audit-logs/export
- GET /api/v1/admin/payments/sessions
Remaining:
- none for MVP scope
CI / Governance enforcement during build¶
Status: In Progress
Completed:
- e05784f CI scripts enforce go vet and token-query contract guard
- f25c00d integration smoke gate:
- scripts/ci/integration_smoke.sh runs selected -tags integration suites
- make test-integration-selected target added for local + CI reuse
- .gitlab-ci.yml includes integration_smoke stage job
- 66f3078 integration smoke scope expanded to include billing-worker integration suite
- integration smoke scope expanded to include API integration suite:
- make test-integration-selected now includes ./cmd/api (//go:build integration route tests).
- local/CI integration smoke now continuously validates admin DB-backed route behavior.
- bf6f2d9 contract lint runtime enforcement:
- contracts_validate.sh now runs Spectral/AsyncAPI via native CLI or npx fallback
- CI enforces contract-lint tool presence (REQUIRE_CONTRACT_LINT_TOOLS=1)
- b242705 strict AsyncAPI validation enabled:
- AsyncAPI contract updated to validator-clean 2.6 structure (removed unsupported address keys, added metadata/message IDs)
- contracts_validate.sh now defaults ALLOW_ASYNCAPI_VALIDATE_FAILURE to 0 (blocking mode)
- .gitlab-ci.yml enforces blocking AsyncAPI validation (ALLOW_ASYNCAPI_VALIDATE_FAILURE=0)
- OpenAPI lint signal cleanup:
- Spectral policy updated to match project conventions (global security model and external behavior docs).
- OpenAPI metadata/errors/idempotency coverage updated on flagged endpoints.
- OpenAPI lint now reports zero warnings/errors under doc/governance/openapi.spectral.yaml.
- 8b75ff0 local GitLab parity tooling:
- added scripts/ci/gitlab_local_dry_run.sh to execute CI gates in .gitlab-ci.yml order.
- added local GitLab compose/runbook (operations/local-dev/docker-compose.gitlab.yaml, operations/local-dev/GitLab_Local_Setup.md).
- make ci-local-dry-run target added; dry run verified green.
- generated artifact cleanliness enforcement:
- scripts/ci/sdk_codegen_smoke.sh now fails when generated trees are dirty in strict mode.
- .gitlab-ci.yml sets CODEGEN_ENFORCE_CLEAN=1 for hosted CI parity.
- ops evidence gate:
- added scripts/ops/parallel_ops_evidence_check.sh and make ops-parallel-evidence-check.
- added scripts/ci/ops_evidence_gate.sh and wired it into gitlab_local_dry_run.sh.
- asyncapi validation noise reduction:
- contracts_validate.sh now exports SUPPRESS_NO_CONFIG_WARNING=1 for asyncapi validate calls.
- integration smoke default DB wiring:
- integration_smoke.sh now defaults to local dev DATABASE_URL and runs selected integration suites when DB is reachable.
- skip behavior is now limited to unreachable DB instead of unset env.
- security scan gate hardening:
- security_scans.sh now runs tool-aware scans when available (govulncheck, trivy, gitleaks), with explicit warnings when missing.
- contract breaking-change guard hardening:
- contracts_breaking_change.sh now detects when API contract files changed but diff tools are missing.
- optional strict mode added via REQUIRE_BREAKING_DIFF_TOOLS=1.
- dockerized web runtime guard:
- added scripts/ops/web_container_smoke.sh and wired it into gitlab_local_dry_run.sh (when Docker is available).
- local web compose now isolates .next as container volume to prevent missing-chunk dev runtime errors.
Remaining: - none (continue keeping CI strict-mode flags enabled)
Demo Readiness¶
Status: In Progress
Completed:
- scripts/demo_smoke.sh added for local same-day walkthrough (health/auth/skus/nodes/create/get/list/release allocation).
- operations/local-dev/Demo_Runbook.md added with quickstart and realtime validation pointers.
- local compose/runtime hardening:
- exposed API host port in operations/local-dev/docker-compose.app.yaml (${API_PORT}:8080).
- exposed webhook worker host port in operations/local-dev/docker-compose.app.yaml (${WEBHOOK_PORT}:8082).
- aligned webhook worker signature-secret env var with runtime config (STRIPE_WEBHOOK_SECRET).
- removed obsolete compose version key warning from app compose file.
- fixed JWKS cache init options in middleware/auth service to avoid startup crash.
- demo smoke resiliency:
- added API readiness wait loop.
- auto-applies schema/seed on fresh local DB (AUTO_INIT_DB=1).
- default output switched to compact list summaries to keep unattended logs readable; full payload mode available via DEMO_VERBOSE=1.
- local token flow now pins Keycloak host header (KC_HOST_HEADER) so token issuer matches API JWKS issuer in compose network.
- local OIDC subject bootstrap now ensures a matching users.id record exists for protected endpoint walkthrough.
- release step now conditionally skips when allocation is not yet active (expected in async flow).
- demo path now provisions through active-state locally:
- smoke script bootstraps a demo admin node through admin API and creates allocations with explicit node_id.
- provisioning worker local mode supports PROVISIONING_RUNTIME_MODE=noop to exercise async lifecycle without real SSH side effects.
- smoke run verifies requested -> active, terminal-token mint, and release request (active -> releasing).
- smoke run now includes terminal websocket upgrade/control-frame check via scripts/ws_terminal_smoke.go.
Frontend UX Foundation (Slice 0)¶
Status: In Progress
Completed:
- 0f015ca UX planning hardening:
- added doc/product/UX_Execution_Plan.md
- hardened Slice 1/2 mocks and execution order.
- current scaffold:
- added packages/web/src/components/system/* shared UI states and modal primitives.
- added packages/web/src/components/a11y/* helpers (FocusTrap, LiveRegion, keyboard shortcut hook).
- added packages/web/src/lib/api/rateLimit.ts (Retry-After parsing/countdown) and packages/web/src/lib/api/errors.ts.
- added package docs and barrel exports (packages/web/README.md, packages/web/src/index.ts).
- frontend test harness baseline:
- added vitest + jsdom + @testing-library/react in packages/web.
- added vitest.config.ts and test setup bootstrap.
- added initial tests for:
- src/lib/api/rateLimit.ts
- src/lib/api/errors.ts
- restricted-state rendering for /notifications and /storage.
- added component tests for:
- RateLimitedState countdown and retry enabled behavior.
- FocusTrap initial focus and tab-cycle wrapping behavior.
- added page-level error-path tests:
- /storage renders mapped ErrConflict and ErrNotFound error states.
- /notifications renders websocket error state fallback.
- added route-level 429 countdown tests:
- /auth/login reads and renders Retry-After header as countdown.
- /allocations/{allocation_id} now uses API Retry-After when available and falls back to 30s.
- added admin mutation-path tests:
- /admin/allocations force-release requires reason before submit.
- /admin/nodes delete action disabled for in-use nodes.
- added admin user mutation error handling + tests:
- /admin/users/{user_id} balance adjust/refund actions now surface mapped API errors.
- tests cover ErrConflict and ErrNotFound mutation error rendering.
- added remaining admin mutation error handling + tests:
- /admin/nodes create/probe/delete actions now surface mapped API errors.
- /admin/allocations force-release action now surfaces mapped API errors.
- tests cover conflict/not-found mutation failures on both routes.
- added billing/admin-payments edge-path tests:
- /billing Stripe return query handling (?canceled=true) and CSV export action behavior.
- /admin/payments/sessions status-filter call shaping (all -> credited).
- /admin/audit-logs action-filter and cursor pagination call shaping.
- added payment action edge-path hardening:
- /billing checkout/customer-portal actions now surface mapped API errors.
- tests cover action call dispatch and checkout failure UX rendering.
- added terminal panel route-level tests:
- validates no-session error path.
- validates mint failure error path.
- validates websocket control-frame rendering (session_ready).
- validates terminal input send and explicit disconnect behavior.
Remaining:
- Keep frontend smoke gate in sync as web package test/build commands evolve.
Frontend Slice 1 (Auth + Marketplace)¶
Status: In Progress
Completed:
- Next.js app shell baseline implemented in packages/web:
- app/layout.tsx, app/globals.css, app/page.tsx
- package/runtime config (package.json, tsconfig.json, next.config.mjs)
- Auth route baseline:
- /auth/login with OIDC authorize-start request + PKCE verifier staging + error/rate-limit states.
- /auth/callback wired to POST /api/v1/auth/oidc/exchange, session bootstrap, and redirect.
- Marketplace route baseline:
- /marketplace authenticated typed fetch for SKUs/nodes using browser session token.
- sold-out SKU rendering + estimate-first provision modal UX.
- real POST /api/v1/allocations wiring from provision modal with async acceptance and route handoff:
- single accepted request -> /allocations/{allocation_id}
- multi accepted requests -> /allocations
- added allocations route scaffolds:
- /allocations
- /allocations/{allocation_id}
- Added additional UX parity routes:
- /schedulers route as the current scheduler surface baseline (detailed managed scheduler UX remains future work).
- /settings/profile route with GET /api/v1/users/me.
- Validation:
- pnpm typecheck (web) green.
- pnpm build (web) green.
Remaining:
- Add web test runner and component-level tests.
- Replace allocation list/detail scaffolds with Slice 2 lifecycle polling + terminal/release controls.
Frontend Slice 2 (Allocations + Terminal Baseline)¶
Status: In Progress
Completed:
- Implemented /allocations page with authenticated allocation list fetch (GET /api/v1/allocations).
- Implemented /allocations/{allocation_id} page with polling detail fetch (GET /api/v1/allocations/{allocation_id}).
- Wired release action (POST /api/v1/allocations/{allocation_id}/release) through confirmation modal.
- Wired terminal-token action (POST /api/v1/allocations/{allocation_id}/terminal-token) into the terminal UX path.
- Replaced the earlier token display-only step with the terminal panel component:
- mints single-use token and connects browser WS via Sec-WebSocket-Protocol.
- renders terminal control frames (session_ready, session_error, session_closed).
- supports reconnect/remint and interactive input send.
- Added connection info + copyable SSH command rendering when allocation has connection metadata.
- Added full prototype-intent action row on allocation detail:
- Metrics (config-driven deep link via NEXT_PUBLIC_METRICS_BASE_URL)
- Console (terminal panel)
- Key (routes to user key-management/help flow)
- Release (confirmed async release)
- Added user key-management guidance copy on allocation detail.
- allocations list operator controls:
- server-driven status filter (listAllocations query parameter) for active/released/etc views.
- local search on allocation id/sku/node id for quick narrowing without extra round-trips.
- visible "showing x of y" counter and edge tests for status-query + local search behavior.
- Validation:
- pnpm typecheck (web) green.
- pnpm build (web) green.
Remaining:
- continue lifecycle UX refinement for richer per-step progress and action disablement patterns.
Frontend Slice 3 (Billing + Notifications Baseline)¶
Status: In Progress
Completed:
- Implemented /billing page with contract-backed flows:
- GET /api/v1/billing/balance
- GET /api/v1/billing/usage
- GET /api/v1/billing/usage/csv (export trigger)
- POST /api/v1/payments/checkout-session
- POST /api/v1/payments/customer-portal-session
- Added Stripe return-param handling on billing route (session_id / canceled).
- Added top-nav notification bell/panel:
- websocket subscribe to /ws/notifications
- browser auth via Sec-WebSocket-Protocol (bearer, <access_token>)
- in-app notification list with deep-link action support.
- d8d172f dedicated notifications center route:
- added /notifications page for full list management.
- introduced shared websocket notification store hook (useNotifications) reused by bell + route.
- added dismiss/clear behavior and local persistence for retained notifications.
- Added global persistent low-balance banner in app layout:
- periodic balance polling on authenticated sessions
- warning/depleted variants with billing CTA.
- Validation:
- pnpm typecheck (web) green.
- pnpm build (web) green.
Remaining:
- Refine read/unread persistence semantics once notification retention policy is finalized.
Frontend Slice 5 (Storage Baseline)¶
Status: In Progress
Completed:
- 0546ef3 storage UI baseline:
- added /storage page with contract-backed flows:
- GET /api/v1/storage/list
- GET /api/v1/storage/download
- PUT /api/v1/storage/upload
- POST /api/v1/storage/mkdir
- POST /api/v1/storage/rename
- DELETE /api/v1/storage/delete
- added storage API client methods in packages/web/src/lib/api/client.ts.
- added top-nav Storage route link.
- storage operator controls:
- local type filter (all/file/dir) and name search added.
- visible "showing x of y" counter for directory triage.
- tests cover filter behavior without additional API calls.
Validation:
- pnpm build (web) green.
- pnpm typecheck (web) green.
Remaining:
- None for MVP scope beyond backend-level S3 swap.
Frontend Slice 4 (Admin Surfaces Baseline)¶
Status: In Progress
Completed:
- Implemented /admin/overview:
- GET /api/v1/admin/overview
- auto-refresh every 5s with pause/resume and last-updated indicator.
- Implemented /admin/allocations:
- GET /api/v1/admin/allocations with status filter.
- POST /api/v1/admin/allocations/{allocation_id}/force-release with required reason via confirmation modal.
- Implemented /admin/nodes:
- GET /api/v1/admin/nodes
- POST /api/v1/admin/nodes
- POST /api/v1/admin/nodes/{node_id}/probe
- DELETE /api/v1/admin/nodes/{node_id} with occupancy guard in UI.
- admin nodes operator controls:
- local status filter (all/online/offline/maintenance) and search (host/node id/sku) added.
- visible "showing x of y" counter for quick triage on large node sets.
- tests cover local filter behavior in addition to existing mutation guards.
- Added admin navigation link in app layout.
- admin users list hardening:
- create-user mutation failures now render mapped API errors with correlation id.
- local operator controls added for quick filtering without extra API calls:
- search by username/user id
- role filter (all/user/admin)
- visible "showing x of y" counter
- edge tests cover cursor pagination, create-user error mapping, and local filter behavior.
- admin user action-history context:
- admin user detail now includes direct context links to:
- /admin/allocations?user_id=<id>
- /admin/payments/sessions?user_id=<id>
- /admin/audit-logs?actor_user_id=<id>
- tests validate link targets for consistent operator navigation.
- target admin pages now consume seeded query filters at route entry:
- allocations: user_id, status
- payment sessions: user_id, status
- audit logs: actor_user_id, action
- edge tests cover query-seeded API call shaping.
- Validation:
- pnpm typecheck (web) green.
- pnpm build (web) green.
Remaining:
- Keep per-screen filter ergonomics aligned as new routes are added.
Parallel Ops Evidence¶
Status: In Progress
Completed:
- backup/restore drill baseline:
- added scripts/ops/backup_restore_smoke.sh for repeatable local Postgres dump+restore validation.
- added make ops-backup-restore-smoke convenience target.
- moved Parallel_Ops_Track item 3 (Backup/Restore/DR) to in_progress with concrete evidence link.
- captured local rehearsal report artifact with observed restore duration/table checks.
- latest local run refreshed evidence (restore_smoke_1771895409, 23 tables, 1s, success).
- SLO/alert artifact baseline:
- added baseline alert rule manifest at doc/operations/evidence/alert_rule_manifest_baseline.yaml.
- added API hardening alerts for fail-open rate limiting, terminal token replay spikes, and notification websocket write errors.
- added simulation playbook/report artifacts for repeatable validation.
- outbox payload minimization guard baseline:
- added scripts/ci/outbox_payload_guard.sh and wired it into scripts/ci/contracts_validate.sh.
- runbooks/on-call artifact baseline:
- added explicit on-call roster + escalation artifact.
- added incident drill calendar/report template artifact.
- completed baseline runbook coverage for provisioning, webhook, database failover, and incident communications.
- east/west security artifact baseline:
- added baseline network policy manifest for default-deny + explicit allow-list flows.
- added TLS cert expiry check script + make target for repeatable verification.
- key-rotation runbook baseline:
- added unified runbook for planned rotation and compromise response across JWKS, terminal, control keys, and envelope keys.
- data-growth guard baseline:
- added scripts/ops/data_growth_check.sh with row/size threshold checks for high-growth tables.
- added make ops-data-growth-check and evidence doc wiring.
- latest local run refreshed row/size evidence for usage_records, ledger_entries, and audit_logs.
- evidence-gate baseline:
- added scripts/ops/parallel_ops_evidence_check.sh to validate required evidence artifacts for Parallel Ops items 1-5.
- added make ops-parallel-evidence-check for local repeatable verification.
- wired CI dry-run gate via scripts/ci/ops_evidence_gate.sh and scripts/ci/gitlab_local_dry_run.sh.
- observability smoke baseline:
- added scripts/ops/observability_smoke.sh and make ops-observability-smoke.
- validates API/webhook metrics endpoint reachability + required metric names.
- added local evidence report doc/operations/evidence/observability_local_smoke_report.md.
Security foundation (shared package)¶
Status: In Progress
Completed:
- encryption envelope baseline (current commit):
- doc/architecture/Encryption_Envelope_Spec.md defines canonical _enc JSON shape and AES-256-GCM baseline.
- packages/shared/crypto/envelope.go implements envelope encrypt/decrypt + marshal helpers.
- packages/shared/crypto/envelope_test.go adds round-trip and guardrail coverage.
- storage path-safety baseline:
- packages/shared/storagepath/path.go defines namespace-rooted path resolution (filepath.Clean + namespace escape checks).
- packages/shared/storagepath/path_test.go adds traversal/absolute-path rejection tests.
- envelope material stability:
- provisioning worker now uses config-driven/stable envelope key material (ENVELOPE_KEY_B64, ENVELOPE_KEY_ID) instead of per-process random keys.
- deterministic local-dev fallback keeps decrypt paths consistent across processes.
- shared key loader moved to packages/shared/crypto/key_material.go and reused by provisioning + terminal.
- node-probe SSRF guardrails:
- inventory node create/probe flows now validate target host resolution and block unsafe targets before network dial.
- deny set includes loopback/link-local/multicast/unspecified and metadata endpoint 169.254.169.254.
- optional CIDR allowlist support via NODE_PROBE_ALLOWED_CIDRS for controlled admin probe scope.
- API route tests added for denied-target responses (400) on admin node create/probe endpoints.
- KMS key-source command hardening:
- legacy provisioning-worker SSH command paths were removed with retirement of the SSH runtime.
- remaining key-fetch command surfaces should be reviewed at the active terminal/app boundary instead of reintroducing worker-side private-key handling.
- rate-limit fail-open observability baseline:
- RateLimiter now records fail-open occurrences when Redis eval path errors.
- unit test added to assert fail-open request path and counter increment behavior.
- policy cache invalidation baseline:
- PostgresClient now supports Invalidate(key) and InvalidateAll() for in-process policy cache eviction.
- API process subscribes to Redis policy.invalidate.* and evicts local cache entries immediately.
- protected API chain now mounts policy-backed RateLimiter middleware.
- idempotency response-body sanitization baseline:
- idempotency middleware now sanitizes JSON response bodies before storing replay payloads.
- invalid/non-JSON response payloads are not persisted to idempotency_keys.response_body (fail-safe default).
- tests added for sensitive-key redaction and bearer-token string scrubbing.
- snapshot counters added for persisted/ skipped/replay-served idempotency behavior (IdempotencySnapshot()).
- JWKS break-glass baseline:
- added JWKSAuth.ForceRefresh(ctx) for immediate cache refresh during incidents.
- added doc/operations/runbooks/JWKS_Compromise_Breakglass_Runbook.md and linked it into secrets/key ops evidence.
- API internal incident endpoint added: POST /internal/auth/jwks/refresh guarded by INTERNAL_JWKS_REFRESH_TOKEN.
- scheduler metadata encryption baseline:
- allocation create path now envelope-encrypts scheduler_request into allocations.scheduler_metadata.
- unit test validates decryptable round-trip using shared envelope key material.
- ERD/schema docs now enforce envelope encryption expectation for credential-bearing scheduler metadata.
Remaining: - wire shared crypto helper into storage runtime secret paths if storage metadata starts carrying credential material.
API slice hardening (admin + payments)¶
Status: In Progress
Completed:
- admin route wiring (DB-backed):
- registered and implemented GET /api/v1/admin/overview.
- registered and implemented GET/POST /api/v1/admin/users.
- registered and implemented GET /api/v1/admin/users/{user_id}.
- registered and implemented POST /api/v1/admin/users/{user_id}/balance.
- registered and implemented POST /api/v1/admin/users/{user_id}/refunds.
- registered and implemented GET /api/v1/admin/payments/sessions.
- registered and implemented GET /api/v1/admin/audit-logs.
- registered and implemented GET /api/v1/admin/audit-logs/export.
- auth/role guard ordering:
- admin handlers now enforce auth/role checks before dependency health checks, so non-admin requests deterministically return 403 instead of surfacing infra state.
- route guardrail tests:
- added table-driven coverage for all new admin endpoints:
- non-admin claims -> 403.
- admin claims with missing DB dependency -> 503.
- integration route coverage (DB-backed):
- added cmd/api/routes_integration_test.go (//go:build integration) to exercise admin HTTP handlers against real Postgres schema:
- GET /api/v1/admin/overview
- POST /api/v1/admin/users
- GET /api/v1/admin/users/{user_id}
- includes malformed-id (not-a-uuid) 400 contract assertion
- GET /api/v1/admin/users (pagination envelope shape)
- GET /api/v1/admin/payments/sessions (filter by user_id, status)
- GET /api/v1/admin/audit-logs (filter by actor_user_id, action)
- GET /api/v1/admin/audit-logs/export (CSV header + row presence)
- validated with:
- DATABASE_URL=... go test -tags integration ./cmd/api -count=1
- admin mutation error classification hardening:
- adminAdjustBalanceHandler / adminCreateRefundHandler now classify DB failures by PG error code:
- 23503 -> 404 (user not found)
- 22P02 -> 400 (invalid user id)
- other DB errors -> 500
- admin path-parameter validation hardening:
- adminGetUserHandler, adminAdjustBalanceHandler, and adminCreateRefundHandler now validate user_id as UUID before dependency/DB access.
- malformed user_id now returns deterministic 400 invalid user id.
- route tests added for invalid-UUID cases on balance/refund admin mutations.
- OpenAPI updated to declare 400 BadRequest for:
- GET /api/v1/admin/users/{user_id}
- POST /api/v1/admin/users/{user_id}/balance
- local CI contract parity:
- ran scripts/ci/gitlab_local_dry_run.sh end-to-end after route changes and codegen sync; all gates pass.
- added optional Playwright stage to local dry-run (RUN_WEB_E2E=1) with automatic stack up/down via scripts/ci/frontend_e2e.sh.
- validated full parity path with RUN_WEB_E2E=1 RUN_WEB_CONTAINER_SMOKE=1 bash scripts/ci/gitlab_local_dry_run.sh (green).
- raised e2e gate strictness:
- scripts/ci/gitlab_local_dry_run.sh now runs frontend e2e by default (RUN_WEB_E2E=1).
- .gitlab-ci.yml now includes dedicated frontend_e2e job in build_test stage.
- codegen drift fix:
- synced generated OpenAPI artifacts:
- packages/shared/gen/openapi_types.gen.go
- packages/web/src/lib/gen/openapi.types.ts
- revalidated with go test ./... and pnpm --dir packages/web typecheck.
- authenticated UI flow regression coverage:
- expanded Playwright OIDC coverage in packages/web/e2e/auth-login.spec.ts:
- post-login route access checks for /allocations, /storage, /notifications.
- post-login admin route access check for /admin/users.
- post-login admin overview route access check for /admin/overview.
- mock billing checkout redirect path remains covered.
- marked OIDC suite serial to avoid shared-session race flakes against a single local Keycloak/API stack.
- added persona click-through smoke suite:
- packages/web/e2e/persona-smoke.spec.ts covers user/admin nav journeys and asserts no generic fallback errors appear.
- added persona click-through smoke suite packages/web/e2e/persona-smoke.spec.ts:
- user journey nav walk: marketplace -> allocations -> billing -> storage -> notifications -> schedulers -> settings
- admin journey nav walk: overview -> users -> nodes -> allocations -> audit logs -> payment sessions
- explicit guard assertion that generic fallback errors (Request failed / Something went wrong) do not appear.
- allocations page-shell consistency:
- packages/web/app/allocations/page.tsx now preserves the page header/shell (My Allocations) across loading/restricted/error/empty states.
- added unit coverage in packages/web/app/allocations/page.test.tsx for empty-state and restricted-state heading visibility.
- billing page-shell consistency:
- packages/web/app/billing/page.tsx now preserves the page header/shell (Billing) across loading/restricted/error states.
- added unit coverage in packages/web/app/billing/page.edge.test.tsx for restricted-state heading visibility.
- storage page-shell consistency:
- packages/web/app/storage/page.tsx now preserves the page header/shell (My Storage) across restricted/loading/error states.
- added unit coverage in packages/web/app/storage/page.test.tsx to assert heading visibility when unauthenticated.
- notifications page-shell consistency:
- packages/web/app/notifications/page.tsx now preserves the page header/shell (Notifications) across restricted/error states.
- added unit coverage in packages/web/app/notifications/page.test.tsx and packages/web/app/notifications/page.error-state.test.tsx for heading visibility.
- admin overview page test coverage:
- added packages/web/app/admin/overview/page.edge.test.tsx with coverage for:
- successful metric rendering
- non-admin restricted behavior
- mapped API-error rendering with correlation ID
- pause/resume auto-refresh toggle behavior
- backend 401 response path now enforces restricted state (session-expiry regression guard)
- API error mapping hardening:
- packages/web/src/lib/api/errors.ts now supports both nested (error.code) and canonical flat (code) error envelopes.
- added coverage for flat middleware auth payload (token_missing) to prevent generic "Request failed" UX fallback.
Current Next Queue (unattended)¶
- Execute and attach concrete staging evidence for
doc/operations/Parallel_Ops_Track.mditems 1-5 (launch-critical), then flip each fromin_progresstodone. - Add policy invalidation publisher on admin policy mutation endpoints once policy-management APIs land.
- Keep CI parity strict and re-run
make ci-local-dry-runon every service slice merge. - Track MAAS/cloud-init provisioning visibility as a release-candidate gap under
A-MAAS-BOOTSTRAP-PROGRESS-VISIBILITY-001: - emit explicit phase markers for site bootstrap, node bootstrap, and node-agent enrollment
- surface those markers through MAAS-visible status/events when reachable
- also expose the same progress in GPUaaS onboarding detail/events so operators can distinguish slow package installs from real bootstrap or enrollment failure without SSHing to the node
- short term on the MAAS side, poll curated
INFOevents during wait-loop ticks and filter them to the current workflow/stage window - long term, move MAAS progress collection to a shared site-level refresher if reimage/onboarding concurrency grows
- separate the remaining long-running
site_bootstrapinterior gap underA-MAAS-SITE-BOOTSTRAP-MIDPHASE-PROGRESS-001so the missing package-install / Lambda-stack / DOCA-OFED callbacks do not get lost in the broader visibility thread - track OCI site-bootstrap publish/profile-reference discipline separately under
A-MAAS-SITE-BOOTSTRAP-BUNDLE-RELEASE-DISCIPLINE-001 - track a cleaner admin lifecycle/progress read surface separately under
A-MAAS-ADMIN-LIFECYCLE-PROGRESS-SURFACE-001 - Track MAAS workflow UI ordering cleanup under
A-MAAS-WORKFLOW-TIMELINE-ORDERING-001: - make Stage Progress use explicit workflow stage order instead of first-seen timestamp order
- make Current Attempt / timeline rows deterministic when multiple events land in the same second
- keep GPUaaS workflow detail ordering aligned with Temporal execution order during incidents
2026-04-13 Progress Note¶
- App-runtime validation reached a first stable platform-control state across:
- Slurm Reference,
- Self-managed Kubernetes (RKE2),
- JupyterLab launchable OCI workload.
- Launchable OCI / JupyterLab:
- commit
149a244fadded curated JupyterLab runtime image definitions for CPU, NVIDIA H200/CUDA, and AMD ROCm variants. - node-agent bootstrap now installs Docker when no approved OCI runtime is
present and auto-configures
nvidia-container-toolkitwhennvidia-smiis available. - app-runtime now sends bounded GPU requests (
gpu_request.kind=count) and node-agent maps them to Docker--gpus N. - platform-control pipeline
593deployed the slice successfully. - H200 validation launched JupyterLab with
gpu_count=1; Docker inspect reported device requestCount: 1, andnvidia-smiinside the container showed one visible NVIDIA H200. - Slurm Reference:
- commit
5491774ffixed stale workload health by reporting completed Slurm bootstrap ashealthyinstead ofprogressing. - platform-control pipeline
595deployed the fix successfully. - existing Slurm instance was corrected through the public report API and now
reports
status=running,health_status=healthy,phase=slurm_bootstrap_completed. - RKE2:
- platform-control validation is stable for the first single-server path.
- remaining product decisions are storage/CSI, external exposure, kubeconfig privilege/copy UX, repair/reconcile, and cleanup semantics.
- Follow-up work after first stable validation:
- added
scripts/ops/app_runtime_first_slice_smoke.shas a configurable local-kind/platform-control smoke entry point. It can observe existing running Slurm/RKE2/Jupyter instances, or create/wait/access-check/decommission instances when placement/artifact inputs are supplied by the operator. - added
doc/product/Jupyter_Package_Install_v1.mdfor package version constraints, ephemeral install semantics, failure reporting, and the derived image follow-up. - node-agent now returns a bounded pip install log excerpt for launchable OCI workloads and the workload Events tab surfaces it with the installed package count.
- added RKE2 kubeconfig copy support in the workload Access tab.
- confirmed launchable OCI Access already documents private SSH tunnel access and the future platform-proxy mode; no extra item remains there for the first slice.
- strengthened local-kind controller refresh handling by stamping Slurm and RKE2 controller pod templates with build metadata and by deploying the RKE2 controller from the parity validation flow, reducing stale same-tag image and stale runtime-config drift.
- Remaining next work:
- validate the new single-node
vllm-openaiDocker Compose app profile in local kind with CPU-safe inputs, then promote to platform-control and run an H200/GPU smoke with a production-sized model selection, - finish node-agent lifecycle upgrade delivery so existing nodes do not need manual rebootstrap for host prerequisites and binary updates,
- keep RKE2 storage/CSI and external exposure as explicit infra/product open items until infra confirms ownership and the supported route.
2026-04-04 Progress Note¶
- MAAS drift reconciliation first slice implemented locally:
- provisioning-worker now ensures a Temporal schedule that runs the periodic MAAS reconciliation workflow/activity
- MAAS service scan refreshes
node_maas_statefor MAAS-managed nodes and classifies drift such asip_drift,machine_missing,released_outside_workflow,agent_offline,retired_node_still_deployed, and MAAS read failures - current periodic scope is intentionally limited to MAAS-managed nodes already
active; non-active states should be checked by the owning lifecycle workflows/events rather than the broad fleet scan - current scan intentionally records report-only/candidate follow-up actions; automatic lifecycle execution remains blocked on the owning node-agent update path task
- Separate MAAS discovery follow-up tracked:
- phase 1 should be an explicit admin
Discoveraction that produces an operator-reviewed MAAS candidate list with first-pass SKU inference, so admins can choose candidates from MAAS inventory instead of typing every onboarding input manually - long term, that candidate list can be refreshed by Temporal on a schedule once the discovery model is trusted, but adoption should remain operator-reviewed by default
- MAAS operating-model boundary now clarified:
- periodic Temporal reconciliation is for known MAAS-managed nodes already
active - non-active MAAS states should be checked by their owning lifecycle workflows/events
- unknown MAAS machines belong to discovery/adoption flow, not the steady-state reconciler
- MAAS provisioning visibility gap is now tracked as a first-class backend task:
A-MAAS-BOOTSTRAP-PROGRESS-VISIBILITY-001- this should be treated as part of the current MAAS release-candidate quality bar, not as a later observability extra
- Local frontend CI cleanup also ready to batch with the next push:
- tenant-shared scheduler test now waits for the async-enabled request button before clicking