Implementation Roadmap¶
Ordered coding breakdown for GPUaaS v1. Each phase lists its hard prerequisites, the exact files to create, the API endpoints to implement, and the tests to write.
Document Role¶
- Purpose: source-of-truth plan for what to build and in what order.
- Scope: phase definitions, prerequisites, deliverables, and done criteria.
- Does not track daily progress; use
doc/Execution_Progress.mdfor commit-level status anddoc/Phase_Readiness_Tracker.mdfor readiness gating. - Historical note: early phase sections preserve the original execution plan language for sequencing context. Current implementation truth for completed phases lives in the repo,
doc/Execution_Progress.md, and queue state indoc/governance/Agent_Work_Queue.yaml.
How to use this document
- Work phases in order. Phases within a group that share no dependencies may be
parallelised across agents (noted where applicable).
- Before starting any task: read AGENTS.md for architecture rules, then read
the files listed under "Read first" for that phase.
- Follow doc/governance/Coding_Standards.md §Go Implementation Patterns
for every handler, service function, and test file.
- After completing a phase: tick all boxes and verify make test passes before
moving on.
CI portability rule (host-agnostic execution)
- Keep gate logic in scripts/ci/*.sh and call these scripts from pipeline YAML.
- .gitlab-ci.yml and GitHub Actions workflows should stay as orchestration wrappers only.
- If host changes (GitLab -> GitHub or reverse), reuse scripts/ci unchanged and only adapt runner/secret wiring.
Pre-Phase UX — UX completion before feature coding¶
Prerequisite: Docs baseline complete (PRD + OpenAPI + AsyncAPI + Architecture). Blocking rule: Feature coding does not start until UX completion checklist is signed off.
Read first:
- doc/product/UX_Journeys.md
- doc/product/UX_Implementation_Spec.md
- doc/api/openapi.draft.yaml
- doc/api/asyncapi.draft.yaml
Deliverables:
- Screen inventory (user + admin) with route ownership and contract links.
- State matrix per screen (loading, empty, error, success, restricted, rate_limited).
- Async UX patterns for allocation lifecycle (requested/provisioning/active/releasing/released/failed/release_failed).
- Terminal UX flow (mint token -> connect WS -> disconnected/retry states).
- Accessibility baseline (keyboard, focus trap, aria labels, contrast checks).
Done when:
- [ ] Every user action maps to an OpenAPI endpoint or AsyncAPI channel.
- [ ] No production flow depends on prototype-only behavior.
- [ ] UX spec includes explicit handling for 401, 403, 404, 409, 429.
- [ ] UX signoff recorded in doc/Phase_Readiness_Tracker.md.
Pre-Phase Tooling — Contract Codegen¶
Purpose: avoid losing codegen setup while implementation starts.
Deliverables:
- Add scripts/codegen.sh with deterministic OpenAPI-driven generation steps.
- Wire make codegen to this script and verify it runs locally.
- Add sdk_codegen_smoke CI step to execute (not just echo) once toolchain is installed.
Done when:
- [x] scripts/codegen.sh exists and is executable.
- [x] make codegen updates generated artifacts without manual edits.
- [x] AGENTS.md repo layout note is updated back to include scripts/codegen.sh.
Pre-Phase Platform — Production Baseline (DevOps Parallel)¶
Purpose: allow DevOps/security platform work to run in parallel with app feature coding.
Read first:
- doc/operations/Production_Platform_Baseline.md
- doc/operations/Parallel_Ops_Track.md
- doc/operations/Environment_Promotion_Policy.md
- doc/governance/Security_Control_Verification.md
Deliverables: - Managed edge gateway + WAF configured for API + websocket routes. - TLS and cert rotation in place for public endpoints. - East/west default-deny network policy + explicit allow-list flows implemented. - Internal mTLS (or equivalent) with certificate issuance/rotation/revocation SOP implemented. - Centralized logs/metrics/traces with alert rules for MVP SLOs. - Secret manager/KMS wiring for runtime secrets. - Backup/restore rehearsal completed for Postgres.
Done when:
- [ ] All “Required for Public MVP” controls in Production_Platform_Baseline.md are implemented in staging.
- [ ] Parallel operations items in Parallel_Ops_Track.md have owners and status updates.
- [ ] Evidence links are recorded in doc/Phase_Readiness_Tracker.md.
Pre-Phase Observability — Contract and UX Gate¶
Purpose: lock observability backend and Ops UI contracts before implementation.
Read first:
- doc/architecture/Observability_Architecture.md
- doc/governance/Observability_Standards.md
- doc/governance/UX_Contract_Gate.md
- doc/operations/Observability_Baseline.md
- doc/operations/Ops_Runbook_Architecture.md
- doc/product/ux-mocks/admin-ops.md
Deliverables:
- Observability backend decision finalized (OTel Collector + Prometheus + Tempo + Loki + Grafana).
- OpenAPI additions for admin ops aggregated endpoints (required before UI coding).
- OpenAPI additions for runbook metadata endpoints (/api/v1/admin/runbooks*) before runbook panel UI coding.
- Runbook metadata architecture (manifest, stable IDs, alert mapping) approved.
- Ops UI mock (/admin/ops) reviewed and approved.
- Alert and dashboard minimum set mapped to SLOs.
Done when:
- [ ] Observability architecture/standards docs are approved.
- [ ] Ops UI route contract is added to OpenAPI.
- [ ] Runbook metadata route contracts are added to OpenAPI.
- [ ] Ops UI mock maps every panel interaction to a contract endpoint.
- [ ] Degraded panel states have deterministic runbook mappings by runbook_id.
- [ ] UX/contract gate checklist is satisfied for /admin/ops before feature implementation.
Pre-Phase Node Agent — Secure Node Communication (Blocking for Phase 7)¶
Purpose: replace raw SSH provisioning with a pull-based, typed-task node agent protected by mTLS and task-signing. All node-side operations are performed by compiled, audited handlers — no arbitrary command execution.
Blocking rule: Phase 7 (Provisioning Worker) does not start until every box below is checked.
Read first:
- doc/architecture/PKI_Spec.md
- doc/architecture/Node_Agent_Spec.md
- doc/architecture/db_schema_v1.sql
- doc/api/openapi.draft.yaml §Internal
Deliverables:
Specification
- doc/architecture/PKI_Spec.md written and reviewed (CA hierarchy, enrollment flow,
renewal, revocation, task signing, Vault migration path). ✅
- doc/architecture/Node_Agent_Spec.md written and reviewed (task catalog, protocol,
privilege model, parameter validation). ✅
DB Schema
- node_tasks table added to doc/architecture/db_schema_v1.sql and applied to dev DB.
OpenAPI Contract
- /internal/v1/nodes/* endpoints added to doc/api/openapi.draft.yaml before any
implementation of handlers or node agent code.
CA Infrastructure
- step-ca deployed in Kubernetes internal namespace (pki-ca.internal:9000).
- CA ceremony completed (Root CA offline, Intermediate CA in KMS, fingerprint recorded).
- Root CA cert fingerprint added to doc/Phase_Readiness_Tracker.md.
MAAS Integration (optional, parallel track — gates full-reimage isolation model)
- packages/services/maas/ — MAASClient interface + implementation (OAuth 1.0 auth,
DeployMachine, ReleaseMachine, GetMachineStatus, ListMachines).
- nodes table: maas_system_id TEXT column added to db_schema_v1.sql.
- POST /internal/v1/maas/machine-commissioned internal webhook endpoint added to
openapi.draft.yaml — MAAS calls this on commissioning complete to auto-register
nodes and generate enrollment tokens.
- MAAS_URL, MAAS_API_KEY added to cmd/api/config.go (only required when
MAAS_ENABLED=true; config validates conditionally).
- Policy keys maas.enabled and allocation.isolation_model seeded in scripts/seed.sql.
- Isolation model read from PolicyClient in provisioning worker — never hardcoded.
Go Scaffold (historical bootstrap requirement)
- packages/shared/pki/client.go — CAClient interface + StepCAClient implementation.
- cmd/node-agent/ directory scaffold compiles (go build ./cmd/node-agent).
- catalog/catalog.go dispatch function with full task type registry (historical bootstrap note:
initial handler stubs were acceptable at scaffold time; current implementation should not rely on this guidance).
- validate/params.go — parameter validators for all task types.
- signing/verify.go — Ed25519 signature verification.
- Unit tests pass: catalog dispatch (known type dispatches, unknown type rejects),
parameter validation (valid params pass, invalid params reject), replay protection.
Done when:
- [ ] doc/architecture/PKI_Spec.md signed off in doc/Phase_Readiness_Tracker.md
- [ ] doc/architecture/Node_Agent_Spec.md signed off in doc/Phase_Readiness_Tracker.md
- [ ] node_tasks table in schema and applied
- [ ] /internal/v1/nodes/* endpoints in openapi.draft.yaml
- [ ] step-ca running in staging; CA ceremony complete
- [ ] go build ./cmd/node-agent passes
- [ ] go build ./packages/shared/pki/... passes
- [ ] make test passes (node agent unit tests)
Pre-Phase Security — Encryption Envelope Baseline (Blocking)¶
Purpose: prevent ad-hoc encryption implementations in provisioning, storage, and scheduler metadata paths.
Read first:
- doc/operations/Scalability_Security_Watchlist.md (SEC-3, E-3)
- doc/architecture/db_schema_v1.sql (*_enc fields, scheduler_metadata)
Deliverables:
- doc/architecture/Encryption_Envelope_Spec.md defining:
- envelope format/version fields
- key identifiers and KMS key source conventions
- rotation and re-encryption strategy
- decrypt failure handling and audit expectations
- packages/shared/crypto/ scaffold with:
- envelope encode/decode interfaces
- KMS adapter abstraction
- deterministic test fixtures and redaction-safe logging behavior
Done when:
- [ ] Encryption envelope spec exists and is referenced by provisioning/storage implementation phases.
- [ ] Shared crypto package compiles and is usable by provisioning worker code.
- [ ] Security owner signoff recorded in doc/Phase_Readiness_Tracker.md.
Pre-Phase Tenant Ownership — Tenant/Project Enforcement Baseline (Blocking)¶
Purpose: lock ownership semantics before further feature coding so access control,
billing scope, and policy evaluation stay coherent.
Prerequisite gate: update and approve doc/governance/Testing_Standards.md tenant/project
authz coverage expectations before implementation tasks in this phase start.
Read first:
- doc/architecture/Tenant_Project_Ownership_Baseline.md
- doc/architecture/adrs/ADR-008-tenant-project-ownership-baseline.md
- doc/architecture/ERD.md
- doc/architecture/db_schema_v1.sql
- doc/api/openapi.draft.yaml
Deliverables:
- Ownership semantics adopted in docs:
- tenant(org) as ownership root,
- project as resource scope,
- user as actor attribution (not owner-of-record).
- Baseline schema tightened for ownership invariants (reset-baseline, no data migration):
- allocations.org_id non-null
- allocations.project_id non-null
- Membership baseline added now to lock authz query shape:
- tenant_memberships (MVP-enforced single-tenant via UNIQUE(user_id))
- project_memberships (UNIQUE(project_id, user_id))
- Hybrid auth context baseline documented:
- tenant claim remains in JWT/session for MVP boundary enforcement
- active project remains request-scoped and membership-validated
- API contract updated for explicit project context on project-owned mutations.
- Authorization rules updated to tenant/project checks for resource list/read/mutate paths.
- Billing scope plan documented for tenant-owned customer/balance model.
- Policy scope plan documented for both project cap and tenant cap concurrency limits.
Done when:
- [ ] ADR-008 is accepted and linked from architecture index.
- [ ] Ownership baseline doc is approved.
- [ ] doc/governance/Testing_Standards.md tenant/project authz coverage section is approved before phase implementation starts.
- [ ] ERD and db_schema_v1.sql reflect non-null allocation ownership fields.
- [ ] ERD and db_schema_v1.sql include membership baseline tables and constraints.
- [ ] OpenAPI reflects project-context requirements on project-owned mutations.
Pre-Phase Service Accounts — Machine Identity Baseline (Blocking for App Integrations)¶
Purpose: define machine identity contracts and controls before app-team integration work.
Read first:
- doc/architecture/Service_Account_Model.md
- doc/architecture/adrs/ADR-004-identity-authz-model.md
- doc/architecture/Tenant_Project_Ownership_Baseline.md
- doc/governance/Security_Control_Verification.md
- doc/api/openapi.draft.yaml
Deliverables:
- Service-account ownership baseline defined:
- service account belongs to one project and one tenant,
- machine auth is project-scoped, tenant-bounded.
- Planned schema objects documented:
- service_accounts
- service_account_credentials
- Planned token model documented:
- actor_type=service_account
- short-lived token TTL, audience + scope claims.
- Planned API surface documented for lifecycle + token issuance.
- Security control set documented for key storage, rotation, revocation, and audit.
Done when:
- [ ] Service account baseline doc is approved.
- [ ] OpenAPI change list for service-account endpoints/tokens is defined.
- [ ] ERD/db schema change list includes service-account tables and constraints.
- [ ] Authz matrix includes actor_type=service_account.
- [ ] Security controls for service accounts are added to verification checklist.
Pre-Phase Resource Naming — Canonical Identifier Baseline (Blocking)¶
Purpose: establish one machine-readable identifier shape across API/events/audit before broad feature expansion.
Read first:
- doc/architecture/Resource_Identifier_Spec.md
- doc/architecture/adrs/ADR-009-canonical-resource-identifier-format.md
- doc/architecture/Tenant_Project_Ownership_Baseline.md
Deliverables:
- Canonical format adopted:
- core42:aicloud:{region}:{tenant_id}:{project_id}:{resource_type}:{resource_id}
- Resource type registry baseline documented for MVP domains.
- Shared parser/formatter implementation plan captured (single package, no per-service drift).
- API/event/audit adoption targets listed for initial rollout.
Done when: - [ ] ADR-009 is accepted and linked from architecture index. - [ ] Resource identifier spec is approved. - [ ] Architecture docs reference canonical identifier usage. - [ ] Initial implementation backlog includes parser + boundary emission tasks.
Pre-Phase Frontend — UX Foundation Packages¶
Purpose: establish shared UX platform primitives before feature slices, so UI work stays consistent and API-first.
Deliverables:
- packages/web/src/lib/api/:
- contract client wrapper (typed calls, auth header injection, refresh flow handling, correlation-id propagation)
- common error mapper from ErrorResponse to UX-safe message model
- packages/web/src/lib/query/:
- cache/query conventions (keys, stale times, retry defaults)
- packages/web/src/lib/session/:
- session/user/role state + protected-route helpers
- packages/web/src/components/system/:
- shared async states (LoadingState, EmptyState, ErrorState, RestrictedState, RateLimitedState)
- pagination/table primitives bound to cursor model
- confirm modal + destructive action pattern
- packages/web/src/components/a11y/:
- focus trap, keyboard shortcuts helper, aria-live announcement helper
- packages/web/src/styles/:
- design tokens (color/spacing/type/radius/elevation) and theme contract
Done when: - [ ] Frontend can call protected API via shared client with automatic token refresh path. - [ ] All list screens can reuse common cursor pagination primitives. - [ ] Shared UX state components are used by at least one screen each. - [ ] A11y helpers are integrated in modal + notification flows.
Phase 0 — Foundation ✅ DONE¶
| Artifact | Status |
|---|---|
go.mod + directory scaffold |
✅ |
packages/shared/errors |
✅ |
packages/shared/events |
✅ |
packages/shared/middleware |
✅ |
packages/shared/policy |
✅ |
packages/shared/db + rdb |
✅ |
Initial cmd/* scaffolds |
✅ |
doc/governance/Coding_Standards.md |
✅ |
doc/governance/Testing_Standards.md |
✅ |
Phase 1 — Test harness + cmd/api wiring¶
Prerequisite: Phase 0. Parallel: 1A (test harness) and 1B (cmd/api wiring) can run concurrently.
1A — Unit tests for packages/shared¶
Read first: doc/governance/Testing_Standards.md §Go Test Patterns
Files to create:
packages/shared/errors/errors_test.go
packages/shared/middleware/sanitize_test.go
packages/shared/middleware/correlation_test.go
packages/shared/middleware/auth_test.go # httptest + fake JWKS server
packages/shared/middleware/ratelimit_test.go # stub policy + stub Redis via interface
packages/shared/middleware/idempotency_test.go # stub pgxpool
packages/shared/events/types_test.go
packages/shared/policy/policy_test.go # stub DB via interface
Tests to write:
- errors: New(), WithDetails(), all ErrCode constants compile
- sanitize: redacts each blocked field, redacts ssh_private_key* prefix, recurses into nested maps, leaves unblocked fields untouched
- correlation: generates UUID when header absent, echoes existing header, stores in context
- auth: valid JWT passes, expired JWT → 401 ErrTokenExpired, missing Bearer → 401 ErrTokenMissing, bad signature → 401 ErrTokenInvalid, RequireAdmin passes admin role, RequireAdmin blocks user role
- ratelimit: under limit passes, at limit+1 returns 429, X-RateLimit- headers present, fails open on Redis error
- idempotency: no header → passes through, same key+body → replays cached response, same key+different body → 422, in-flight key → 409
- events/types: all Subject constants are non-empty and unique, all payload structs have json tags
- policy: GetInt / GetBool / GetString return correct types, cache hit skips DB call, cache miss queries DB
Done when: make test passes with zero failures.
1B — Wire cmd/api + outbox relay¶
Read first: doc/architecture/Inter_Service_Communication.md
Files to create / replace:
cmd/api/main.go # full wiring (replaces stub)
cmd/api/config.go # env-var config struct with validation
cmd/api/server.go # http.Server setup, graceful shutdown
cmd/api/routes.go # route mounting (historical bootstrap note; current handlers are implemented)
cmd/api/outbox.go # outbox relay loop
cmd/outbox-relay/main.go # dedicated outbox relay process (scalable option)
packages/shared/outbox/relay.go # shared relay logic (used by billing-worker too)
config.go — reads and validates:
DATABASE_URL (required)
REDIS_URL (required)
NATS_URL (required, default nats://localhost:4222)
KEYCLOAK_ISSUER_URL (required)
PORT (default 8080)
OTEL_EXPORTER_OTLP_ENDPOINT (optional)
main.go wiring order:
1. Parse config
2. middleware.SetupOTel(ctx, "gpuaas-api", version)
3. db.Connect(ctx, cfg.DatabaseURL)
4. rdb.Connect(ctx, cfg.RedisURL)
5. events.Connect(cfg.NatsURL) → events.InitStreams(js)
6. middleware.NewJWKSAuth(ctx, cfg.KeycloakIssuerURL)
7. policy.NewPostgresClient(pool)
8. middleware.NewRateLimiter(rdb, policyClient)
9. Mount middleware chain: Tracing → CorrelationID → Auth → RateLimit
10. Start outbox relay goroutine
11. server.ListenAndServe
12. On SIGTERM: drain NATS → close pool → shutdown HTTP server
outbox relay — claims rows with:
- SELECT ... FROM outbox_events WHERE status = 'pending' ORDER BY occurred_at LIMIT 50 FOR UPDATE SKIP LOCKED
- then publishes and updates status in the same worker transaction boundary.
- Publish each row via events.PublishTyped
- On success: UPDATE outbox_events SET status = 'published', published_at = now()
- On failure: UPDATE outbox_events SET retry_count = retry_count + 1, last_attempted_at = now(); after 10 retries set status = 'failed'
- Runs every 2 s; jitter ±200 ms to avoid thundering herd on multi-instance deploy
Endpoints to implement:
- GET /api/v1/healthz → checks DB ping + Redis ping + NATS connection; returns 200 or 503
Tests to write:
cmd/api/config_test.go # missing required env → error
cmd/api/routes_test.go # GET /healthz returns 200 with all deps up; 503 with DB down
packages/shared/outbox/relay_test.go # pending rows published; retry incremented on NATS error; failed after 10 retries
Done when:
- [ ] make dev-infra && make dev-api starts without error
- [ ] curl localhost:8080/api/v1/healthz returns {"status":"ok"}
- [ ] make test passes
Phase 2 — Auth + Users service¶
Prerequisite: Phase 1 complete.
Read first: doc/api/openapi.draft.yaml §Auth §Users, doc/architecture/Inter_Service_Communication.md §JWT
Files to create:
packages/services/auth/service.go
packages/services/auth/handler.go
packages/services/auth/handler_test.go
packages/services/auth/service_test.go
packages/services/auth/models.go
Endpoints to implement:
| Method | Path | Notes |
|---|---|---|
| GET | /api/v1/auth/oidc/authorize | Redirect to Keycloak authorize URL with PKCE params |
| POST | /api/v1/auth/oidc/exchange | Exchange code for tokens; upsert user in users table |
| POST | /api/v1/auth/personal/login | Personal account login (feature-flag controlled) |
| POST | /api/v1/auth/token/refresh | Forward refresh token to Keycloak |
| POST | /api/v1/auth/logout | Revoke refresh token at Keycloak |
| GET | /api/v1/users/me | Returns current user from users table by JWT sub |
Service logic:
- UpsertUserFromClaims(ctx, claims) — INSERT … ON CONFLICT (oidc_issuer, oidc_subject) DO UPDATE SET …; maps realm_access.roles claim to users.role
- GetUserByOIDCSub(ctx, issuer, subject) — lookup by (oidc_issuer, oidc_subject) unique index
DB tables touched: users
Tests to write:
- UpsertUserFromClaims: new user created, existing user updated, role mapped correctly
- GetUserByOIDCSub: found, not-found → ErrUserNotFound
- GET /users/me: valid token returns user; missing token → 401; unknown sub → 404
Done when:
- [ ] POST /auth/oidc/exchange with dev Keycloak token upserts a user row
- [ ] GET /users/me returns the user
- [ ] All unit tests pass
Phase 3 — Inventory service (Catalog + Nodes)¶
Prerequisite: Phase 1 complete. (Parallel with Phase 2.)
Read first: doc/api/openapi.draft.yaml §Catalog §Nodes §AdminNodes, doc/architecture/db_schema_v1.sql
Files to create:
packages/services/inventory/service.go
packages/services/inventory/handler.go
packages/services/inventory/handler_test.go
packages/services/inventory/service_test.go
packages/services/inventory/models.go
Endpoints to implement:
| Method | Path | Auth | Notes |
|---|---|---|---|
| GET | /api/v1/skus | user | List active SKUs from sku_catalog |
| GET | /api/v1/nodes | user | List node lifecycle + occupancy projection (no SSH secrets) |
| GET | /api/v1/admin/nodes | admin | List all nodes with onboarding mode and occupancy context |
| POST | /api/v1/admin/nodes | admin | Insert node; validate SKU exists; choose onboarding mode (manual or maas) |
| POST | /api/v1/admin/nodes/{node_id}/probe | admin | Reachability probe; update lifecycle status (active or offline) |
| DELETE | /api/v1/admin/nodes/{node_id} | admin | Soft-retire node (status = 'retired') |
Service logic:
- ListSKUs(ctx) — SELECT … FROM sku_catalog WHERE active = true ORDER BY sku
- ListAvailableNodes(ctx) — schedulable nodes are lifecycle active and occupancy available
- CreateNode(ctx, req) — validate SKU, insert; write audit log
- ProbeNode(ctx, nodeID) — SSH dial with 10 s timeout; set active/offline; write audit log
- DisableNode(ctx, nodeID) — set retired; write audit log; fail if node has active allocation
DB tables touched: sku_catalog, nodes, audit_logs
Tests:
- ListAvailableNodes: only lifecycle active + occupancy available nodes returned
- DisableNode: fails when node has active allocation (ErrNodeInUse)
- POST /admin/nodes without admin token → 403
Done when:
- [ ] GET /api/v1/skus returns seeded SKUs
- [ ] Admin can register and probe a node
Phase 4 — Billing service (read path)¶
Prerequisite: Phase 2 (user identity).
Read first: doc/architecture/State_Machines.md §3-4, doc/architecture/db_schema_v1.sql §ledger_entries §usage_records
Files to create:
packages/services/billing/service.go
packages/services/billing/handler.go
packages/services/billing/handler_test.go
packages/services/billing/service_test.go
packages/services/billing/models.go
Endpoints to implement:
| Method | Path | Notes |
|---|---|---|
| GET | /api/v1/billing/balance | Sum of ledger_entries.amount_minor for user; never a column |
| GET | /api/v1/billing/usage | Paginated usage_records for user |
| GET | /api/v1/billing/usage/csv | CSV export of usage records |
Service logic:
- GetBalance(ctx, userID) — SELECT COALESCE(SUM(amount_minor),0) FROM ledger_entries WHERE user_id = $1; never a balance column
- GetUsage(ctx, userID, filter) — paginated query on usage_records
- GetLedger(ctx, userID, filter) — paginated query on ledger_entries
- CreditLedger(ctx, tx, userID, amount, entryType, refID, corrID) — shared helper used by payments and admin; inserts a ledger row inside an existing transaction
Tests: - Balance: credits and debits sum correctly; empty → 0 - Balance: no balance column in schema — test verifies query uses SUM
Done when:
- [ ] GET /billing/balance returns correct sum after seeded ledger entry
- [ ] CSV export streams correctly
Phase 5 — Payments service¶
Prerequisite: Phase 4 (billing ledger credit helper).
Read first: doc/api/openapi.draft.yaml §Payments §AdminPayments, doc/architecture/State_Machines.md §5
Files to create:
packages/services/payments/service.go
packages/services/payments/handler.go
packages/services/payments/webhook.go
packages/services/payments/handler_test.go
packages/services/payments/service_test.go
packages/services/payments/models.go
cmd/webhook-worker/main.go # implemented (historically replaced stub)
Endpoints to implement:
| Method | Path | Notes |
|---|---|---|
| POST | /api/v1/payments/checkout-session | Create Stripe session; insert payment_sessions row; idempotent via X-Idempotency-Key |
| POST | /api/v1/payments/customer-portal-session | Stripe billing portal URL |
| POST | /api/v1/payments/webhook | Stripe webhook — buffer raw body first before any JSON parse |
| GET | /api/v1/admin/payments/sessions | List stuck/failed sessions for reconciliation |
Service logic:
- CreateCheckoutSession(ctx, userID, req) — validate amount against policy min/max; create Stripe session; insert payment_sessions with status = 'initiated'; idempotency via ix_payment_sessions_idempotency
- HandleWebhook(ctx, rawBody, sigHeader) — stripe.ConstructEvent; on checkout.session.completed: update payment_sessions to checkout_completed, then in one transaction: post ledger credit + update to credited + write payments.balance_credited to outbox
- Amount mismatch → failed_reconcile
Critical: POST /payments/webhook must read and buffer r.Body as raw bytes BEFORE calling any JSON decoder. The Stripe signature is computed over the exact raw bytes.
Tests:
- Checkout session created and payment_sessions row inserted
- Webhook with valid signature credits balance + transitions session state
- Duplicate webhook (same stripe_event_id) does not double-credit (idempotent via stripe_events PK)
- Webhook with mutated body → 400 (signature invalid)
- Amount below minimum → 400
Done when:
- [ ] POST /payments/checkout-session returns a Stripe URL
- [ ] Webhook handler verifies signature and credits balance
- [ ] Duplicate webhook rejected
Phase 6 — Provisioning orchestrator¶
Prerequisite: Phases 2, 3, 4 (auth, inventory, billing balance check) and Pre-Phase Security (Encryption Envelope Baseline).
Read first: doc/architecture/State_Machines.md §1, doc/architecture/Sequence_Flows.md, doc/api/openapi.draft.yaml §Allocations
Files to create:
packages/services/provisioning/orchestrator/service.go
packages/services/provisioning/orchestrator/handler.go
packages/services/provisioning/orchestrator/handler_test.go
packages/services/provisioning/orchestrator/service_test.go
packages/services/provisioning/orchestrator/models.go
packages/services/provisioning/orchestrator/statemachine.go
Endpoints to implement:
| Method | Path | Notes |
|---|---|---|
| POST | /api/v1/allocations | Create allocation; check balance + concurrency limit via policy; insert requested; write outbox provisioning.requested; write usage_records row |
| GET | /api/v1/allocations | Paginated list for current user |
| GET | /api/v1/allocations/{id} | Single allocation; ownership check |
| POST | /api/v1/allocations/{id}/release | Transition to releasing; write outbox provisioning.releasing.requested; also accepts release_failed → releasing |
| GET | /api/v1/ssh-keys | List current user's registered SSH public keys |
| POST | /api/v1/ssh-keys | Register SSH public key for runtime access |
| DELETE | /api/v1/ssh-keys/{key_id} | Revoke/remove SSH public key |
| GET | /api/v1/admin/allocations | Admin list with status filter |
| POST | /api/v1/admin/allocations/{id}/force-release | Transition release_failed → releasing; write outbox provisioning.force_release_requested; write audit log |
State machine transitions (see doc/architecture/State_Machines.md §1):
requested → provisioning (on provisioning.requested consumed by worker)
provisioning → active (on provisioning.active event)
provisioning → failed (on provisioning.failed event)
active → releasing (user/admin release request)
releasing → released (on provisioning.releasing.completed event)
releasing → release_failed (on provisioning.release_failed event)
release_failed → releasing (user retry or admin force-release)
Concurrency check: SELECT COUNT(*) FROM allocations WHERE user_id = $1 AND status IN ('requested','provisioning','active','releasing') — compare against policy.KeyAllocationMaxConcurrentPerUser.
Balance check: billing.GetBalance(ctx, userID) > 0 before creating allocation.
DB tables touched: allocations, usage_records, outbox_events, audit_logs
Tests:
- Create allocation: happy path, insufficient balance → 402, concurrency limit → 429, SKU unavailable → 409
- Release: active → releasing transition; already releasing → 409; wrong owner → 403
- Force-release: only for release_failed status; requires admin role
- GET /allocations/{id}: wrong owner → 403, not found → 404
Done when: - [ ] Full create → release cycle transitions correctly in DB - [ ] All state machine edge cases tested
Phase 7 — Provisioning worker¶
Prerequisite: Phase 6 (allocation state machine in DB), Pre-Phase Security (Encryption Envelope Baseline), and Pre-Phase Node Agent (node agent scaffold + step-ca running).
Read first: doc/architecture/NATS_Stream_Config.md, doc/architecture/Inter_Service_Communication.md §GPU nodes
Files to create:
packages/services/provisioning/worker/workflow.go # Temporal workflow definitions
packages/services/provisioning/worker/activities.go # Node agent task activities
packages/services/provisioning/worker/ssh.go # SSH dial + exec helpers (admin probe only)
packages/services/provisioning/worker/consumer.go # NATS consumer setup
packages/services/provisioning/worker/workflow_test.go
packages/services/provisioning/worker/activities_test.go
cmd/provisioning-worker/main.go # implemented (historically replaced stub)
Note: provisioning activities use the node agent task API, not raw SSH.
See doc/architecture/Node_Agent_Spec.md §12 for the activity pattern.
ssh.go is retained only for the admin probe endpoint (POST /admin/nodes/{id}/probe).
Temporal workflows:
- ProvisionNodeWorkflow(allocationID) — activity sequence: AllocateNode → ProvisionUser (via node agent allocation.provision_user task) → UpdateAllocationActive → EmitOutboxActive
- ReleaseNodeWorkflow(allocationID) — activity sequence: RevokeUser (via node agent allocation.revoke_user task) → UpdateAllocationReleased → EmitOutboxReleasingCompleted; on max retries → UpdateAllocationReleaseFailed + EmitOutboxReleaseFailed
Node agent activities:
- ProvisionUserActivity — inserts node_tasks row (allocation.provision_user); polls until succeeded or timeout; applies user public key set and avoids persistent storage of user private keys in control-plane DB.
- RevokeUserActivity — inserts node_tasks row (allocation.revoke_user); polls until succeeded or timeout; sets release_failed_reason on task failed or timeout
NATS consumers to register on startup (durable names from NATS_Stream_Config.md):
- provisioning_worker_provision_requested → start ProvisionNodeWorkflow
- provisioning_worker_releasing_requested → start ReleaseNodeWorkflow
- provisioning_worker_force_release → start ReleaseNodeWorkflow
Tests:
- ProvisionUserActivity: node task queued; task failed → activity returns error; task succeeded → allocation updated
- RevokeUserActivity: task timeout → activity returns error with reason
- ProvisionNodeWorkflow: activity failures retried; exhausted retries → release_failed state
Done when:
- [ ] make dev-infra && make dev-worker-provisioning starts
- [ ] Creating an allocation (Phase 6) triggers the workflow and transitions to active
- [ ] Agent runtime waits for node_tasks completion and maps timeout/failure to failed/release_failed deterministically
- [ ] Internal node endpoints enforce node identity authorization with mTLS certificate binding
- [ ] Temporal workflow path includes retry + compensation behavior for duplicate events and task timeouts
- [ ] Provisioning resumes from provisioning when node assignment arrives (no stranded rows)
- [ ] step-ca integration + KMS signing key lifecycle verified in staging
- [ ] Integration tests cover full node-agent flow (requested -> active, releasing -> released, release_failed retry)
- [ ] Provisioning metrics include task queue depth, dispatch latency, timeout count, and failure reasons
- [ ] Private-key handling cutover complete: no persistent server-side user SSH private-key storage before public launch
Remaining backend provisioning checklist (active):
- [ ] Agent runtime completion semantics (node_tasks enqueue is not terminal success).
- [ ] Internal node auth hardening (mTLS identity -> node_id binding).
- [ ] Workflow robustness for retries/compensation and duplicate event handling.
- [ ] Scheduler/assignment re-trigger when allocation is waiting for node assignment.
- [ ] PKI production path completion (step-ca + rotation + KMS key lifecycle).
- [ ] Provisioning integration/e2e coverage for success/failure/retry/replay.
- [ ] Operational metrics + alerts for provisioning control loop health.
- [ ] Remove persistent private-key storage dependency from provisioning/terminal path via pre-launch cutover (one-time delivery and/or user-managed key model).
Phase 8 — Billing worker¶
Prerequisite: Phase 4 (ledger credit helper), Phase 6 (usage_records).
Read first: doc/architecture/State_Machines.md §3-4, doc/architecture/NATS_Stream_Config.md
Files to create:
packages/services/billing/accrual.go # billing loop logic
packages/services/billing/accrual_test.go
packages/services/billing/consumer.go # NATS consumer setup
cmd/billing-worker/main.go # implemented (historically replaced stub)
Worker responsibilities:
1. Accrual loop — every policy.KeyBillingWindowSeconds:
- Query usage_records WHERE end_time IS NULL (active usage)
- For each: compute elapsed * gpu_hourly_price_minor * gpus_total_snapshot
- Insert ledger_entries debit row; update usage_records.last_billed_at + accrued_cost_minor
- Idempotency key: (usage_record_id, window_start) in idempotency_keys
2. Low balance check — after each accrual cycle:
- GetBalance(ctx, userID) for every user with active usage
- If balance ≤ policy.KeyBillingLowBalanceThresholdMinor and last_low_balance_notified_at is nil or > 24 h ago: write outbox billing.low_balance_warning; update users.last_low_balance_notified_at
3. Balance depleted — if balance ≤ 0: write outbox billing.balance_depleted; write outbox provisioning.force_release_requested for each active allocation
NATS consumers (durable names from NATS_Stream_Config.md):
- billing_worker_provision_active → open usage_records row
- billing_worker_releasing_completed → close usage_records row (end_time = now())
- billing_worker_release_failed → close usage_records row (billing stops)
- billing_worker_balance_credited → check if any paused allocations can resume
Tests: - Accrual: correct cost for GPU-hours elapsed; idempotent on replay - Low balance: warning emitted once per transition; not re-emitted while still low - Depleted: force-release outbox row written for each active allocation
Done when:
- [ ] make dev-worker-billing starts
- [ ] Creating and holding an allocation for 2 billing cycles accrues expected cost
- [ ] Depleting balance triggers force-release flow
Phase 9 — Terminal service¶
Prerequisite: Phase 6 (allocation active + runtime access credential model).
Read first: doc/architecture/Inter_Service_Communication.md §Terminal tokens
Files to create:
packages/services/terminal/service.go
packages/services/terminal/handler.go
packages/services/terminal/proxy.go # WebSocket → SSH proxy
packages/services/terminal/handler_test.go
packages/services/terminal/service_test.go
Endpoints to implement:
| Method | Path | Notes |
|---|---|---|
| POST | /api/v1/allocations/{id}/terminal-token | Mint single-use 256-bit token; store in Redis with 300 s TTL; key: terminal_token:{token} → {user_id, allocation_id} |
| WS | /ws/terminal/{allocation_id} | WebSocket upgrade; validate terminal token via Sec-WebSocket-Protocol for browser clients or Authorization header for non-browser clients (never query string); open SSH shell; stream bidirectionally |
Token storage:
Token validation: GETDEL from Redis (atomic single-use consume).
WebSocket proxy: - Dial SSH using non-persistent runtime credentials - Bidirectional copy: WS frames ↔ SSH stdin/stdout/stderr - On SSH disconnect: close WS with normal close code
Tests: - Token minting: stored in Redis with correct TTL - Token is single-use: second validation returns not-found - Token for wrong user → 403 - Terminal token must be in header, not query string
Done when:
- [ ] POST /api/v1/allocations/{id}/terminal-token returns an opaque token
- [ ] WS connection with token proxies to SSH
Phase 9C — Hardened Terminal Gateway Extraction (Option C)¶
Prerequisite: Phase 9 complete and stable in staging.
Read first:
- doc/architecture/Inter_Service_Communication.md §5.1
- doc/operations/Production_Platform_Baseline.md
- doc/operations/Parallel_Ops_Track.md
Goal: - Move terminal streaming from embedded API runtime (Option B) to a dedicated hardened service (Option C) without changing public contracts.
Files to create/update:
cmd/terminal-gateway/main.go
cmd/terminal-gateway/config.go
cmd/terminal-gateway/server.go
packages/services/terminal/gateway_service.go
packages/services/terminal/gateway_service_test.go
doc/operations/runbooks/Terminal_Gateway_Incident_Runbook.md
doc/operations/local-dev/docker-compose.yaml
Scope:
1. Route ownership split:
- cmd/api: terminal token mint endpoint only.
- cmd/terminal-gateway: WS /ws/terminal/{allocation_id} only.
2. Security hardening:
- gateway can consume terminal tokens atomically via Redis GETDEL.
- strict origin/protocol validation for browser WS.
- deny direct access to non-terminal API surfaces.
3. Network policy:
- edge routes /ws/terminal/* to gateway service.
- gateway egress limited to Redis and node SSH targets.
4. Observability:
- connection success/failure counters, replay rejects, active session gauges.
- runbook links for degraded/error states in admin ops.
Tests: - Gateway accepts valid token and rejects replay/expired/mismatched allocation token. - Multi-instance gateway can serve concurrent sessions without sticky routing failures. - API terminal-token endpoint remains unchanged and interoperates with gateway consumer. - Integration: route switch from API WS handler to gateway with zero contract changes.
Done when:
- [ ] cmd/terminal-gateway runs in local dev and staging.
- [ ] /ws/terminal/* ingress points to gateway, not cmd/api.
- [ ] Terminal contracts in OpenAPI/AsyncAPI remain unchanged.
- [ ] Ops dashboard includes gateway health and replay/anomaly signals.
- [ ] Rollback procedure (route switch back to API WS handler) documented and tested.
Phase 10 — Storage service¶
Prerequisite: Phase 2 (user identity).
Read first: doc/api/openapi.draft.yaml §Storage
Files to create:
packages/services/storage/service.go
packages/services/storage/handler.go
packages/services/storage/pathsafety.go # path traversal prevention
packages/services/storage/handler_test.go
packages/services/storage/service_test.go
Endpoints to implement:
| Method | Path | Notes |
|---|---|---|
| GET | /api/v1/storage/list | List storage_objects for active project under given path prefix |
| GET | /api/v1/storage/download | Stream object bytes from S3 |
| PUT | /api/v1/storage/upload | Stream to S3; insert/update storage_objects; quota check |
| POST | /api/v1/storage/mkdir | Insert dir type row in storage_objects |
| POST | /api/v1/storage/rename | Update path column; check no traversal |
| DELETE | /api/v1/storage/delete | Delete from S3 + storage_objects |
Path safety rules:
- Resolved path must remain under /{org_id}/{project_id}/ prefix
- Reject .. components: return 400 ErrStoragePathTraversal
- Quota: SUM of size_bytes for project must stay under policy.KeyStorageQuotaBytes (add this policy key)
Tests:
- ../ traversal → 400 storage_path_traversal
- Quota exceeded → 400 storage_quota_exceeded
- Download non-existent object → 404 storage_object_not_found
Done when: - [ ] Upload → list → download round-trip works - [ ] Path traversal and quota tests pass
Phase 11 — Admin service¶
Prerequisite: Phases 2–6 complete (all domain entities exist).
Read first: doc/api/openapi.draft.yaml §AdminUsers §AdminAllocations §AdminAudit §AdminPayments
Files to create:
packages/services/admin/users_handler.go
packages/services/admin/nodes_handler.go
packages/services/admin/allocations_handler.go
packages/services/admin/audit_handler.go
packages/services/admin/payments_handler.go
packages/services/admin/service.go
packages/services/admin/handler_test.go
packages/services/admin/service_test.go
Endpoints to implement:
| Method | Path | Notes |
|---|---|---|
| GET | /api/v1/admin/users | Paginated user list |
| POST | /api/v1/admin/users | Create user directly (bypass OIDC) |
| GET | /api/v1/admin/users/{id} | User + balance + active allocations |
| POST | /api/v1/admin/users/{id}/balance | Admin ledger credit/adjustment; write audit log |
| POST | /api/v1/admin/users/{id}/refunds | Refund within policy window; write refund_requests; write audit log |
| GET | /api/v1/admin/allocations | Paginated with status filter (esp. release_failed) |
| POST | /api/v1/admin/allocations/{id}/force-release | Already in Phase 6 |
| GET | /api/v1/admin/audit-logs | Filterable by actor, target, action, date range |
| GET | /api/v1/admin/audit-logs/export | CSV export |
| GET | /api/v1/admin/payments/sessions | Sessions in initiated/failed_reconcile state |
All admin endpoints must:
- Be gated by middleware.RequireAdmin
- Write an audit_logs row for every mutation
Refund logic:
- Check policy.KeyAllocationRefundWindowDays from credit posting date
- Beyond window → 422 refund_window_exceeded
- Within window: call Stripe refund API OR post internal ledger credit; update refund_requests
Tests: - Non-admin token → 403 on every admin endpoint - Admin balance credit creates ledger entry and audit log - Refund beyond window → 422
Done when: - [ ] All admin CRUD operations work via Keycloak dev-admin token - [ ] Audit log populated for every mutation
Phase 12 — Notification service¶
Prerequisite: Phase 8 (billing events), Phase 7 (provisioning events).
Read first: doc/architecture/NATS_Stream_Config.md §BILLING §PROVISIONING
Files to create:
packages/services/notification/service.go
packages/services/notification/consumer.go # NATS consumer setup
packages/services/notification/email.go # email adapter (stub → SES/SMTP)
packages/services/notification/ws.go # WebSocket broadcast (user-facing)
packages/services/notification/service_test.go
NATS consumers (durable names from NATS_Stream_Config.md):
- notification_relay_low_balance → publish user-scoped Redis notification + optional email
- notification_relay_auto_release_pending → publish user-scoped Redis notification
- notification_relay_balance_depleted → publish user-scoped Redis notification + optional email
- notification_relay_provision_active → publish user-scoped Redis notification
- notification_relay_provision_failed → publish user-scoped Redis notification + optional email
- notification_relay_releasing_completed → publish user-scoped Redis notification
- notification_relay_release_failed → publish user-scoped Redis notification + optional email
Enable/disable controlled by policy.KeyNotificationLowBalanceEnabled and policy.KeyNotificationBalanceDepletedEnabled.
Tests: - Low balance with feature flag disabled → no notification dispatched - Provisioning failed → email adapter called with correct user ID and reason
Done when: - [ ] Low-balance event triggers log entry (email adapter stubbed) - [ ] Policy flag disables notification correctly
Phase 13 — Integration test harness¶
Prerequisite: Phase 1B (dev-infra running), any service under test.
Read first: doc/governance/Testing_Standards.md §Integration test setup
Files to create:
packages/testhelpers/db.go # DB(t) — pool + t.Cleanup
packages/testhelpers/redis.go # Redis(t) — client + t.Cleanup
packages/testhelpers/nats.go # NATS(t) — connection + t.Cleanup
packages/testhelpers/truncate.go # TruncateTables(t, pool, tables...)
packages/testhelpers/jwt.go # MintToken(t, userID, roles) using test JWKS
packages/testhelpers/jwks.go # NewFakeJWKSServer(t) — httptest JWKS + token signer
MintToken generates an RS256 JWT signed by the fake JWKS server's private key,
used to test auth-protected endpoints in integration tests without Keycloak.
Add integration tests for:
packages/services/auth/service_integration_test.go
packages/services/billing/accrual_integration_test.go
packages/services/payments/webhook_integration_test.go
packages/services/provisioning/orchestrator/service_integration_test.go
Done when:
- [ ] make test-integration passes with make dev-infra running
- [ ] TruncateTables isolates each test
Phase 14 — E2E acceptance tests¶
Prerequisite: All phases complete. make e2e-up running.
Read first: doc/governance/Testing_Standards.md §Acceptance Matrix
Files to create:
tests/e2e/auth_test.go # AT-001, AT-002, AT-003
tests/e2e/marketplace_test.go # AT-010
tests/e2e/provisioning_test.go # AT-020, AT-023, AT-030 – AT-032
tests/e2e/billing_test.go # AT-040 – AT-042
tests/e2e/payments_test.go # AT-050 – AT-053
tests/e2e/storage_test.go # AT-060, AT-061
tests/e2e/ratelimit_test.go # AT-070 – AT-072
tests/e2e/audit_test.go # AT-080 – AT-083
All E2E tests use //go:build e2e build tag.
Done when:
- [ ] All AT-xxx cases from Testing_Standards.md pass against full stack
Dependency graph summary¶
Pre-Phase Node Agent (step-ca + node-agent scaffold + schema + OpenAPI)
│
└──────────────────────────────────────── Phase 7 (provisioning worker) [BLOCKS]
Phase 0 (done)
└── Phase 1 (test harness + cmd/api)
├── Phase 2 (auth) ─┐
├── Phase 3 (inventory) ├── Phase 6 (provisioning orchestrator)
└── Phase 4 (billing) ─┘ └── Phase 7 (provisioning worker)
└── Phase 5 (payments) ← also requires Pre-Phase Node Agent
└── Phase 8 (billing worker)
└── Phase 12 (notification)
Phase 6 ──────────────────────────────────────────────── Phase 9 (terminal)
Phase 2 ──────────────────────────────────────────────── Phase 10 (storage)
Phases 2–6 ───────────────────────────────────────────── Phase 11 (admin)
Phase 1 ──────────────────────────────────────────────── Phase 13 (integration harness)
All phases ───────────────────────────────────────────── Phase 14 (E2E)
Phases 2, 3, 4 have no inter-dependencies and can run in parallel after Phase 1. Phases 9, 10, 11, 12 can run in parallel after their prerequisites are met.
UX + API Vertical Slice Strategy¶
To reduce integration risk, deliver backend and frontend together per slice after pre-phases:
- Slice A — Auth + Profile
- API: auth/session +
GET /users/me -
UX: login redirect/exchange/logout, protected layout, session expiry handling
-
Slice B — Marketplace + Allocations Read
- API:
GET /skus,GET /nodes,GET /allocations,GET /allocations/{id} -
UX: capacity cards, allocation list/detail, async state rendering
-
Slice C — Provision/Release + Terminal
- API: create/release allocation, terminal token endpoint, terminal WS path
-
UX: request/provisioning/releasing lifecycle states, terminal connect/reconnect flow
-
Slice D — Billing + Payments
- API: balance/usage/csv, checkout session, portal session
-
UX: balance cards, usage table/export, payment redirect outcomes
-
Slice E — Admin Ops
- API: admin users/nodes/allocations/audit endpoints
-
UX: admin tables, filters, force-release and refund workflows, audit export
-
Slice F — Storage
- API: list/upload/download/mkdir/rename/delete
- UX: file explorer interactions with path-safety errors and confirmations
Rule: - If a UX flow needs multiple services, ship partial sections with explicit loading/degraded states rather than blocking full-screen delivery.