External Architectural & Product Review — GPUaaS¶
Date: 2026-04-24
Reviewer voice: External technical reviewer (15+ years on similar platforms), no internal context primed.
Scope: Full-repo architectural and product evaluation. Code first, docs second, hyped explanations last.
Method: Five parallel layer audits (toolchain, control plane, compute, app platform, ops/billing) cross-checked against doc/architecture/, doc/governance/, and doc/operations/. No charity — gaps named.
Comparison set: CoreWeave, Lambda, RunPod, internal hyperscaler GPU control planes, large-bank tenant-shared-runtime platforms.
TL;DR¶
| Layer | Grade | One-line |
|---|---|---|
| Toolchain & Governance | B | Governance is A-tier; enforcement is B-tier. 10 custom invariant guards in CI; lint, queue-git-check, breaking-change tools, and security scans are missing or report-only. (Revised after CI gate-coverage audit, see §3.1.) |
| Contract & Control plane | B− | Strong contracts; cross-cutting middleware enforcement is patchy. |
| Compute (Provisioning / Node-agent / MAAS / PKI) | B− | Sophisticated design; scattered state-machine enforcement. |
| App Platform | B+ | Most original work in the repo. Design A-tier, implementation 60–70%. |
| Billing / Ops / SRE | B− | Solid bones; missing email, DLQ recovery, numbered migrations. |
| GPU Slice Scheduling (sub-component of Compute) | B | VM-based whole-GPU slicing is real and well-engineered; MIG/vGPU/MPS not built; multi-tenant network isolation thin. |
| Console / Terminal | B+ | Most evolutionary maturity in the repo. Atomic single-use tokens, mTLS bridge, recent legacy-path removal. Missing session recording + mint rate-limit. |
| Overall | B / B+ | Above-average for stage. Top decile on design, mid-pack on enforcement. |
One-paragraph synthesis. This team has built ~70% of a serious GPU provisioning + tenant-operated app platform that, if completed, would compete in the on-prem-cloud-replacement and white-label-AI-platform segments. The architectural choices are above average for the stage. The biggest risks are (a) cross-cutting middleware enforcement gaps that will cause correctness incidents in production, and (b) a marketing position that under-sells the most original work (the app platform) and could be confused for a managed AI cloud (which it isn't and shouldn't be). 6–9 months of focused execution on the action list in §5 — particularly idempotency wiring, optimistic locking on allocations, DLQ recovery, app-runtime metering producer, and the tenant-shared runtime API — would move this from B-grade implementation to A-grade. The design work is largely done.
1. As-built layer map¶
This is what the code says you've built, not what the docs aspire to.
┌─────────────────────────────────────┐
│ Web Console (Next.js) │
│ CLI (gpuaas-cli, ops + user) │
USER / OPERATOR ───►│ Python SDK (packages/python-sdk) │
│ Browser terminal (xterm.js + WS) │
└─────────────┬───────────────────────┘
│ REST + WS (contract-first)
│ Auth: Keycloak OIDC + JWT
┌─────────────▼───────────────────────────────────┐
│ EDGE / GATEWAY │
│ Auth → Denylist → SA-allowlist → RateLimit → │
│ CorrelationID → OTel → Routes (~150 endpoints) │
│ cmd/api (BFF, monolithic 919KB routes.go) │
│ cmd/terminal-gateway (WS for /ws/terminal/{id}) │
└─────────────┬───────────────────────────────────┘
│
┌──────────────────────────────┼──────────────────────────────┐
│ │ │
┌──────▼──────┐ ┌────────────────────▼────────────┐ ┌──────────────▼─────────┐
│ IDENTITY │ │ CONTROL DOMAIN │ │ APP PLATFORM │
│ Keycloak │ │ - Allocation orchestrator │ │ - appruntime svc │
│ + RBAC │ │ - Inventory / SKUs / Marketpl. │ │ - OCI manifest framework│
│ + Policy │ │ - Storage (S3-style) │ │ - Shared runtime model │
│ override │ │ - Admin ops │ │ - UI extension registry │
│ engine │ │ - Audit log (immutable) │ │ - Trust ⊥ promotion │
└─────────────┘ └────────────────┬────────────────┘ └────────────┬────────────┘
│ │
│ Outbox (txn'al) │
▼ ▼
┌─────────────────────────────────────────────┐
│ EVENT BUS — NATS JetStream │
│ Streams: BILLING, PAYMENTS, PROVISIONING, │
│ APPS, DLQ │
│ Outbox-relay (claim/publish/retry) │
└─────────────┬───────────────────────────────┘
│
┌──────────────────────────────┼──────────────────────────────┐
│ │ │
┌──────▼─────────┐ ┌─────────────────▼───────────┐ ┌───────────────▼───────────┐
│ BILLING WORKER │ │ PROVISIONING WORKER │ │ NOTIFICATION RELAY │
│ Periodic │ │ Temporal workflows + acts │ │ NATS → Redis Pub/Sub → WS │
│ accrual, │ │ Async completion via task │ │ (no email yet) │
│ low-balance, │ │ token; DLQ on max retries │ │ │
│ auto-release │ └─────────────┬───────────────┘ │ WEBHOOK WORKER │
│ │ │ │ Stripe HMAC, raw-body │
│ LEDGER append- │ │ │ first, /metrics endpoint │
│ only (no UPD) │ │ └───────────────────────────┘
└────────────────┘ │
│ mTLS + signed tasks (Ed25519)
│ pull-based, Postgres-first
┌─────────▼───────────────────────────────┐
│ NODE LAYER │
│ cmd/node-agent: typed task catalog │
│ provision_user, terminal.open, │
│ slice.vm_provision, oci_launch, │
│ node.drain, node.self_update │
│ │
│ Reference adapters: │
│ cmd/slurm-reference-controller │
│ cmd/rke2-self-managed-controller │
│ │
│ MAAS API client (deploy/release/power) │
│ step-ca enrollment via API proxy │
└──────────────────────────────────────────┘
Cross-cutting: Correlation-ID + OTel tracecontext + sanitized logs threaded
through every layer (mostly enforced — see §3.2 gaps).
The shape that matters: three planes (identity, control/orchestration, app platform) sharing one event bus and one ledger. That's a hyperscaler-shaped architecture, not a startup-shaped one. The discipline cost is high; the long-term maintainability dividend is real if the patterns get enforced.
The most original part is the seam between the control plane (allocations, billing, identity) and the app platform (OCI manifests, shared runtimes, UI extensions). Most platforms collapse these. Keeping them separate but co-located in one repo is the reason JupyterLab/vLLM/Slurm/RKE2 are all "stock apps" instead of bespoke integrations.
2. Position statement¶
GPUaaS is a production-shaped GPU provisioning + billing control plane with a partially-built tenant-operated app platform layered on top. It is not a managed AI cloud, not a Kubernetes-as-a-service, and not a pure IaaS. The closest analog isn't a single competitor — it's "what you'd build if you wanted to run multiple workload-shaped products (Slurm batch, K8s clusters, OCI launchers, model servers) on the same allocation/billing/identity substrate."
The footer of the existing capability matrix already captures this. The body of the matrix doesn't quite — the app-platform layer is invisible because it's split implicitly across runtime cells.
3. Layer-by-layer review¶
3.1 Toolchain & Governance — Grade: B¶
Revised after a second-pass CI gate-coverage audit. Original review materially under-credited the invariant guards and over-credited a few gates that turned out to be local-only or best-effort. Corrections inline below.
Genuinely strong (some of this is rarer than I first realized):
- 10 custom invariant guards wired into CI via scripts/ci/backend_build_and_tests.sh. These enforce codebase-specific architectural rules that no off-the-shelf linter can:
- audit_mandatory_guard.sh + audit_presence_guard.sh — every privileged mutation must write an audit_logs row.
- outbox_tx_guard.sh — domain change + outbox row in the same Postgres transaction; no direct NATS publish from handlers.
- policy_literal_guard.sh — flags hardcoded policy thresholds; forces policy.Client use.
- canonical_error_guard.sh — error responses must use the catalog codes.
- log_correlation_guard.sh + log_code_guard.sh + trace_header_guard.sh — correlation_id, error code, and trace context are present and propagated.
- ci_script_smoke.sh — gates the gates themselves.
- This is the "custom linters for the codebase's invariants" pattern most teams never build. It's the single most original part of the toolchain layer.
- Codegen drift IS enforced. scripts/ci/sdk_codegen_smoke.sh runs with CODEGEN_ENFORCE_CLEAN=1 and does git diff --exit-code -- packages/shared/gen packages/web/src/lib/gen after make codegen. (My original review claimed this was missing — wrong.)
- Platform-control release branch guard (platform_control_release_branch_guard.sh) is wired and enforcing — exits 1 on divergence from master unless explicit override.
- Migration validation (migration_validation.sh) is a CI gate.
- Observability instrumentation gate (observability_trace_gate.sh) — pattern-matches that workers call SetupOTel(), fails CI on regression. Real, but with a caveat (see weak list).
- Local dev parity is excellent (full Postgres + Redis + NATS + Temporal + Keycloak + observability overlay via docker-compose).
- Agent_Work_Queue.yaml + multi-agent operating model is genuinely novel — but see the gate-vs-tool note below.
Weak (the real gaps, with the right diagnoses this time):
- golangci-lint does not run in CI at all. It's a manual make lint Makefile target. No .golangci.yml config exists. CI's only Go static analysis is go vet (called inside backend_build_and_tests.sh). The invariant guards above partially compensate — they catch architectural rule violations — but they don't catch errcheck, staticcheck, unused, gosec, etc.
- No frontend ESLint/Prettier config in CI.
- Pre-commit hooks are absent. No husky/lefthook/pre-commit. Discipline is documented; nothing enforces it before push.
- All security scans run report-only. security_sast_summary.sh:65 literally has "report_only": True; security_govulncheck_report.sh:35 same. Gitleaks runs but its findings don't fail CI. The .gitlab-ci.yml itself acknowledges (line 189–191) that "signal calibration" is in progress. SAST findings print and exit 0.
- contracts_breaking_change.sh is best-effort. Script requires optional tooling (openapi-diff, asyncapi CLI). If those aren't installed in the runner, the script logs "best-effort baseline" and exits 0. Only fails when REQUIRE_BREAKING_DIFF_TOOLS=1 is set, which isn't the default. (My original review credited this as a real gate — that was overstated.)
- make queue-git-check is NOT a CI gate. The script agent_queue_git_consistency.sh exists but is invoked only from gitlab_local_dry_run.sh, a local utility. CI doesn't call it. So the multi-agent queue's "every done task has a commit reachable from origin/master" property is enforced by developer discipline + local make targets, not automated CI.
- observability_trace_gate.sh is skipped on release profiles. backend_build_and_tests.sh (which invokes the gate) is conditionally skipped for slice-dev, runtime-fast, and web-fast release modes (.gitlab-ci.yml:244-247). Fast-path releases bypass the OTel gate. Probably intentional (these ship pre-validated artifacts) but worth confirming.
- Frontend e2e is path-conditional. frontend_e2e.sh runs only when packages/web/**, docs, or CI scripts change (.gitlab-ci.yml:281-291). Backend-only changes that affect the contract surface bypass it.
- 47+ scripts in scripts/ci/ are not invoked by .gitlab-ci.yml. Some are deploy helpers and local utilities (intentional). Some are likely inventory drift. There's no labeling convention to tell the two apart at a glance.
- No Dependabot/Renovate. No SBOM generation. govulncheck runs report-only.
CI gate coverage matrix — what actually enforces:
| Gate | Enforcing? | Trigger | Notes |
|---|---|---|---|
contracts_validate.sh (incl. reviewguard, outbox payload, ux gates) |
✓ Yes | every push/PR | calls 3 sub-gates |
contracts_breaking_change.sh |
⚠ Best-effort | every push/PR | needs openapi-diff/asyncapi installed |
sdk_codegen_smoke.sh (codegen drift) |
✓ Yes | every push/PR (excl. fast-release profiles) | CODEGEN_ENFORCE_CLEAN=1 |
backend_build_and_tests.sh (incl. 10 invariant guards) |
✓ Yes | every push/PR (excl. fast-release profiles) | the heart of the gate |
frontend_build_and_tests.sh |
✓ Yes | every push/PR | |
frontend_e2e.sh |
⚠ Conditional | only when packages/web/** changes |
backend-only changes skip |
migration_validation.sh |
✓ Yes | every push/PR | |
platform_control_release_branch_guard.sh |
✓ Yes | release-branch pushes | divergence = fail |
runtime_uid_parity_guard.sh |
⚠ Conditional | only on infra/k8s changes | |
security_sast_* (gosec, semgrep, gitleaks) |
✗ Report-only | every push/PR | findings don't fail CI |
security_govulncheck_report.sh |
✗ Report-only | every push/PR | |
security_image_scan_report.sh |
✗ Report-only | release mode only | |
security_dast_report.sh |
✗ Report-only | release mode only | |
golangci-lint |
✗ Not in CI | — | manual Makefile only |
queue-git-check |
✗ Not in CI | — | local dry-run only |
observability_trace_gate.sh |
⚠ Profile-skipped | every push/PR (excl. fast-release profiles) | indirectly via backend gate |
Outside read (revised): The CI layer is better than I gave credit for in one critical dimension (custom invariant guards) and weaker than I implied in two (linting absent, security/breaking-change/queue gates not actually enforcing). The invariant-guard suite is the part most worth holding onto and expanding — a make queue-git-check-style gate could be moved into CI alongside the others; a golangci-lint job could be added in a day. The security report-only posture is honest (the comment in .gitlab-ci.yml calls it "signal calibration in progress"), but it should have a clear graduation date. Net: governance is A-tier; enforcement is B-tier (up from C-tier in my original read), with three concrete fixes that would make it A-tier in weeks, not months.
3.2 Contract-first & Control plane — Grade: B−¶
Strong:
- OpenAPI spec at 11,524 lines, 30+ resource tags, ~150 endpoints. Real contract, not a sketch. Bearer auth, X-Idempotency-Key, X-Project-ID consistently declared.
- AsyncAPI at 2,085 lines covering core flows.
- Error catalog enforced at the type level (packages/shared/errors); 50+ canonical codes; correlation_id required on every error response.
- Policy override engine (packages/shared/policy) with global → org → project precedence + 60s cache + NATS invalidation.
- Outbox is real: FOR UPDATE SKIP LOCKED claim, exponential backoff, OTel context propagation.
- JWT JWKS caching (5 min refresh) so auth doesn't fall over when Keycloak hiccups.
- Rate limiting is Redis-backed and policy-driven (default/financial/admin/terminal lanes).
Weak (these are the high-priority finds):
- Idempotency middleware exists but is not wired into routes.
packages/shared/middleware/idempotency.gois a 274-line, fully-implemented DB-backed replay cache. Zero handlers wrap withmiddleware.Idempotency(...). Six financial handlers manually checkr.Header.Get("Idempotency-Key")and return 400 if missing. Most mutating endpoints (createAllocation, createProject, createAppInstance, releaseAllocation) don't enforce idempotency at all. This contradicts AGENTS.md rule #2 directly. Single biggest correctness gap. - Audit logging is scattered. Found in
maas/service.goandauth/projects.gobut no systematic wrapper. Compliance posture depends on every author remembering to insert. - Sanitize is per-callsite, not middleware. Function exists, no global span/log wrapper enforces it.
- No generated SDK (only types). Internal services hand-code HTTP calls.
- AsyncAPI is incomplete. Notification + app-instance event payloads underspecified.
- Tenant header (X-Project-ID) is validated in handlers, not at middleware. Path-scoped endpoints are safe; header-scoped endpoints depend on individual handlers calling
resolveProjectPathScope()correctly.
Outside read: The contract surface is one of the most rigorous I've seen at this stage. The cross-cutting middleware story is one of the least rigorous parts of the same codebase, which is the disconnect to fix first.
3.3 Compute (provisioning, node-agent, MAAS, PKI) — Grade: B−¶
Architecturally sound:
- Pull-based node agent with mTLS + Ed25519 task signing + replay protection (task-id seen-set + expiry) is the right shape. No SSH from control plane.
- Two-key model: mTLS certs (24h, X5C renewal) for transport, Ed25519 task-signing key separate. Compromise of one doesn't unlock the other.
- Enrollment via API proxy to step-ca: node-agent only ever talks to api.internal. Single egress destination is good for firewall rules and audit.
- Async Temporal completion via task token (no activity polling). When the API receives a node task result, it calls temporalClient.CompleteActivity(token, result) directly. Avoids polling, scales.
- Compensation matrix is documented (Compensation_Matrix.md) with failure→compensation pairs.
- MAAS client (packages/services/maas/maas_api_client.go) is a real HTTP client with deploy/release/power-control, not a shim.
Boundary leaks / fragility:
1. State machine enforcement is scattered. requested → provisioning → active → releasing → released is documented; no central enforcer object validates transitions. Both orchestrator and worker mutate the allocations table directly. No optimistic lock or version column on allocations — concurrent admin force-release + user release can race.
2. Isolation model is documented but unbranched. Policy key allocation.isolation_model exists with values user-revoke vs full-reimage. Code path enqueues allocation.revoke_user unconditionally. The full-reimage branch (which would call MAAS to redeploy) doesn't exist. The flag is currently a no-op for one of two values.
3. MAAS↔allocation handoff is implicit. No documented sequencing: when does MAAS machine state transition relative to allocation state? If deploy succeeds but the next activity fails, you can drift.
4. DLQ has no auto-recovery. Failed outbox messages sit at status='failed'. Recovery is a manual SQL replay. No tooling, no metrics on DLQ depth.
5. Force-release loop is unbounded. If admin retries force-release while the underlying issue persists, no max retry. Temporal retry policy is initialized but its bounds aren't visible in grep.
6. App-instance ↔ allocation coupling is ambiguous. app_instance_members exists; no FK from allocations is visible. If app instances scale, the relationship to allocations isn't enforced at the schema level.
Outside read: Sophisticated, not naive — the design ideas (pull-based + signed + dual-key + Temporal-token-completion) are above average. The execution leaves enough state-mutation seams open that a well-timed concurrent admin action could corrupt allocation state.
3.4 App Platform & Workload — Grade: B+ (highest in the repo)¶
This is the layer with the most original thinking. Design is A-tier, implementation is currently 60–70%.
The genuinely novel work:
- Operating-mode × control-plane-scope as orthogonal dimensions.
App_Runtime_Operating_Modes_v1.mdandApp_Tenant_Shared_Attachment_Model_v1.mdseparate "who owns the instance" (always project-scoped) from "what scope does the runtime control plane operate at" (project | tenant | platform). A project-owned app instance can attach to a tenant-shared Slurm controller without overloading the project_id field. Most platforms conflate these. This separation is the reason the platform can host both managed shared services and tenant-isolated launchers. - Trust ⊥ promotion.
App_Artifact_Trust_and_Promotion_v1.mdtreats artifact lifecycle (published → promoted → deprecated → retired) and trust state (unverified → verified → failed_verification → revoked) as independent state machines. Promotion is always explicit and auditable; trust can fail without forcing a lifecycle state change. Mutable tags forbidden; digest-only deployment mandatory. Rare in v1. - OCI manifest framework.
gpuaas.launchable_oci_workloadprofile schema →renderLaunchableOCINodeTaskParams()→ typed node tasks. JupyterLab, vLLM, Ollama are slugs against this generic profile. Adding a new stock app is "publish a manifest", not "write an adapter". This is the reusable platform primitive the capability matrix under-credits. - Platform proxy for app-owned UI extensions.
packages/web/src/lib/proxy-launch.ts—BroadcastChannel-coordinated proxy lets app-owned UI bundles call platform APIs (with scoped tokens) and vice-versa. There's akind: "platform"vskind: "app"split. Most IaaS consoles either lock down extensions entirely or expose them with no auth scoping. - Recovery logic exists for stalled OCI launches. Probe every minute, fail at 3-min stall, re-enqueue within 15-min recovery window. Discoverable only by reading
service_integration_test.go:733-959. Production-shaped behavior buried in tests.
Where design exceeds implementation:
- Manifest registration as a public flow is not done. Current onboarding is admin-assisted seed SQL. App_Manifest_Registration_Guide_v1.md:48-56 admits this.
- Tenant-shared runtime APIs are reserved in OpenAPI but routes are not implemented. Schema (shared_app_runtimes, shared_app_runtime_attachments) is built; REST surface is half.
- UI extension is internal-only. No external manifest model, no SDK package for third-party UI bundles. Apps extend the shell only by submitting code to the platform repo.
- App-runtime billing/metering: schema fields (app_instance_id, control_plane_component, operating_mode, correlation_id) exist in usage records and ledger — but the producer that emits app-runtime usage rows is missing. Billing service can query app-attributed cost; nothing populates it yet.
- External worker contract is documented but the scoped-machine-identity it depends on is "blocked on the follow-on identity contract."
Where implementation exceeds documentation (positive surprises):
- Shared-runtime lifecycle handler is more complete than the API docs suggest — full transactional state machine with FOR UPDATE locking, cascade detach, correlation-IDed events.
- Access credential lifecycle (legacy resource_type fields and new access_credential_bindings rows handled in parallel) is buried in releaseAppInstanceAccessCredentialsTx.
- 2,370-line integration test suite covering 13+ scenarios.
Outside-reviewer concerns: - 8 of the 20 App_*.md docs contain explicit "not yet implemented / future / directional" admissions. A reader has to do code archaeology to find the gap between design and reality. - No multi-tenant blast-radius test: nothing proves that tenant A's failed app instance doesn't degrade tenant B. - Platform proxy is browser-only — no documented path for app server-side code to call platform APIs. - Reference apps (JupyterLab/vLLM/Slurm/Ollama) are seeded in SQL. There's no example of a tenant-published app, so the "anyone can publish OCI/Compose apps" claim is unproven empirically.
The trust-vs-promotion decoupling is the kind of thing you'd expect from a 5-year-old platform team that has been bitten by mutable tags and "promote = trust" conflation. Putting that in v1 is unusual. Operating mode × control-plane scope is the right primitive for hosting Slurm, K8s, vLLM, Jupyter under one roof. Most teams collapse it and have to add it back painfully two years later.
3.5 Billing, Payments, Observability, SRE — Grade: B−¶
Production-grade:
- Immutable ledger is enforced — balance is SUM(amount_minor) from ledger_entries, no UPDATE path exists in code.
- Stripe webhook signature verification is correct: raw-body-first, HMAC-SHA256, 5-min timestamp tolerance. Unit-tested.
- NATS InitStreams() creates BILLING/PAYMENTS/PROVISIONING/APPS/DLQ correctly per NATS_Stream_Config.md. Producer/consumer pairing checked end-to-end.
- Periodic accrual loop is policy-driven (billing.window_seconds, default 60s, clamped 10–3600s).
- Low-balance + depletion are real: low-balance crossing detection with dedup via last_low_balance_notified_at; depletion auto-release emits force-release events for all active allocations.
- OTel + correlation_id threaded through async paths. Spans annotated with messaging.system, event.type, correlation_id, org/project/user/allocation IDs. Per-message instrumentation in billing-worker and notification-relay.
- 32 detailed runbooks with correlation-first diagnostics — Billing_Worker_Failure_Runbook.md is professional.
- Operator CLI (gpuaas-cli ops fleet-health, node-metrics, runbooks) with --output json. Agent-readable output is implemented.
Looks-right-but-fragile:
- DLQ has no auto-publish. When retry_count >= maxRetries, message goes status='failed' in outbox_events and stays there. No DLQ NATS stream re-publishing, no monitoring tool, no alert. Recovery requires manual SQL replay.
- Email notifications missing. Notification-relay only does NATS → Redis → WebSocket. Low-balance alerts never reach a user who isn't actively logged in.
- No refund operation in payments service. Stripe reconciliation drift can't be auto-resolved.
- Migration story is single-file. db_schema_v1.sql (72 tables) — no numbered migrations, no rollback path. Fine for active dev; will be a problem at first production schema change.
- Prometheus metrics only on webhook-worker. Billing-worker, provisioning-worker, API rely on OTel + logs. Fine if your observability stack is Tempo+Loki, less fine if you also need Prometheus alerts.
- No /readyz — only /healthz. Orchestrators can't distinguish "warming up" from "serving."
- Coverage targets in Testing_Standards.md (errors 100%, middleware 90%, billing 85%, provisioning 80%) are not CI-enforced. They will silently degrade.
Test maturity (honest grade C+): 116 Go test files, 17 integration tests, 17 Playwright E2E specs. Unit tests are table-driven and use httptest — good patterns, shallow critical-path coverage. The billing service has tests for usageOrderBy() and time-format helpers, but no integration test for the accrual transaction or ledger balance computation. Outbox tests use fake stores; no real Postgres integration test. Volume looks right; depth on the highest-stakes flows is light.
3.6 GPU Slice Scheduling — Grade: B (sub-component of §3.3 Compute)¶
Subsystem summary: VM-based whole-GPU slicing — not fractional. Each "slice" is a Linux VM (libvirt + KVM, virt-install) with VFIO-passthrough for whole GPUs, typically H200. Cloud-init bootstraps the OS, installs NVIDIA drivers, registers a guest telemetry agent. Slicing is the way you carve up a node into 1/2/4-GPU sub-allocations. MIG, vGPU, MPS, and time-sliced products are intentionally out of v1.
End-to-end flow (verified by tracing through code):
1. User requests SKU h200-sxm-slice (scripts/seed.sql:18, capacity_shape: gpu_slice).
2. Orchestrator calls reserveCapacityForAllocation() (packages/services/provisioning/orchestrator/service.go:463), picks a gpu_slice node, locks 1–4 node_resource_slots rows; allocation row created with placement_status='reserved'.
3. Node-agent receives slice.vm_provision task. runSliceVMProvision() (cmd/node-agent/slice_vm.go:478) progresses through documented phases: lease acquire → host dependency check → vfio bind → image download → cloud-init seed → DHCP reservation → image-write to NVMe → virt-install → readiness probe.
4. virt-install builds a VM with --cpu=host-passthrough, NVMe raw block device, OVS bridge interface, GPU + InfiniBand --host-device, NUMA affinity.
5. Guest telemetry token allocated; cloud-init drops a probe script + systemd timer; guest pushes metrics to /internal/v1/guest-telemetry on node-agent every 30s. Control plane reads via node-agent (cmd/node-agent/telemetry.go, commit 740df955).
6. Allocation transitions to active.
Genuinely well-engineered:
- NVMe raw passthrough as VM disk — no qcow2, no copy-on-write overhead, near-baremetal I/O.
- Pre-flight VFIO bind verification — driver_override + sysfs bind path executed before virt-install (slice_vm.go:1012-1060). libvirt would silently fail otherwise on devices already bound to nvidia/nouveau.
- Slot-leasing decoupled from reservation — /var/lib/gpuaas/node-scheduler/leases/{slot-id}.json written during provisioning, not at reserve time. Survives node crashes; reconciled on next discovery.
- NUMA-local fabric+GPU pairing — selectSingleNUMAGroup() (orchestrator/service.go:880) keeps multi-GPU slice slots on one NUMA node and pairs them with co-located fabric devices. Performance-critical and not obvious until you've debugged a poorly-placed RDMA workload.
- Push-model guest telemetry via node-agent — guests push to node-agent's internal endpoint, control plane reads from node-agent. Avoids direct guest-to-control-plane network paths and simplifies firewall posture.
Specifically not built (clarifying the matrix's "Partial" label):
- MIG / vGPU / MPS / time-slicing — explicitly out per openapi_types.gen.go ("design-only"). DB schema has sharing_model and profile_name columns reserved but unused. The "Partial" label means VM-slicing works, fractional sharing doesn't. Worth surfacing explicitly — "we slice nodes by whole-GPU VMs" is more precise than "GPU slicing — partial".
- Live migration between nodes — none. Node down = slices lost.
- Persistent volumes — slices are stateless; NVMe is wiped on release.
- Slice resizing — immutable once active.
- Slice snapshot / clone — not supported.
- Out-of-band console — virt-install runs with --graphics=none and serial console isn't plumbed. SSH-only; if guest networking breaks, slice is unrecoverable without node-side intervention.
- Image versioning / per-allocation image selection — one base image per node.
- Fabric/network isolation between slices on the same node — all slices share OVS bridge + IP subnet; no VLAN, no netfilter rules. Hypervisor isolates, network does not. Worth flagging for hostile-multi-tenant scenarios.
Production-fragility worth tracking:
- Cloud-init apt-get install nvidia-driver-… is best-effort (slice_vm.go:1389). Repo down or no internet → driver fails to install but VM still transitions to active (readiness probe checks SSH responds, not GPU visibility). A user gets an allocated slice with no working drivers and gets billed for it.
- VFIO unbind on release is brittle. If unbind fails, GPU stays bound to vfio-pci; the next slice allocation on that slot fails the bind check. No retry logic.
- Graceful shutdown has no escalation — if VM ignores virsh shutdown past timeout, code falls back to virsh destroy, but a hung guest can stall the release workflow without alerting.
- In-memory guest telemetry — node restart loses history. No persistent journal.
- Slot reservation isn't atomic with leasing — small race window where DB says "reserved" but node hasn't acquired the lease yet. Concurrent provisions can fail at the lease step.
Outside read: This is the most hardware-aware part of the codebase and shows the deepest VFIO/libvirt/NUMA expertise. The end-to-end flow is real and the engineering quality is above the line. The gaps are mostly absent features (live migration, persistent volumes) or edge-case fragility (driver-install best-effort, no slice-to-slice network isolation), not core correctness bugs. Ship-ready for a single-tenant or trusted-tenant deployment; not yet for hostile multi-tenant.
Files cited: cmd/node-agent/slice_vm.go (2,526 LOC), cmd/node-agent/slice_topology.go (707 LOC), cmd/node-agent/telemetry.go, packages/services/provisioning/orchestrator/service.go:463-730, db_schema_v1.sql:827-923, scripts/seed.sql:16-22, doc/architecture/Allocation_Capacity_Shapes_and_GPU_Slices_v1.md.
3.7 Console / Terminal Architecture — Grade: B+ (highest in the repo, tied with App Platform)¶
Subsystem summary: Three-process data path — browser (xterm.js + WS) → cmd/terminal-gateway (separate process) → cmd/node-agent (PTY for baremetal allocations, SSH client for slice VMs). Auth: opaque 256-bit tokens minted by API, stored in Redis with 300s TTL, single-use via atomic GetDel. Token transport: Sec-WebSocket-Protocol header, never query string. Recent commits show the team migrated from a Redis pub/sub transport to an internal mTLS WS bridge and removed the legacy path (08fe9c4d) — disciplined cleanup, not bit-rot.
Browser xterm.js ── WS + Sec-WebSocket-Protocol: <token> ──► terminal-gateway
│
│ ConsumeToken (Redis GetDel atomic)
│ CreateStreamSessionBinding
│ Enqueue terminal.open node-task
│
wss:// internal (mTLS, peer CN check)
│
▼
cmd/node-agent
│
┌──────────────────────┼──────────────────────┐
▼ ▼
baremetal: /bin/bash with gpu_slice: ssh user@slice-vm
syscall.Credential UID/GID (key from cloud-init)
End-to-end (browser click → user types ls):
1. TerminalPanel.tsx calls mintTerminalToken() → POST /api/v1/allocations/{id}/terminal-token.
2. API validates ownership, calls terminal.CreateToken() (packages/services/terminal/service.go:154); 32-byte random hex token written to Redis terminal_token:{token} TTL 300s.
3. Browser opens WS, passes token as Sec-WebSocket-Protocol. Gateway extracts (cmd/terminal-gateway/routes.go:1080), validates format with regex ^[a-f0-9]{64}$.
4. Gateway calls ConsumeToken() — atomic Redis GetDel (service.go:202). Token deleted on first use; replay returns rejection + bumps replayRejected metric.
5. CreateStreamSessionBindingForAllocation() claims terminal_stream_active:{allocation_id} in Redis with NX. Same-user reconnect can take over an orphaned session; different-user is rejected.
6. Gateway enqueues a signed terminal.open node_tasks row (HMAC-SHA256 with node signing key, 5-min TTL).
7. Node-agent picks up task, dials gateway internal endpoint wss://{gw}/internal/ws/terminal/{session_id} over mTLS. Gateway validates peer cert CN matches node-{nodeID}.
8. Bridge hub (cmd/terminal-gateway/bridge.go) matches browser session and node session by sessionID. Pre-allocates session struct so node and browser can arrive in any order.
9. Node-agent spawns PTY: /bin/bash -i -l for baremetal (uses syscall.Credential to set UID/GID from DB-resolved POSIX identity), or ssh user@slice-vm for gpu_slice allocations. Slice path reuses the SSH key embedded in the slice's cloud-init.
10. PTY ↔ WS bidirectional relay via three goroutines. Resize is a JSON text frame; data is binary. Browser sends ping every 30s to keep proxies happy.
Genuinely well-designed:
- Single-use token enforcement is atomic — GetDel, not Get + Del. No race. Replay attempts increment a counter that's metricable.
- Sec-WebSocket-Protocol for auth — RFC-compliant, no token in URLs/logs/referrer headers. Direct compliance with AGENTS.md rule #8.
- Separate gateway process — terminal traffic doesn't share fate with API. Gateway holds no persisted secrets; everything transient via Redis.
- Bridge hub with session pre-allocation — node-agent and browser can connect in any order; sessionID matches them. Per-session write mutexes prevent interleaved frames.
- Slice terminals are SSH-tunneled, not PTY-on-host — slice users get isolation via the VM boundary; node-agent doesn't run their shell. SSH key is provisioned via slice cloud-init, no separate key-distribution path needed.
- Safe reconnect — same user + same allocation can reclaim an orphaned session; different user is rejected. Logs the takeover.
- Recent stability work — b1ae9c5c ("Redesign terminal transport over internal websockets"), 41392a55 ("Harden terminal bridge session writes" with mutex additions), 08fe9c4d ("Remove legacy terminal transport path"). Active hardening and disciplined cleanup. This subsystem shows the most evolutionary maturity in the repo.
Gaps (real ones, ranked):
- No session recording / audit transcript. Today, terminal sessions write nothing durable beyond log lines. For an enterprise GPU platform, this is a compliance gap (SOC2, HIPAA, internal audit). Add async PTY-output recording → S3 with session metadata in DB. Tracked as action item #14.
- No backpressure on PTY → WS frames. PTY read buffer is 4 KB; if user runs
cat /dev/urandom, frames pile up in WS write buffer. No explicit drop or rate cap. Gateway OOM risk under abuse. - No rate limit on token-mint endpoint. AGENTS.md has
rate_limit.terminal_token_requests_per_minute = 10policy key, but the code path onPOST /allocations/{id}/terminal-tokendoesn't appear to enforce it. Verify and wire. Tracked as action item #11. - mTLS peer validation is CN-only, no SAN check. Gateway validates peer CN matches
node-{uuid}(routes.go:957); doesn't verify SAN/DNS. Lower-impact in practice (cert issuance is internal), but should be tightened. - App-instance terminals don't exist. Terminal works for allocations only. JupyterLab / vLLM / Slurm controllers running as app instances have no consistent terminal-into-app path; today users SSH to a sidecar or use the app's own UI. The platform proxy could carry this; it doesn't.
- One-session-per-allocation is silent. Opening a second tab takes over the first without warning. Either document it or warn the user.
- Session TTL is gateway-enforced; node-agent has no independent timer. If gateway crashes mid-session, the PTY may run until allocation release. Node-agent should enforce its own TTL.
- No metrics dashboard. Active sessions, p99 connect latency, mint→connect time, replayRejected rate — all instrumented, none surfaced on a panel.
Security review (concise): Token replay impossible (atomic GetDel ✓). Cross-tenant token reuse prevented at consume + bind (✓). Token never appears in URL or response body (✓). PTY runs with the right UID/GID (✓). Main residual concerns: no session recording (compliance), no rate limit on mint (DDoS surface), CN-only mTLS validation (tightenable). A pentester would file the rate-limit gap as the most actionable finding.
Outside read: The architecture is one of the better takes on browser-terminal-into-allocation I've seen at this stage. You can read commit history and see the team move from Redis-pubsub to internal-WS, harden races, then delete the legacy path. The big production gaps — session recording (compliance) and abuse rate-limiting (DDoS) — are addressable in days, not quarters. B+ is genuine; closest thing to A in the repo.
Files cited: cmd/terminal-gateway/{main,routes,bridge}.go, packages/services/terminal/service.go, cmd/node-agent/{terminal_stream,catalog}.go, packages/web/src/components/terminal/TerminalPanel.tsx. ~3,780 LOC subsystem-wide.
4. Five things you've genuinely earned credit for¶
These are the things where a sharp outsider would say "they thought about this before they had to."
- Outbox + immutable ledger + correlation-ID threading. The boring thing done right at v1. Most teams skip this and pay for it 18 months later when reconciliation gets ugly.
- Pull-based node agent with two-key auth model + step-ca proxy. The "node-agent only ever talks to api.internal" rule is something most teams discover the hard way after a firewall incident.
- App platform's operating-mode × control-plane-scope orthogonality, plus trust ⊥ promotion. The kind of design that pays off in year 3, not year 1.
- Multi-agent worktree governance + queue-git-check. Whether or not this stays long-term, the discipline of "every done task has a commit reachable from origin/master" prevents a class of "but it works on my machine" problems.
- Contract-first that's actually enforced. 11K-line OpenAPI + 2K-line AsyncAPI + bidirectional codegen + breaking-change detector + observability gate. This level of contract investment is rare outside bigco platform teams.
5. Action items — five things to fix before claiming "production-ready"¶
Ranked by risk × ease-of-fix. Track progress via the checkboxes; this list is the basis for the next-week re-review.
[ ] #1 — Wire the idempotency middleware (HIGH risk, LOW effort)¶
The single biggest correctness gap.
packages/shared/middleware/idempotency.gois built and unused.- Wrap every mutating route (POST/PUT/PATCH) with
middleware.Idempotency(pool, "<scope>"). - Remove the 6 hand-rolled handler checks (
cmd/api/routes.go:9388-9389, 9479, 9611, 9660, 18095, 18138). - Add an integration test confirming replay returns the cached response and a fresh request creates a new state change.
- This contradicts AGENTS.md rule #2 today.
Re-review check: grep for middleware.Idempotency( in cmd/api/routes.go — should appear on every mutating route. Hand-rolled r.Header.Get("Idempotency-Key") checks should be zero in handler bodies.
[ ] #2 — Add optimistic locking on allocations + central state-machine enforcer (HIGH risk, MEDIUM effort)¶
- Add a
versioncolumn toallocations(or useWHERE status = $previous_statuspredicate in every UPDATE). - Centralize transitions in an
AllocationStateMachineenforcer object inside the orchestrator package; orchestrator and worker both go through it. - Add a regression test: concurrent admin force-release + user release; only one should succeed, the other should return a conflict error.
Re-review check: there should be exactly one place state transitions are validated. Concurrent-mutation test should exist and pass.
[ ] #3 — Implement DLQ recovery + email delivery (MEDIUM risk, MEDIUM effort)¶
- Auto-republish failed outbox messages to a
DLQNATS subject (don't just let them rot atstatus='failed'). - Add a Prometheus metric for DLQ depth + an alert rule.
- Add an
email-worker(or wire SES/SendGrid into notification-relay) so low-balance and depletion notifications reach offline users. Today, an offline user whose balance depletes loses their allocations with no notification they can see when they next log in.
Re-review check: outbox_events WHERE status='failed' should have an automated recovery path. A user who is logged out should receive an email on billing.balance_depleted.
[ ] #4 — Promote security and lint gates from report-only to enforcing (MEDIUM risk, LOW effort)¶
(Revised after gate-coverage audit. Codegen drift is already enforced via sdk_codegen_smoke.sh with CODEGEN_ENFORCE_CLEAN=1 — that part is done; striking it.)
- Add a
.golangci.ymlconfig with at minimum:errcheck,gosec,govet,staticcheck,unused. Wire alintjob into.gitlab-ci.ymlthat fails on findings. Todaymake lintis manual-only. - Add a frontend ESLint config and a Prettier check; wire into the
frontend_build_and_tests.shjob. - Promote
gitleaksandgovulncheckfrom report-only to failing (high severity only). Set a graduation date on the "signal calibration in progress" comment in.gitlab-ci.yml:189-191. - Add a pre-commit hook (
lefthookis lightest) that runs the new lint locally.
Re-review check: cat .golangci.yml should show real rules; CI log for a recent PR should show a lint job enforcing failure; security tool_status should map "high severity finding" → fail, not "report-only keeps CI green".
[ ] #5 — Honest implementation-status headers on App_*.md docs (LOW risk, LOW effort)¶
You said it yourself: docs may be out of sync with code. The 21 App_*.md docs are some of the most thoughtful design writing in the repo, but 8+ contain "not yet implemented" admissions buried mid-document.
- Add YAML frontmatter to each:
status: design | partial | shipped | deprecated, plus alast_verified_against_code: 2026-MM-DDfield. - Run a quarterly verification pass.
Re-review check: every doc/architecture/App_*.md should carry a status header. Status should match what grep says about the code.
Bonus action items (lower priority but worth tracking)¶
[ ] #6 — Generate a Go client SDK from OpenAPI¶
Internal services hand-code HTTP calls to API. Generate a typed client package and migrate workers + CLI to use it. Eliminates a class of contract drift bugs.
[ ] #7 — Surface the App Platform as its own row on the capability matrix¶
Today the matrix splits app-platform work across runtime cells. Add a row:
| App Platform | OCI manifest framework — Implemented | Trust + promotion — Implemented | App billing attribution — Schema-ready, producer pending | Tenant-shared runtimes — Schema, API partial | External app SDK — Reserved |
[ ] #8 — Numbered migration system¶
Replace the single db_schema_v1.sql file with a numbered migration tool (Goose / Flyway / sqlx-migrate). Required for first production schema change.
[ ] #9 — Add /readyz¶
Distinct from /healthz. Orchestrators (k8s, MAAS) need to know "warming up" vs "ready to serve."
[ ] #10 — Multi-tenant blast-radius test¶
Add an integration test where tenant A's app instance crashes / hangs / saturates a node, and assert tenant B's app instance is unaffected. Critical for the "we host shared runtimes" sales claim.
[ ] #11 — Wire terminal token-mint rate limit¶
AGENTS.md has rate_limit.terminal_token_requests_per_minute = 10 policy key. Verify whether POST /api/v1/allocations/{id}/terminal-token enforces it. If not, wire a sliding-window limiter keyed by user_id (and consider also IP).
Re-review check: hammer the token-mint endpoint from a single user; expect 429 after the policy threshold.
[ ] #12 — Slice cloud-init driver-install verification¶
Today the readiness probe checks SSH responds, not whether NVIDIA driver/CUDA actually loaded. A user can be billed for a slice with broken drivers. Add a post-install verification step inside cloud-init that fails cleanly if nvidia-smi doesn't return device info, and a node-side check that runs over SSH before flipping allocation to active.
Re-review check: simulate apt mirror outage; request a slice; allocation should fail or roll back, not transition to active with broken drivers.
[ ] #13 — Slice-to-slice network isolation¶
Slices on the same node share an OVS bridge with no policy. Add VLAN tags or netfilter rules so slices can't ARP-spoof or sniff each other. Required before claiming hostile-multi-tenant readiness.
Re-review check: from slice A, attempt to reach slice B's IP via ICMP/ARP — should be blocked.
[ ] #14 — Terminal session recording¶
Async PTY-output recording → S3 with session metadata in DB. Required for any compliance posture (SOC2, HIPAA, internal audit). Without this, "enterprise-ready" is not a defensible claim.
Re-review check: an active terminal session should produce a retrievable artifact post-session-close; admin should be able to query "show me all sessions for allocation X".
[ ] #15 — Position the slicing story honestly (or build the missing fractional path)¶
Current capability matrix says "GPU slicing — Partial". The actual state: VM-based whole-GPU slicing works; MIG/vGPU/MPS/time-slicing are not built and the OpenAPI types comment explicitly says "design-only". Either: - (a) build at least one fractional path (most likely MIG for H100/H200 customers who want < 1 GPU), OR - (b) update the matrix to read "GPU slicing — VM-based whole-GPU slicing implemented; fractional/MIG out-of-scope" so external readers don't expect what isn't there.
Re-review check: capability matrix should not say "GPU slicing — Partial" without a footnote that explains what the partial covers and excludes.
[ ] #16 — Wire queue-git-check into CI (or accept that it's a local-only convention)¶
scripts/ci/agent_queue_git_consistency.sh is real and runs from make queue-git-check, but it's only invoked by gitlab_local_dry_run.sh, not by .gitlab-ci.yml. The multi-agent queue's "every done task has a commit reachable from origin/master" property currently depends on developer discipline. Either wire it into CI (low effort) or document explicitly that it's a local convention so future reviewers don't assume it's an automated gate.
Re-review check: either .gitlab-ci.yml has a queue_consistency job, or Multi_Agent_Lane_Worktrees_v1.md clearly states "queue-git-check is a local discipline, not a CI gate."
[ ] #17 — Make breaking-change detection actually fail when tools are missing¶
scripts/ci/contracts_breaking_change.sh falls back to "best-effort baseline" and exits 0 when openapi-diff or asyncapi CLI is missing. Default-on REQUIRE_BREAKING_DIFF_TOOLS=1 and bake the tools into the CI image so the script can't silently skip. A breaking change to openapi.draft.yaml shouldn't be able to land just because the runner happened to lack the diff tool.
Re-review check: intentionally introduce a breaking change in a feature branch; CI should fail the contracts stage, not pass with "best-effort" warnings.
[ ] #18 — Decide on the fast-release-profile gate-skip (intentional or gap?)¶
backend_build_and_tests.sh is conditionally skipped for slice-dev, runtime-fast, and web-fast release modes (.gitlab-ci.yml:244-247), which means the observability_trace_gate and the 10 invariant guards all skip on those profiles. If the fast-paths exist because they ship pre-validated artifacts from a master pipeline that already ran the gates, document that explicitly. If it's a gap, plumb a "validated-against-master-SHA" check so fast-paths can't ship gates-untested artifacts.
Re-review check: Platform_Control_Release_Promotion_Policy.md should explicitly describe how fast-release profiles relate to the master-pipeline gate run that validated their artifacts.
[ ] #19 — Label or prune the 47 unused scripts/ci/*.sh¶
There are 47+ scripts in scripts/ci/ that .gitlab-ci.yml does not invoke. Some are intentional (deployment helpers, local dev utilities, common-libs sourced by other scripts); some are inventory drift (legacy/superseded). Without a labeling convention, a reviewer can't tell at a glance whether a script is a dead gate or a live tool.
Suggested fix: split into scripts/ci/gates/ (CI-invoked, one per file) vs scripts/ci/tools/ (deploy/local) vs scripts/ci/lib/ (common-sourced helpers). Or add a header comment # ROLE: gate | tool | lib enforced by a meta-gate.
Re-review check: every .sh under scripts/ci/ has a clear classification, and grep -L 'scripts/ci/' .gitlab-ci.yml against the gates dir returns nothing unexpected.
6. Product evaluation¶
Position is mostly right but under-marketed¶
The footer of the capability matrix says the right thing: "What this is: a GPU provisioning and billing control plane with workload launchers. What it's not yet: a full AI cloud (no managed inference, model registry, training pipelines, multi-region, or enterprise billing)."
The body of the matrix doesn't. The app platform is the most differentiated thing built and it's invisible — split across runtime cells. Recommendation: add an "App Platform" row (action item #7).
Target user the code implies (vs what marketing usually says)¶
Reading the code, the buyer profile is: - Mid-market AI/ML infra teams running mixed workloads (Slurm batch + K8s + interactive Jupyter + vLLM serving) on the same fleet. - Enterprise-friendly auth (Keycloak OIDC, RBAC, audit logs, immutable ledger, idempotency-aware contracts) — implies B2B/regulated buyers, not consumer. - Operator-first (32 runbooks, ops CLI with JSON output, correlation-IDed everything) — implies you expect customers to operate this themselves, or have a "white-label internal cloud" angle.
Where the code disagrees with a "managed AI cloud" framing: tenant-operated app primitives are built, managed inference is not. Don't market the latter.
Strategic gaps¶
- Multi-region story doesn't exist. Region columns in the schema, no routing logic. Fine for on-prem / single-DC, problematic for cloud-replacement.
- No spot/preempt/scheduling sophistication. Flat allocation model. For training-pipeline customers, this is a gap.
- Tenant-shared runtime is the differentiator but the API isn't shipped. Until it is, "we host shared Slurm and K8s for you" is a roadmap claim.
- External app SDK is the platform play but it's blocked on the identity contract. Until external developers can ship apps without committing to your repo, the app platform is a private platform.
Board-level summary¶
"This team has built ~70% of a serious GPU provisioning + app platform that, if completed, would compete in the on-prem-cloud-replacement and white-label-AI-platform segments. The architectural choices are above-average for the stage. The biggest risks are (a) cross-cutting middleware enforcement gaps that will cause correctness incidents in production, and (b) a marketing position that under-sells the most original work (app platform) and could be confused for a managed AI cloud (which it isn't and shouldn't be). 6–9 months of focused execution on the action list above would move this from B-grade implementation to A-grade. The design work is largely done."
7. Next-week re-review checklist¶
When this doc is re-reviewed, the reviewer should:
- Grade movement — has overall grade moved from B/B+ to B+/A-?
- Action item checkboxes — count completed; for each, run the "Re-review check" specified in §5.
- Spot-check action item #1 specifically —
grep -n "middleware.Idempotency(" cmd/api/routes.goshould now have ≥20 matches; manual handler-level header checks should be zero. - Spot-check action item #2 — open
packages/services/provisioning/orchestrator/andcmd/provisioning-worker/; both should call into a single state-machine enforcer; concurrent-release test should exist. - Re-run test counts — Go test files, integration tests, E2E specs. Should not have regressed; ideally grown along the highest-stakes flows (ledger accrual, idempotency replay, outbox DLQ).
- Re-run the App_*.md status header check — every doc should now have YAML frontmatter with a
statusfield. - Capability matrix update — App Platform row should now be visible.
- New gaps surfaced by the work — what wasn't in this review that the work surfaced? Document them as the next round.
- Slice subsystem (§3.6) — driver-install verification path should fail closed; slice-to-slice network policy should exist; capability matrix wording on slicing should be precise.
- Terminal subsystem (§3.7) — token-mint rate limit should be enforced; session recording should produce retrievable artifacts; metrics dashboard for active sessions / mint→connect latency should exist.
- CI gate coverage (§3.1 revised) — re-run the
scripts/ci/*.sh↔.gitlab-ci.ymlcross-reference. Items to confirm: (a)golangci-lintjob exists and is enforcing; (b)gitleaks/govulncheckgraduated from report-only; (c)contracts_breaking_change.shno longer falls back to "best-effort" silently; (d)queue-git-checkeither wired into CI or documented as local-only; (e) fast-release-profile gate-skip is documented or removed.
Appendix: methodology¶
This review was produced by five parallel layer audits:
- Toolchain & CI/CD: Makefile,
scripts/ci/,.gitlab-ci.yml,.github/workflows/, lint/security/codegen, governance docs. - Contract & Control Plane:
doc/api/openapi.draft.yaml,doc/api/asyncapi.draft.yaml,cmd/api/,packages/shared/{errors,middleware,policy,outbox}, audit + idempotency patterns. - Compute Layer:
cmd/provisioning-worker/,cmd/node-agent/,packages/services/{provisioning,maas}/, state machines, Temporal, PKI, isolation models. - App Platform & Workload: 21
App_*.mddocs,packages/services/appruntime/,cmd/terminal-gateway/,packages/web/,packages/python-sdk/,cmd/gpuaas-cli/. - Ops, Billing, Observability, Tests:
cmd/billing-worker/,cmd/webhook-worker/,cmd/notification-relay/,cmd/outbox-relay/, runbooks, NATS streams, OTel, test counts.
Each audit reported strengths, gaps between rule and reality, and a layer grade independently before synthesis.
— Reviewer: external-perspective evaluation, point-in-time. Re-verify against current code at each subsequent review.