External architectural & product review¶

Decided

Source: doc/governance/External_Architectural_Review_2026-04.md · Reviewer: external technical reviewer with 15+ years on similar platforms · No internal context primed

A single end-to-end external review of the GPUaaS codebase and design. Five parallel layer audits cross-checked against doc/architecture/, doc/governance/, and doc/operations/. Comparison set: CoreWeave, Lambda, RunPod, internal hyperscaler GPU control planes, large-bank tenant-shared-runtime platforms.

This is the most consolidated outside view of the platform and was useful precisely because it was deliberately uncharitable.

TL;DR table (verbatim from review, 2026-04-24)¶

Layer	Grade	One-line
Toolchain & Governance	B	Governance is A-tier; enforcement is B-tier. 10 custom invariant guards in CI; lint, queue-git-check, breaking-change tools, and security scans are missing or report-only.
Contract & Control plane	B−	Strong contracts; cross-cutting middleware enforcement is patchy.
Compute (Provisioning / Node-agent / MAAS / PKI)	B−	Sophisticated design; scattered state-machine enforcement.
App Platform	B+	Most original work in the repo. Design A-tier, implementation 60–70%.
Billing / Ops / SRE	B−	Solid bones; missing email, DLQ recovery, numbered migrations.
GPU Slice Scheduling	B	VM-based whole-GPU slicing is real and well-engineered; MIG/vGPU/MPS not built; multi-tenant network isolation thin.
Console / Terminal	B+	Most evolutionary maturity in the repo. Atomic single-use tokens, mTLS bridge, recent legacy-path removal. Missing session recording + mint rate-limit.
Overall	B / B+	Above-average for stage. Top decile on design, mid-pack on enforcement.

Grade rollup¶

flowchart LR
    classDef ap fill:#d1e7dd,stroke:#0a3622
    classDef bm fill:#fff3cd,stroke:#332701
    classDef bp fill:#cfe2ff,stroke:#0e2240

    L1["Toolchain & Governance: B"]:::bm
    L2["Contract & Control plane: B−"]:::bm
    L3["Compute: B−"]:::bm
    L4["App Platform: B+ (design A-tier, impl 60–70%)"]:::ap
    L5["Billing / Ops / SRE: B−"]:::bm
    L6["GPU Slice Scheduling: B"]:::bm
    L7["Console / Terminal: B+"]:::bp

One-paragraph synthesis (verbatim)¶

This team has built ~70% of a serious GPU provisioning + tenant-operated app platform that, if completed, would compete in the on-prem-cloud-replacement and white-label-AI-platform segments. The architectural choices are above average for the stage. The biggest risks are (a) cross-cutting middleware enforcement gaps that will cause correctness incidents in production, and (b) a marketing position that under-sells the most original work (the app platform) and could be confused for a managed AI cloud (which it isn't and shouldn't be). 6–9 months of focused execution on the action list in §5 — particularly idempotency wiring, optimistic locking on allocations, DLQ recovery, app-runtime metering producer, and the tenant-shared runtime API — would move this from B-grade implementation to A-grade. The design work is largely done.

Where the review pushed¶

flowchart TB
    classDef strong fill:#d1e7dd,stroke:#0a3622
    classDef weak   fill:#fff3cd,stroke:#332701
    classDef risk   fill:#f8d7da,stroke:#42101e

    subgraph S[Confirmed strengths]
      direction TB
      S1[Governance discipline]:::strong
      S2[App platform originality]:::strong
      S3[Terminal evolution]:::strong
      S4[Contract-first authority]:::strong
      S5[Slice VM isolation]:::strong
    end

    subgraph W[Confirmed weaknesses]
      direction TB
      W1[Cross-cutting middleware<br/>enforcement gaps]:::weak
      W2[Idempotency wiring incomplete]:::weak
      W3[Optimistic locking on allocations]:::weak
      W4[DLQ recovery gaps]:::weak
      W5[App-runtime metering producer<br/>missing]:::weak
      W6[Tenant-shared runtime API direction]:::weak
      W7[Network isolation thin for multi-tenant]:::weak
      W8[No MIG/vGPU/MPS]:::weak
    end

    subgraph R[Strategic risks]
      direction TB
      R1["Marketing under-sells<br/>the app platform"]:::risk
      R2["Could be confused for<br/>a managed AI cloud<br/>which it isn't"]:::risk
    end

What the review confirmed about strengths¶

Governance is A-tier. 10 custom invariant guards in CI; contract-first authority is real.
App platform is the most original work in the repo. Design is at A-tier; implementation is at 60–70%.
Console / Terminal has the most evolutionary maturity — atomic single-use tokens, mTLS bridge, legacy-path removal.
Slice VM whole-GPU passthrough is "real and well-engineered" — the per-slot dedicated NVMe + IB VF + VFIO combination is unusual and defensible.

Action list highlights¶

The review's §5 named priorities to move B → A:

Idempotency wiring across mutation paths.
Optimistic locking on allocations — close the small race window where two writers could conflict.
DLQ recovery flows — explicit operator path for poison messages.
App-runtime metering producer — the missing piece between app-runtime and the ledger.
Tenant-shared runtime API direction — codify or kill.
Email channel for notifications (currently WS-only on the implemented side).
Numbered migrations with explicit forward + rollback tested per migration.
Session recording on terminal (compliance ask).
Mint rate-limit on terminal tokens to slow brute-force enumeration attempts.

What came out as requirements¶

The action list maps to specific subsequent docs / tickets:

Idempotency + optimistic locking → tracked in Fallback_Tech_Debt_Register.md.
DLQ recovery → in Queue_Backlog_Runbook.md.
App-runtime metering → App_Runtime_Metering_v1.md + App_Runtime_Billing_Model_v1.md.
Tenant-shared runtime API direction → App_Tenant_Shared_Runtime_API_Direction_v1.md.
Schema migration plan → Schema_Migration_Plan.md.
Terminal mint rate-limit → policy key rate_limit.terminal_token_requests_per_minute (default 10) is implemented.
Network isolation thinness → reserved in slice networking model as Phase 2+ project-network work.

What the review explicitly did not say¶

It did not endorse adding MIG/vGPU/MPS — only observed that they don't exist.
It did not call for a multi-region active/active rewrite — explicit non-goal per PRD.
It did not call for migration to Kubernetes-everywhere — confirms the deliberate "platform below K8s" framing.

How to use the review¶

flowchart LR
    A[New PR or design] --> B{Touches a B− area?}
    B -- yes --> C[Reference the review action item<br/>that the PR closes or improves]
    B -- no --> D[Normal flow]
    C --> E[Track progress against<br/>action list in §5]

The review is not a runtime check — it's a periodic external calibration. Re-running it (or commissioning a second external review) is the validation that the action list is closing real risks rather than busy-work.

Where to look next¶

External clouds & products comparison — what the comparison set looked like
Gap analyses — the internal gap analyses the review confirmed
Position vs other clouds — public posture
Source:
External_Architectural_Review_2026-04.md
Fallback_Tech_Debt_Register.md