External architectural & product review¶
Decided
doc/governance/External_Architectural_Review_2026-04.md · Reviewer: external technical reviewer with 15+ years on similar platforms · No internal context primed
A single end-to-end external review of the GPUaaS codebase and design. Five parallel layer audits cross-checked against doc/architecture/, doc/governance/, and doc/operations/. Comparison set: CoreWeave, Lambda, RunPod, internal hyperscaler GPU control planes, large-bank tenant-shared-runtime platforms.
This is the most consolidated outside view of the platform and was useful precisely because it was deliberately uncharitable.
TL;DR table (verbatim from review, 2026-04-24)¶
| Layer | Grade | One-line |
|---|---|---|
| Toolchain & Governance | B | Governance is A-tier; enforcement is B-tier. 10 custom invariant guards in CI; lint, queue-git-check, breaking-change tools, and security scans are missing or report-only. |
| Contract & Control plane | B− | Strong contracts; cross-cutting middleware enforcement is patchy. |
| Compute (Provisioning / Node-agent / MAAS / PKI) | B− | Sophisticated design; scattered state-machine enforcement. |
| App Platform | B+ | Most original work in the repo. Design A-tier, implementation 60–70%. |
| Billing / Ops / SRE | B− | Solid bones; missing email, DLQ recovery, numbered migrations. |
| GPU Slice Scheduling | B | VM-based whole-GPU slicing is real and well-engineered; MIG/vGPU/MPS not built; multi-tenant network isolation thin. |
| Console / Terminal | B+ | Most evolutionary maturity in the repo. Atomic single-use tokens, mTLS bridge, recent legacy-path removal. Missing session recording + mint rate-limit. |
| Overall | B / B+ | Above-average for stage. Top decile on design, mid-pack on enforcement. |
Grade rollup¶
flowchart LR
classDef ap fill:#d1e7dd,stroke:#0a3622
classDef bm fill:#fff3cd,stroke:#332701
classDef bp fill:#cfe2ff,stroke:#0e2240
L1["Toolchain & Governance: B"]:::bm
L2["Contract & Control plane: B−"]:::bm
L3["Compute: B−"]:::bm
L4["App Platform: B+ (design A-tier, impl 60–70%)"]:::ap
L5["Billing / Ops / SRE: B−"]:::bm
L6["GPU Slice Scheduling: B"]:::bm
L7["Console / Terminal: B+"]:::bp
One-paragraph synthesis (verbatim)¶
This team has built ~70% of a serious GPU provisioning + tenant-operated app platform that, if completed, would compete in the on-prem-cloud-replacement and white-label-AI-platform segments. The architectural choices are above average for the stage. The biggest risks are (a) cross-cutting middleware enforcement gaps that will cause correctness incidents in production, and (b) a marketing position that under-sells the most original work (the app platform) and could be confused for a managed AI cloud (which it isn't and shouldn't be). 6–9 months of focused execution on the action list in §5 — particularly idempotency wiring, optimistic locking on allocations, DLQ recovery, app-runtime metering producer, and the tenant-shared runtime API — would move this from B-grade implementation to A-grade. The design work is largely done.
Where the review pushed¶
flowchart TB
classDef strong fill:#d1e7dd,stroke:#0a3622
classDef weak fill:#fff3cd,stroke:#332701
classDef risk fill:#f8d7da,stroke:#42101e
subgraph S[Confirmed strengths]
direction TB
S1[Governance discipline]:::strong
S2[App platform originality]:::strong
S3[Terminal evolution]:::strong
S4[Contract-first authority]:::strong
S5[Slice VM isolation]:::strong
end
subgraph W[Confirmed weaknesses]
direction TB
W1[Cross-cutting middleware<br/>enforcement gaps]:::weak
W2[Idempotency wiring incomplete]:::weak
W3[Optimistic locking on allocations]:::weak
W4[DLQ recovery gaps]:::weak
W5[App-runtime metering producer<br/>missing]:::weak
W6[Tenant-shared runtime API direction]:::weak
W7[Network isolation thin for multi-tenant]:::weak
W8[No MIG/vGPU/MPS]:::weak
end
subgraph R[Strategic risks]
direction TB
R1["Marketing under-sells<br/>the app platform"]:::risk
R2["Could be confused for<br/>a managed AI cloud<br/>which it isn't"]:::risk
end
What the review confirmed about strengths¶
- Governance is A-tier. 10 custom invariant guards in CI; contract-first authority is real.
- App platform is the most original work in the repo. Design is at A-tier; implementation is at 60–70%.
- Console / Terminal has the most evolutionary maturity — atomic single-use tokens, mTLS bridge, legacy-path removal.
- Slice VM whole-GPU passthrough is "real and well-engineered" — the per-slot dedicated NVMe + IB VF + VFIO combination is unusual and defensible.
Action list highlights¶
The review's §5 named priorities to move B → A:
- Idempotency wiring across mutation paths.
- Optimistic locking on allocations — close the small race window where two writers could conflict.
- DLQ recovery flows — explicit operator path for poison messages.
- App-runtime metering producer — the missing piece between app-runtime and the ledger.
- Tenant-shared runtime API direction — codify or kill.
- Email channel for notifications (currently WS-only on the implemented side).
- Numbered migrations with explicit forward + rollback tested per migration.
- Session recording on terminal (compliance ask).
- Mint rate-limit on terminal tokens to slow brute-force enumeration attempts.
What came out as requirements¶
The action list maps to specific subsequent docs / tickets:
- Idempotency + optimistic locking → tracked in
Fallback_Tech_Debt_Register.md. - DLQ recovery → in
Queue_Backlog_Runbook.md. - App-runtime metering →
App_Runtime_Metering_v1.md+App_Runtime_Billing_Model_v1.md. - Tenant-shared runtime API direction →
App_Tenant_Shared_Runtime_API_Direction_v1.md. - Schema migration plan →
Schema_Migration_Plan.md. - Terminal mint rate-limit → policy key
rate_limit.terminal_token_requests_per_minute(default 10) is implemented. - Network isolation thinness → reserved in slice networking model as Phase 2+ project-network work.
What the review explicitly did not say¶
- It did not endorse adding MIG/vGPU/MPS — only observed that they don't exist.
- It did not call for a multi-region active/active rewrite — explicit non-goal per PRD.
- It did not call for migration to Kubernetes-everywhere — confirms the deliberate "platform below K8s" framing.
How to use the review¶
flowchart LR
A[New PR or design] --> B{Touches a B− area?}
B -- yes --> C[Reference the review action item<br/>that the PR closes or improves]
B -- no --> D[Normal flow]
C --> E[Track progress against<br/>action list in §5]
The review is not a runtime check — it's a periodic external calibration. Re-running it (or commissioning a second external review) is the validation that the action list is closing real risks rather than busy-work.
Where to look next¶
- External clouds & products comparison — what the comparison set looked like
- Gap analyses — the internal gap analyses the review confirmed
- Position vs other clouds — public posture
- Source:
External_Architectural_Review_2026-04.mdFallback_Tech_Debt_Register.md