Skip to content

External architectural & product review

Decided

Source: doc/governance/External_Architectural_Review_2026-04.md · Reviewer: external technical reviewer with 15+ years on similar platforms · No internal context primed

A single end-to-end external review of the GPUaaS codebase and design. Five parallel layer audits cross-checked against doc/architecture/, doc/governance/, and doc/operations/. Comparison set: CoreWeave, Lambda, RunPod, internal hyperscaler GPU control planes, large-bank tenant-shared-runtime platforms.

This is the most consolidated outside view of the platform and was useful precisely because it was deliberately uncharitable.

TL;DR table (verbatim from review, 2026-04-24)

Layer Grade One-line
Toolchain & Governance B Governance is A-tier; enforcement is B-tier. 10 custom invariant guards in CI; lint, queue-git-check, breaking-change tools, and security scans are missing or report-only.
Contract & Control plane B− Strong contracts; cross-cutting middleware enforcement is patchy.
Compute (Provisioning / Node-agent / MAAS / PKI) B− Sophisticated design; scattered state-machine enforcement.
App Platform B+ Most original work in the repo. Design A-tier, implementation 60–70%.
Billing / Ops / SRE B− Solid bones; missing email, DLQ recovery, numbered migrations.
GPU Slice Scheduling B VM-based whole-GPU slicing is real and well-engineered; MIG/vGPU/MPS not built; multi-tenant network isolation thin.
Console / Terminal B+ Most evolutionary maturity in the repo. Atomic single-use tokens, mTLS bridge, recent legacy-path removal. Missing session recording + mint rate-limit.
Overall B / B+ Above-average for stage. Top decile on design, mid-pack on enforcement.

Grade rollup

flowchart LR
    classDef ap fill:#d1e7dd,stroke:#0a3622
    classDef bm fill:#fff3cd,stroke:#332701
    classDef bp fill:#cfe2ff,stroke:#0e2240

    L1["Toolchain & Governance: B"]:::bm
    L2["Contract & Control plane: B−"]:::bm
    L3["Compute: B−"]:::bm
    L4["App Platform: B+ (design A-tier, impl 60–70%)"]:::ap
    L5["Billing / Ops / SRE: B−"]:::bm
    L6["GPU Slice Scheduling: B"]:::bm
    L7["Console / Terminal: B+"]:::bp

One-paragraph synthesis (verbatim)

This team has built ~70% of a serious GPU provisioning + tenant-operated app platform that, if completed, would compete in the on-prem-cloud-replacement and white-label-AI-platform segments. The architectural choices are above average for the stage. The biggest risks are (a) cross-cutting middleware enforcement gaps that will cause correctness incidents in production, and (b) a marketing position that under-sells the most original work (the app platform) and could be confused for a managed AI cloud (which it isn't and shouldn't be). 6–9 months of focused execution on the action list in §5 — particularly idempotency wiring, optimistic locking on allocations, DLQ recovery, app-runtime metering producer, and the tenant-shared runtime API — would move this from B-grade implementation to A-grade. The design work is largely done.

Where the review pushed

flowchart TB
    classDef strong fill:#d1e7dd,stroke:#0a3622
    classDef weak   fill:#fff3cd,stroke:#332701
    classDef risk   fill:#f8d7da,stroke:#42101e

    subgraph S[Confirmed strengths]
      direction TB
      S1[Governance discipline]:::strong
      S2[App platform originality]:::strong
      S3[Terminal evolution]:::strong
      S4[Contract-first authority]:::strong
      S5[Slice VM isolation]:::strong
    end

    subgraph W[Confirmed weaknesses]
      direction TB
      W1[Cross-cutting middleware<br/>enforcement gaps]:::weak
      W2[Idempotency wiring incomplete]:::weak
      W3[Optimistic locking on allocations]:::weak
      W4[DLQ recovery gaps]:::weak
      W5[App-runtime metering producer<br/>missing]:::weak
      W6[Tenant-shared runtime API direction]:::weak
      W7[Network isolation thin for multi-tenant]:::weak
      W8[No MIG/vGPU/MPS]:::weak
    end

    subgraph R[Strategic risks]
      direction TB
      R1["Marketing under-sells<br/>the app platform"]:::risk
      R2["Could be confused for<br/>a managed AI cloud<br/>which it isn't"]:::risk
    end

What the review confirmed about strengths

  • Governance is A-tier. 10 custom invariant guards in CI; contract-first authority is real.
  • App platform is the most original work in the repo. Design is at A-tier; implementation is at 60–70%.
  • Console / Terminal has the most evolutionary maturity — atomic single-use tokens, mTLS bridge, legacy-path removal.
  • Slice VM whole-GPU passthrough is "real and well-engineered" — the per-slot dedicated NVMe + IB VF + VFIO combination is unusual and defensible.

Action list highlights

The review's §5 named priorities to move B → A:

  1. Idempotency wiring across mutation paths.
  2. Optimistic locking on allocations — close the small race window where two writers could conflict.
  3. DLQ recovery flows — explicit operator path for poison messages.
  4. App-runtime metering producer — the missing piece between app-runtime and the ledger.
  5. Tenant-shared runtime API direction — codify or kill.
  6. Email channel for notifications (currently WS-only on the implemented side).
  7. Numbered migrations with explicit forward + rollback tested per migration.
  8. Session recording on terminal (compliance ask).
  9. Mint rate-limit on terminal tokens to slow brute-force enumeration attempts.

What came out as requirements

The action list maps to specific subsequent docs / tickets:

What the review explicitly did not say

  • It did not endorse adding MIG/vGPU/MPS — only observed that they don't exist.
  • It did not call for a multi-region active/active rewrite — explicit non-goal per PRD.
  • It did not call for migration to Kubernetes-everywhere — confirms the deliberate "platform below K8s" framing.

How to use the review

flowchart LR
    A[New PR or design] --> B{Touches a B− area?}
    B -- yes --> C[Reference the review action item<br/>that the PR closes or improves]
    B -- no --> D[Normal flow]
    C --> E[Track progress against<br/>action list in §5]

The review is not a runtime check — it's a periodic external calibration. Re-running it (or commissioning a second external review) is the validation that the action list is closing real risks rather than busy-work.

Where to look next