Position vs other GPU clouds¶
Implemented
Factual comparison of GPUaaS today vs. publicly-documented hyperscaler and boutique-cloud GPU offerings.
This page describes where the GPUaaS slice and bare-metal product currently sits in the broader GPU-cloud landscape. No roadmap, no proposed improvements — only factual claims about what each platform does.
Positioning quadrant¶
quadrantChart
title GPU cloud isolation vs operational simplicity
x-axis "Operationally simple" --> "Operationally rich"
y-axis "Weaker isolation" --> "Stronger isolation"
quadrant-1 "Hyperscaler HPC"
quadrant-2 "GPUaaS sweet spot"
quadrant-3 "Boutique container clouds"
quadrant-4 "Bare metal direct"
"RunPod / Vast / Together": [0.25, 0.25]
"TensorDock / FluidStack": [0.30, 0.30]
"Lambda 1-Click": [0.45, 0.55]
"CoreWeave (K8s)": [0.70, 0.55]
"DGX Cloud": [0.80, 0.65]
"AWS / Azure / GCP HPC": [0.85, 0.80]
"GPUaaS slice": [0.55, 0.78]
GPUaaS lands deliberately in a "VM-with-passthrough between hyperscalers and boutique clouds" spot:
- Stronger isolation than container clouds (RunPod, Vast, Together) because of VFIO + per-slot NVMe + per-slot IB VF.
- Operationally simpler than CoreWeave because there is no K8s + GPU Operator + Multus stack to maintain.
- Less feature-rich than AWS/Azure/GCP HPC — no MIG, no confidential compute, no multi-node clusters, no managed driver pipelines.
Side-by-side comparison¶
| Dimension | GPUaaS (today) | CoreWeave / Lambda | AWS / Azure / GCP HPC | RunPod / Vast / TensorDock | DGX Cloud |
|---|---|---|---|---|---|
| Tenancy unit | Per-slot VM with passthrough | Bare-metal node or K8s pod | Whole VM, often whole node | Containers on shared hosts | K8s + Slurm pods |
| Sub-GPU partitioning | None (1 GPU = 1 slot) | MIG via K8s on some SKUs | MIG-backed instance shapes | Sometimes MIG | MIG natively |
| Isolation strength | Strong (VFIO + per-slot NVMe + per-slot IB VF) | Bare metal: full; K8s: cgroups + GPU operator | VM-level + Nitro/Hyper-V offload | Container-level (weaker) | Pod-level |
| Scheduler | Postgres slot table + Temporal | K8s scheduler | Internal placement | K8s / custom | K8s / Slurm |
| Network plane | OVS + iptables NAT + dnsmasq | CNI (Multus, Calico) + SR-IOV | Nitro / Andromeda / Azure SDN | CNI | CNI + GPUDirect |
| East-west fabric | IPoIB w/ per-slot SR-IOV VF | RDMA via SR-IOV | EFA / GPUDirect / NDR | Often none | NVLink + NDR |
| Topology awareness | NUMA only | NUMA + NVLink + rack | NUMA + NVLink + rack/spine | None | Full topology |
| Image pipeline | qemu-img convert + cloud-init | Container image (instant) | AMI / VHD baked | Docker image | Container |
| Confidential compute | No (loader_secure=no, no TPM) |
Optional on some SKUs | Yes (Nitro Enclaves, CVM, Confidential GKE) | No | H100 CC available |
| Source of truth | Postgres | etcd | Internal databases | Mixed | etcd |
| Onboarding | Operator-approved slot map | Auto via K8s join | Fully automated | Auto | Operator + auto |
| Multi-node | Not supported | K8s pod groups, Slurm | UltraClusters / placement groups | Not typical | First-class |
Strengths grounded in code¶
mindmap
root((Strengths))
Isolation
VFIO GPU passthrough
Per-slot dedicated NVMe
Per-slot SR-IOV IB VF
Wipe policy required on slot
Reliability
Transactional reservation
Outbox + Temporal
Operator-approved slot map
Per-slot file leases
Operability
Single Postgres source of truth
Deterministic MAC/IP layout
Phase-timed task output
Privacy-respecting telemetry
Contract discipline
Contract-first APIs
Audit on every privileged mutation
Immutable ledger
Verifiable strength claims¶
- Strong physical isolation per slot. Per-slot dedicated NVMe + dedicated IB VF + VFIO GPU. Verified by:
node_resource_slotsrequiredcapacity_metadatakeys (storage_ownership=slice,fabric_claim_mode=per_slot_vf, non-emptyfabric_vf_pci_address) — see Capacity shapes. - Reservation is transactional. Slot UPDATE + allocation INSERT + outbox row in one Postgres transaction. Verified by:
service.go:1499-1660. - Outbox + Temporal beats imperative kubectl. Verified by:
packages/shared/outbox/+cmd/provisioning-worker/temporal.go. - Operator approval gate on slot inventory. Topology discovery returns
approval_required: true; only operator action createsnode_resource_slotsrows. - Deterministic MAC/IP/lease layout. MAC =
52:54:+ sha256(node_id:slot)[:4]; IP =10.100.0.{10+slot_index}; leases at/var/lib/gpuaas/node-scheduler/leases/{slot_id}.json. - Privacy-respecting telemetry. Per-allocation token to host-only sink at
10.100.0.1:9110. No tenant access to host Netdata. Verified by:cmd/node-agent/telemetry.goand design docSlice_Guest_Telemetry_and_Benchmark_v1.md. - Wipe-policy required at slot approval time.
destructive_wipe_policymust be non-empty for the slot to be schedulable.
Where the model is intentionally narrower¶
These are not "gaps" — they're explicit choices that match the current product scope.
| Item | Stance | Source |
|---|---|---|
| Sub-GPU partitioning (MIG/vGPU/MPS) | Out of scope for v1 | Allocation_Capacity_Shapes_and_GPU_Slices_v1.md §Non-Goals |
| Multi-node slice clusters | Non-goal for v1 | Same |
| Live migration | Non-goal (PCI passthrough) | Same |
| Confidential compute | Not configured (loader_secure=no) |
slice_vm.go:1240 virt-install args |
| Cross-tenant east-west networking | Denied by default | Slice_Networking_Architecture_v1.md |
What's specific about how GPUaaS does this¶
A few choices that are unusual relative to either hyperscalers or boutique clouds:
- Per-slot file leases (not DB rows or distributed lock). Host-local mutex via JSON files under
/var/lib/gpuaas/node-scheduler/leases/. Trades a tiny bit of cross-host visibility for simpler crash semantics and fast reconciliation. - Slot reservation is a separate concern from VM lifecycle. The Postgres slot table is the durable scheduler state; the node-agent only validates a plan and executes — it does not invent placement.
- One BFF binary (
cmd/api) for everything. Most GPU clouds split BFF/admin/internal into separate services. GPUaaS imports all domain packages directly. The eventual split is documented but not done. - Contract-first OpenAPI is authoritative. 33k lines of OpenAPI + 2.3k of AsyncAPI define what the platform exposes; code generation enforces it.
Reading suggestions¶
| If you're comparing to … | Read |
|---|---|
| AWS GPU HPC instances (P5, P5e, EC2 UltraCluster) | GPU slice as-built for isolation parity; note no UltraCluster equivalent |
| Azure ND H100 v5 | Same |
| GCP A3 / A3 Mega | Same; note no GPUDirect equivalent yet |
| CoreWeave K8s | Domain ownership — note no K8s on the runtime side |
| RunPod / Vast | GPU slice as-built — note dedicated NVMe per slot, real VM isolation |
| DGX Cloud / Slurm | cmd/slurm-reference-controller; note single-allocation only |
Where to look next¶
- GPU slice as-built — the full code-grounded write-up
- Capacity shapes & SKUs — what's actually sellable today
- System context — process map