Position vs other GPU clouds¶

Implemented

Factual comparison of GPUaaS today vs. publicly-documented hyperscaler and boutique-cloud GPU offerings.

This page describes where the GPUaaS slice and bare-metal product currently sits in the broader GPU-cloud landscape. No roadmap, no proposed improvements — only factual claims about what each platform does.

Positioning quadrant¶

quadrantChart
    title GPU cloud isolation vs operational simplicity
    x-axis "Operationally simple" --> "Operationally rich"
    y-axis "Weaker isolation" --> "Stronger isolation"
    quadrant-1 "Hyperscaler HPC"
    quadrant-2 "GPUaaS sweet spot"
    quadrant-3 "Boutique container clouds"
    quadrant-4 "Bare metal direct"
    "RunPod / Vast / Together": [0.25, 0.25]
    "TensorDock / FluidStack":   [0.30, 0.30]
    "Lambda 1-Click":            [0.45, 0.55]
    "CoreWeave (K8s)":           [0.70, 0.55]
    "DGX Cloud":                 [0.80, 0.65]
    "AWS / Azure / GCP HPC":     [0.85, 0.80]
    "GPUaaS slice":              [0.55, 0.78]

GPUaaS lands deliberately in a "VM-with-passthrough between hyperscalers and boutique clouds" spot:

Stronger isolation than container clouds (RunPod, Vast, Together) because of VFIO + per-slot NVMe + per-slot IB VF.
Operationally simpler than CoreWeave because there is no K8s + GPU Operator + Multus stack to maintain.
Less feature-rich than AWS/Azure/GCP HPC — no MIG, no confidential compute, no multi-node clusters, no managed driver pipelines.

Side-by-side comparison¶

Dimension	GPUaaS (today)	CoreWeave / Lambda	AWS / Azure / GCP HPC	RunPod / Vast / TensorDock	DGX Cloud
Tenancy unit	Per-slot VM with passthrough	Bare-metal node or K8s pod	Whole VM, often whole node	Containers on shared hosts	K8s + Slurm pods
Sub-GPU partitioning	None (1 GPU = 1 slot)	MIG via K8s on some SKUs	MIG-backed instance shapes	Sometimes MIG	MIG natively
Isolation strength	Strong (VFIO + per-slot NVMe + per-slot IB VF)	Bare metal: full; K8s: cgroups + GPU operator	VM-level + Nitro/Hyper-V offload	Container-level (weaker)	Pod-level
Scheduler	Postgres slot table + Temporal	K8s scheduler	Internal placement	K8s / custom	K8s / Slurm
Network plane	OVS + iptables NAT + dnsmasq	CNI (Multus, Calico) + SR-IOV	Nitro / Andromeda / Azure SDN	CNI	CNI + GPUDirect
East-west fabric	IPoIB w/ per-slot SR-IOV VF	RDMA via SR-IOV	EFA / GPUDirect / NDR	Often none	NVLink + NDR
Topology awareness	NUMA only	NUMA + NVLink + rack	NUMA + NVLink + rack/spine	None	Full topology
Image pipeline	qemu-img convert + cloud-init	Container image (instant)	AMI / VHD baked	Docker image	Container
Confidential compute	No (`loader_secure=no`, no TPM)	Optional on some SKUs	Yes (Nitro Enclaves, CVM, Confidential GKE)	No	H100 CC available
Source of truth	Postgres	etcd	Internal databases	Mixed	etcd
Onboarding	Operator-approved slot map	Auto via K8s join	Fully automated	Auto	Operator + auto
Multi-node	Not supported	K8s pod groups, Slurm	UltraClusters / placement groups	Not typical	First-class

Strengths grounded in code¶

mindmap
  root((Strengths))
    Isolation
      VFIO GPU passthrough
      Per-slot dedicated NVMe
      Per-slot SR-IOV IB VF
      Wipe policy required on slot
    Reliability
      Transactional reservation
      Outbox + Temporal
      Operator-approved slot map
      Per-slot file leases
    Operability
      Single Postgres source of truth
      Deterministic MAC/IP layout
      Phase-timed task output
      Privacy-respecting telemetry
    Contract discipline
      Contract-first APIs
      Audit on every privileged mutation
      Immutable ledger

Verifiable strength claims¶

Strong physical isolation per slot. Per-slot dedicated NVMe + dedicated IB VF + VFIO GPU. Verified by: node_resource_slots required capacity_metadata keys (storage_ownership=slice, fabric_claim_mode=per_slot_vf, non-empty fabric_vf_pci_address) — see Capacity shapes.
Reservation is transactional. Slot UPDATE + allocation INSERT + outbox row in one Postgres transaction. Verified by: service.go:1499-1660.
Outbox + Temporal beats imperative kubectl. Verified by: packages/shared/outbox/ + cmd/provisioning-worker/temporal.go.
Operator approval gate on slot inventory. Topology discovery returns approval_required: true; only operator action creates node_resource_slots rows.
Deterministic MAC/IP/lease layout. MAC = 52:54: + sha256(node_id:slot)[:4]; IP = 10.100.0.{10+slot_index}; leases at /var/lib/gpuaas/node-scheduler/leases/{slot_id}.json.
Privacy-respecting telemetry. Per-allocation token to host-only sink at 10.100.0.1:9110. No tenant access to host Netdata. Verified by: cmd/node-agent/telemetry.go and design doc Slice_Guest_Telemetry_and_Benchmark_v1.md.
Wipe-policy required at slot approval time. destructive_wipe_policy must be non-empty for the slot to be schedulable.

Where the model is intentionally narrower¶

These are not "gaps" — they're explicit choices that match the current product scope.

Item	Stance	Source
Sub-GPU partitioning (MIG/vGPU/MPS)	Out of scope for v1	`Allocation_Capacity_Shapes_and_GPU_Slices_v1.md §Non-Goals`
Multi-node slice clusters	Non-goal for v1	Same
Live migration	Non-goal (PCI passthrough)	Same
Confidential compute	Not configured (`loader_secure=no`)	`slice_vm.go:1240` virt-install args
Cross-tenant east-west networking	Denied by default	`Slice_Networking_Architecture_v1.md`

What's specific about how GPUaaS does this¶

A few choices that are unusual relative to either hyperscalers or boutique clouds:

Per-slot file leases (not DB rows or distributed lock). Host-local mutex via JSON files under /var/lib/gpuaas/node-scheduler/leases/. Trades a tiny bit of cross-host visibility for simpler crash semantics and fast reconciliation.
Slot reservation is a separate concern from VM lifecycle. The Postgres slot table is the durable scheduler state; the node-agent only validates a plan and executes — it does not invent placement.
One BFF binary (cmd/api) for everything. Most GPU clouds split BFF/admin/internal into separate services. GPUaaS imports all domain packages directly. The eventual split is documented but not done.
Contract-first OpenAPI is authoritative. 33k lines of OpenAPI + 2.3k of AsyncAPI define what the platform exposes; code generation enforces it.

Reading suggestions¶

If you're comparing to …	Read
AWS GPU HPC instances (P5, P5e, EC2 UltraCluster)	GPU slice as-built for isolation parity; note no UltraCluster equivalent
Azure ND H100 v5	Same
GCP A3 / A3 Mega	Same; note no GPUDirect equivalent yet
CoreWeave K8s	Domain ownership — note no K8s on the runtime side
RunPod / Vast	GPU slice as-built — note dedicated NVMe per slot, real VM isolation
DGX Cloud / Slurm	`cmd/slurm-reference-controller`; note single-allocation only

Where to look next¶

GPU slice as-built — the full code-grounded write-up
Capacity shapes & SKUs — what's actually sellable today
System context — process map