Skip to content

Position vs other GPU clouds

Implemented

Factual comparison of GPUaaS today vs. publicly-documented hyperscaler and boutique-cloud GPU offerings.

This page describes where the GPUaaS slice and bare-metal product currently sits in the broader GPU-cloud landscape. No roadmap, no proposed improvements — only factual claims about what each platform does.

Positioning quadrant

quadrantChart
    title GPU cloud isolation vs operational simplicity
    x-axis "Operationally simple" --> "Operationally rich"
    y-axis "Weaker isolation" --> "Stronger isolation"
    quadrant-1 "Hyperscaler HPC"
    quadrant-2 "GPUaaS sweet spot"
    quadrant-3 "Boutique container clouds"
    quadrant-4 "Bare metal direct"
    "RunPod / Vast / Together": [0.25, 0.25]
    "TensorDock / FluidStack":   [0.30, 0.30]
    "Lambda 1-Click":            [0.45, 0.55]
    "CoreWeave (K8s)":           [0.70, 0.55]
    "DGX Cloud":                 [0.80, 0.65]
    "AWS / Azure / GCP HPC":     [0.85, 0.80]
    "GPUaaS slice":              [0.55, 0.78]

GPUaaS lands deliberately in a "VM-with-passthrough between hyperscalers and boutique clouds" spot:

  • Stronger isolation than container clouds (RunPod, Vast, Together) because of VFIO + per-slot NVMe + per-slot IB VF.
  • Operationally simpler than CoreWeave because there is no K8s + GPU Operator + Multus stack to maintain.
  • Less feature-rich than AWS/Azure/GCP HPC — no MIG, no confidential compute, no multi-node clusters, no managed driver pipelines.

Side-by-side comparison

Dimension GPUaaS (today) CoreWeave / Lambda AWS / Azure / GCP HPC RunPod / Vast / TensorDock DGX Cloud
Tenancy unit Per-slot VM with passthrough Bare-metal node or K8s pod Whole VM, often whole node Containers on shared hosts K8s + Slurm pods
Sub-GPU partitioning None (1 GPU = 1 slot) MIG via K8s on some SKUs MIG-backed instance shapes Sometimes MIG MIG natively
Isolation strength Strong (VFIO + per-slot NVMe + per-slot IB VF) Bare metal: full; K8s: cgroups + GPU operator VM-level + Nitro/Hyper-V offload Container-level (weaker) Pod-level
Scheduler Postgres slot table + Temporal K8s scheduler Internal placement K8s / custom K8s / Slurm
Network plane OVS + iptables NAT + dnsmasq CNI (Multus, Calico) + SR-IOV Nitro / Andromeda / Azure SDN CNI CNI + GPUDirect
East-west fabric IPoIB w/ per-slot SR-IOV VF RDMA via SR-IOV EFA / GPUDirect / NDR Often none NVLink + NDR
Topology awareness NUMA only NUMA + NVLink + rack NUMA + NVLink + rack/spine None Full topology
Image pipeline qemu-img convert + cloud-init Container image (instant) AMI / VHD baked Docker image Container
Confidential compute No (loader_secure=no, no TPM) Optional on some SKUs Yes (Nitro Enclaves, CVM, Confidential GKE) No H100 CC available
Source of truth Postgres etcd Internal databases Mixed etcd
Onboarding Operator-approved slot map Auto via K8s join Fully automated Auto Operator + auto
Multi-node Not supported K8s pod groups, Slurm UltraClusters / placement groups Not typical First-class

Strengths grounded in code

mindmap
  root((Strengths))
    Isolation
      VFIO GPU passthrough
      Per-slot dedicated NVMe
      Per-slot SR-IOV IB VF
      Wipe policy required on slot
    Reliability
      Transactional reservation
      Outbox + Temporal
      Operator-approved slot map
      Per-slot file leases
    Operability
      Single Postgres source of truth
      Deterministic MAC/IP layout
      Phase-timed task output
      Privacy-respecting telemetry
    Contract discipline
      Contract-first APIs
      Audit on every privileged mutation
      Immutable ledger

Verifiable strength claims

  1. Strong physical isolation per slot. Per-slot dedicated NVMe + dedicated IB VF + VFIO GPU. Verified by: node_resource_slots required capacity_metadata keys (storage_ownership=slice, fabric_claim_mode=per_slot_vf, non-empty fabric_vf_pci_address) — see Capacity shapes.
  2. Reservation is transactional. Slot UPDATE + allocation INSERT + outbox row in one Postgres transaction. Verified by: service.go:1499-1660.
  3. Outbox + Temporal beats imperative kubectl. Verified by: packages/shared/outbox/ + cmd/provisioning-worker/temporal.go.
  4. Operator approval gate on slot inventory. Topology discovery returns approval_required: true; only operator action creates node_resource_slots rows.
  5. Deterministic MAC/IP/lease layout. MAC = 52:54: + sha256(node_id:slot)[:4]; IP = 10.100.0.{10+slot_index}; leases at /var/lib/gpuaas/node-scheduler/leases/{slot_id}.json.
  6. Privacy-respecting telemetry. Per-allocation token to host-only sink at 10.100.0.1:9110. No tenant access to host Netdata. Verified by: cmd/node-agent/telemetry.go and design doc Slice_Guest_Telemetry_and_Benchmark_v1.md.
  7. Wipe-policy required at slot approval time. destructive_wipe_policy must be non-empty for the slot to be schedulable.

Where the model is intentionally narrower

These are not "gaps" — they're explicit choices that match the current product scope.

Item Stance Source
Sub-GPU partitioning (MIG/vGPU/MPS) Out of scope for v1 Allocation_Capacity_Shapes_and_GPU_Slices_v1.md §Non-Goals
Multi-node slice clusters Non-goal for v1 Same
Live migration Non-goal (PCI passthrough) Same
Confidential compute Not configured (loader_secure=no) slice_vm.go:1240 virt-install args
Cross-tenant east-west networking Denied by default Slice_Networking_Architecture_v1.md

What's specific about how GPUaaS does this

A few choices that are unusual relative to either hyperscalers or boutique clouds:

  • Per-slot file leases (not DB rows or distributed lock). Host-local mutex via JSON files under /var/lib/gpuaas/node-scheduler/leases/. Trades a tiny bit of cross-host visibility for simpler crash semantics and fast reconciliation.
  • Slot reservation is a separate concern from VM lifecycle. The Postgres slot table is the durable scheduler state; the node-agent only validates a plan and executes — it does not invent placement.
  • One BFF binary (cmd/api) for everything. Most GPU clouds split BFF/admin/internal into separate services. GPUaaS imports all domain packages directly. The eventual split is documented but not done.
  • Contract-first OpenAPI is authoritative. 33k lines of OpenAPI + 2.3k of AsyncAPI define what the platform exposes; code generation enforces it.

Reading suggestions

If you're comparing to … Read
AWS GPU HPC instances (P5, P5e, EC2 UltraCluster) GPU slice as-built for isolation parity; note no UltraCluster equivalent
Azure ND H100 v5 Same
GCP A3 / A3 Mega Same; note no GPUDirect equivalent yet
CoreWeave K8s Domain ownership — note no K8s on the runtime side
RunPod / Vast GPU slice as-built — note dedicated NVMe per slot, real VM isolation
DGX Cloud / Slurm cmd/slurm-reference-controller; note single-allocation only

Where to look next