External clouds & products¶
Decided
Source:
doc/architecture/Cloud_Hierarchy_Comparison.md · doc/architecture/Allocation_Capacity_Shapes_and_GPU_Slices_v1.md · doc/product/GPUaaS_vs_Armada_Bridge_Gap_Matrix.md · GPU-cloud landscape research
1. Cloud hierarchy (AWS / GCP / Azure / Nebius)¶
flowchart TB
classDef gpa fill:#e8f5e9,stroke:#2e7d32
classDef ext fill:#e3f2fd,stroke:#1565c0
G[GPUaaS<br/>tenant → project → resource]:::gpa
AWS[AWS<br/>Org / OU / Account / Resource]:::ext
GCP[GCP<br/>Org / Folder / Project / Resource]:::ext
AZ[Azure<br/>Mgmt Group / Sub / RG / Resource]:::ext
NB[Nebius<br/>Tenant / Project / Resource]:::ext
NB ---|closest semantic match| G
GCP ---|two-level→three-level mapping| G
AWS ---|account = our project| G
AZ ---|subscription = our project| G
Mapping table¶
| Concept | AWS | GCP | Azure | Nebius | GPUaaS |
|---|---|---|---|---|---|
| Ownership root | Org / Management Account | Organization | Tenant + Mgmt Group root | Tenant | Tenant (organizations) |
| Operational scope | Account | Project | Subscription / RG | Project | Project |
| Membership | IAM principals | IAM principals | Entra principals + RBAC | Tenant/project groups | tenant_memberships + project_memberships |
| Policy model | IAM + SCP | IAM + org policy | RBAC + policy initiatives | Group + role | global → tenant → project chain |
| Billing anchor | Account / payer | Billing account + org/project | Subscription | Tenant + quotas | Tenant (organizations.stripe_customer_id) |
| User as owner-of-record? | No | No | No | No | No (requested_by_user_id attribution only) |
What came out as requirements¶
- Tenant as ownership root, not user. Resources survive user churn.
- Project as operational scope inside tenant, mirroring GCP/Nebius.
- Membership-table-driven access, not user-as-owner.
- Scoped policy chain
global → tenant → projectwith most-specific-wins, modeled on GCP IAM inheritance and Azure RBAC scope. - Canonical resource identifier
core42:aicloud:{region}:{tenant_id}:{project_id}:{resource_type}:{resource_id}— directly inspired by ARN / GCP resource-names / Azure resource-id patterns. - MVP constraint
UNIQUE(user_id)on tenant_memberships (single-tenant per user); multi-tenant deferred but designed for.
Source: Cloud_Hierarchy_Comparison.md.
2. GPU-cloud landscape¶
quadrantChart
title GPU cloud isolation vs operational simplicity
x-axis "Operationally simple" --> "Operationally rich"
y-axis "Weaker isolation" --> "Stronger isolation"
quadrant-1 "Hyperscaler HPC"
quadrant-2 "GPUaaS sweet spot"
quadrant-3 "Boutique container clouds"
quadrant-4 "Bare metal direct"
"RunPod / Vast / Together": [0.25, 0.25]
"TensorDock / FluidStack": [0.30, 0.30]
"Lambda 1-Click": [0.45, 0.55]
"CoreWeave (K8s)": [0.70, 0.55]
"DGX Cloud": [0.80, 0.65]
"AWS / Azure / GCP HPC": [0.85, 0.80]
"GPUaaS slice": [0.55, 0.78]
Side-by-side¶
| Dimension | GPUaaS (today) | CoreWeave / Lambda | AWS / Azure / GCP HPC | RunPod / Vast / TensorDock | DGX Cloud |
|---|---|---|---|---|---|
| Tenancy unit | Per-slot VM with passthrough | Bare-metal node or K8s pod | Whole VM, often whole node | Containers on shared hosts | K8s + Slurm pods |
| Sub-GPU partitioning | None (1 GPU = 1 slot) | MIG via K8s on some SKUs | MIG-backed instance shapes | Sometimes MIG | MIG natively |
| Isolation strength | Strong (VFIO + per-slot NVMe + per-slot IB VF) | Bare metal: full; K8s: cgroups + GPU operator | VM-level + Nitro/Hyper-V offload | Container-level (weaker) | Pod-level |
| Scheduler | Postgres slot table + Temporal | K8s scheduler | Internal placement | K8s / custom | K8s / Slurm |
| Network plane | OVS + iptables NAT + dnsmasq | CNI (Multus, Calico) + SR-IOV | Nitro / Andromeda / Azure SDN | CNI | CNI + GPUDirect |
| East-west fabric | IPoIB w/ per-slot SR-IOV VF | RDMA via SR-IOV | EFA / GPUDirect / NDR | Often none | NVLink + NDR |
| Topology awareness | NUMA only | NUMA + NVLink + rack | NUMA + NVLink + rack/spine | None | Full topology |
| Image pipeline | qemu-img convert + cloud-init | Container image (instant) | AMI / VHD baked | Docker image | Container |
| Confidential compute | No (loader_secure=no, no TPM) |
Optional on some SKUs | Yes (Nitro Enclaves, CVM) | No | H100 CC available |
| Source of truth | Postgres | etcd | Internal databases | Mixed | etcd |
| Onboarding | Operator-approved slot map | Auto via K8s join | Fully automated | Auto | Operator + auto |
| Multi-node | Not supported | K8s pod groups, Slurm | UltraClusters / placement groups | Not typical | First-class |
What came out as requirements¶
- VM-with-passthrough position deliberately chosen — stronger isolation than container clouds (RunPod, Vast); simpler ops than CoreWeave; less feature-rich than hyperscalers.
- Per-slot dedicated NVMe + IB VF + VFIO GPU — unusual combination, justified by the slice product's strong-isolation promise.
- Operator-approved slot map as a safety property — most boutique clouds skip this and pay later.
- Wipe-policy required on slot — codifies an erase contract that most clouds bury in an SLA doc.
Read the Position vs other clouds page for the verifiable strength claims.
3. Armada Bridge gap matrix¶
Captured as of March 6, 2026 from the Armada Bridge product page.
What Armada claims¶
| Use case | Bridge description |
|---|---|
| AI Factory Orchestration | Deploy + scale large AI workloads with dynamic optimization across GPU clusters |
| GPU-as-a-Service | Provision resources, monetize capacity, deliver self-service cloud for end users |
| Platform-as-a-Service | Run AI models via APIs and dashboards with customizable compliance controls |
Five capability pillars Armada markets:
- Hard isolation for multi-tenant environments
- Elastic resource allocation across clusters
- GPU monetization and revenue generation
- Unified billing and real-time observability
- Air-gapped security, data sovereignty, regulatory compliance
NVIDIA certifications claimed: NCP, Spectrum-X, Quantum-2, Base Command Manager.
Gap matrix (GPUaaS vs Bridge)¶
flowchart LR
subgraph BRIDGE[Armada Bridge marketing]
P1[Hard multi-tenant isolation]
P2[Elastic cross-cluster allocation]
P3[GPU monetization]
P4[Unified billing + observability]
P5[Air-gapped + data sovereignty]
end
subgraph GPUAAS[GPUaaS status]
G1[VFIO per-slot isolation<br/>IMPLEMENTED]
G2[Single-region; cross-cluster<br/>DESIGNED]
G3[SKU + ledger + Stripe<br/>IMPLEMENTED]
G4[Ledger + OTel + Grafana +<br/>admin ops panel<br/>IMPLEMENTED]
G5[Per-region resource ids<br/>tenant isolation<br/>IMPLEMENTED]
end
P1 --- G1
P2 --- G2
P3 --- G3
P4 --- G4
P5 --- G5
What came out as requirements¶
- Self-service GPU cloud is in-scope and table-stakes per the comparison.
- GPU monetization alignment: GPUaaS provides marketplace + admin SKU control as functional equivalents.
- Unified billing + observability is a deliverable — already shipped (ledger + OTel + Grafana + admin ops panel).
- Hard multi-tenant isolation — Bridge's marketing pillar maps to GPUaaS's per-slot VFIO model, considered a stronger position than container-based competitors.
- Air-gapped / sovereign is a tracked workstream — the slice path supports private NAT and on-prem deployment; the broader regulatory compliance posture remains an enterprise expansion topic.
Source: GPUaaS_vs_Armada_Bridge_Gap_Matrix.md.
Cross-cutting takeaways¶
mindmap
root((External comparisons<br/>→ requirements))
Hierarchy
Tenant ownership root
Project operational scope
Scoped policy chain
Canonical resource ids
GPU cloud position
VM-with-passthrough
Per-slot dedicated resources
Operator-approved slot map
Strong isolation as differentiator
Competitive parity
Self-service catalog
Marketplace + monetization
Unified billing + observability
Multi-tenant isolation