External clouds & products¶

Decided

Source: doc/architecture/Cloud_Hierarchy_Comparison.md · doc/architecture/Allocation_Capacity_Shapes_and_GPU_Slices_v1.md · doc/product/GPUaaS_vs_Armada_Bridge_Gap_Matrix.md · GPU-cloud landscape research

1. Cloud hierarchy (AWS / GCP / Azure / Nebius)¶

flowchart TB
    classDef gpa fill:#e8f5e9,stroke:#2e7d32
    classDef ext fill:#e3f2fd,stroke:#1565c0

    G[GPUaaS<br/>tenant → project → resource]:::gpa
    AWS[AWS<br/>Org / OU / Account / Resource]:::ext
    GCP[GCP<br/>Org / Folder / Project / Resource]:::ext
    AZ[Azure<br/>Mgmt Group / Sub / RG / Resource]:::ext
    NB[Nebius<br/>Tenant / Project / Resource]:::ext

    NB ---|closest semantic match| G
    GCP ---|two-level→three-level mapping| G
    AWS ---|account = our project| G
    AZ ---|subscription = our project| G

Mapping table¶

Concept	AWS	GCP	Azure	Nebius	GPUaaS
Ownership root	Org / Management Account	Organization	Tenant + Mgmt Group root	Tenant	Tenant (`organizations`)
Operational scope	Account	Project	Subscription / RG	Project	Project
Membership	IAM principals	IAM principals	Entra principals + RBAC	Tenant/project groups	`tenant_memberships` + `project_memberships`
Policy model	IAM + SCP	IAM + org policy	RBAC + policy initiatives	Group + role	`global → tenant → project` chain
Billing anchor	Account / payer	Billing account + org/project	Subscription	Tenant + quotas	Tenant (`organizations.stripe_customer_id`)
User as owner-of-record?	No	No	No	No	No (`requested_by_user_id` attribution only)

What came out as requirements¶

Tenant as ownership root, not user. Resources survive user churn.
Project as operational scope inside tenant, mirroring GCP/Nebius.
Membership-table-driven access, not user-as-owner.
Scoped policy chain global → tenant → project with most-specific-wins, modeled on GCP IAM inheritance and Azure RBAC scope.
Canonical resource identifier core42:aicloud:{region}:{tenant_id}:{project_id}:{resource_type}:{resource_id} — directly inspired by ARN / GCP resource-names / Azure resource-id patterns.
MVP constraint UNIQUE(user_id) on tenant_memberships (single-tenant per user); multi-tenant deferred but designed for.

Source: Cloud_Hierarchy_Comparison.md.

2. GPU-cloud landscape¶

quadrantChart
    title GPU cloud isolation vs operational simplicity
    x-axis "Operationally simple" --> "Operationally rich"
    y-axis "Weaker isolation" --> "Stronger isolation"
    quadrant-1 "Hyperscaler HPC"
    quadrant-2 "GPUaaS sweet spot"
    quadrant-3 "Boutique container clouds"
    quadrant-4 "Bare metal direct"
    "RunPod / Vast / Together": [0.25, 0.25]
    "TensorDock / FluidStack":   [0.30, 0.30]
    "Lambda 1-Click":            [0.45, 0.55]
    "CoreWeave (K8s)":           [0.70, 0.55]
    "DGX Cloud":                 [0.80, 0.65]
    "AWS / Azure / GCP HPC":     [0.85, 0.80]
    "GPUaaS slice":              [0.55, 0.78]

Side-by-side¶

Dimension	GPUaaS (today)	CoreWeave / Lambda	AWS / Azure / GCP HPC	RunPod / Vast / TensorDock	DGX Cloud
Tenancy unit	Per-slot VM with passthrough	Bare-metal node or K8s pod	Whole VM, often whole node	Containers on shared hosts	K8s + Slurm pods
Sub-GPU partitioning	None (1 GPU = 1 slot)	MIG via K8s on some SKUs	MIG-backed instance shapes	Sometimes MIG	MIG natively
Isolation strength	Strong (VFIO + per-slot NVMe + per-slot IB VF)	Bare metal: full; K8s: cgroups + GPU operator	VM-level + Nitro/Hyper-V offload	Container-level (weaker)	Pod-level
Scheduler	Postgres slot table + Temporal	K8s scheduler	Internal placement	K8s / custom	K8s / Slurm
Network plane	OVS + iptables NAT + dnsmasq	CNI (Multus, Calico) + SR-IOV	Nitro / Andromeda / Azure SDN	CNI	CNI + GPUDirect
East-west fabric	IPoIB w/ per-slot SR-IOV VF	RDMA via SR-IOV	EFA / GPUDirect / NDR	Often none	NVLink + NDR
Topology awareness	NUMA only	NUMA + NVLink + rack	NUMA + NVLink + rack/spine	None	Full topology
Image pipeline	qemu-img convert + cloud-init	Container image (instant)	AMI / VHD baked	Docker image	Container
Confidential compute	No (`loader_secure=no`, no TPM)	Optional on some SKUs	Yes (Nitro Enclaves, CVM)	No	H100 CC available
Source of truth	Postgres	etcd	Internal databases	Mixed	etcd
Onboarding	Operator-approved slot map	Auto via K8s join	Fully automated	Auto	Operator + auto
Multi-node	Not supported	K8s pod groups, Slurm	UltraClusters / placement groups	Not typical	First-class

What came out as requirements¶

VM-with-passthrough position deliberately chosen — stronger isolation than container clouds (RunPod, Vast); simpler ops than CoreWeave; less feature-rich than hyperscalers.
Per-slot dedicated NVMe + IB VF + VFIO GPU — unusual combination, justified by the slice product's strong-isolation promise.
Operator-approved slot map as a safety property — most boutique clouds skip this and pay later.
Wipe-policy required on slot — codifies an erase contract that most clouds bury in an SLA doc.

Read the Position vs other clouds page for the verifiable strength claims.

3. Armada Bridge gap matrix¶

Captured as of March 6, 2026 from the Armada Bridge product page.

What Armada claims¶

Use case	Bridge description
AI Factory Orchestration	Deploy + scale large AI workloads with dynamic optimization across GPU clusters
GPU-as-a-Service	Provision resources, monetize capacity, deliver self-service cloud for end users
Platform-as-a-Service	Run AI models via APIs and dashboards with customizable compliance controls

Five capability pillars Armada markets:

Hard isolation for multi-tenant environments
Elastic resource allocation across clusters
GPU monetization and revenue generation
Unified billing and real-time observability
Air-gapped security, data sovereignty, regulatory compliance

NVIDIA certifications claimed: NCP, Spectrum-X, Quantum-2, Base Command Manager.

Gap matrix (GPUaaS vs Bridge)¶

flowchart LR
    subgraph BRIDGE[Armada Bridge marketing]
      P1[Hard multi-tenant isolation]
      P2[Elastic cross-cluster allocation]
      P3[GPU monetization]
      P4[Unified billing + observability]
      P5[Air-gapped + data sovereignty]
    end
    subgraph GPUAAS[GPUaaS status]
      G1[VFIO per-slot isolation<br/>IMPLEMENTED]
      G2[Single-region; cross-cluster<br/>DESIGNED]
      G3[SKU + ledger + Stripe<br/>IMPLEMENTED]
      G4[Ledger + OTel + Grafana +<br/>admin ops panel<br/>IMPLEMENTED]
      G5[Per-region resource ids<br/>tenant isolation<br/>IMPLEMENTED]
    end
    P1 --- G1
    P2 --- G2
    P3 --- G3
    P4 --- G4
    P5 --- G5

What came out as requirements¶

Self-service GPU cloud is in-scope and table-stakes per the comparison.
GPU monetization alignment: GPUaaS provides marketplace + admin SKU control as functional equivalents.
Unified billing + observability is a deliverable — already shipped (ledger + OTel + Grafana + admin ops panel).
Hard multi-tenant isolation — Bridge's marketing pillar maps to GPUaaS's per-slot VFIO model, considered a stronger position than container-based competitors.
Air-gapped / sovereign is a tracked workstream — the slice path supports private NAT and on-prem deployment; the broader regulatory compliance posture remains an enterprise expansion topic.

Source: GPUaaS_vs_Armada_Bridge_Gap_Matrix.md.

Cross-cutting takeaways¶

mindmap
  root((External comparisons<br/>→ requirements))
    Hierarchy
      Tenant ownership root
      Project operational scope
      Scoped policy chain
      Canonical resource ids
    GPU cloud position
      VM-with-passthrough
      Per-slot dedicated resources
      Operator-approved slot map
      Strong isolation as differentiator
    Competitive parity
      Self-service catalog
      Marketplace + monetization
      Unified billing + observability
      Multi-tenant isolation

External clouds & products¶

1. Cloud hierarchy (AWS / GCP / Azure / Nebius)¶

Mapping table¶

What came out as requirements¶

2. GPU-cloud landscape¶

Side-by-side¶

What came out as requirements¶

3. Armada Bridge gap matrix¶

What Armada claims¶

Gap matrix (GPUaaS vs Bridge)¶

What came out as requirements¶

Cross-cutting takeaways¶

Where to look next¶