Skip to content

External clouds & products

Decided

Source: doc/architecture/Cloud_Hierarchy_Comparison.md · doc/architecture/Allocation_Capacity_Shapes_and_GPU_Slices_v1.md · doc/product/GPUaaS_vs_Armada_Bridge_Gap_Matrix.md · GPU-cloud landscape research

1. Cloud hierarchy (AWS / GCP / Azure / Nebius)

flowchart TB
    classDef gpa fill:#e8f5e9,stroke:#2e7d32
    classDef ext fill:#e3f2fd,stroke:#1565c0

    G[GPUaaS<br/>tenant → project → resource]:::gpa
    AWS[AWS<br/>Org / OU / Account / Resource]:::ext
    GCP[GCP<br/>Org / Folder / Project / Resource]:::ext
    AZ[Azure<br/>Mgmt Group / Sub / RG / Resource]:::ext
    NB[Nebius<br/>Tenant / Project / Resource]:::ext

    NB ---|closest semantic match| G
    GCP ---|two-level→three-level mapping| G
    AWS ---|account = our project| G
    AZ ---|subscription = our project| G

Mapping table

Concept AWS GCP Azure Nebius GPUaaS
Ownership root Org / Management Account Organization Tenant + Mgmt Group root Tenant Tenant (organizations)
Operational scope Account Project Subscription / RG Project Project
Membership IAM principals IAM principals Entra principals + RBAC Tenant/project groups tenant_memberships + project_memberships
Policy model IAM + SCP IAM + org policy RBAC + policy initiatives Group + role global → tenant → project chain
Billing anchor Account / payer Billing account + org/project Subscription Tenant + quotas Tenant (organizations.stripe_customer_id)
User as owner-of-record? No No No No No (requested_by_user_id attribution only)

What came out as requirements

  • Tenant as ownership root, not user. Resources survive user churn.
  • Project as operational scope inside tenant, mirroring GCP/Nebius.
  • Membership-table-driven access, not user-as-owner.
  • Scoped policy chain global → tenant → project with most-specific-wins, modeled on GCP IAM inheritance and Azure RBAC scope.
  • Canonical resource identifier core42:aicloud:{region}:{tenant_id}:{project_id}:{resource_type}:{resource_id} — directly inspired by ARN / GCP resource-names / Azure resource-id patterns.
  • MVP constraint UNIQUE(user_id) on tenant_memberships (single-tenant per user); multi-tenant deferred but designed for.

Source: Cloud_Hierarchy_Comparison.md.


2. GPU-cloud landscape

quadrantChart
    title GPU cloud isolation vs operational simplicity
    x-axis "Operationally simple" --> "Operationally rich"
    y-axis "Weaker isolation" --> "Stronger isolation"
    quadrant-1 "Hyperscaler HPC"
    quadrant-2 "GPUaaS sweet spot"
    quadrant-3 "Boutique container clouds"
    quadrant-4 "Bare metal direct"
    "RunPod / Vast / Together": [0.25, 0.25]
    "TensorDock / FluidStack":   [0.30, 0.30]
    "Lambda 1-Click":            [0.45, 0.55]
    "CoreWeave (K8s)":           [0.70, 0.55]
    "DGX Cloud":                 [0.80, 0.65]
    "AWS / Azure / GCP HPC":     [0.85, 0.80]
    "GPUaaS slice":              [0.55, 0.78]

Side-by-side

Dimension GPUaaS (today) CoreWeave / Lambda AWS / Azure / GCP HPC RunPod / Vast / TensorDock DGX Cloud
Tenancy unit Per-slot VM with passthrough Bare-metal node or K8s pod Whole VM, often whole node Containers on shared hosts K8s + Slurm pods
Sub-GPU partitioning None (1 GPU = 1 slot) MIG via K8s on some SKUs MIG-backed instance shapes Sometimes MIG MIG natively
Isolation strength Strong (VFIO + per-slot NVMe + per-slot IB VF) Bare metal: full; K8s: cgroups + GPU operator VM-level + Nitro/Hyper-V offload Container-level (weaker) Pod-level
Scheduler Postgres slot table + Temporal K8s scheduler Internal placement K8s / custom K8s / Slurm
Network plane OVS + iptables NAT + dnsmasq CNI (Multus, Calico) + SR-IOV Nitro / Andromeda / Azure SDN CNI CNI + GPUDirect
East-west fabric IPoIB w/ per-slot SR-IOV VF RDMA via SR-IOV EFA / GPUDirect / NDR Often none NVLink + NDR
Topology awareness NUMA only NUMA + NVLink + rack NUMA + NVLink + rack/spine None Full topology
Image pipeline qemu-img convert + cloud-init Container image (instant) AMI / VHD baked Docker image Container
Confidential compute No (loader_secure=no, no TPM) Optional on some SKUs Yes (Nitro Enclaves, CVM) No H100 CC available
Source of truth Postgres etcd Internal databases Mixed etcd
Onboarding Operator-approved slot map Auto via K8s join Fully automated Auto Operator + auto
Multi-node Not supported K8s pod groups, Slurm UltraClusters / placement groups Not typical First-class

What came out as requirements

  • VM-with-passthrough position deliberately chosen — stronger isolation than container clouds (RunPod, Vast); simpler ops than CoreWeave; less feature-rich than hyperscalers.
  • Per-slot dedicated NVMe + IB VF + VFIO GPU — unusual combination, justified by the slice product's strong-isolation promise.
  • Operator-approved slot map as a safety property — most boutique clouds skip this and pay later.
  • Wipe-policy required on slot — codifies an erase contract that most clouds bury in an SLA doc.

Read the Position vs other clouds page for the verifiable strength claims.


3. Armada Bridge gap matrix

Captured as of March 6, 2026 from the Armada Bridge product page.

What Armada claims

Use case Bridge description
AI Factory Orchestration Deploy + scale large AI workloads with dynamic optimization across GPU clusters
GPU-as-a-Service Provision resources, monetize capacity, deliver self-service cloud for end users
Platform-as-a-Service Run AI models via APIs and dashboards with customizable compliance controls

Five capability pillars Armada markets:

  1. Hard isolation for multi-tenant environments
  2. Elastic resource allocation across clusters
  3. GPU monetization and revenue generation
  4. Unified billing and real-time observability
  5. Air-gapped security, data sovereignty, regulatory compliance

NVIDIA certifications claimed: NCP, Spectrum-X, Quantum-2, Base Command Manager.

Gap matrix (GPUaaS vs Bridge)

flowchart LR
    subgraph BRIDGE[Armada Bridge marketing]
      P1[Hard multi-tenant isolation]
      P2[Elastic cross-cluster allocation]
      P3[GPU monetization]
      P4[Unified billing + observability]
      P5[Air-gapped + data sovereignty]
    end
    subgraph GPUAAS[GPUaaS status]
      G1[VFIO per-slot isolation<br/>IMPLEMENTED]
      G2[Single-region; cross-cluster<br/>DESIGNED]
      G3[SKU + ledger + Stripe<br/>IMPLEMENTED]
      G4[Ledger + OTel + Grafana +<br/>admin ops panel<br/>IMPLEMENTED]
      G5[Per-region resource ids<br/>tenant isolation<br/>IMPLEMENTED]
    end
    P1 --- G1
    P2 --- G2
    P3 --- G3
    P4 --- G4
    P5 --- G5

What came out as requirements

  • Self-service GPU cloud is in-scope and table-stakes per the comparison.
  • GPU monetization alignment: GPUaaS provides marketplace + admin SKU control as functional equivalents.
  • Unified billing + observability is a deliverable — already shipped (ledger + OTel + Grafana + admin ops panel).
  • Hard multi-tenant isolation — Bridge's marketing pillar maps to GPUaaS's per-slot VFIO model, considered a stronger position than container-based competitors.
  • Air-gapped / sovereign is a tracked workstream — the slice path supports private NAT and on-prem deployment; the broader regulatory compliance posture remains an enterprise expansion topic.

Source: GPUaaS_vs_Armada_Bridge_Gap_Matrix.md.


Cross-cutting takeaways

mindmap
  root((External comparisons<br/>→ requirements))
    Hierarchy
      Tenant ownership root
      Project operational scope
      Scoped policy chain
      Canonical resource ids
    GPU cloud position
      VM-with-passthrough
      Per-slot dedicated resources
      Operator-approved slot map
      Strong isolation as differentiator
    Competitive parity
      Self-service catalog
      Marketplace + monetization
      Unified billing + observability
      Multi-tenant isolation

Where to look next