Slice Networking Architecture v1¶
Purpose¶
Define the first networking model for GPUaaS gpu_slice allocations.
This document is a companion to
doc/architecture/Allocation_Capacity_Shapes_and_GPU_Slices_v1.md. The capacity
shape document owns placement, slot claims, SKU direction, and billing
semantics. This document owns the slice VM network planes, node-local IPAM,
public ingress, and console exposure rules.
This is a design proposal. It does not change API, schema, node-agent task, or networking implementation contracts until those contracts are updated.
Context¶
The H200 slice prototype in ~/Downloads/GPUaaS uses:
- one VM per slice;
- a management VM network connected to an OVS bridge;
- local DHCP on a private subnet such as
10.100.0.0/24; - host NAT for egress;
- direct GPU and IB PCI passthrough;
- VNC for manual console access.
Infrastructure feedback adds the production hardware direction:
- BF3 virtual functions can be split so each VM receives one management vNIC.
- OVS can attach those vNICs and enforce optional QoS per VM.
- The IB card is separate from the management network and does not require an IP address for RDMA.
- Public access requires NAT/firewall mapping to the VM management vNIC, or a controlled overlay such as Tailscale/Funnel while firewall ownership is being resolved.
Network Planes¶
Slice VMs need two distinct network planes.
Management/Public Plane¶
The management plane is:
It is used for:
- SSH and browser terminal access;
- app endpoint access;
- package install and image bootstrap;
- guest health checks;
- node-agent lifecycle coordination;
- public ingress when enabled.
Workload Fabric Plane¶
The workload fabric plane is:
It is used for high-performance workload traffic. It should not be treated as the internet or management access path.
For H200 v1, the IB device does not need an IP address. IPoIB can be added later for workloads that require it, but it is not a default dependency. AMD/RoCE slice products should model RoCE-capable NIC/VF resources explicitly because the validation and traffic path differ from IB.
Initial Modes¶
| Mode | Behavior | v1 stance |
|---|---|---|
private-nat |
VM has private management IP; node NAT handles egress; GPUaaS proxies inbound | default |
private-routed |
VM has routed private management IP reachable from platform network | optional site feature |
public-nat |
public IP/firewall DNAT maps to VM management vNIC | supported when firewall/IPAM control exists |
tailscale-overlay |
VM or node-side proxy exposes access through Tailscale/Funnel | dev/lab workaround, not the only production answer |
fabric-only |
workload fabric only, no management/public plane | not valid for normal GPUaaS slices |
Per-slice public IP is feasible, but it is an IPAM/firewall/NAT feature on the management vNIC. It is not an IB feature.
Slot Metadata Requirements¶
The scheduler and slot model should track both management and fabric resources:
network_device: BF3 VF or equivalent management/public attachment identity;mac_address: stable per-slice VM MAC address or reservation;private_ip: intended private management IP when node-local IPAM is used;network_metadata: OVS bridge, QoS, VLAN, subnet, DHCP, or overlay details;fabric_device_id: IB/RDMA or RoCE workload fabric identity;fabric_pci_addr: passthrough PCI address when applicable;fabric_metadata: fabric type, validation evidence, and topology metadata.
Placement must validate the correct fabric for the SKU or app profile. It must not schedule a workload that expects IB semantics onto a RoCE-only or ethernet-only slot unless the product explicitly allows that.
MAC addresses must be globally unique within the managed L2 domain. The
prototype's fixed 52:54:00:11:00:XX pattern is not safe across multiple
nodes. Use either:
- a platform-managed MAC pool; or
- a deterministic QEMU-local MAC such as
52:54:00plus three bytes derived fromhash(node_id, slot_index).
The selected scheme must prevent cross-node collisions and remain stable for a slot unless the slot is reprovisioned by an admin topology update.
Node-Local IPAM¶
For dev/lab parity with the prototype, a node may run a private subnet behind an OVS bridge and local DHCP service:
The prototype uses dnsmasq leases as the source for actual IP discovery. GPUaaS should instead make the control plane own the intended reservation:
- reserve MAC/IP during slot approval or allocation provisioning;
- pass the intended network identity to node-agent;
- configure OVS/DHCP/NAT from typed node-agent tasks;
- read lease files or equivalent local state only as reconciliation evidence;
- report actual assigned IP and drift back to the control plane.
Node-agent should report {mac, expected_ip, actual_ip, lease_state} during VM
boot and reconciliation. If the lease does not match the expected MAC/IP within
the readiness timeout, the allocation should remain provisioning or fail with a
structured networking reason, and the slot should be marked degraded until
reconciled.
Default dev/lab ranges should be tight enough for the slot count. For an 8-slot H200 node, a private subnet can be larger, but the DHCP reservation range should normally reserve exactly the approved slot addresses rather than a broad unowned pool.
Node-local IPAM state should be lifecycle-managed with the allocation. Release must remove stale DHCP, DNS, NAT, firewall, and overlay state before the slot is marked reusable.
For the current private-nat mode, control-plane health must validate live host
networking state, not only bootstrap artifacts. At minimum, node discovery
should surface:
- whether the expected OVS bridge is present;
- whether IP forwarding is enabled;
- whether the
10.100.0.0/24MASQUERADE rule exists; - which host uplink is the default route;
- whether BF3
tmfifocontrol connectivity is present.
Missing NAT must be treated as a hard scheduling blocker for private-nat
slots, because guests otherwise leak 10.100.x.x source addresses toward the
host uplink and can be dropped by upstream firewalls.
Public Ingress¶
Public ingress must always terminate on or map to the management plane. It must not use the IB/RDMA fabric plane.
Supported implementation directions:
- public IP/firewall DNAT to the VM management vNIC;
- load balancer or reverse proxy to the private VM endpoint;
- Tailscale/Funnel or equivalent overlay for dev/lab environments;
- future routed private network integration when site networking supports it.
Ingress setup and teardown must be allocation-scoped and audited:
- reserve IP, DNS name, or overlay identity;
- apply NAT, firewall, proxy, or Funnel rule;
- publish endpoint metadata to the allocation/app read model;
- remove all exposure state on stop/release;
- reconcile stale mappings as drift.
Future OVS Expansion¶
OVS should be treated as the extensibility point for future slice network segregation and controlled inter-slice communication. The v1 default remains simple private NAT with no cross-tenant east-west access by default, but the model should not block richer network products later.
Potential future capabilities:
- per-tenant or per-project private networks;
- inter-slice communication for slices in the same project or app topology;
- explicit deny-by-default isolation between unrelated tenants/projects;
- security-group style ingress/egress rules enforced at OVS or node firewall boundaries;
- per-vNIC QoS and rate limiting;
- routable private networks when the site network can carry project subnets;
- app-scoped service networks where only declared app endpoints are reachable.
Directional isolation model:
| Stage | Behavior | Notes |
|---|---|---|
| v1 private NAT | slices get node-local private IPs, no user-controlled east-west network | simplest safe default |
| project-local L2/L3 | slices in the same project can communicate on an isolated network | requires project network identity and OVS segmentation |
| security groups | platform applies declarative ingress/egress policy per allocation/app | requires API contract and rule reconciliation |
| routed private networks | project network is reachable across nodes through routed fabric or overlay | requires infra-owned routing/IPAM design |
| service network | apps declare named endpoints and the platform programs only required flows | aligns with app manifests and least privilege |
Implementation options for OVS segmentation include VLAN tags, isolated OVS bridges, OpenFlow rules, or an overlay controlled by a network agent. The choice should be made with infra based on BF3 capabilities, operational visibility, debuggability, and how public/firewall routing will be owned.
Initial product recommendation:
- do not expose user-managed networks in the first slice release;
- reserve data model room for
network_id,network_policy_id, andproject_network_idin future allocation/network read models; - keep all cross-slice traffic denied unless a project/app network explicitly allows it;
- make OVS state reconciled and observable before adding self-service network features.
Console Access¶
The prototype exposes VNC on 0.0.0.0 with a predictable port derived from the
slice id. GPUaaS must not expose raw VNC directly from nodes.
If console access is needed, it should be admin-only for v1:
- console session is created through an authenticated gateway;
- authorization is short-lived and scoped to one VM/allocation;
- every start, connect, disconnect, and failure is audited;
- raw VNC ports are not reachable from customer networks;
- console access is separate from customer terminal access.
Customer terminal access should continue through the existing terminal gateway and allocation access model where possible.
Node-Agent Networking Tasks¶
Networking should be implemented as bounded typed tasks, not shell escape hatches.
Likely task responsibilities:
- verify OVS/BF3/VF/IP forwarding/NAT prerequisites;
- attach or validate the per-slice management vNIC;
- create or reconcile OVS port state;
- reserve or apply MAC/IP lease state;
- apply NAT/firewall/proxy/overlay exposure;
- report actual VM IP and endpoint state;
- poll SSH or image-specific management readiness after VM boot;
- remove exposure and lease state on release;
- emit drift evidence for admin node detail.
Task outputs should include structured state such as selected OVS port, MAC, IP, subnet, ingress mapping, and any drift reason. They should not depend on parsing ad hoc command output in the control plane.
Drift Signals¶
The node-agent should reconcile:
- VM vNIC exists and is attached to the expected OVS bridge or VF;
- MAC address matches the approved slot/allocation reservation;
- private IP lease matches intended state;
- NAT/firewall/proxy/overlay mappings match allocation exposure state;
- fabric device is attached to the expected VM;
- SSH or image-specific management readiness reached before allocation active;
- app endpoints are reachable according to exposure mode.
Networking drift should block only the affected slot or exposure unless the evidence indicates node-wide OVS/BF3/fabric failure.
Performance Tuning Backlog¶
The latest prototype benchmark shows acceptable GPU compute/HBM behavior but a large fabric delta: bare-metal IB write bandwidth around 363 Gb/s versus one-GPU VM slice around 206 Gb/s. Treat this as a networking performance workstream to investigate with infra after basic slice lifecycle is working.
Areas to validate:
- MTU and link settings on the BF3/ConnectX management and fabric interfaces;
- RDMA device mode, queue count, queue depth, and completion queue sizing;
- VM NIC multi-queue settings and virt-install/libvirt XML for queue exposure;
- IRQ affinity and NUMA locality for passthrough GPU, fabric device, VM vCPUs, and memory;
- OVS datapath/offload mode and whether BF3 representor/VF offload is active;
- per-vNIC QoS policy and whether it throttles below expected product bandwidth;
- guest driver versions for OFED/RDMA and NVIDIA, including fabric manager expectations;
- benchmark command parity between bare metal and VM, including device, GID index, message size, and CPU binding.
Do not expose bandwidth guarantees for gpu_slice until the tuning profile and
acceptance threshold are documented. The scheduler can still use fabric
capability as a placement constraint, but product-level performance promises
should wait for repeatable benchmark evidence.
Open Questions¶
- Which node-local IPAM implementation should back private slice networking: dnsmasq, libvirt network DHCP, central DHCP/IPAM, or a small node-agent-owned allocator with reconciliation?
- Should BF3 VF allocation be static per slot, or can VFs be dynamically assigned during placement?
- What QoS policy is required per vNIC for the first H200 slice product?
- Which public ingress mode is production-default: firewall DNAT, load balancer, reverse proxy, or routed private endpoint?
- What is the admin console gateway implementation: noVNC, SPICE proxy, SSH to guest, serial console, or another controlled path?
- How should Tailscale/Funnel identities be named and retired for dev/lab allocations?
- Do we allocate MAC addresses from a central pool or use deterministic node-id/slot-index derivation?
- Is SSH reachability always the management readiness signal, or should the network task support image-specific readiness ports?
- Which OVS segmentation primitive should infra standardize on for future project networks: VLAN, isolated bridge, OpenFlow rules, or overlay?
- Should inter-slice communication be project-scoped, app-scoped, or both?
- Where should security-group style rules be enforced: OVS, nftables/iptables, upstream firewall, or a combination?
Non-Goals For v1¶
- Treating IB/RDMA as the public internet path.
- Exposing raw VNC ports directly from nodes.
- Cross-node slice networking.
- Live migration of PCI-passthrough slice VMs.
- Building a generic customer-managed virtual network product.
- Self-service tenant networking or user-authored security groups.