Skip to content

Hierarchical Placement and Node Scheduler v1

Purpose

GPUaaS placement is moving from a single control-plane scheduler into a hierarchical placement model. This is not a separate product scheduler. It is the internal control boundary needed to place GPU, fabric, storage, CPU, memory, network, and runtime resources safely at the layer that has enough information to make each decision.

The platform direction is:

Region scheduler
  -> Cluster scheduler
    -> Node scheduler
      -> Runtime executor

Each layer narrows intent without leaking lower-level implementation details upward.

The GPU slice decisions that currently bind this model are recorded in GPU_Slice_End_to_End_Readiness_Decisions_v1.md. In particular, v1 whole-GPU VM slices use per-slot VF-backed fabric and same-node multi-GPU placement.

Layer Responsibilities

Region Scheduler

The region scheduler chooses the broad placement region for an allocation or app runtime.

It owns:

  1. user-visible region choice;
  2. project and tenant policy;
  3. region availability;
  4. price and product availability by region;
  5. future latency, data-residency, and quota constraints.

It must not know PCI addresses, NVMe devices, NUMA topology, or host-local driver state.

Cluster Scheduler

The cluster scheduler chooses a capacity pool or cluster inside a region.

Examples:

  1. H200 bare-metal pool;
  2. H200 GPU VM slice pool;
  3. app-runtime pool;
  4. MAAS-managed bare-metal pool;
  5. future scheduler-backed pools.

It owns:

  1. pool eligibility;
  2. node class and SKU compatibility;
  3. capacity-shape compatibility such as baremetal or gpu_slice;
  4. broad packing policy across nodes;
  5. preserving bare-metal optionality when slice pools and bare-metal pools share physical nodes.

It must not decide the exact GPU, IB/RDMA, NVMe, MAC/IP, or CPU-set tuple for a slice VM.

Node Scheduler

The node scheduler is the host-local placement authority. It runs in, or behind, the node-agent boundary because only the node can reliably validate current host-local topology and runtime state.

It owns:

  1. building schedulable bundles from local inventory;
  2. selecting GPU PCI devices;
  3. selecting fabric devices such as IB/RDMA PCI functions;
  4. selecting raw NVMe devices approved for slice use;
  5. selecting CPU and memory quantities or CPU sets by runtime profile;
  6. selecting or validating MAC/IP identities;
  7. validating NUMA and topology policy;
  8. validating host driver state, IOMMU groups, mounted storage state, and passthrough readiness;
  9. taking a short host-local lease so concurrent node tasks cannot select the same physical resource;
  10. returning the selected bundle set to the control plane for durable claim recording.

The node scheduler must not decide user quota, billing policy, tenant policy, or cross-region placement.

The node scheduler must take a short host-local lease before returning a plan. The lease is not the durable allocation record; it is a same-node race guard until the controller records allocation_resource_claims. Lease records should live under /var/lib/gpuaas/node-scheduler/leases and include allocation ID, task ID, slot IDs, device identities, and expiry.

Runtime Executor

The runtime executor is the bounded node-agent task implementation that turns an approved node-scheduler plan into a running workload.

For gpu_slice, this includes:

  1. cloning or preparing the slice VM image;
  2. passing through selected GPU and fabric devices;
  3. attaching selected raw disk devices;
  4. applying cloud-init and SSH access;
  5. waiting for readiness gates such as SSH, cloud-init completion, and nvidia-smi;
  6. releasing and wiping slot resources before reuse.

The executor must consume a concrete node-scheduler plan. It should not invent a new placement decision after the plan is accepted.

GPU Slice Bundle Model

For H200 GPU VM slices, one slice unit is a complete host-local bundle:

1 GPU + 1 IB/RDMA device + 1 raw NVMe device + CPU + memory + MAC/IP

A request for multiple GPUs claims multiple bundles on the same node. For a 2-GPU VM, the node scheduler should select two compatible bundles and the runtime executor should create one VM with both sets of devices attached.

Example 2-GPU plan:

{
  "capacity_shape": "gpu_slice",
  "gpu_count": 2,
  "bundles": [
    {
      "slot_index": 1,
      "gpu_pci": "0000:3c:00.0",
      "fabric_parent_pci": "0000:3a:00.0",
      "fabric_vf_pci": "0000:3a:00.2",
      "nvme_device": "/dev/disk/by-id/...",
      "numa_node": 0,
      "vcpu_count": 24,
      "memory_mib": 65536,
      "mac_address": "52:54:...",
      "private_ip": "10.100.0.11"
    },
    {
      "slot_index": 2,
      "gpu_pci": "0000:4b:00.0",
      "fabric_parent_pci": "0000:4d:00.0",
      "fabric_vf_pci": "0000:4d:00.2",
      "nvme_device": "/dev/disk/by-id/...",
      "numa_node": 0,
      "vcpu_count": 24,
      "memory_mib": 65536,
      "mac_address": "52:54:...",
      "private_ip": "10.100.0.12"
    }
  ]
}

The bundle model is intentionally stricter than raw GPU counting. A node with free GPUs is not schedulable for a slice request unless it can produce enough complete, topology-compatible bundles.

For v1, fabric_vf_pci or equivalent isolated fabric attachment is required for each bundle. A parent BF/IB device without an assigned VF is not sufficient to approve concurrent GPU slice capacity.

Control-Plane Claim Model

The control plane remains the durable source of truth for allocations and claims. The node scheduler is the host-local authority for plan validity.

The intended flow is:

  1. API receives an allocation request.
  2. Region scheduler selects the region or validates the requested region.
  3. Cluster scheduler selects candidate pools and nodes.
  4. Control plane creates a placement intent for one candidate node and requested GPU count.
  5. Node scheduler plans and leases concrete bundles on that node.
  6. Control plane records the returned bundle claims in allocation_resource_claims and marks selected slot rows reserved.
  7. Runtime executor creates the VM from the accepted plan.
  8. Node-agent reports readiness.
  9. Control plane transitions the allocation to active.

If node scheduling fails because the node-local state changed, the control plane may retry another node or return sku_unavailable with a topology-specific reason.

Transitional Implementation

The current implementation stores approved slice bundles in node_resource_slots and has the control plane select slot IDs directly. This is acceptable as a bootstrap path, but it should be treated as transitional.

Near-term changes:

  1. keep node-agent topology discovery conservative: it may report parent fabric devices for operator context, but schedulable candidates require per-slot VF identity;
  2. add a node-agent slice.plan task, or equivalent planning phase inside slice.vm_create, that validates and returns the concrete bundle set before VM creation;
  3. add a host-local lease directory such as /var/lib/gpuaas/node-scheduler/leases to prevent same-node races;
  4. keep node_resource_slots as the approved operator inventory and durable claim target, but let node-agent validate that the selected rows still match real host state;
  5. update marketplace and admin capacity surfaces to distinguish raw available slots from node-scheduler-compatible capacity.

Until that planning layer exists, the control plane must remain conservative for fabric placement. GPU VM slices are schedulable only when approved slot metadata sets fabric_claim_mode=per_slot_vf and provides a non-empty fabric_vf_pci_address. fabric_device can remain the parent BF/fabric device for operator context, but duplicate parent values are not a concurrency approval.

Historical j22u05 Dev Profile

j22u05 was used for early GPU slice development and is now being returned. Do not use it as the active slice validation target. Keep this section as historical topology evidence only.

Its dev profile uses one H200 GPU, one IB/RDMA device, and one raw NVMe device per slice slot.

The intended static bundle profile is:

Slot GPU PCI Fabric PCI
0 0000:1b:00.0 0000:1a:00.0
1 0000:3c:00.0 0000:3a:00.0
2 0000:4b:00.0 0000:4d:00.0
3 0000:5c:00.0 0000:5d:00.0
4 0000:9a:00.0 0000:9b:00.0
5 0000:bb:00.0 0000:ba:00.0
6 0000:cd:00.0 0000:ca:00.0
7 0000:dc:00.0 0000:db:00.0

This profile should not be blindly copied to the current test nodes. Use it as a shape example only; current slot approval must come from each node's topology discovery, MAAS profile evidence, and per-slot VF metadata.

If j22u15 or j22u11 metadata maps the same parent fabric device across slots and no per-slot VF identity is present, treat the node as not schedulable for GPU VM slices. Do not advertise eight concurrent slices until the node-local fabric model is corrected and every slot has fabric_claim_mode=per_slot_vf plus a unique fabric_vf_pci_address.

Non-Goals

This document does not introduce a generic workload scheduler framework. It only defines placement responsibility boundaries for GPUaaS capacity allocation.

This document does not make node-agent a generic remote shell. Node scheduling and runtime execution must remain bounded typed-task behavior with explicit inputs, outputs, validation, and auditability.