GPU Slice Implementation Checklist v1¶
Purpose¶
Turn the reviewed GPU slice architecture into an implementation sequence.
This checklist coordinates:
doc/architecture/Allocation_Capacity_Shapes_and_GPU_Slices_v1.md;doc/architecture/Slice_Networking_Architecture_v1.md;doc/product/GPU_Slice_UX_Product_Reference_Notes_v1.md;doc/api/openapi.draft.yaml;doc/architecture/db_schema_v1.sql;- node-agent, scheduler, admin inventory, and app-runtime follow-up work.
The work should stay contract-first. Any API/event shape change starts in OpenAPI/AsyncAPI before service implementation.
Current Decision Record¶
Use doc/architecture/GPU_Slice_End_to_End_Readiness_Decisions_v1.md as the
current source of truth for the decisions that close the architecture/product
discussion before end-to-end testing:
- v1 fabric model is per-slot VF-backed, not host-owned shared fabric;
- multi-GPU slice allocations must fit on one node;
- controller scheduler chooses region/pool/candidate node while node scheduler validates host-local bundles and leases them;
- MAAS tags express profile intent, while readiness evidence and approved slots decide schedulability;
- allocation timeline pagination is enough for immediate testing, but the product target is a first-class task read model;
- launch UX is one H200 wizard that branches between baremetal and GPU slice.
Control-Plane Start Gate¶
Control-plane implementation can start now. Earlier validation used j22u05
for firmware readiness, KVM/IOMMU, OVS private NAT, RShim, node-agent task
execution, and admin-visible slot/topology read models. j22u05 is no longer
part of the active slice test pool. Current end-to-end slice validation should
use j22u15 first, then j22u11.
Non-blocking infra items to keep visible while control-plane work proceeds:
- MAAS tag assignment and firmware profile operating mode: MAAS standard
commissioning stays in place;
gpuaas-profile-slice-vmandgpuaas-profile-baremetalexpress intent; deploy-time cloud-init runs the one-time firmware profile check/apply helper only for the selected profile. - BF3 networking profile: v1 is per-slot VF-backed. Keep duplicate parent
fabric devices conservative until every schedulable slot has unique
fabric_vf_pci_addressmetadata. - Node-local IPAM/public access stance: private NAT by default, public NAT or overlay only after firewall/IPAM ownership is clear.
- Host bootstrap handoff: MAAS deploy userdata, cloud-init, or operator-run
bootstrap for
scripts/ops/gpuaas_slice_host_deploy_bootstrap.sh. - Slice VM image source, digest verification authority, and initial
H200-ubuntu-24.04-gpuaas-sliceimage naming. - Tenant isolation wipe policy:
blkdiscard,nvme format, or site-specific secure erase path before any released slot can be reused. - Final slot profile values, starting from the tested
1 GPU = 24 vCPU / 64 GiBprofile but keeping it SKU/profile-driven. - Fractional/shared GPU products from the Cisco-style direction are design-only for now. The backend model should leave room for them, but schedulable v1 slices remain exclusive whole-GPU VM slots until a specific mechanism is selected and validated.
- Raw NVMe slot ownership for
j22u05: the first runtime validation found mounted host share partitions under devices that were approved as slice disks. Treat this as a convertible baremetal/share-to-slice storage pool: keep slotscleanup_blockeduntil infra drains baremetal/share use, unmounts or remaps the devices, approves destructive slice ownership, and reruns topology discovery. - Tuned slice profile validation from
GPUaaS_0416: hugepages, persistent VFIO binding, Secure Boot disabled, host/guest network tuning, and fast raw image clone are the current prototype performance model, but storage unmounting must become an infra-approved disk ownership profile rather than an automated blind unmount. The current comparison record isdoc/operations/evidence/GPUaaS_0416_Tuned_Profile_Comparison_2026-04-18.md. - Baremetal preservation: scheduler packing should prefer already-sliced nodes for new slice requests so clean nodes remain available for baremetal. Freeing a sliced node for baremetal is v1 drain/natural release or later stop-and-recreate evacuation, not transparent live migration.
These items block schedulable production slices, not Phase 1 through Phase 4 control-plane groundwork. Slice scheduling must remain disabled until approved slots exist and runtime validation is wired.
Current Non-Infra Work Queue¶
These tasks can move while waiting for infra tag/profile answers and end-to-end MAAS validation:
- Implement the node-scheduler boundary behind node-agent: accept approved slot
IDs from the control plane, validate current host state, take host-local
leases, and return a concrete bundle plan before VM create. The required
lease/plan semantics are defined in
GPU_Slice_End_to_End_Readiness_Decisions_v1.md. - Productize slice images: keep a small Ubuntu base image and add a separate NVIDIA/CUDA developer image with driver/tooling readiness checks. Do not install heavy GPU toolchains on every base image.
- Consolidate catalog and launch UX: one H200 launch path should branch inside the wizard into baremetal versus GPU VM slice, with region, SSH, storage, and network choices staying in-flow.
- Add the region selector now even when only one region is enabled; the UI should show disabled/unavailable future regions without requiring a later navigation redesign.
- Quantify platform-control deploy time and split the release paths into fast dev deploy, targeted component deploy, and full verification. The target for normal dev iteration is 2-3 minutes.
- Keep improving provisioning task read models beyond allocation timeline so MAAS onboarding, slice placement, image clone, VM boot, SSH readiness, and cleanup states are visible without direct DB inspection.
- Keep
gpuaas-api-docs-redocnon-blocking for launch unless CI proves it breaks the developer docs path; if it remains flaky, move it to a cached or static artifact path. - Follow up on SDK codegen failure
742as a release-quality issue, but do not let it block slice runtime work unless generated API artifacts are stale.
Phase 1: Contract And Schema Groundwork¶
Goal: make the data model capable of representing baremetal and slice placement without changing the current baremetal runtime path.
Tasks:
- Add
capacity_shapeand placement/read-model fields to allocation contracts. - Add
node_resource_slotsfor approved schedulable host-local bundles. - Add
allocation_resource_claimsfor durable allocation-to-node/slot binding. - Keep
allocations.node_idas a compatibility/read-model field. - Relax node-active uniqueness so it applies to baremetal allocations only.
- Add admin/public read-model fields for aggregate slot availability without exposing raw PCI details to users.
- Define slots as generic schedulable accelerator/resource bundles so future child slots can represent fractional/shared GPU resources without changing the allocation/claim invariant.
- Update ERD and seed/spec docs.
Definition of done:
- Existing baremetal allocation behavior remains valid with default
capacity_shape=baremetal. - No slice placement is enabled by default.
- Schema is additive except for the active-node uniqueness predicate required for future slice coexistence.
- Fractional/shared GPU capacity shapes are not exposed in OpenAPI or enabled in scheduling until a later phase selects the runtime mechanism.
Phase 2: Admin Inventory And Topology Approval¶
Goal: discover candidate slice bundles and let operators approve schedulable slots.
Tasks:
- Add node-agent topology discovery task output.
- Discover GPU PCI/sysfs metadata.
- Discover NVMe by approved disk identity, not hardcoded size filters.
- Discover IB/RDMA from
/sys/class/infinibandand PCI metadata. - Discover management-network identity: BF3 VF, OVS port, MAC/IP reservation. Until node-local IPAM is authoritative, discovery emits deterministic suggested MAC/private-IP/OVS values for admin review; these suggestions are advisory and do not become schedulable until approved into slots.
- Store candidate topology separately from approved
node_resource_slots. The initial control-plane bridge is the adminPOST/GET /api/v1/admin/nodes/{node_id}/slice-topology/discoveryAPI, which queuesslice.topology_discoverand reads the latest advisorynode_tasks.output. Candidate maps stay non-schedulable until explicitly approved through the resource-slots API. - Add admin APIs/UI for approving, disabling, repairing, and viewing slots.
- Report bootstrap prerequisites and
reboot_requiredstate. - Whole-GPU slots can only become
availableafter the runtime-critical bindings are present: GPU PCI address, raw NVMe device, management MAC, and private IP reservation. Incomplete candidates must remaindisabled. - Slot approval must reject any NVMe candidate with mounted child partitions or unexpected filesystems unless an explicit infra wipe/remap workflow has completed and recorded proof.
- Slot approval must require per-slot VF-backed fabric metadata. Operators may
only make a whole-GPU slice slot
availableaftercapacity_metadata.fabric_claim_mode=per_slot_vfand a non-emptyfabric_vf_pci_addressare present for that slot.
Definition of done:
- Operators can see candidate and approved topology for a node. Candidate topology is exposed through the slice-topology discovery API; approved topology is exposed through the resource-slots API.
- Invalid or incomplete topology cannot become schedulable.
- Slot status and drift are visible in Admin Nodes.
- Reused BF/fabric metadata cannot silently sell concurrent slices; concurrency requires explicit per-slot VF metadata.
Phase 3: Claim-Aware Baremetal Scheduler¶
Goal: move correctness from allocations.node_id to claims while keeping the
existing user-visible baremetal product.
Tasks:
- On baremetal allocation create, insert a
node_exclusiveclaim. - Replace active allocation node checks with claim-aware predicates.
- Backfill current active/releasing baremetal allocations into claim rows.
- Update release/force-release to release claims.
- Keep billing sourced from allocation-level usage records.
Definition of done:
- Baremetal scheduling behavior is unchanged for users.
- Claims are the durable source of placement correctness.
- Admin allocation detail shows placement shape and claims.
Phase 4: Slice Scheduler¶
Goal: place gpu_slice allocations on same-node approved slot groups.
Implementation note: the initial scheduler uses SQL to fetch compatible
candidate slots, ranks candidate nodes in Go with deterministic topology-aware
best-fit scoring, then locks the selected node's available slots with
FOR UPDATE SKIP LOCKED before reserving them. This keeps the fragmentation
policy testable outside SQL while preserving transaction-level concurrency
safety.
Tasks:
- Add
h200-sxm-sliceSKU/profile with allowed GPU counts. - Add same-node slot selection and
FOR UPDATE SKIP LOCKEDlocking. - Implement deterministic topology-aware best-fit scoring in Go.
- Reserve selected slots and insert one claim per slot atomically.
- Expose unavailable reasons when aggregate GPUs exist but topology-safe slots do not.
- Scheduler predicates must independently ignore incomplete
availablerows so direct DB/bootstrap mistakes cannot bypass the approval validator.
Definition of done:
- Slice requests cannot span nodes.
- Fragmentation-aware scoring is deterministic.
- Claims, slot states, allocation state, and outbox write commit together.
Phase 5: Node-Agent Slice Runtime¶
Goal: make node-agent execute bounded VM lifecycle tasks for slices.
Tasks:
- Prereq check: IOMMU, VFIO, SR-IOV, OVS, libvirt, UEFI, cloud-init tooling,
RDMA tools, image cache, reboot-required status.
Manual lab bootstrap notes are captured in
doc/operations/runbooks/GPU_Slice_Node_Manual_Bootstrap_Runbook.md. Firmware/preflight automation starts inscripts/maas/commissioning/50-gpuaas-slice-firmware-preflight.sh; deployed Ubuntu host setup starts inscripts/ops/gpuaas_slice_host_deploy_bootstrap.sh. - Image clone: verify digest, validate destination disk,
qemu-img convert -O raw, structured progress/error output. Node-agent must validate that each approvednvme_deviceresolves to an actual unmounted block device before clone or wipe, not just match an allowed path pattern. - Cloud-init: generate per-allocation user-data/meta-data and attach seed ISO.
- VM create/start: libvirt/QEMU, VFIO GPU/fabric, raw NVMe, OVS vNIC, NUMA tuning. Node-agent must verify the selected OVS bridge exists before launching the VM. Tuned runtime profiles may additionally require hugepage-backed memory and Secure Boot disabled; node-agent should fail fast if a selected profile requires hugepages and the host has not reported enough free configured hugepages.
- Readiness: wait for expected MAC/IP lease and SSH or image-specific
management readiness. The initial node-agent implementation treats the boot
slot management MAC/private IP and SSH readiness as a hard provisioning gate;
a VM that is created but not reachable must fail provisioning rather than be
marked active.
Slice slots attached to a failed provisioning attempt move to
cleanup_blocked, notavailable, because clone/start/readiness failures may leave tenant data or VM state on the raw NVMe device. Successful provision/release paths write slothealth_stateandhealth_detailfrom readiness and cleanup proof so admin inventory has a lifecycle audit trail for the slot. - Stop/release: graceful shutdown, timeout, hard-stop fallback.
- Cleanup: explicit raw NVMe wipe/blkdiscard/reimage and signature verification; mounted host devices must be refused before destructive cleanup starts.
- Reconcile: VM/domain, PCI binding, disk, DHCP/IPAM, OVS, ingress, guest health.
Definition of done:
- Raw NVMe wipe verification blocks slot reuse.
- Raw VNC is not exposed; any console access is gatewayed and audited.
- GPU/fabric devices remain bound to
vfio-pciwhile the node is in slice mode unless drained or repaired.
Phase 6: Slice Networking¶
Goal: deliver the v1 private NAT model while preserving OVS extensibility.
Tasks:
- Use BF3 VF -> OVS -> VM vNIC for management/public plane.
- Keep IB/RDMA passthrough separate from management networking.
- Add platform-owned MAC/IP reservation and node-agent lease reporting.
- Implement private NAT by default.
- Add lifecycle-managed public ingress only when firewall/IPAM ownership exists.
- Keep cross-slice traffic denied by default.
Definition of done:
- Node-agent reports
{mac, expected_ip, actual_ip, lease_state}. - Stale DHCP/NAT/overlay mappings are reconciled and cleaned on release.
- Future OVS project networks remain possible but are not exposed in v1.
Phase 7: App Compatibility¶
Goal: enable apps only where their runtime contract fits the allocation shape.
Tasks:
- Add app manifest compatibility fields.
- Enable launchable OCI, Jupyter, and vLLM on slices first.
- Allow single-node Slurm/Kubernetes on slices only when the full control plane and worker runtime stay inside one allocation.
- Keep multi-node Slurm/Kubernetes cluster profiles baremetal-only until slice networking supports clusters.
- Evaluate internal
system_vmplacements for infra-managed controllers.
Definition of done:
- App target picker only shows compatible allocations.
- Incompatible profiles explain why the selected allocation cannot run them.
- No app can bypass the allocation/claim/slot compatibility checks.
Deferred Phase 8: Fractional And Shared GPU Readiness¶
Goal: keep the model extensible for future fractional/shared GPU products without implementing or exposing them in the first slice release.
Non-goal for current implementation:
- Do not schedule or bill fractional/shared GPU products.
- Do not expand public
CapacityShapeenum values until the control plane, scheduler, node-agent, billing, and app compatibility rules are ready. - Do not treat
gpu_sliceas an alias for MIG, vGPU, MPS, time slicing, or software multiplexing.
Design tasks:
- Reserve explicit future shapes such as
gpu_partitionorgpu_shared. - Model fractional resources as approved child slots or equivalent child resources under a parent physical GPU.
- Capture the sharing mechanism explicitly:
exclusive_device,mig_partition,mdev_vgpu,time_sliced,mps, orsoftware_shared. - Track billable capacity beyond GPU count, such as GPU memory MiB, compute-share units, named partition profile, or SKU-defined accelerator units.
- Require node-agent reconciliation for parent/child GPU state, reset behavior, and cleanup before any child slot can be reused.
- Make app compatibility adapter-specific; whole-GPU slice support must not automatically imply fractional GPU support.
- Keep overcommit disabled unless a future product explicitly defines fairness, performance, billing, and tenant-isolation policy.
Definition of ready for a future implementation:
- Infra identifies the first supported mechanism and hardware generation.
- SKU metadata defines profiles, isolation guarantees, and billable units.
- Scheduler can atomically claim child resources without exceeding parent GPU capacity or mixing incompatible profiles.
- Node-agent can provision, reconcile, reset, and release the selected fractional mechanism.
- Billing exports explain fractional units from immutable claim snapshots.