Skip to content

GPU Slice Implementation Checklist v1

Purpose

Turn the reviewed GPU slice architecture into an implementation sequence.

This checklist coordinates:

  1. doc/architecture/Allocation_Capacity_Shapes_and_GPU_Slices_v1.md;
  2. doc/architecture/Slice_Networking_Architecture_v1.md;
  3. doc/product/GPU_Slice_UX_Product_Reference_Notes_v1.md;
  4. doc/api/openapi.draft.yaml;
  5. doc/architecture/db_schema_v1.sql;
  6. node-agent, scheduler, admin inventory, and app-runtime follow-up work.

The work should stay contract-first. Any API/event shape change starts in OpenAPI/AsyncAPI before service implementation.

Current Decision Record

Use doc/architecture/GPU_Slice_End_to_End_Readiness_Decisions_v1.md as the current source of truth for the decisions that close the architecture/product discussion before end-to-end testing:

  1. v1 fabric model is per-slot VF-backed, not host-owned shared fabric;
  2. multi-GPU slice allocations must fit on one node;
  3. controller scheduler chooses region/pool/candidate node while node scheduler validates host-local bundles and leases them;
  4. MAAS tags express profile intent, while readiness evidence and approved slots decide schedulability;
  5. allocation timeline pagination is enough for immediate testing, but the product target is a first-class task read model;
  6. launch UX is one H200 wizard that branches between baremetal and GPU slice.

Control-Plane Start Gate

Control-plane implementation can start now. Earlier validation used j22u05 for firmware readiness, KVM/IOMMU, OVS private NAT, RShim, node-agent task execution, and admin-visible slot/topology read models. j22u05 is no longer part of the active slice test pool. Current end-to-end slice validation should use j22u15 first, then j22u11.

Non-blocking infra items to keep visible while control-plane work proceeds:

  1. MAAS tag assignment and firmware profile operating mode: MAAS standard commissioning stays in place; gpuaas-profile-slice-vm and gpuaas-profile-baremetal express intent; deploy-time cloud-init runs the one-time firmware profile check/apply helper only for the selected profile.
  2. BF3 networking profile: v1 is per-slot VF-backed. Keep duplicate parent fabric devices conservative until every schedulable slot has unique fabric_vf_pci_address metadata.
  3. Node-local IPAM/public access stance: private NAT by default, public NAT or overlay only after firewall/IPAM ownership is clear.
  4. Host bootstrap handoff: MAAS deploy userdata, cloud-init, or operator-run bootstrap for scripts/ops/gpuaas_slice_host_deploy_bootstrap.sh.
  5. Slice VM image source, digest verification authority, and initial H200-ubuntu-24.04-gpuaas-slice image naming.
  6. Tenant isolation wipe policy: blkdiscard, nvme format, or site-specific secure erase path before any released slot can be reused.
  7. Final slot profile values, starting from the tested 1 GPU = 24 vCPU / 64 GiB profile but keeping it SKU/profile-driven.
  8. Fractional/shared GPU products from the Cisco-style direction are design-only for now. The backend model should leave room for them, but schedulable v1 slices remain exclusive whole-GPU VM slots until a specific mechanism is selected and validated.
  9. Raw NVMe slot ownership for j22u05: the first runtime validation found mounted host share partitions under devices that were approved as slice disks. Treat this as a convertible baremetal/share-to-slice storage pool: keep slots cleanup_blocked until infra drains baremetal/share use, unmounts or remaps the devices, approves destructive slice ownership, and reruns topology discovery.
  10. Tuned slice profile validation from GPUaaS_0416: hugepages, persistent VFIO binding, Secure Boot disabled, host/guest network tuning, and fast raw image clone are the current prototype performance model, but storage unmounting must become an infra-approved disk ownership profile rather than an automated blind unmount. The current comparison record is doc/operations/evidence/GPUaaS_0416_Tuned_Profile_Comparison_2026-04-18.md.
  11. Baremetal preservation: scheduler packing should prefer already-sliced nodes for new slice requests so clean nodes remain available for baremetal. Freeing a sliced node for baremetal is v1 drain/natural release or later stop-and-recreate evacuation, not transparent live migration.

These items block schedulable production slices, not Phase 1 through Phase 4 control-plane groundwork. Slice scheduling must remain disabled until approved slots exist and runtime validation is wired.

Current Non-Infra Work Queue

These tasks can move while waiting for infra tag/profile answers and end-to-end MAAS validation:

  1. Implement the node-scheduler boundary behind node-agent: accept approved slot IDs from the control plane, validate current host state, take host-local leases, and return a concrete bundle plan before VM create. The required lease/plan semantics are defined in GPU_Slice_End_to_End_Readiness_Decisions_v1.md.
  2. Productize slice images: keep a small Ubuntu base image and add a separate NVIDIA/CUDA developer image with driver/tooling readiness checks. Do not install heavy GPU toolchains on every base image.
  3. Consolidate catalog and launch UX: one H200 launch path should branch inside the wizard into baremetal versus GPU VM slice, with region, SSH, storage, and network choices staying in-flow.
  4. Add the region selector now even when only one region is enabled; the UI should show disabled/unavailable future regions without requiring a later navigation redesign.
  5. Quantify platform-control deploy time and split the release paths into fast dev deploy, targeted component deploy, and full verification. The target for normal dev iteration is 2-3 minutes.
  6. Keep improving provisioning task read models beyond allocation timeline so MAAS onboarding, slice placement, image clone, VM boot, SSH readiness, and cleanup states are visible without direct DB inspection.
  7. Keep gpuaas-api-docs-redoc non-blocking for launch unless CI proves it breaks the developer docs path; if it remains flaky, move it to a cached or static artifact path.
  8. Follow up on SDK codegen failure 742 as a release-quality issue, but do not let it block slice runtime work unless generated API artifacts are stale.

Phase 1: Contract And Schema Groundwork

Goal: make the data model capable of representing baremetal and slice placement without changing the current baremetal runtime path.

Tasks:

  1. Add capacity_shape and placement/read-model fields to allocation contracts.
  2. Add node_resource_slots for approved schedulable host-local bundles.
  3. Add allocation_resource_claims for durable allocation-to-node/slot binding.
  4. Keep allocations.node_id as a compatibility/read-model field.
  5. Relax node-active uniqueness so it applies to baremetal allocations only.
  6. Add admin/public read-model fields for aggregate slot availability without exposing raw PCI details to users.
  7. Define slots as generic schedulable accelerator/resource bundles so future child slots can represent fractional/shared GPU resources without changing the allocation/claim invariant.
  8. Update ERD and seed/spec docs.

Definition of done:

  1. Existing baremetal allocation behavior remains valid with default capacity_shape=baremetal.
  2. No slice placement is enabled by default.
  3. Schema is additive except for the active-node uniqueness predicate required for future slice coexistence.
  4. Fractional/shared GPU capacity shapes are not exposed in OpenAPI or enabled in scheduling until a later phase selects the runtime mechanism.

Phase 2: Admin Inventory And Topology Approval

Goal: discover candidate slice bundles and let operators approve schedulable slots.

Tasks:

  1. Add node-agent topology discovery task output.
  2. Discover GPU PCI/sysfs metadata.
  3. Discover NVMe by approved disk identity, not hardcoded size filters.
  4. Discover IB/RDMA from /sys/class/infiniband and PCI metadata.
  5. Discover management-network identity: BF3 VF, OVS port, MAC/IP reservation. Until node-local IPAM is authoritative, discovery emits deterministic suggested MAC/private-IP/OVS values for admin review; these suggestions are advisory and do not become schedulable until approved into slots.
  6. Store candidate topology separately from approved node_resource_slots. The initial control-plane bridge is the admin POST/GET /api/v1/admin/nodes/{node_id}/slice-topology/discovery API, which queues slice.topology_discover and reads the latest advisory node_tasks.output. Candidate maps stay non-schedulable until explicitly approved through the resource-slots API.
  7. Add admin APIs/UI for approving, disabling, repairing, and viewing slots.
  8. Report bootstrap prerequisites and reboot_required state.
  9. Whole-GPU slots can only become available after the runtime-critical bindings are present: GPU PCI address, raw NVMe device, management MAC, and private IP reservation. Incomplete candidates must remain disabled.
  10. Slot approval must reject any NVMe candidate with mounted child partitions or unexpected filesystems unless an explicit infra wipe/remap workflow has completed and recorded proof.
  11. Slot approval must require per-slot VF-backed fabric metadata. Operators may only make a whole-GPU slice slot available after capacity_metadata.fabric_claim_mode=per_slot_vf and a non-empty fabric_vf_pci_address are present for that slot.

Definition of done:

  1. Operators can see candidate and approved topology for a node. Candidate topology is exposed through the slice-topology discovery API; approved topology is exposed through the resource-slots API.
  2. Invalid or incomplete topology cannot become schedulable.
  3. Slot status and drift are visible in Admin Nodes.
  4. Reused BF/fabric metadata cannot silently sell concurrent slices; concurrency requires explicit per-slot VF metadata.

Phase 3: Claim-Aware Baremetal Scheduler

Goal: move correctness from allocations.node_id to claims while keeping the existing user-visible baremetal product.

Tasks:

  1. On baremetal allocation create, insert a node_exclusive claim.
  2. Replace active allocation node checks with claim-aware predicates.
  3. Backfill current active/releasing baremetal allocations into claim rows.
  4. Update release/force-release to release claims.
  5. Keep billing sourced from allocation-level usage records.

Definition of done:

  1. Baremetal scheduling behavior is unchanged for users.
  2. Claims are the durable source of placement correctness.
  3. Admin allocation detail shows placement shape and claims.

Phase 4: Slice Scheduler

Goal: place gpu_slice allocations on same-node approved slot groups.

Implementation note: the initial scheduler uses SQL to fetch compatible candidate slots, ranks candidate nodes in Go with deterministic topology-aware best-fit scoring, then locks the selected node's available slots with FOR UPDATE SKIP LOCKED before reserving them. This keeps the fragmentation policy testable outside SQL while preserving transaction-level concurrency safety.

Tasks:

  1. Add h200-sxm-slice SKU/profile with allowed GPU counts.
  2. Add same-node slot selection and FOR UPDATE SKIP LOCKED locking.
  3. Implement deterministic topology-aware best-fit scoring in Go.
  4. Reserve selected slots and insert one claim per slot atomically.
  5. Expose unavailable reasons when aggregate GPUs exist but topology-safe slots do not.
  6. Scheduler predicates must independently ignore incomplete available rows so direct DB/bootstrap mistakes cannot bypass the approval validator.

Definition of done:

  1. Slice requests cannot span nodes.
  2. Fragmentation-aware scoring is deterministic.
  3. Claims, slot states, allocation state, and outbox write commit together.

Phase 5: Node-Agent Slice Runtime

Goal: make node-agent execute bounded VM lifecycle tasks for slices.

Tasks:

  1. Prereq check: IOMMU, VFIO, SR-IOV, OVS, libvirt, UEFI, cloud-init tooling, RDMA tools, image cache, reboot-required status. Manual lab bootstrap notes are captured in doc/operations/runbooks/GPU_Slice_Node_Manual_Bootstrap_Runbook.md. Firmware/preflight automation starts in scripts/maas/commissioning/50-gpuaas-slice-firmware-preflight.sh; deployed Ubuntu host setup starts in scripts/ops/gpuaas_slice_host_deploy_bootstrap.sh.
  2. Image clone: verify digest, validate destination disk, qemu-img convert -O raw, structured progress/error output. Node-agent must validate that each approved nvme_device resolves to an actual unmounted block device before clone or wipe, not just match an allowed path pattern.
  3. Cloud-init: generate per-allocation user-data/meta-data and attach seed ISO.
  4. VM create/start: libvirt/QEMU, VFIO GPU/fabric, raw NVMe, OVS vNIC, NUMA tuning. Node-agent must verify the selected OVS bridge exists before launching the VM. Tuned runtime profiles may additionally require hugepage-backed memory and Secure Boot disabled; node-agent should fail fast if a selected profile requires hugepages and the host has not reported enough free configured hugepages.
  5. Readiness: wait for expected MAC/IP lease and SSH or image-specific management readiness. The initial node-agent implementation treats the boot slot management MAC/private IP and SSH readiness as a hard provisioning gate; a VM that is created but not reachable must fail provisioning rather than be marked active. Slice slots attached to a failed provisioning attempt move to cleanup_blocked, not available, because clone/start/readiness failures may leave tenant data or VM state on the raw NVMe device. Successful provision/release paths write slot health_state and health_detail from readiness and cleanup proof so admin inventory has a lifecycle audit trail for the slot.
  6. Stop/release: graceful shutdown, timeout, hard-stop fallback.
  7. Cleanup: explicit raw NVMe wipe/blkdiscard/reimage and signature verification; mounted host devices must be refused before destructive cleanup starts.
  8. Reconcile: VM/domain, PCI binding, disk, DHCP/IPAM, OVS, ingress, guest health.

Definition of done:

  1. Raw NVMe wipe verification blocks slot reuse.
  2. Raw VNC is not exposed; any console access is gatewayed and audited.
  3. GPU/fabric devices remain bound to vfio-pci while the node is in slice mode unless drained or repaired.

Phase 6: Slice Networking

Goal: deliver the v1 private NAT model while preserving OVS extensibility.

Tasks:

  1. Use BF3 VF -> OVS -> VM vNIC for management/public plane.
  2. Keep IB/RDMA passthrough separate from management networking.
  3. Add platform-owned MAC/IP reservation and node-agent lease reporting.
  4. Implement private NAT by default.
  5. Add lifecycle-managed public ingress only when firewall/IPAM ownership exists.
  6. Keep cross-slice traffic denied by default.

Definition of done:

  1. Node-agent reports {mac, expected_ip, actual_ip, lease_state}.
  2. Stale DHCP/NAT/overlay mappings are reconciled and cleaned on release.
  3. Future OVS project networks remain possible but are not exposed in v1.

Phase 7: App Compatibility

Goal: enable apps only where their runtime contract fits the allocation shape.

Tasks:

  1. Add app manifest compatibility fields.
  2. Enable launchable OCI, Jupyter, and vLLM on slices first.
  3. Allow single-node Slurm/Kubernetes on slices only when the full control plane and worker runtime stay inside one allocation.
  4. Keep multi-node Slurm/Kubernetes cluster profiles baremetal-only until slice networking supports clusters.
  5. Evaluate internal system_vm placements for infra-managed controllers.

Definition of done:

  1. App target picker only shows compatible allocations.
  2. Incompatible profiles explain why the selected allocation cannot run them.
  3. No app can bypass the allocation/claim/slot compatibility checks.

Deferred Phase 8: Fractional And Shared GPU Readiness

Goal: keep the model extensible for future fractional/shared GPU products without implementing or exposing them in the first slice release.

Non-goal for current implementation:

  1. Do not schedule or bill fractional/shared GPU products.
  2. Do not expand public CapacityShape enum values until the control plane, scheduler, node-agent, billing, and app compatibility rules are ready.
  3. Do not treat gpu_slice as an alias for MIG, vGPU, MPS, time slicing, or software multiplexing.

Design tasks:

  1. Reserve explicit future shapes such as gpu_partition or gpu_shared.
  2. Model fractional resources as approved child slots or equivalent child resources under a parent physical GPU.
  3. Capture the sharing mechanism explicitly: exclusive_device, mig_partition, mdev_vgpu, time_sliced, mps, or software_shared.
  4. Track billable capacity beyond GPU count, such as GPU memory MiB, compute-share units, named partition profile, or SKU-defined accelerator units.
  5. Require node-agent reconciliation for parent/child GPU state, reset behavior, and cleanup before any child slot can be reused.
  6. Make app compatibility adapter-specific; whole-GPU slice support must not automatically imply fractional GPU support.
  7. Keep overcommit disabled unless a future product explicitly defines fairness, performance, billing, and tenant-isolation policy.

Definition of ready for a future implementation:

  1. Infra identifies the first supported mechanism and hardware generation.
  2. SKU metadata defines profiles, isolation guarantees, and billable units.
  3. Scheduler can atomically claim child resources without exceeding parent GPU capacity or mixing incompatible profiles.
  4. Node-agent can provision, reconcile, reset, and release the selected fractional mechanism.
  5. Billing exports explain fractional units from immutable claim snapshots.