Skip to content

Lab topology

Implemented

Source: doc/operations/Three_Host_Dev_CI_MaaS_Lab_Plan.md · doc/operations/MAAS_j22u11_Execution_Plan_2026-04-20.md

Three-host model

flowchart LR
    classDef dev fill:#fff8e1,stroke:#f57f17
    classDef ci  fill:#e3f2fd,stroke:#1565c0
    classDef gpu fill:#fce4ec,stroke:#c2185b

    DEV[Dev host<br/>local-dev compose<br/>+ developer iteration]:::dev
    CI[CI host<br/>full integration<br/>+ e2e]:::ci

    subgraph MAAS_LAB[GPU lab via MAAS]
      J11[j22u11<br/>10.177.36.100<br/>secondary slice test]:::gpu
      J15[j22u15<br/>10.177.36.197<br/>primary slice test]:::gpu
      J27[j27u15<br/>10.177.36.198<br/>baremetal / temp slice]:::gpu
    end

    DEV --> CI
    CI --> MAAS_LAB

Test node scope

Current slice validation nodes (from GPU_Slice_End_to_End_Readiness_Decisions_v1.md):

Node MAAS IP Use
j22u15 10.177.36.197 Primary slice test host
j22u11 10.177.36.100 Secondary slice test host
j27u15 10.177.36.198 Baremetal or temporary slice if retagged

j22u05 was used for early slice development; it is being returned and is not part of the active slice test pool.

Why three hosts

flowchart LR
    A[Three roles] --> R1[Dev:<br/>fast loop, no GPU needed]
    A --> R2[CI:<br/>real infra, real Postgres,<br/>real NATS, integration tests]
    A --> R3[MAAS lab:<br/>real GPU hosts,<br/>slice + baremetal validation]

This split keeps GPU host time scarce-and-valuable: developers iterate without consuming GPUs; CI runs against real infra but not GPUs; only the MAAS lab carries actual H200 hardware.

Onboarding a new GPU host

→ Read source: GPU_Slice_Node_Manual_Bootstrap_Runbook.md.

flowchart LR
    A[1. MAAS commission] --> B[2. Tag<br/>gpuaas-profile-slice-vm<br/>+ hardware tags]
    B --> C[3. Deploy<br/>cloud-init applies<br/>firmware profile]
    C --> D[4. node-agent<br/>enrolls via bootstrap script]
    D --> E["5. POST<br/>/admin/nodes/&lcub;id&rcub;/<br/>slice-topology/discovery"]
    E --> F[6. Operator reviews<br/>candidate slot map]
    F --> G[7. Approve →<br/>node_resource_slots]
    G --> H[8. Node active for<br/>slice scheduling]

Lab parity locally

→ Read source: Local_Parity_Mode_Inventory_v1.md.

make kind-parity-up brings a kind-cluster-based stack so production deployment topology can be exercised on a single developer host.

Where to look next