Lab topology¶
Implemented
Source:
doc/operations/Three_Host_Dev_CI_MaaS_Lab_Plan.md · doc/operations/MAAS_j22u11_Execution_Plan_2026-04-20.md
Three-host model¶
flowchart LR
classDef dev fill:#fff8e1,stroke:#f57f17
classDef ci fill:#e3f2fd,stroke:#1565c0
classDef gpu fill:#fce4ec,stroke:#c2185b
DEV[Dev host<br/>local-dev compose<br/>+ developer iteration]:::dev
CI[CI host<br/>full integration<br/>+ e2e]:::ci
subgraph MAAS_LAB[GPU lab via MAAS]
J11[j22u11<br/>10.177.36.100<br/>secondary slice test]:::gpu
J15[j22u15<br/>10.177.36.197<br/>primary slice test]:::gpu
J27[j27u15<br/>10.177.36.198<br/>baremetal / temp slice]:::gpu
end
DEV --> CI
CI --> MAAS_LAB
Test node scope¶
Current slice validation nodes (from GPU_Slice_End_to_End_Readiness_Decisions_v1.md):
| Node | MAAS IP | Use |
|---|---|---|
j22u15 |
10.177.36.197 |
Primary slice test host |
j22u11 |
10.177.36.100 |
Secondary slice test host |
j27u15 |
10.177.36.198 |
Baremetal or temporary slice if retagged |
j22u05 was used for early slice development; it is being returned and is not part of the active slice test pool.
Why three hosts¶
flowchart LR
A[Three roles] --> R1[Dev:<br/>fast loop, no GPU needed]
A --> R2[CI:<br/>real infra, real Postgres,<br/>real NATS, integration tests]
A --> R3[MAAS lab:<br/>real GPU hosts,<br/>slice + baremetal validation]
This split keeps GPU host time scarce-and-valuable: developers iterate without consuming GPUs; CI runs against real infra but not GPUs; only the MAAS lab carries actual H200 hardware.
Onboarding a new GPU host¶
→ Read source: GPU_Slice_Node_Manual_Bootstrap_Runbook.md.
flowchart LR
A[1. MAAS commission] --> B[2. Tag<br/>gpuaas-profile-slice-vm<br/>+ hardware tags]
B --> C[3. Deploy<br/>cloud-init applies<br/>firmware profile]
C --> D[4. node-agent<br/>enrolls via bootstrap script]
D --> E["5. POST<br/>/admin/nodes/{id}/<br/>slice-topology/discovery"]
E --> F[6. Operator reviews<br/>candidate slot map]
F --> G[7. Approve →<br/>node_resource_slots]
G --> H[8. Node active for<br/>slice scheduling]
Lab parity locally¶
→ Read source: Local_Parity_Mode_Inventory_v1.md.
make kind-parity-up brings a kind-cluster-based stack so production deployment topology can be exercised on a single developer host.