Lab topology¶

Implemented

Source: doc/operations/Three_Host_Dev_CI_MaaS_Lab_Plan.md · doc/operations/MAAS_j22u11_Execution_Plan_2026-04-20.md

Three-host model¶

flowchart LR
    classDef dev fill:#fff8e1,stroke:#f57f17
    classDef ci  fill:#e3f2fd,stroke:#1565c0
    classDef gpu fill:#fce4ec,stroke:#c2185b

    DEV[Dev host<br/>local-dev compose<br/>+ developer iteration]:::dev
    CI[CI host<br/>full integration<br/>+ e2e]:::ci

    subgraph MAAS_LAB[GPU lab via MAAS]
      J11[j22u11<br/>10.177.36.100<br/>secondary slice test]:::gpu
      J15[j22u15<br/>10.177.36.197<br/>primary slice test]:::gpu
      J27[j27u15<br/>10.177.36.198<br/>baremetal / temp slice]:::gpu
    end

    DEV --> CI
    CI --> MAAS_LAB

Test node scope¶

Current slice validation nodes (from GPU_Slice_End_to_End_Readiness_Decisions_v1.md):

Node	MAAS IP	Use
`j22u15`	`10.177.36.197`	Primary slice test host
`j22u11`	`10.177.36.100`	Secondary slice test host
`j27u15`	`10.177.36.198`	Baremetal or temporary slice if retagged

j22u05 was used for early slice development; it is being returned and is not part of the active slice test pool.

Why three hosts¶

flowchart LR
    A[Three roles] --> R1[Dev:<br/>fast loop, no GPU needed]
    A --> R2[CI:<br/>real infra, real Postgres,<br/>real NATS, integration tests]
    A --> R3[MAAS lab:<br/>real GPU hosts,<br/>slice + baremetal validation]

This split keeps GPU host time scarce-and-valuable: developers iterate without consuming GPUs; CI runs against real infra but not GPUs; only the MAAS lab carries actual H200 hardware.

Onboarding a new GPU host¶

→ Read source: GPU_Slice_Node_Manual_Bootstrap_Runbook.md.

flowchart LR
    A[1. MAAS commission] --> B[2. Tag<br/>gpuaas-profile-slice-vm<br/>+ hardware tags]
    B --> C[3. Deploy<br/>cloud-init applies<br/>firmware profile]
    C --> D[4. node-agent<br/>enrolls via bootstrap script]
    D --> E["5. POST<br/>/admin/nodes/&lcub;id&rcub;/<br/>slice-topology/discovery"]
    E --> F[6. Operator reviews<br/>candidate slot map]
    F --> G[7. Approve →<br/>node_resource_slots]
    G --> H[8. Node active for<br/>slice scheduling]

Lab parity locally¶

→ Read source: Local_Parity_Mode_Inventory_v1.md.

make kind-parity-up brings a kind-cluster-based stack so production deployment topology can be exercised on a single developer host.