Three Host Dev CI MaaS Lab Plan¶

Goal¶

Define a repeatable dev and integration lab that separates: 1. control-plane and CI concerns, 2. integration and platform-app control-plane testing, 3. real GPU worker validation.

This is the baseline operating model for the current hardware: 1. 100.90.157.34 2. 100.69.173.30 3. 100.88.227.60

Host Roles¶

Production-shaped logical roles: 1. platform_control 2. app_control 3. worker_compute

Current hostnames stay the same for now: 1. dev-control-1 2. dev-lab-1 3. dev-gpu-1

Hostnames are environment instances. The logical role names above are the architectural source of truth.

`100.90.157.34` - `dev-control-1`¶

Logical role: platform_control

Primary responsibilities: 1. GitLab 2. GitLab Runner 3. container registry 4. dev control-plane stack 5. central observability

Must not be used as a general test worker for destructive hardware scenarios.

Control-plane deployment direction: 1. local-dev-style manual compose is not the target operating model for this host 2. the next phase is Kubernetes (k3s) for the GPUaaS platform core 3. platform-app control stacks remain off this host

`100.69.173.30` - `dev-lab-1`¶

Logical role: app_control

Primary responsibilities: 1. node-shim and synthetic integration harness 2. scheduler/app control-plane test host 3. Slurm controller, Ray head, or lightweight K8s control-plane experiments 4. destructive integration jobs that should not touch the real GPU worker

This host is the shared lab control host for platform apps.

`100.88.227.60` - `dev-gpu-1`¶

Logical role: worker_compute

Primary responsibilities: 1. real node-agent validation 2. real GPU runtime checks 3. real allocation and terminal validation 4. MaaS lifecycle evidence target 5. future GPU worker participation in tenant-dedicated app backends

This host should remain as close as possible to worker semantics.

Design Rules¶

Do not mix CI/control-plane services with real GPU workload validation on the same host.
Do not overload the GPU host with scheduler or app control-plane daemons.
Platform-app control components belong on dev-lab-1, not on GPU workers.
Automation must converge hosts from configuration; do not rely on manual shell history as state.
Real GPU jobs must run through tagged and serialized pipelines only.

Control-Plane and Platform-App Testing Model¶

dev-lab-1 is the place to validate platform apps such as: 1. Slurm controller 2. Ray head 3. lightweight K8s control plane 4. app runtime adapters

dev-gpu-1 joins or interacts with those control planes as a worker node or execution target.

This mirrors the intended long-term platform model: 1. GPUaaS provides primitives 2. platform apps own their own runtime/control logic 3. node-agent remains focused on bounded host/node responsibilities

CI Runner Model¶

Use distinct runner tags: 1. platform-control 2. sim 3. app-control 4. worker-compute

Guidance: 1. platform-control for contract, build, docs, and control-plane deploy jobs 2. sim for node-shim and fast integration 3. app-control for scheduler/app-control-plane integration tests 4. worker-compute for real hardware validation only

GPU-tagged jobs should be serialized and initially manual or protected.

Automation Layout¶

Recommended repo structure: 1. infra/ansible/ 2. infra/ansible/inventory/environments/ 3. infra/env/dev-control/ 4. infra/env/dev-lab/ 5. infra/env/dev-gpu/ 5. infra/systemd/ 6. build/ci-image/ 7. infra/k8s/

Environment creation rule: 1. the three-host lab is the first instance of a reusable environment blueprint 2. future test, staging, or production environments should be created by changing environment inventory/config, not by writing new deployment logic 3. control-plane growth from dev-control-1 to dev-control-1 + dev-control-2 must be a Kubernetes/control-plane expansion path, not a platform redesign

Recommended roles: 1. common 2. platform_control 3. control_plane_k3s 4. gitlab 5. runner 6. app_control 7. worker_compute 8. observability_agent

Converge Path¶

The baseline converge path is owned by infra/ansible/.

Local operator flow: 1. validate inventory and playbook syntax 2. ping the three hosts 3. converge all hosts or a specific host group

Reference commands: 1. GPUAAS_ENVIRONMENT=dev ansible-playbook -i infra/ansible/inventory/environments/dev/hosts.ini infra/ansible/playbooks/site.yml --syntax-check 2. GPUAAS_ENVIRONMENT=dev ansible all -i infra/ansible/inventory/environments/dev/hosts.ini -m ping 3. GPUAAS_ENVIRONMENT=dev ansible-playbook -i infra/ansible/inventory/environments/dev/hosts.ini infra/ansible/playbooks/site.yml 4. GPUAAS_ENVIRONMENT=dev ansible-playbook -i infra/ansible/inventory/environments/dev/hosts.ini infra/ansible/playbooks/site.yml --limit worker_compute

The initial baseline covers: 1. common packages and Docker runtime 2. base directories under /opt/gpuaas, /var/lib/gpuaas, /var/log/gpuaas, /etc/gpuaas 3. journald retention settings 4. explicit host-role markers for platform_control, app_control, and worker_compute 5. k3s baseline on the platform-control host 6. baseline Kubernetes namespaces for the platform core

The initial baseline does not yet deploy: 1. GitLab 2. scheduler/runtime stacks 3. MaaS

Day 0-2 Tasks¶

Baseline all three hosts:
hostname
package baseline
Docker/Compose
journald retention
metrics/log shipping
Stand up GitLab + runner + registry on dev-control-1
Define runner tags and scheduling rules
Install node-shim and platform-app lab wrappers on dev-lab-1
Install real node-agent service wrapper and GPU validation tools on dev-gpu-1

Day 3-5 Tasks¶

Move CI setup drift into a reusable base image
Converge the dev control-plane stack from automation
Extend observability from platform_control to app_control and worker_compute
Add Slurm controller or reference platform-app stack to dev-lab-1
Validate dev-gpu-1 as worker target through the existing control-plane contracts

Current reference baseline: 1. infra/env/dev-lab/platform-apps/slurm-reference/ 2. infra/env/dev-gpu/slurm-reference/ 3. doc/operations/Slurm_Reference_Lab_Stack.md

CI base image baseline location: 1. build/ci-image/Dockerfile 2. build/ci-image/build.sh

Runner adoption model: 1. publish from dev-control-1 to the control-plane registry, 2. grant tagged runners pull access, 3. switch pipeline default image to the published digest once proven, 4. retain job-level fallback install logic only for bootstrap and recovery.

Day 6-7 Tasks¶

Run node-shim integration lane on dev-lab-1
Run real GPU smoke lane on dev-gpu-1
Capture MaaS baseline evidence when available
Document reset and teardown procedures per host role

Follow-On Execution Seeds¶

This baseline should immediately drive implementation tasks rather than remain documentation-only.

A-LAB-AUTOMATION-001
create infra/ansible/ baseline
define host inventory and common roles
converge package/runtime baseline on all three hosts
A-CI-BASE-IMAGE-001
build and publish reusable CI image
move toolchain drift out of job scripts
consume image by digest in pipeline jobs
A-LAB-PLATFORM-APP-STACK-001
stand up first reference platform-app control stack on dev-lab-1
initial target: Slurm controller or equivalent scheduler reference stack
prove dev-gpu-1 can join as a worker/execution target through existing contracts
C-LAB-OBSERVABILITY-001
centralize metrics/log shipping from all three hosts
label dashboards and alerts by host role (control, lab, gpu)
preserve correlation-first triage across host boundaries
A-CONTROL-PLANE-K8S-BASELINE-001
install k3s on dev-control-1
define namespace/config/ingress baseline for the GPUaaS platform core
keep stateful infra outside the cluster for the first cut
A-CONTROL-PLANE-K8S-CORE-DEPLOY-001
deploy GPUaaS core services to dev-control-1 Kubernetes
prove health, rollout, and scaling behavior
keep platform-app control stacks on dev-lab-1
A-CONTROL-PLANE-K8S-CD-001
automate image build, publish, nightly deploy, and rollback path for dev-control-1
A-CONTROL-PLANE-K8S-INFRA-MIGRATION-001
migrate selected infra services into the control-plane cluster one service at a time
initial order: Redis, NATS, Temporal, Keycloak, Postgres last

Production Direction¶

This lab should mirror the initial production operating model: 1. platform control services on dedicated control infrastructure 2. tenant-dedicated or project-dedicated platform-app control planes on non-worker control nodes 3. worker/GPU nodes dedicated to execution and lifecycle management

For early production, each tenant may receive dedicated control and compute nodes for platform apps until platform-managed offerings are mature enough.

Safety Rules¶

No arbitrary CI jobs on dev-gpu-1
No direct human debugging on production-like secrets without audit trail
No scheduler-specific control-plane state on the GPU worker host
Every host must be rebuildable from automation

Acceptance Criteria¶

host responsibilities are documented and non-overlapping
CI tag model is explicit
automation layout is defined
platform-app lab testing path is separated from real GPU worker validation
MaaS expansion path is documented, not ad hoc