Three Host Dev CI MaaS Lab Plan¶
Goal¶
Define a repeatable dev and integration lab that separates: 1. control-plane and CI concerns, 2. integration and platform-app control-plane testing, 3. real GPU worker validation.
This is the baseline operating model for the current hardware:
1. 100.90.157.34
2. 100.69.173.30
3. 100.88.227.60
Host Roles¶
Production-shaped logical roles:
1. platform_control
2. app_control
3. worker_compute
Current hostnames stay the same for now:
1. dev-control-1
2. dev-lab-1
3. dev-gpu-1
Hostnames are environment instances. The logical role names above are the architectural source of truth.
100.90.157.34 - dev-control-1¶
Logical role: platform_control
Primary responsibilities: 1. GitLab 2. GitLab Runner 3. container registry 4. dev control-plane stack 5. central observability
Must not be used as a general test worker for destructive hardware scenarios.
Control-plane deployment direction:
1. local-dev-style manual compose is not the target operating model for this host
2. the next phase is Kubernetes (k3s) for the GPUaaS platform core
3. platform-app control stacks remain off this host
100.69.173.30 - dev-lab-1¶
Logical role: app_control
Primary responsibilities: 1. node-shim and synthetic integration harness 2. scheduler/app control-plane test host 3. Slurm controller, Ray head, or lightweight K8s control-plane experiments 4. destructive integration jobs that should not touch the real GPU worker
This host is the shared lab control host for platform apps.
100.88.227.60 - dev-gpu-1¶
Logical role: worker_compute
Primary responsibilities: 1. real node-agent validation 2. real GPU runtime checks 3. real allocation and terminal validation 4. MaaS lifecycle evidence target 5. future GPU worker participation in tenant-dedicated app backends
This host should remain as close as possible to worker semantics.
Design Rules¶
- Do not mix CI/control-plane services with real GPU workload validation on the same host.
- Do not overload the GPU host with scheduler or app control-plane daemons.
- Platform-app control components belong on
dev-lab-1, not on GPU workers. - Automation must converge hosts from configuration; do not rely on manual shell history as state.
- Real GPU jobs must run through tagged and serialized pipelines only.
Control-Plane and Platform-App Testing Model¶
dev-lab-1 is the place to validate platform apps such as:
1. Slurm controller
2. Ray head
3. lightweight K8s control plane
4. app runtime adapters
dev-gpu-1 joins or interacts with those control planes as a worker node or execution target.
This mirrors the intended long-term platform model: 1. GPUaaS provides primitives 2. platform apps own their own runtime/control logic 3. node-agent remains focused on bounded host/node responsibilities
CI Runner Model¶
Use distinct runner tags:
1. platform-control
2. sim
3. app-control
4. worker-compute
Guidance:
1. platform-control for contract, build, docs, and control-plane deploy jobs
2. sim for node-shim and fast integration
3. app-control for scheduler/app-control-plane integration tests
4. worker-compute for real hardware validation only
GPU-tagged jobs should be serialized and initially manual or protected.
Automation Layout¶
Recommended repo structure:
1. infra/ansible/
2. infra/ansible/inventory/environments/
3. infra/env/dev-control/
4. infra/env/dev-lab/
5. infra/env/dev-gpu/
5. infra/systemd/
6. build/ci-image/
7. infra/k8s/
Environment creation rule:
1. the three-host lab is the first instance of a reusable environment blueprint
2. future test, staging, or production environments should be created by changing environment inventory/config, not by writing new deployment logic
3. control-plane growth from dev-control-1 to dev-control-1 + dev-control-2 must be a Kubernetes/control-plane expansion path, not a platform redesign
Recommended roles:
1. common
2. platform_control
3. control_plane_k3s
4. gitlab
5. runner
6. app_control
7. worker_compute
8. observability_agent
Converge Path¶
The baseline converge path is owned by infra/ansible/.
Local operator flow: 1. validate inventory and playbook syntax 2. ping the three hosts 3. converge all hosts or a specific host group
Reference commands:
1. GPUAAS_ENVIRONMENT=dev ansible-playbook -i infra/ansible/inventory/environments/dev/hosts.ini infra/ansible/playbooks/site.yml --syntax-check
2. GPUAAS_ENVIRONMENT=dev ansible all -i infra/ansible/inventory/environments/dev/hosts.ini -m ping
3. GPUAAS_ENVIRONMENT=dev ansible-playbook -i infra/ansible/inventory/environments/dev/hosts.ini infra/ansible/playbooks/site.yml
4. GPUAAS_ENVIRONMENT=dev ansible-playbook -i infra/ansible/inventory/environments/dev/hosts.ini infra/ansible/playbooks/site.yml --limit worker_compute
The initial baseline covers:
1. common packages and Docker runtime
2. base directories under /opt/gpuaas, /var/lib/gpuaas, /var/log/gpuaas, /etc/gpuaas
3. journald retention settings
4. explicit host-role markers for platform_control, app_control, and worker_compute
5. k3s baseline on the platform-control host
6. baseline Kubernetes namespaces for the platform core
The initial baseline does not yet deploy: 1. GitLab 2. scheduler/runtime stacks 3. MaaS
Day 0-2 Tasks¶
- Baseline all three hosts:
- hostname
- package baseline
- Docker/Compose
- journald retention
- metrics/log shipping
- Stand up GitLab + runner + registry on
dev-control-1 - Define runner tags and scheduling rules
- Install node-shim and platform-app lab wrappers on
dev-lab-1 - Install real node-agent service wrapper and GPU validation tools on
dev-gpu-1
Day 3-5 Tasks¶
- Move CI setup drift into a reusable base image
- Converge the dev control-plane stack from automation
- Extend observability from
platform_controltoapp_controlandworker_compute - Add Slurm controller or reference platform-app stack to
dev-lab-1 - Validate
dev-gpu-1as worker target through the existing control-plane contracts
Current reference baseline:
1. infra/env/dev-lab/platform-apps/slurm-reference/
2. infra/env/dev-gpu/slurm-reference/
3. doc/operations/Slurm_Reference_Lab_Stack.md
CI base image baseline location:
1. build/ci-image/Dockerfile
2. build/ci-image/build.sh
Runner adoption model:
1. publish from dev-control-1 to the control-plane registry,
2. grant tagged runners pull access,
3. switch pipeline default image to the published digest once proven,
4. retain job-level fallback install logic only for bootstrap and recovery.
Day 6-7 Tasks¶
- Run node-shim integration lane on
dev-lab-1 - Run real GPU smoke lane on
dev-gpu-1 - Capture MaaS baseline evidence when available
- Document reset and teardown procedures per host role
Follow-On Execution Seeds¶
This baseline should immediately drive implementation tasks rather than remain documentation-only.
A-LAB-AUTOMATION-001- create
infra/ansible/baseline - define host inventory and common roles
-
converge package/runtime baseline on all three hosts
-
A-CI-BASE-IMAGE-001 - build and publish reusable CI image
- move toolchain drift out of job scripts
-
consume image by digest in pipeline jobs
-
A-LAB-PLATFORM-APP-STACK-001 - stand up first reference platform-app control stack on
dev-lab-1 - initial target: Slurm controller or equivalent scheduler reference stack
-
prove
dev-gpu-1can join as a worker/execution target through existing contracts -
C-LAB-OBSERVABILITY-001 - centralize metrics/log shipping from all three hosts
- label dashboards and alerts by host role (
control,lab,gpu) -
preserve correlation-first triage across host boundaries
-
A-CONTROL-PLANE-K8S-BASELINE-001 - install
k3sondev-control-1 - define namespace/config/ingress baseline for the GPUaaS platform core
-
keep stateful infra outside the cluster for the first cut
-
A-CONTROL-PLANE-K8S-CORE-DEPLOY-001 - deploy GPUaaS core services to
dev-control-1Kubernetes - prove health, rollout, and scaling behavior
-
keep platform-app control stacks on
dev-lab-1 -
A-CONTROL-PLANE-K8S-CD-001 -
automate image build, publish, nightly deploy, and rollback path for
dev-control-1 -
A-CONTROL-PLANE-K8S-INFRA-MIGRATION-001 - migrate selected infra services into the control-plane cluster one service at a time
- initial order: Redis, NATS, Temporal, Keycloak, Postgres last
Production Direction¶
This lab should mirror the initial production operating model: 1. platform control services on dedicated control infrastructure 2. tenant-dedicated or project-dedicated platform-app control planes on non-worker control nodes 3. worker/GPU nodes dedicated to execution and lifecycle management
For early production, each tenant may receive dedicated control and compute nodes for platform apps until platform-managed offerings are mature enough.
Safety Rules¶
- No arbitrary CI jobs on
dev-gpu-1 - No direct human debugging on production-like secrets without audit trail
- No scheduler-specific control-plane state on the GPU worker host
- Every host must be rebuildable from automation
Acceptance Criteria¶
- host responsibilities are documented and non-overlapping
- CI tag model is explicit
- automation layout is defined
- platform-app lab testing path is separated from real GPU worker validation
- MaaS expansion path is documented, not ad hoc