Skip to content

Three Host Dev CI MaaS Lab Plan

Goal

Define a repeatable dev and integration lab that separates: 1. control-plane and CI concerns, 2. integration and platform-app control-plane testing, 3. real GPU worker validation.

This is the baseline operating model for the current hardware: 1. 100.90.157.34 2. 100.69.173.30 3. 100.88.227.60

Host Roles

Production-shaped logical roles: 1. platform_control 2. app_control 3. worker_compute

Current hostnames stay the same for now: 1. dev-control-1 2. dev-lab-1 3. dev-gpu-1

Hostnames are environment instances. The logical role names above are the architectural source of truth.

100.90.157.34 - dev-control-1

Logical role: platform_control

Primary responsibilities: 1. GitLab 2. GitLab Runner 3. container registry 4. dev control-plane stack 5. central observability

Must not be used as a general test worker for destructive hardware scenarios.

Control-plane deployment direction: 1. local-dev-style manual compose is not the target operating model for this host 2. the next phase is Kubernetes (k3s) for the GPUaaS platform core 3. platform-app control stacks remain off this host

100.69.173.30 - dev-lab-1

Logical role: app_control

Primary responsibilities: 1. node-shim and synthetic integration harness 2. scheduler/app control-plane test host 3. Slurm controller, Ray head, or lightweight K8s control-plane experiments 4. destructive integration jobs that should not touch the real GPU worker

This host is the shared lab control host for platform apps.

100.88.227.60 - dev-gpu-1

Logical role: worker_compute

Primary responsibilities: 1. real node-agent validation 2. real GPU runtime checks 3. real allocation and terminal validation 4. MaaS lifecycle evidence target 5. future GPU worker participation in tenant-dedicated app backends

This host should remain as close as possible to worker semantics.

Design Rules

  1. Do not mix CI/control-plane services with real GPU workload validation on the same host.
  2. Do not overload the GPU host with scheduler or app control-plane daemons.
  3. Platform-app control components belong on dev-lab-1, not on GPU workers.
  4. Automation must converge hosts from configuration; do not rely on manual shell history as state.
  5. Real GPU jobs must run through tagged and serialized pipelines only.

Control-Plane and Platform-App Testing Model

dev-lab-1 is the place to validate platform apps such as: 1. Slurm controller 2. Ray head 3. lightweight K8s control plane 4. app runtime adapters

dev-gpu-1 joins or interacts with those control planes as a worker node or execution target.

This mirrors the intended long-term platform model: 1. GPUaaS provides primitives 2. platform apps own their own runtime/control logic 3. node-agent remains focused on bounded host/node responsibilities

CI Runner Model

Use distinct runner tags: 1. platform-control 2. sim 3. app-control 4. worker-compute

Guidance: 1. platform-control for contract, build, docs, and control-plane deploy jobs 2. sim for node-shim and fast integration 3. app-control for scheduler/app-control-plane integration tests 4. worker-compute for real hardware validation only

GPU-tagged jobs should be serialized and initially manual or protected.

Automation Layout

Recommended repo structure: 1. infra/ansible/ 2. infra/ansible/inventory/environments/ 3. infra/env/dev-control/ 4. infra/env/dev-lab/ 5. infra/env/dev-gpu/ 5. infra/systemd/ 6. build/ci-image/ 7. infra/k8s/

Environment creation rule: 1. the three-host lab is the first instance of a reusable environment blueprint 2. future test, staging, or production environments should be created by changing environment inventory/config, not by writing new deployment logic 3. control-plane growth from dev-control-1 to dev-control-1 + dev-control-2 must be a Kubernetes/control-plane expansion path, not a platform redesign

Recommended roles: 1. common 2. platform_control 3. control_plane_k3s 4. gitlab 5. runner 6. app_control 7. worker_compute 8. observability_agent

Converge Path

The baseline converge path is owned by infra/ansible/.

Local operator flow: 1. validate inventory and playbook syntax 2. ping the three hosts 3. converge all hosts or a specific host group

Reference commands: 1. GPUAAS_ENVIRONMENT=dev ansible-playbook -i infra/ansible/inventory/environments/dev/hosts.ini infra/ansible/playbooks/site.yml --syntax-check 2. GPUAAS_ENVIRONMENT=dev ansible all -i infra/ansible/inventory/environments/dev/hosts.ini -m ping 3. GPUAAS_ENVIRONMENT=dev ansible-playbook -i infra/ansible/inventory/environments/dev/hosts.ini infra/ansible/playbooks/site.yml 4. GPUAAS_ENVIRONMENT=dev ansible-playbook -i infra/ansible/inventory/environments/dev/hosts.ini infra/ansible/playbooks/site.yml --limit worker_compute

The initial baseline covers: 1. common packages and Docker runtime 2. base directories under /opt/gpuaas, /var/lib/gpuaas, /var/log/gpuaas, /etc/gpuaas 3. journald retention settings 4. explicit host-role markers for platform_control, app_control, and worker_compute 5. k3s baseline on the platform-control host 6. baseline Kubernetes namespaces for the platform core

The initial baseline does not yet deploy: 1. GitLab 2. scheduler/runtime stacks 3. MaaS

Day 0-2 Tasks

  1. Baseline all three hosts:
  2. hostname
  3. package baseline
  4. Docker/Compose
  5. journald retention
  6. metrics/log shipping
  7. Stand up GitLab + runner + registry on dev-control-1
  8. Define runner tags and scheduling rules
  9. Install node-shim and platform-app lab wrappers on dev-lab-1
  10. Install real node-agent service wrapper and GPU validation tools on dev-gpu-1

Day 3-5 Tasks

  1. Move CI setup drift into a reusable base image
  2. Converge the dev control-plane stack from automation
  3. Extend observability from platform_control to app_control and worker_compute
  4. Add Slurm controller or reference platform-app stack to dev-lab-1
  5. Validate dev-gpu-1 as worker target through the existing control-plane contracts

Current reference baseline: 1. infra/env/dev-lab/platform-apps/slurm-reference/ 2. infra/env/dev-gpu/slurm-reference/ 3. doc/operations/Slurm_Reference_Lab_Stack.md

CI base image baseline location: 1. build/ci-image/Dockerfile 2. build/ci-image/build.sh

Runner adoption model: 1. publish from dev-control-1 to the control-plane registry, 2. grant tagged runners pull access, 3. switch pipeline default image to the published digest once proven, 4. retain job-level fallback install logic only for bootstrap and recovery.

Day 6-7 Tasks

  1. Run node-shim integration lane on dev-lab-1
  2. Run real GPU smoke lane on dev-gpu-1
  3. Capture MaaS baseline evidence when available
  4. Document reset and teardown procedures per host role

Follow-On Execution Seeds

This baseline should immediately drive implementation tasks rather than remain documentation-only.

  1. A-LAB-AUTOMATION-001
  2. create infra/ansible/ baseline
  3. define host inventory and common roles
  4. converge package/runtime baseline on all three hosts

  5. A-CI-BASE-IMAGE-001

  6. build and publish reusable CI image
  7. move toolchain drift out of job scripts
  8. consume image by digest in pipeline jobs

  9. A-LAB-PLATFORM-APP-STACK-001

  10. stand up first reference platform-app control stack on dev-lab-1
  11. initial target: Slurm controller or equivalent scheduler reference stack
  12. prove dev-gpu-1 can join as a worker/execution target through existing contracts

  13. C-LAB-OBSERVABILITY-001

  14. centralize metrics/log shipping from all three hosts
  15. label dashboards and alerts by host role (control, lab, gpu)
  16. preserve correlation-first triage across host boundaries

  17. A-CONTROL-PLANE-K8S-BASELINE-001

  18. install k3s on dev-control-1
  19. define namespace/config/ingress baseline for the GPUaaS platform core
  20. keep stateful infra outside the cluster for the first cut

  21. A-CONTROL-PLANE-K8S-CORE-DEPLOY-001

  22. deploy GPUaaS core services to dev-control-1 Kubernetes
  23. prove health, rollout, and scaling behavior
  24. keep platform-app control stacks on dev-lab-1

  25. A-CONTROL-PLANE-K8S-CD-001

  26. automate image build, publish, nightly deploy, and rollback path for dev-control-1

  27. A-CONTROL-PLANE-K8S-INFRA-MIGRATION-001

  28. migrate selected infra services into the control-plane cluster one service at a time
  29. initial order: Redis, NATS, Temporal, Keycloak, Postgres last

Production Direction

This lab should mirror the initial production operating model: 1. platform control services on dedicated control infrastructure 2. tenant-dedicated or project-dedicated platform-app control planes on non-worker control nodes 3. worker/GPU nodes dedicated to execution and lifecycle management

For early production, each tenant may receive dedicated control and compute nodes for platform apps until platform-managed offerings are mature enough.

Safety Rules

  1. No arbitrary CI jobs on dev-gpu-1
  2. No direct human debugging on production-like secrets without audit trail
  3. No scheduler-specific control-plane state on the GPU worker host
  4. Every host must be rebuildable from automation

Acceptance Criteria

  1. host responsibilities are documented and non-overlapping
  2. CI tag model is explicit
  3. automation layout is defined
  4. platform-app lab testing path is separated from real GPU worker validation
  5. MaaS expansion path is documented, not ad hoc