Runbook: Three-Host Lab Incident¶
Trigger¶
- The three-host lab shows an incident that could belong to
dev-control-1,dev-lab-1, ordev-gpu-1. - Platform-app validation fails and the owning host is unclear.
- MaaS, node-agent, or scheduler-reference testing shows cross-host symptoms.
- Support has a
correlation_idbut cannot yet tell whether the failure is: - platform control
- app control host
- real worker compute host
Required Context¶
correlation_idfrom the user/API/CLI/SDK error envelope.trace_idif present.- Host-role context:
host_rolehost_namenode_idwhen a GPU worker is involvedapp_slugor lab stack name when a platform-app control stack is involved- Effective runtime boundary:
operating_modecontrol_plane_scoperuntime_backendtenant_boundary_mode
Host-Role Model¶
Treat these as different incident domains:
platform_controldev-control-1-
GitLab, runners, registry, control-plane stack, observability services
-
app_control dev-lab-1-
node-shim, scheduler reference stacks, Slurm controller, Ray head, lightweight K8s control-plane experiments
-
worker_compute dev-gpu-1- node-agent, allocation runtime, terminal path, GPU validation, future MaaS worker validation
Do not collapse them into one generic "dev environment" incident.
Correlation-First Triage¶
- API/control-plane logs by
correlation_id: {service="gpuaas-api"} | json | correlation_id="<CORRELATION_ID>"- Lab host stack logs by
correlation_idor stack label: {host_role="app_control"} | json | correlation_id="<CORRELATION_ID>"- GPU worker logs by
correlation_idornode_id: {host_role="worker_compute"} | json | correlation_id="<CORRELATION_ID>"- Control host support services:
{host_role="platform_control"} | json | correlation_id="<CORRELATION_ID>"- If
trace_idexists, open Tempo and verify span ownership across: gpuaas-apigpuaas-outbox-relaygpuaas-app-runtime-worker- terminal or node-agent-adjacent services when relevant
Fast Separation Rules¶
- If GitLab, registry, dashboard, or central control-plane deploy is failing first:
- classify as
platform_control - If the issue starts at Slurm/Ray/K8s reference stack boundaries:
- classify as
app_control - If node provisioning, terminal, GPU validation, or worker execution fails:
- classify as
worker_compute - If symptoms cross roles:
- keep the first failing boundary and then capture downstream impact separately
Common Failure Classes¶
Platform Control¶
- CI image pull/build drift
- control-plane stack unavailable
- observability stack degraded
- registry or runner failure
App Control¶
- scheduler reference control stack misconfigured
- platform-app control component unavailable
- node-shim or adapter path failure
- stack-local port/state collision
Worker Compute¶
- node-agent offline or cert/runtime issue
- allocation runtime or SSH/terminal failure
- GPU runtime validation failure
- worker cannot join or communicate with a lab-host control plane
Boundary Validation¶
- Confirm platform-app control components are not running on the GPU worker host.
- Confirm arbitrary CI workloads are not running on the GPU worker host.
- Confirm the lab host is not being treated as the platform control-plane host.
- Confirm
host_roleis present in the incident evidence before escalation.
Recovery Guidance¶
platform_controlincident:- restore GitLab/registry/control-plane/observability baseline first
app_controlincident:- repair or restart only the affected platform-app control stack
- do not "fix" by moving the control-plane function onto the GPU host
worker_computeincident:- restore node-agent/runtime/worker readiness first
- do not debug it as a scheduler-controller incident unless the join path proves that boundary
Evidence to Capture¶
correlation_id,trace_id, and exact failing actionhost_roleandhost_namenode_idif the GPU host is involvedapp_slugor stack identifier if a lab-host platform-app stack is involved- First failing boundary:
- platform_control
- app_control
- worker_compute
- Whether the issue is single-host or cross-host
Escalation Rule¶
If incident evidence does not preserve host_role and correlation_id, treat that as an observability gap and file follow-up work. Do not close the incident as "environment flaky" without a boundary-specific diagnosis.