Runbook: Three-Host Lab Incident¶

Trigger¶

The three-host lab shows an incident that could belong to dev-control-1, dev-lab-1, or dev-gpu-1.
Platform-app validation fails and the owning host is unclear.
MaaS, node-agent, or scheduler-reference testing shows cross-host symptoms.
Support has a correlation_id but cannot yet tell whether the failure is:
platform control
app control host
real worker compute host

Required Context¶

correlation_id from the user/API/CLI/SDK error envelope.
trace_id if present.
Host-role context:
host_role
host_name
node_id when a GPU worker is involved
app_slug or lab stack name when a platform-app control stack is involved
Effective runtime boundary:
operating_mode
control_plane_scope
runtime_backend
tenant_boundary_mode

Host-Role Model¶

Treat these as different incident domains:

platform_control
dev-control-1
GitLab, runners, registry, control-plane stack, observability services
app_control
dev-lab-1
node-shim, scheduler reference stacks, Slurm controller, Ray head, lightweight K8s control-plane experiments
worker_compute
dev-gpu-1
node-agent, allocation runtime, terminal path, GPU validation, future MaaS worker validation

Do not collapse them into one generic "dev environment" incident.

Correlation-First Triage¶

API/control-plane logs by correlation_id:
{service="gpuaas-api"} | json | correlation_id="<CORRELATION_ID>"
Lab host stack logs by correlation_id or stack label:
{host_role="app_control"} | json | correlation_id="<CORRELATION_ID>"
GPU worker logs by correlation_id or node_id:
{host_role="worker_compute"} | json | correlation_id="<CORRELATION_ID>"
Control host support services:
{host_role="platform_control"} | json | correlation_id="<CORRELATION_ID>"
If trace_id exists, open Tempo and verify span ownership across:
gpuaas-api
gpuaas-outbox-relay
gpuaas-app-runtime-worker
terminal or node-agent-adjacent services when relevant

Fast Separation Rules¶

If GitLab, registry, dashboard, or central control-plane deploy is failing first:
classify as platform_control
If the issue starts at Slurm/Ray/K8s reference stack boundaries:
classify as app_control
If node provisioning, terminal, GPU validation, or worker execution fails:
classify as worker_compute
If symptoms cross roles:
keep the first failing boundary and then capture downstream impact separately

Common Failure Classes¶

Platform Control¶

CI image pull/build drift
control-plane stack unavailable
observability stack degraded
registry or runner failure

App Control¶

scheduler reference control stack misconfigured
platform-app control component unavailable
node-shim or adapter path failure
stack-local port/state collision

Worker Compute¶

node-agent offline or cert/runtime issue
allocation runtime or SSH/terminal failure
GPU runtime validation failure
worker cannot join or communicate with a lab-host control plane

Boundary Validation¶

Confirm platform-app control components are not running on the GPU worker host.
Confirm arbitrary CI workloads are not running on the GPU worker host.
Confirm the lab host is not being treated as the platform control-plane host.
Confirm host_role is present in the incident evidence before escalation.

Recovery Guidance¶

platform_control incident:
restore GitLab/registry/control-plane/observability baseline first
app_control incident:
repair or restart only the affected platform-app control stack
do not "fix" by moving the control-plane function onto the GPU host
worker_compute incident:
restore node-agent/runtime/worker readiness first
do not debug it as a scheduler-controller incident unless the join path proves that boundary

Evidence to Capture¶

correlation_id, trace_id, and exact failing action
host_role and host_name
node_id if the GPU host is involved
app_slug or stack identifier if a lab-host platform-app stack is involved
First failing boundary:
platform_control
app_control
worker_compute
Whether the issue is single-host or cross-host

Escalation Rule¶

If incident evidence does not preserve host_role and correlation_id, treat that as an observability gap and file follow-up work. Do not close the incident as "environment flaky" without a boundary-specific diagnosis.