Skip to content

Runbook: Three-Host Lab Incident

Trigger

  1. The three-host lab shows an incident that could belong to dev-control-1, dev-lab-1, or dev-gpu-1.
  2. Platform-app validation fails and the owning host is unclear.
  3. MaaS, node-agent, or scheduler-reference testing shows cross-host symptoms.
  4. Support has a correlation_id but cannot yet tell whether the failure is:
  5. platform control
  6. app control host
  7. real worker compute host

Required Context

  1. correlation_id from the user/API/CLI/SDK error envelope.
  2. trace_id if present.
  3. Host-role context:
  4. host_role
  5. host_name
  6. node_id when a GPU worker is involved
  7. app_slug or lab stack name when a platform-app control stack is involved
  8. Effective runtime boundary:
  9. operating_mode
  10. control_plane_scope
  11. runtime_backend
  12. tenant_boundary_mode

Host-Role Model

Treat these as different incident domains:

  1. platform_control
  2. dev-control-1
  3. GitLab, runners, registry, control-plane stack, observability services

  4. app_control

  5. dev-lab-1
  6. node-shim, scheduler reference stacks, Slurm controller, Ray head, lightweight K8s control-plane experiments

  7. worker_compute

  8. dev-gpu-1
  9. node-agent, allocation runtime, terminal path, GPU validation, future MaaS worker validation

Do not collapse them into one generic "dev environment" incident.

Correlation-First Triage

  1. API/control-plane logs by correlation_id:
  2. {service="gpuaas-api"} | json | correlation_id="<CORRELATION_ID>"
  3. Lab host stack logs by correlation_id or stack label:
  4. {host_role="app_control"} | json | correlation_id="<CORRELATION_ID>"
  5. GPU worker logs by correlation_id or node_id:
  6. {host_role="worker_compute"} | json | correlation_id="<CORRELATION_ID>"
  7. Control host support services:
  8. {host_role="platform_control"} | json | correlation_id="<CORRELATION_ID>"
  9. If trace_id exists, open Tempo and verify span ownership across:
  10. gpuaas-api
  11. gpuaas-outbox-relay
  12. gpuaas-app-runtime-worker
  13. terminal or node-agent-adjacent services when relevant

Fast Separation Rules

  1. If GitLab, registry, dashboard, or central control-plane deploy is failing first:
  2. classify as platform_control
  3. If the issue starts at Slurm/Ray/K8s reference stack boundaries:
  4. classify as app_control
  5. If node provisioning, terminal, GPU validation, or worker execution fails:
  6. classify as worker_compute
  7. If symptoms cross roles:
  8. keep the first failing boundary and then capture downstream impact separately

Common Failure Classes

Platform Control

  1. CI image pull/build drift
  2. control-plane stack unavailable
  3. observability stack degraded
  4. registry or runner failure

App Control

  1. scheduler reference control stack misconfigured
  2. platform-app control component unavailable
  3. node-shim or adapter path failure
  4. stack-local port/state collision

Worker Compute

  1. node-agent offline or cert/runtime issue
  2. allocation runtime or SSH/terminal failure
  3. GPU runtime validation failure
  4. worker cannot join or communicate with a lab-host control plane

Boundary Validation

  1. Confirm platform-app control components are not running on the GPU worker host.
  2. Confirm arbitrary CI workloads are not running on the GPU worker host.
  3. Confirm the lab host is not being treated as the platform control-plane host.
  4. Confirm host_role is present in the incident evidence before escalation.

Recovery Guidance

  1. platform_control incident:
  2. restore GitLab/registry/control-plane/observability baseline first
  3. app_control incident:
  4. repair or restart only the affected platform-app control stack
  5. do not "fix" by moving the control-plane function onto the GPU host
  6. worker_compute incident:
  7. restore node-agent/runtime/worker readiness first
  8. do not debug it as a scheduler-controller incident unless the join path proves that boundary

Evidence to Capture

  1. correlation_id, trace_id, and exact failing action
  2. host_role and host_name
  3. node_id if the GPU host is involved
  4. app_slug or stack identifier if a lab-host platform-app stack is involved
  5. First failing boundary:
  6. platform_control
  7. app_control
  8. worker_compute
  9. Whether the issue is single-host or cross-host

Escalation Rule

If incident evidence does not preserve host_role and correlation_id, treat that as an observability gap and file follow-up work. Do not close the incident as "environment flaky" without a boundary-specific diagnosis.