Skip to content

Runbook: Slurm Reference Instance Stuck In deploying

Trigger

  1. A slurm-reference app instance stays in deploying after the controller allocation is already active.
  2. The app-instance detail page shows a controller member in reconciling or a worker add operation in accepted/failed.
  3. The first Slurm instance in an environment never moves past initial bootstrap.

Impact

  1. The scheduler app never reaches a stable running state.
  2. App-instance bootstrap SSH trust can expire or churn while the runtime is stalled.
  3. Operators may misclassify the incident as an allocation or MAAS problem even though the owning failure is in the app-runtime adapter layer.

Required Context

  1. project_id
  2. app_instance_id
  3. app_slug
  4. Controller allocation id and bound node id
  5. Correlation id from the app-instance create/deploy request if available

Primary Checks

  1. Confirm the app-instance is really stalled in the app adapter layer:
  2. GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}
  3. GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/members
  4. GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/member-operations
  5. If the controller allocation is already active and bound, treat this as a Slurm adapter/controller incident, not an allocation incident.
  6. Verify whether a slurm-reference-controller is actually running in the target environment.

Known First-Slice Failure Modes

  1. No slurm-reference-controller is deployed in the environment.
  2. The controller uses connection.hostname for SSH transport instead of connection.host.
  3. The bound node cannot resolve or reach node-api.100-90-157-34.sslip.io.
  4. The node-agent local certificate is expired.
  5. The node-agent enrollment token in /etc/gpuaas/node-agent.env is stale.
  6. App bootstrap SSH reconcile requests create node_tasks, but they stay queued until node-agent connectivity is restored.

Deep Diagnosis

1. Confirm the app-runtime state

  1. GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}
  2. if status=deploying and the controller allocation is already bound, continue here
  3. GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/members
  4. inspect controller adapter_detail.host
  5. inspect controller adapter_detail.hostname
  6. inspect worker/member failures
  7. GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/member-operations
  8. look for worker add operations stuck in accepted or failed with SSH/bootstrap errors

2. Validate the controller deployment path

  1. Confirm whether the environment has a long-running slurm-reference-controller.
  2. If not, the first owning fix is to bring up the controller, not to retry app-instance create repeatedly.

3. Validate SSH target semantics

  1. For SSH transport, host must be the reachable address.
  2. hostname is display/identity data and may be unresolvable from the controller environment.
  3. If the controller or worker operation is trying to SSH to a MAAS hostname like j27u15, fix the adapter to prefer host.

4. Validate node-agent health on the bound node

  1. SSH to the controller allocation host.
  2. Check:
  3. systemctl status gpuaas-node-agent
  4. journalctl -u gpuaas-node-agent -n 200 --no-pager
  5. Common evidence:
  6. no route to host against node-api.100-90-157-34.sslip.io
  7. local certificate expired
  8. 401 token_invalid

5. Validate bootstrap SSH reconcile

  1. POST /api/v1/projects/{project_id}/app-instances/{app_instance_id}/bootstrap-ssh/reconcile
  2. If reconcile requests are accepted but never applied, inspect the resulting node task lifecycle and node-agent logs.

Recovery

A. Restore node-agent reachability

  1. Ensure the bound node can resolve the public API names to a reachable control-plane IP.
  2. For the platform-control incident on 2026-04-08, the working mapping was:
  3. api.100-90-157-34.sslip.io -> 10.176.46.104
  4. node-api.100-90-157-34.sslip.io -> 10.176.46.104

B. Recover node-agent identity

  1. If the local cert is expired, move aside the stale cert/key and restart the agent.
  2. If restart then fails with token_invalid, mint a fresh enrollment token:
  3. POST /api/v1/admin/nodes/{node_id}/enrollment-token
  4. Update /etc/gpuaas/node-agent.env with the new token and restart gpuaas-node-agent.
  5. Confirm journal evidence:
  6. node certificate enrolled

C. Reconcile bootstrap SSH trust

  1. Reissue:
  2. POST /api/v1/projects/{project_id}/app-instances/{app_instance_id}/bootstrap-ssh/reconcile
  3. Confirm the resulting allocation.reconcile_managed_authorized_key node task is dispatched and completed.
  4. Confirm the managed key appears in the allocation user’s authorized_keys.

D. Rerun the Slurm controller

  1. Use a controller build that prefers connection.host over connection.hostname for SSH transport.
  2. Rerun reconcile for the target app instance.
  3. Confirm the app instance transitions to running.

Success Criteria

  1. GET /api/v1/projects/{project_id}/app-instances/{app_instance_id} returns status=running.
  2. The controller member is ready.
  3. The controller member adapter_detail reports:
  4. phase=slurm_ready
  5. slurmctld_state=up
  6. slurmd_state=idle
  7. New bootstrap SSH reconcile requests complete through the node-agent path.

Post-Incident Follow-Up

  1. Do not leave this as a manual-only recovery.
  2. Track:
  3. durable platform-control deployment wiring for slurm-reference-controller
  4. SSH target semantics: host for transport, hostname for identity/display
  5. node-api host mapping/bootstrap expectations for MAAS-managed nodes
  6. node-agent certificate/token recovery documentation