Runbook: Slurm Reference Instance Stuck In deploying
Trigger
- A
slurm-reference app instance stays in deploying after the controller allocation is already active.
- The app-instance detail page shows a controller member in
reconciling or a worker add operation in accepted/failed.
- The first Slurm instance in an environment never moves past initial bootstrap.
Impact
- The scheduler app never reaches a stable
running state.
- App-instance bootstrap SSH trust can expire or churn while the runtime is stalled.
- Operators may misclassify the incident as an allocation or MAAS problem even though the owning failure is in the app-runtime adapter layer.
Required Context
project_id
app_instance_id
app_slug
- Controller allocation id and bound node id
- Correlation id from the app-instance create/deploy request if available
Primary Checks
- Confirm the app-instance is really stalled in the app adapter layer:
GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}
GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/members
GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/member-operations
- If the controller allocation is already active and bound, treat this as a Slurm adapter/controller incident, not an allocation incident.
- Verify whether a
slurm-reference-controller is actually running in the target environment.
Known First-Slice Failure Modes
- No
slurm-reference-controller is deployed in the environment.
- The controller uses
connection.hostname for SSH transport instead of connection.host.
- The bound node cannot resolve or reach
node-api.100-90-157-34.sslip.io.
- The node-agent local certificate is expired.
- The node-agent enrollment token in
/etc/gpuaas/node-agent.env is stale.
- App bootstrap SSH reconcile requests create
node_tasks, but they stay queued until node-agent connectivity is restored.
Deep Diagnosis
1. Confirm the app-runtime state
GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}
- if
status=deploying and the controller allocation is already bound, continue here
GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/members
- inspect controller
adapter_detail.host
- inspect controller
adapter_detail.hostname
- inspect worker/member failures
GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/member-operations
- look for worker add operations stuck in
accepted or failed with SSH/bootstrap errors
2. Validate the controller deployment path
- Confirm whether the environment has a long-running
slurm-reference-controller.
- If not, the first owning fix is to bring up the controller, not to retry app-instance create repeatedly.
3. Validate SSH target semantics
- For SSH transport,
host must be the reachable address.
hostname is display/identity data and may be unresolvable from the controller environment.
- If the controller or worker operation is trying to SSH to a MAAS hostname like
j27u15, fix the adapter to prefer host.
4. Validate node-agent health on the bound node
- SSH to the controller allocation host.
- Check:
systemctl status gpuaas-node-agent
journalctl -u gpuaas-node-agent -n 200 --no-pager
- Common evidence:
no route to host against node-api.100-90-157-34.sslip.io
- local certificate expired
401 token_invalid
5. Validate bootstrap SSH reconcile
POST /api/v1/projects/{project_id}/app-instances/{app_instance_id}/bootstrap-ssh/reconcile
- If reconcile requests are accepted but never applied, inspect the resulting node task lifecycle and node-agent logs.
Recovery
A. Restore node-agent reachability
- Ensure the bound node can resolve the public API names to a reachable control-plane IP.
- For the platform-control incident on 2026-04-08, the working mapping was:
api.100-90-157-34.sslip.io -> 10.176.46.104
node-api.100-90-157-34.sslip.io -> 10.176.46.104
B. Recover node-agent identity
- If the local cert is expired, move aside the stale cert/key and restart the agent.
- If restart then fails with
token_invalid, mint a fresh enrollment token:
POST /api/v1/admin/nodes/{node_id}/enrollment-token
- Update
/etc/gpuaas/node-agent.env with the new token and restart gpuaas-node-agent.
- Confirm journal evidence:
node certificate enrolled
C. Reconcile bootstrap SSH trust
- Reissue:
POST /api/v1/projects/{project_id}/app-instances/{app_instance_id}/bootstrap-ssh/reconcile
- Confirm the resulting
allocation.reconcile_managed_authorized_key node task is dispatched and completed.
- Confirm the managed key appears in the allocation user’s
authorized_keys.
D. Rerun the Slurm controller
- Use a controller build that prefers
connection.host over connection.hostname for SSH transport.
- Rerun reconcile for the target app instance.
- Confirm the app instance transitions to
running.
Success Criteria
GET /api/v1/projects/{project_id}/app-instances/{app_instance_id} returns status=running.
- The controller member is
ready.
- The controller member
adapter_detail reports:
phase=slurm_ready
slurmctld_state=up
slurmd_state=idle
- New bootstrap SSH reconcile requests complete through the node-agent path.
Post-Incident Follow-Up
- Do not leave this as a manual-only recovery.
- Track:
- durable platform-control deployment wiring for
slurm-reference-controller
- SSH target semantics:
host for transport, hostname for identity/display
- node-api host mapping/bootstrap expectations for MAAS-managed nodes
- node-agent certificate/token recovery documentation