Runbook: Slurm Reference Instance Stuck In `deploying`¶

Trigger¶

A slurm-reference app instance stays in deploying after the controller allocation is already active.
The app-instance detail page shows a controller member in reconciling or a worker add operation in accepted/failed.
The first Slurm instance in an environment never moves past initial bootstrap.

Impact¶

The scheduler app never reaches a stable running state.
App-instance bootstrap SSH trust can expire or churn while the runtime is stalled.
Operators may misclassify the incident as an allocation or MAAS problem even though the owning failure is in the app-runtime adapter layer.

Required Context¶

project_id
app_instance_id
app_slug
Controller allocation id and bound node id
Correlation id from the app-instance create/deploy request if available

Primary Checks¶

Confirm the app-instance is really stalled in the app adapter layer:
GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}
GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/members
GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/member-operations
If the controller allocation is already active and bound, treat this as a Slurm adapter/controller incident, not an allocation incident.
Verify whether a slurm-reference-controller is actually running in the target environment.

Known First-Slice Failure Modes¶

No slurm-reference-controller is deployed in the environment.
The controller uses connection.hostname for SSH transport instead of connection.host.
The bound node cannot resolve or reach node-api.100-90-157-34.sslip.io.
The node-agent local certificate is expired.
The node-agent enrollment token in /etc/gpuaas/node-agent.env is stale.
App bootstrap SSH reconcile requests create node_tasks, but they stay queued until node-agent connectivity is restored.

Deep Diagnosis¶

1. Confirm the app-runtime state¶

GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}
if status=deploying and the controller allocation is already bound, continue here
GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/members
inspect controller adapter_detail.host
inspect controller adapter_detail.hostname
inspect worker/member failures
GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/member-operations
look for worker add operations stuck in accepted or failed with SSH/bootstrap errors

2. Validate the controller deployment path¶

Confirm whether the environment has a long-running slurm-reference-controller.
If not, the first owning fix is to bring up the controller, not to retry app-instance create repeatedly.

3. Validate SSH target semantics¶

For SSH transport, host must be the reachable address.
hostname is display/identity data and may be unresolvable from the controller environment.
If the controller or worker operation is trying to SSH to a MAAS hostname like j27u15, fix the adapter to prefer host.

4. Validate node-agent health on the bound node¶

SSH to the controller allocation host.
Check:
systemctl status gpuaas-node-agent
journalctl -u gpuaas-node-agent -n 200 --no-pager
Common evidence:
no route to host against node-api.100-90-157-34.sslip.io
local certificate expired
401 token_invalid

5. Validate bootstrap SSH reconcile¶

POST /api/v1/projects/{project_id}/app-instances/{app_instance_id}/bootstrap-ssh/reconcile
If reconcile requests are accepted but never applied, inspect the resulting node task lifecycle and node-agent logs.

Recovery¶

A. Restore node-agent reachability¶

Ensure the bound node can resolve the public API names to a reachable control-plane IP.
For the platform-control incident on 2026-04-08, the working mapping was:
api.100-90-157-34.sslip.io -> 10.176.46.104
node-api.100-90-157-34.sslip.io -> 10.176.46.104

B. Recover node-agent identity¶

If the local cert is expired, move aside the stale cert/key and restart the agent.
If restart then fails with token_invalid, mint a fresh enrollment token:
POST /api/v1/admin/nodes/{node_id}/enrollment-token
Update /etc/gpuaas/node-agent.env with the new token and restart gpuaas-node-agent.
Confirm journal evidence:
node certificate enrolled

C. Reconcile bootstrap SSH trust¶

Reissue:
POST /api/v1/projects/{project_id}/app-instances/{app_instance_id}/bootstrap-ssh/reconcile
Confirm the resulting allocation.reconcile_managed_authorized_key node task is dispatched and completed.
Confirm the managed key appears in the allocation user’s authorized_keys.

D. Rerun the Slurm controller¶

Use a controller build that prefers connection.host over connection.hostname for SSH transport.
Rerun reconcile for the target app instance.
Confirm the app instance transitions to running.

Success Criteria¶

GET /api/v1/projects/{project_id}/app-instances/{app_instance_id} returns status=running.
The controller member is ready.
The controller member adapter_detail reports:
phase=slurm_ready
slurmctld_state=up
slurmd_state=idle
New bootstrap SSH reconcile requests complete through the node-agent path.

Post-Incident Follow-Up¶

Do not leave this as a manual-only recovery.
Track:
durable platform-control deployment wiring for slurm-reference-controller
SSH target semantics: host for transport, hostname for identity/display
node-api host mapping/bootstrap expectations for MAAS-managed nodes
node-agent certificate/token recovery documentation

Runbook: Slurm Reference Instance Stuck In deploying¶

Trigger¶

Impact¶

Required Context¶

Primary Checks¶

Known First-Slice Failure Modes¶

Deep Diagnosis¶

1. Confirm the app-runtime state¶

2. Validate the controller deployment path¶

3. Validate SSH target semantics¶

4. Validate node-agent health on the bound node¶

5. Validate bootstrap SSH reconcile¶

Recovery¶

A. Restore node-agent reachability¶

B. Recover node-agent identity¶

C. Reconcile bootstrap SSH trust¶

D. Rerun the Slurm controller¶

Success Criteria¶

Post-Incident Follow-Up¶

Runbook: Slurm Reference Instance Stuck In `deploying`¶