Platform-Control k3s Recovery Runbook¶
Use this when platform-control root disk pressure has been cleared but the k3s environment does not recover cleanly.
Symptoms¶
df -h /previously showed/at or near 100%.- GitLab, CI, or platform app routes were unavailable during the disk-full event.
- After freeing disk,
k3sis active but GPUaaS deployments are still degraded. kubectl get pods -Ashows many stale pods in:ContainerStatusUnknownImagePullBackOffErrImagePullCompletedError- New GPUaaS ReplicaSets try to pull unqualified images such as:
gpuaas-api:dev-control-localgpuaas-web:dev-control-localgpuaas-terminal-gateway:dev-control-local- Events show k3s resolving these unqualified images through Docker Hub:
docker.io/library/gpuaas-api:dev-control-localpull access deniedinsufficient_scope
Immediate Checks¶
ssh hpcadmin@100.90.157.34
df -h / /ai-cloud-data
systemctl is-active k3s containerd docker
sudo k3s kubectl get nodes -o wide
sudo k3s kubectl get deploy -A \
-o custom-columns=NS:.metadata.namespace,NAME:.metadata.name,READY:.status.readyReplicas,DESIRED:.spec.replicas,IMAGE:.spec.template.spec.containers[*].image
sudo k3s kubectl get pods -A
sudo k3s kubectl get events -A --sort-by=.lastTimestamp | tail -80
Healthy baseline:
/below 80%./ai-cloud-databelow 80%.k3s,containerd, anddockeractive.dev-control-1Ready.- Every deployment in
gpuaas-core,gpuaas-infra,gpuaas-observability, andkube-systemhasREADY == DESIRED.
Diagnose Image Reference Drift¶
After a failed local deploy, the deployment specs may point at local Docker tags that k3s/containerd cannot pull. Confirm the deployment image refs and local image inventory:
sudo k3s kubectl get deploy -n gpuaas-core \
-o custom-columns=NAME:.metadata.name,READY:.status.readyReplicas,DESIRED:.spec.replicas,IMAGE:.spec.template.spec.containers[*].image
sudo k3s ctr images list | grep -E 'gpuaas|dev-control|registry' || true
sudo docker images --format '{{.Repository}}:{{.Tag}} {{.ID}} {{.Size}}' \
| grep -E 'gpuaas|dev-control|registry' || true
If a deployment image is gpuaas-*:dev-control-local, k3s will treat it as an
external image unless it has been imported into k3s containerd with the exact
same reference and pull policy behavior. Prefer registry-qualified images for
platform-control recovery.
Recovery: Restore Deployments to a Known Registry Tag¶
This is a recovery action, not a new platform-control release. Use it only when the registry-qualified images already exist on the host or in the local registry.
Pick a known complete runtime tag from sudo docker images, for example:
REG=registry.100-90-157-34.sslip.io/gpuaas-service/runtime
TAG=3e16936657bb0ed941a76b1be0f16c00e63318f4
NS=gpuaas-core
Restore all core deployments to the same tag:
sudo k3s kubectl set image -n "$NS" deploy/gpuaas-api \
api="$REG/gpuaas-api:$TAG"
sudo k3s kubectl set image -n "$NS" deploy/gpuaas-web \
web="$REG/gpuaas-web:$TAG"
sudo k3s kubectl set image -n "$NS" deploy/gpuaas-app-runtime-worker \
app-runtime-worker="$REG/gpuaas-app-runtime-worker:$TAG"
sudo k3s kubectl set image -n "$NS" deploy/gpuaas-billing-worker \
billing-worker="$REG/gpuaas-billing-worker:$TAG"
sudo k3s kubectl set image -n "$NS" deploy/gpuaas-notification-relay \
notification-relay="$REG/gpuaas-notification-relay:$TAG"
sudo k3s kubectl set image -n "$NS" deploy/gpuaas-outbox-relay \
outbox-relay="$REG/gpuaas-outbox-relay:$TAG"
sudo k3s kubectl set image -n "$NS" deploy/gpuaas-provisioning-worker \
provisioning-worker="$REG/gpuaas-provisioning-worker:$TAG"
sudo k3s kubectl set image -n "$NS" deploy/gpuaas-rke2-self-managed-controller \
rke2-self-managed-controller="$REG/gpuaas-rke2-self-managed-controller:$TAG"
sudo k3s kubectl set image -n "$NS" deploy/gpuaas-slurm-reference-controller \
slurm-reference-controller="$REG/gpuaas-slurm-reference-controller:$TAG"
sudo k3s kubectl set image -n "$NS" deploy/gpuaas-terminal-gateway \
terminal-gateway="$REG/gpuaas-terminal-gateway:$TAG"
sudo k3s kubectl set image -n "$NS" deploy/gpuaas-webhook-worker \
webhook-worker="$REG/gpuaas-webhook-worker:$TAG"
Then wait for the rollouts:
for d in gpuaas-api gpuaas-web gpuaas-app-runtime-worker gpuaas-billing-worker \
gpuaas-notification-relay gpuaas-outbox-relay gpuaas-provisioning-worker \
gpuaas-rke2-self-managed-controller gpuaas-slurm-reference-controller \
gpuaas-terminal-gateway gpuaas-webhook-worker; do
sudo k3s kubectl rollout status "deploy/$d" -n gpuaas-core --timeout=90s
done
Cleanup Stale Pods¶
Only do this after deployments are healthy. This removes dead pods left behind by the disk-full incident without changing desired deployment state.
for ns in gpuaas-core gpuaas-infra gpuaas-observability; do
sudo k3s kubectl get pods -n "$ns" --field-selector=status.phase!=Running -o name \
| xargs -r sudo k3s kubectl delete -n "$ns" --wait=false
done
Verification¶
sudo k3s kubectl get deploy -A \
-o custom-columns=NS:.metadata.namespace,NAME:.metadata.name,READY:.status.readyReplicas,DESIRED:.spec.replicas,AVAILABLE:.status.availableReplicas
sudo k3s kubectl get pods -A --field-selector=status.phase!=Running
df -h / /ai-cloud-data
Expected result:
- No non-running pods.
- Every deployment in the GPUaaS namespaces is fully available.
- Root filesystem has meaningful free space.
April 30, 2026 Incident Notes¶
After root disk was recovered on vm-104, k3s was running but several
gpuaas-core deployments were stuck because the desired ReplicaSets referenced
dev-control-local images. k3s attempted to pull these from Docker Hub. Recovery
was to patch the deployments back to a complete registry-qualified runtime tag
already present on the host:
After patching image refs, all rollouts completed and stale
ContainerStatusUnknown / Completed / Error pods were deleted.
Follow-Up¶
- Keep CI and frontend E2E workspaces on
/ai-cloud-data, not/. - Keep Docker log rotation and the daily cleanup timer from
Platform_Control_Disk_Cleanup_Runbook.md. - Prefer registry-qualified deployment images for platform-control; unqualified
gpuaas-*:dev-control-localtags are fragile unless explicitly imported into k3s containerd. - Add a CI/release guard that rejects platform-control manifests containing
unqualified
gpuaas-*:dev-control-localimage refs.