Skip to content

Platform-Control k3s Recovery Runbook

Use this when platform-control root disk pressure has been cleared but the k3s environment does not recover cleanly.

Symptoms

  • df -h / previously showed / at or near 100%.
  • GitLab, CI, or platform app routes were unavailable during the disk-full event.
  • After freeing disk, k3s is active but GPUaaS deployments are still degraded.
  • kubectl get pods -A shows many stale pods in:
  • ContainerStatusUnknown
  • ImagePullBackOff
  • ErrImagePull
  • Completed
  • Error
  • New GPUaaS ReplicaSets try to pull unqualified images such as:
  • gpuaas-api:dev-control-local
  • gpuaas-web:dev-control-local
  • gpuaas-terminal-gateway:dev-control-local
  • Events show k3s resolving these unqualified images through Docker Hub:
  • docker.io/library/gpuaas-api:dev-control-local
  • pull access denied
  • insufficient_scope

Immediate Checks

ssh hpcadmin@100.90.157.34

df -h / /ai-cloud-data
systemctl is-active k3s containerd docker
sudo k3s kubectl get nodes -o wide
sudo k3s kubectl get deploy -A \
  -o custom-columns=NS:.metadata.namespace,NAME:.metadata.name,READY:.status.readyReplicas,DESIRED:.spec.replicas,IMAGE:.spec.template.spec.containers[*].image
sudo k3s kubectl get pods -A
sudo k3s kubectl get events -A --sort-by=.lastTimestamp | tail -80

Healthy baseline:

  • / below 80%.
  • /ai-cloud-data below 80%.
  • k3s, containerd, and docker active.
  • dev-control-1 Ready.
  • Every deployment in gpuaas-core, gpuaas-infra, gpuaas-observability, and kube-system has READY == DESIRED.

Diagnose Image Reference Drift

After a failed local deploy, the deployment specs may point at local Docker tags that k3s/containerd cannot pull. Confirm the deployment image refs and local image inventory:

sudo k3s kubectl get deploy -n gpuaas-core \
  -o custom-columns=NAME:.metadata.name,READY:.status.readyReplicas,DESIRED:.spec.replicas,IMAGE:.spec.template.spec.containers[*].image

sudo k3s ctr images list | grep -E 'gpuaas|dev-control|registry' || true
sudo docker images --format '{{.Repository}}:{{.Tag}} {{.ID}} {{.Size}}' \
  | grep -E 'gpuaas|dev-control|registry' || true

If a deployment image is gpuaas-*:dev-control-local, k3s will treat it as an external image unless it has been imported into k3s containerd with the exact same reference and pull policy behavior. Prefer registry-qualified images for platform-control recovery.

Recovery: Restore Deployments to a Known Registry Tag

This is a recovery action, not a new platform-control release. Use it only when the registry-qualified images already exist on the host or in the local registry.

Pick a known complete runtime tag from sudo docker images, for example:

REG=registry.100-90-157-34.sslip.io/gpuaas-service/runtime
TAG=3e16936657bb0ed941a76b1be0f16c00e63318f4
NS=gpuaas-core

Restore all core deployments to the same tag:

sudo k3s kubectl set image -n "$NS" deploy/gpuaas-api \
  api="$REG/gpuaas-api:$TAG"
sudo k3s kubectl set image -n "$NS" deploy/gpuaas-web \
  web="$REG/gpuaas-web:$TAG"
sudo k3s kubectl set image -n "$NS" deploy/gpuaas-app-runtime-worker \
  app-runtime-worker="$REG/gpuaas-app-runtime-worker:$TAG"
sudo k3s kubectl set image -n "$NS" deploy/gpuaas-billing-worker \
  billing-worker="$REG/gpuaas-billing-worker:$TAG"
sudo k3s kubectl set image -n "$NS" deploy/gpuaas-notification-relay \
  notification-relay="$REG/gpuaas-notification-relay:$TAG"
sudo k3s kubectl set image -n "$NS" deploy/gpuaas-outbox-relay \
  outbox-relay="$REG/gpuaas-outbox-relay:$TAG"
sudo k3s kubectl set image -n "$NS" deploy/gpuaas-provisioning-worker \
  provisioning-worker="$REG/gpuaas-provisioning-worker:$TAG"
sudo k3s kubectl set image -n "$NS" deploy/gpuaas-rke2-self-managed-controller \
  rke2-self-managed-controller="$REG/gpuaas-rke2-self-managed-controller:$TAG"
sudo k3s kubectl set image -n "$NS" deploy/gpuaas-slurm-reference-controller \
  slurm-reference-controller="$REG/gpuaas-slurm-reference-controller:$TAG"
sudo k3s kubectl set image -n "$NS" deploy/gpuaas-terminal-gateway \
  terminal-gateway="$REG/gpuaas-terminal-gateway:$TAG"
sudo k3s kubectl set image -n "$NS" deploy/gpuaas-webhook-worker \
  webhook-worker="$REG/gpuaas-webhook-worker:$TAG"

Then wait for the rollouts:

for d in gpuaas-api gpuaas-web gpuaas-app-runtime-worker gpuaas-billing-worker \
  gpuaas-notification-relay gpuaas-outbox-relay gpuaas-provisioning-worker \
  gpuaas-rke2-self-managed-controller gpuaas-slurm-reference-controller \
  gpuaas-terminal-gateway gpuaas-webhook-worker; do
  sudo k3s kubectl rollout status "deploy/$d" -n gpuaas-core --timeout=90s
done

Cleanup Stale Pods

Only do this after deployments are healthy. This removes dead pods left behind by the disk-full incident without changing desired deployment state.

for ns in gpuaas-core gpuaas-infra gpuaas-observability; do
  sudo k3s kubectl get pods -n "$ns" --field-selector=status.phase!=Running -o name \
    | xargs -r sudo k3s kubectl delete -n "$ns" --wait=false
done

Verification

sudo k3s kubectl get deploy -A \
  -o custom-columns=NS:.metadata.namespace,NAME:.metadata.name,READY:.status.readyReplicas,DESIRED:.spec.replicas,AVAILABLE:.status.availableReplicas
sudo k3s kubectl get pods -A --field-selector=status.phase!=Running
df -h / /ai-cloud-data

Expected result:

  • No non-running pods.
  • Every deployment in the GPUaaS namespaces is fully available.
  • Root filesystem has meaningful free space.

April 30, 2026 Incident Notes

After root disk was recovered on vm-104, k3s was running but several gpuaas-core deployments were stuck because the desired ReplicaSets referenced dev-control-local images. k3s attempted to pull these from Docker Hub. Recovery was to patch the deployments back to a complete registry-qualified runtime tag already present on the host:

registry.100-90-157-34.sslip.io/gpuaas-service/runtime/*:3e16936657bb0ed941a76b1be0f16c00e63318f4

After patching image refs, all rollouts completed and stale ContainerStatusUnknown / Completed / Error pods were deleted.

Follow-Up

  • Keep CI and frontend E2E workspaces on /ai-cloud-data, not /.
  • Keep Docker log rotation and the daily cleanup timer from Platform_Control_Disk_Cleanup_Runbook.md.
  • Prefer registry-qualified deployment images for platform-control; unqualified gpuaas-*:dev-control-local tags are fragile unless explicitly imported into k3s containerd.
  • Add a CI/release guard that rejects platform-control manifests containing unqualified gpuaas-*:dev-control-local image refs.