Platform-Control Disk Cleanup Runbook¶

Use this when platform-control GitLab, CI, or Docker-backed services fail with no space left on device.

Symptoms¶

Git push to platform-control GitLab fails with:
Failed to write to log
No space left on device
Internal API unreachable
CI jobs fail to start or GitLab runner cannot create temporary directories.
df -h / shows the root filesystem at or near 100%.

Initial Checks¶

ssh hpcadmin@100.90.157.34
df -h / /ai-cloud-data
sudo du -xh -d1 / 2>/dev/null | sort -h | tail -30
sudo du -xh -d1 /var 2>/dev/null | sort -h | tail -30
sudo du -xh -d1 /ai-cloud-data 2>/dev/null | sort -h | tail -30
sudo docker system df

Expected layout on vm-104:

/ is the OS/root filesystem and must stay below 80%.
/ai-cloud-data is Docker's data root and can grow quickly from CI images, GitLab runner cache volumes, and container logs.

Known Root Causes¶

Frontend E2E staged workspaces¶

Frontend E2E runs stage GitLab /builds/... workspaces onto a Docker-host path. The harness now prefers /ai-cloud-data/frontend-e2e and removes the current job's staged workspace on exit. It also prunes stale staged workspaces older than 24 hours. Older runs used /tmp/.frontend-e2e; in the April 30, 2026 incident this consumed about 180 GB and filled /.

Safe cleanup:

sudo rm -rf /tmp/.frontend-e2e
sudo find /ai-cloud-data/frontend-e2e -mindepth 3 -maxdepth 3 -type d -mtime +0 -exec rm -rf {} \; 2>/dev/null || true
sudo find /ai-cloud-data/frontend-e2e -mindepth 1 -type d -empty -delete 2>/dev/null || true
sudo apt-get clean
sudo journalctl --vacuum-size=300M
df -h / /ai-cloud-data

Docker JSON logs¶

Long-running containers can accumulate multi-GB JSON logs under Docker's data root. In the same incident the largest logs were from GitLab and GPUaaS workers.

Inspect:

sudo find /ai-cloud-data/docker/containers -name '*-json.log' -size +1G \
  -printf '%s %p\n' | sort -n | tail -30

Safe cleanup:

sudo find /ai-cloud-data/docker/containers -name '*-json.log' -size +1G \
  -exec truncate -s 0 {} \;

Old Docker images and runner volumes¶

CI builds can leave hundreds of old runtime images and unused runner cache volumes.

Conservative cleanup:

sudo docker container prune -f --filter 'until=24h'
sudo docker volume prune -f
sudo docker image prune -af --filter 'until=168h'
sudo docker builder prune -af --filter 'until=168h'
sudo docker system df
df -h /ai-cloud-data

Do not use docker system prune --volumes without checking active services and runner activity.

Installed Guardrail¶

vm-104 has a daily systemd cleanup timer:

Script: /usr/local/sbin/gpuaas-docker-cleanup.sh
Service: /etc/systemd/system/gpuaas-docker-cleanup.service
Timer: /etc/systemd/system/gpuaas-docker-cleanup.timer
Schedule: daily around 03:20 UTC with a randomized delay

The timer:

truncates Docker JSON logs larger than 1 GB
prunes stopped containers older than 24 hours
prunes unused volumes
prunes unused images older than 7 days
prunes build cache older than 7 days

Check status:

systemctl list-timers gpuaas-docker-cleanup.timer --no-pager
sudo systemctl status gpuaas-docker-cleanup.timer --no-pager
sudo journalctl -u gpuaas-docker-cleanup.service -n 200 --no-pager

Run manually:

sudo systemctl start gpuaas-docker-cleanup.service

Docker Log Rotation¶

Docker daemon config on vm-104 includes log rotation:

{
  "data-root": "/ai-cloud-data/docker",
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m",
    "max-file": "5"
  }
}

The log setting applies when containers are recreated. Existing containers may continue with their previous logging config until restart/redeploy.

Verification¶

After cleanup:

df -h / /ai-cloud-data
sudo docker ps --format '{{.Names}} {{.Status}}' | grep -E 'gpuaas-gitlab|gpuaas-gitlab-runner'
git ls-remote platform-control-gitlab:root/GPUasService.git refs/heads/master

If GitLab was blocked by root disk exhaustion, retry the failed push after / has free space.