Platform-Control Disk Cleanup Runbook¶
Use this when platform-control GitLab, CI, or Docker-backed services fail with
no space left on device.
Symptoms¶
- Git push to platform-control GitLab fails with:
Failed to write to logNo space left on deviceInternal API unreachable- CI jobs fail to start or GitLab runner cannot create temporary directories.
df -h /shows the root filesystem at or near 100%.
Initial Checks¶
ssh hpcadmin@100.90.157.34
df -h / /ai-cloud-data
sudo du -xh -d1 / 2>/dev/null | sort -h | tail -30
sudo du -xh -d1 /var 2>/dev/null | sort -h | tail -30
sudo du -xh -d1 /ai-cloud-data 2>/dev/null | sort -h | tail -30
sudo docker system df
Expected layout on vm-104:
/is the OS/root filesystem and must stay below 80%./ai-cloud-datais Docker's data root and can grow quickly from CI images, GitLab runner cache volumes, and container logs.
Known Root Causes¶
Frontend E2E staged workspaces¶
Frontend E2E runs stage GitLab /builds/... workspaces onto a Docker-host path.
The harness now prefers /ai-cloud-data/frontend-e2e and removes the current
job's staged workspace on exit. It also prunes stale staged workspaces older
than 24 hours. Older runs used /tmp/.frontend-e2e; in the April 30, 2026
incident this consumed about 180 GB and filled /.
Safe cleanup:
sudo rm -rf /tmp/.frontend-e2e
sudo find /ai-cloud-data/frontend-e2e -mindepth 3 -maxdepth 3 -type d -mtime +0 -exec rm -rf {} \; 2>/dev/null || true
sudo find /ai-cloud-data/frontend-e2e -mindepth 1 -type d -empty -delete 2>/dev/null || true
sudo apt-get clean
sudo journalctl --vacuum-size=300M
df -h / /ai-cloud-data
Docker JSON logs¶
Long-running containers can accumulate multi-GB JSON logs under Docker's data root. In the same incident the largest logs were from GitLab and GPUaaS workers.
Inspect:
sudo find /ai-cloud-data/docker/containers -name '*-json.log' -size +1G \
-printf '%s %p\n' | sort -n | tail -30
Safe cleanup:
Old Docker images and runner volumes¶
CI builds can leave hundreds of old runtime images and unused runner cache volumes.
Conservative cleanup:
sudo docker container prune -f --filter 'until=24h'
sudo docker volume prune -f
sudo docker image prune -af --filter 'until=168h'
sudo docker builder prune -af --filter 'until=168h'
sudo docker system df
df -h /ai-cloud-data
Do not use docker system prune --volumes without checking active services and
runner activity.
Installed Guardrail¶
vm-104 has a daily systemd cleanup timer:
- Script:
/usr/local/sbin/gpuaas-docker-cleanup.sh - Service:
/etc/systemd/system/gpuaas-docker-cleanup.service - Timer:
/etc/systemd/system/gpuaas-docker-cleanup.timer - Schedule: daily around
03:20 UTCwith a randomized delay
The timer:
- truncates Docker JSON logs larger than 1 GB
- prunes stopped containers older than 24 hours
- prunes unused volumes
- prunes unused images older than 7 days
- prunes build cache older than 7 days
Check status:
systemctl list-timers gpuaas-docker-cleanup.timer --no-pager
sudo systemctl status gpuaas-docker-cleanup.timer --no-pager
sudo journalctl -u gpuaas-docker-cleanup.service -n 200 --no-pager
Run manually:
Docker Log Rotation¶
Docker daemon config on vm-104 includes log rotation:
{
"data-root": "/ai-cloud-data/docker",
"log-driver": "json-file",
"log-opts": {
"max-size": "100m",
"max-file": "5"
}
}
The log setting applies when containers are recreated. Existing containers may continue with their previous logging config until restart/redeploy.
Verification¶
After cleanup:
df -h / /ai-cloud-data
sudo docker ps --format '{{.Names}} {{.Status}}' | grep -E 'gpuaas-gitlab|gpuaas-gitlab-runner'
git ls-remote platform-control-gitlab:root/GPUasService.git refs/heads/master
If GitLab was blocked by root disk exhaustion, retry the failed push after /
has free space.