Skip to content

Vault Bootstrap and Root Token Runbook

Purpose: - Define the one-time bootstrap, unseal, and immediate post-bootstrap handling for platform-control Vault.

Scope

  1. First initialization of persistent platform-control Vault
  2. Unseal key and root token handling
  3. Transition from bootstrap/root access to intended operational access
  4. Recovery checks after restart or host migration

Required Policy

  • The generated Vault root token is break-glass material, not a steady-state runtime credential.
  • The unseal key must be escrowed outside the cluster in an operator-controlled location.
  • The root token must be recorded as bootstrap evidence, then rotated or replaced by a limited-scope operational token.
  • Runtime services should consume only the minimum token/policy required for their Vault paths.

Bootstrap Procedure

  1. Confirm the persistent Vault pod is healthy and using persistent storage.
  2. Initialize Vault once.
  3. Escrow:
  4. unseal key
  5. initial root token
  6. Unseal Vault.
  7. Enable required mounts such as kv/.
  8. Seed required baseline secrets for the environment.
  9. Create or issue the intended limited-scope operational token/policy.
  10. Patch runtime secrets/config to use the operational token, not the root token.
  11. Verify restart behavior:
  12. pod restart
  13. unseal procedure
  14. mount availability
  15. secret readability

Recovery Checks

  • vault status shows Initialized=true
  • vault status shows Sealed=false after unseal
  • expected mounts exist (kv/ at minimum)
  • platform-control registry secrets readable
  • MAAS site secret paths readable if configured

Deploy Preflight Behavior

scripts/ci/platform_control_deploy.sh performs a Vault readiness preflight before Vault-backed registry secrets are reconciled. The deploy may continue only when vault status reports Initialized=true and Sealed=false.

After registry secrets are reconciled, the deploy verifies the publisher and puller Vault paths are readable. The preflight logs only safe status strings such as Vault readiness: Initialized=true Sealed=false; it must never print unseal keys, root tokens, operational tokens, registry passwords, or secret values.

Platform-Control Deploy Failure: Vault Sealed

Use this when a platform-control release or deploy fails while reconciling Vault-backed secrets and the deploy trace contains Vault is sealed.

Typical symptoms: - platform_control_deploy fails before Kubernetes manifests are applied or validated. - The failing step is near reconcile platform-control Vault registry secrets. - vault status reports Initialized=true and Sealed=true.

Safe status check:

ssh hpcadmin@100.90.157.34 \
  'sudo k3s kubectl -n gpuaas-infra exec vault-0 -c vault -- sh -c "export VAULT_ADDR=http://127.0.0.1:8200; vault status -format=json"'

Break-glass unseal from the persistent Vault init material:

ssh hpcadmin@100.90.157.34 'sudo k3s kubectl -n gpuaas-infra exec -i vault-0 -c vault -- sh -eu' <<'EOF'
export VAULT_ADDR=http://127.0.0.1:8200
init_file=/vault/data/init.json
if [ ! -s "$init_file" ]; then
  echo "init file missing" >&2
  exit 1
fi
unseal_key=$(awk -F'"' '/"unseal_keys_b64"/ {getline; print $2; exit}' "$init_file")
if [ -z "$unseal_key" ]; then
  echo "unseal key parse failed" >&2
  exit 1
fi
vault operator unseal "$unseal_key" >/dev/null
vault status -format=json
EOF

After unseal: 1. Confirm Sealed=false. 2. Retry the failed GitLab deploy job instead of hand-applying manifests. 3. Confirm the release pipeline reaches remote validation. 4. Confirm platform-control pods are healthy.

Security notes: - Do not print, copy, or paste the unseal key or root token into chat, CI logs, shell history, or documentation. - /vault/data/init.json is sensitive break-glass material. It is not a substitute for operator-controlled escrow outside the cluster. - The gpuaas-infra-secrets Kubernetes secret may contain bootstrap/root-token material, but it must not be treated as the long-term unseal-key authority.

Follow-up Required After Any Manual Bootstrap

  1. Record the event in doc/operations/evidence/secrets_key_ops.md.
  2. Replace root-token runtime usage with the intended operational token.
  3. Confirm no cluster secret or runtime config still depends on the root token unless explicitly documented as break-glass.
  4. If this was triggered by a deployment, add evidence to the related release notes and verify the deploy job was retried from CI rather than manually bypassed.