Skip to content

Node Onboarding Runbook

Trigger

  • New GPU node is registered and needs onboarding into control-plane trust.
  • Node onboarding must be retried due to expired token, failed enrollment, or bootstrap drift.
  • End-to-end provisioning validation is needed on a real node before release.

Objective

  • Deliver a valid bootstrap bundle per node.
  • Complete enrollment using a single-use enrollment token.
  • Ensure onboarding mode (manual or maas) follows the same security controls.
  • Validate full runtime path: enroll -> task wait -> task result -> allocation lifecycle.

Preconditions

  1. Node exists in admin inventory with correct node metadata.
  2. Operator selects onboarding mode:
  3. manual: operator installs the bootstrap payload on node.
  4. maas: provider automation injects bootstrap payload.
  5. Node can reach only api.internal for onboarding.
  6. For VPN-restricted test nodes, SSH access from control-plane host is available.

End-to-End Validation Checklist (Manual Mode)

A. Control plane: register node and capture IDs

ADMIN_TOKEN="$(curl -s -X POST \"http://localhost:8080/realms/gpuaas/protocol/openid-connect/token\" \
  -H \"Content-Type: application/x-www-form-urlencoded\" \
  -d \"grant_type=password\" \
  -d \"client_id=gpuaas-api\" \
  -d \"client_secret=dev-client-secret\" \
  -d \"username=dev-admin\" \
  -d \"password=admin123\" | jq -r .access_token)"

curl -s -X POST "http://localhost:8081/api/v1/admin/nodes" \
  -H "Authorization: Bearer ${ADMIN_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "host":"<NODE_IP>",
    "port":22,
    "ssh_username":"hpcadmin",
    "ssh_public_key":"ssh-ed25519 AAAA...",
    "sku":"h100",
    "gpus_total":8,
    "region_code":"region-1",
    "onboarding_mode":"manual",
    "probe":true
  }' | tee /tmp/node-create.json

NODE_ID="$(jq -r .id /tmp/node-create.json)"
echo "NODE_ID=${NODE_ID}"

B. Control plane: obtain bootstrap token material

Required model: - use per-node bootstrap bundle delivery with single-use enrollment token. - do not use shared/static internal enrollment tokens in onboarding.

C. Control plane: use the bootstrap bundle as the source of truth

The enrollment-token response must include bootstrap_bundle. Use that bundle directly; do not reconstruct: - GPUAAS_API_URL - GPUAAS_TASK_SIGNING_PUBKEYS - CA trust paths - cert/key paths

D. Node: install and run node-agent

Local-dev-only prerequisite for bridged VMs or external test nodes: 1. install hostname mapping for the environment-reachable bootstrap hosts (for example api.gpuaas.test, registry.gpuaas.test), 2. install the local dev CA into the node trust store before the first bootstrap fetch, 3. verify trust with curl https://api.gpuaas.test/api/v1/healthz, 4. only then continue with the bootstrap script or bundle fetch. 5. for local-dev external nodes, keep the bootstrap fetch URL on https://api.gpuaas.test but set the installed node-agent runtime URL (GPUAAS_API_URL) to http://<host-ip>:8081.

If the node was already enrolled and kind parity config changed later, use the checked-in reconcile helper instead of editing /etc/hosts or re-running docker login by hand:

make kind-parity-reconcile-external-node TARGET=hpcadmin@<node-host-or-ip>

This trust-bootstrap pre-step applies only to local-dev/private-CA environments. Do not require it in environments that present publicly trusted certificates.

Example local-dev CA install from the Mac host:

# On the Mac host
mkcert -CAROOT
scp "$(mkcert -CAROOT)/rootCA.pem" hpcadmin@<node-host-or-ip>:~/mkcert-rootCA.pem

# On the node
sudo cp ~/mkcert-rootCA.pem /usr/local/share/ca-certificates/mkcert-rootCA.crt
sudo update-ca-certificates
curl -v https://api.gpuaas.test/api/v1/healthz

export GPUAAS_NODE_ID="<NODE_ID>"
export GPUAAS_API_URL="<BOOTSTRAP_BUNDLE_API_URL>"
export GPUAAS_ENROLLMENT_TOKEN="<SINGLE_USE_ENROLLMENT_TOKEN>"
export GPUAAS_TASK_SIGNING_PUBKEYS="<BOOTSTRAP_BUNDLE_TASK_SIGNING_PUBKEYS>"
export GPUAAS_CA_BUNDLE_PATH="<BOOTSTRAP_BUNDLE_CA_BUNDLE_PATH>"
export GPUAAS_NODE_CERT_CA_BUNDLE_PATH="<BOOTSTRAP_BUNDLE_NODE_CERT_CA_BUNDLE_PATH>"
export GPUAAS_CERT_PATH="<BOOTSTRAP_BUNDLE_CERT_PATH>"
export GPUAAS_KEY_PATH="<BOOTSTRAP_BUNDLE_KEY_PATH>"
export GPUAAS_NODE_AGENT_INSTALL_ROOT="<BOOTSTRAP_PACKAGE_INSTALL_ROOT>"
export GPUAAS_NODE_AGENT_ENV_FILE="<BOOTSTRAP_PACKAGE_ENV_FILE_PATH>"
export GPUAAS_NODE_AGENT_SYSTEMD_UNIT="<BOOTSTRAP_PACKAGE_SYSTEMD_UNIT_NAME>"
mkdir -p "$HOME/.gpuaas"
cat <<'EOF' > "$GPUAAS_CA_BUNDLE_PATH"
<BOOTSTRAP_BUNDLE_CA_PEM>
EOF

# Pull the bootstrap OCI package by digest using the short-lived pull credential
# delivered in the bootstrap bundle, then run:
sudo /bundle/install-node-agent.sh

Expected logs: - startup line - enrollment success (or explicit enrollment failure reason) - recurring task wait loop without auth errors.

E. Control plane: verify lifecycle and provisioning path

  1. Node appears as active in admin nodes list.
  2. Create allocation for a test user.
  3. Verify progression:
  4. requested -> provisioning -> active
  5. Release allocation and verify:
  6. active -> releasing -> released

F. Negative validation

Run all before signoff: 1. Replay same enrollment token for first enrollment after success -> rejected unless the request is an already-onboarded node recovery re-enrollment. 2. Expired token -> rejected. 3. Token for wrong node_id -> rejected. 4. Node without valid cert on wait/result endpoints -> rejected.

Bootstrap Procedure

  1. Generate a fresh bootstrap bundle from control plane.
  2. Verify bundle contains:
  3. GPUAAS_NODE_ID
  4. GPUAAS_ENROLLMENT_TOKEN
  5. GPUAAS_API_URL
  6. GPUAAS_TASK_SIGNING_PUBKEYS
  7. CA bundle PEM + fingerprint
  8. GPUAAS_CA_BUNDLE_PATH
  9. GPUAAS_NODE_CERT_CA_BUNDLE_PATH
  10. GPUAAS_CERT_PATH
  11. GPUAAS_KEY_PATH
  12. package.oci_ref
  13. package.digest
  14. package.pull_delivery_mode
  15. package.install_root
  16. package.env_file_path
  17. package.systemd_unit_name
  18. Confirm token policy before delivery:
  19. node-bound
  20. TTL-limited
  21. consume-once
  22. Deliver bundle via selected onboarding mode.

Mode-Specific Steps

Manual Mode

  1. Deliver bundle to operator through approved secure channel.
  2. Operator installs bundle on target node only.
  3. Start/restart node agent and watch enrollment status transition.

MaaS Mode

  1. Attach bundle to provider automation/cloud-init payload.
  2. Validate target node identity in provider workflow.
  3. Boot node and observe first enrollment attempt.

MaaS Adapter Validation Checklist (real instance)

Use this checklist before marking MaaS onboarding production-ready: 1. Confirm API config: - MAAS_API_BASE_URL points to real MaaS endpoint. - MAAS_API_TOKEN is present and not logged. - timeout configured (MAAS_API_TIMEOUT_MS) and reasonable for environment. 2. Submit node create with: - onboarding_mode: "maas" - probe: true where networking allows. 3. Verify expected outcomes: - success path: node create returns 201 and onboarding proceeds. - unavailable path: adapter missing/unreachable returns canonical service_unavailable. - provider failure path: prepare hook failure returns canonical service_unavailable. 4. Capture and store incident/debug evidence: - API error envelope code, message, correlation_id. - API log line by correlation_id. 5. Validate fallback: - same node can be created/onboarded with manual mode when MaaS path is unavailable.

Node Lifecycle Transitions (Retire/Reactivate/Remove)

Purpose

  • Define safe operator procedures for retire, reactivate, and remove.
  • Keep lifecycle actions consistent with node-agent identity/cert behavior.

Retire (reversible)

  1. Preconditions:
  2. node has no active/releasing/release_failed allocation ownership conflicts.
  3. Action:
  4. call admin lifecycle API to transition node to retired state.
  5. Expected runtime behavior:
  6. node-agent may still be online and polling, but control-plane should reject runtime assignment/terminal use for retired node.
  7. cert renew can continue for identity continuity unless node is removed.
  8. Rollback path:
  9. use reactivate if retire was premature.

Reactivate (reversible)

  1. Preconditions:
  2. node is currently retired.
  3. Action:
  4. call admin lifecycle API to reactivate same node_id.
  5. Expected runtime behavior:
  6. node returns to schedulable/assignable state if health checks pass.
  7. existing node-agent identity remains valid.
  8. Validation:
  9. verify node appears active in admin nodes list.
  10. verify new allocation can target/select node.

Remove (terminal/irreversible)

  1. Preconditions:
  2. node is retired.
  3. no allocations bound to node.
  4. Action:
  5. call admin lifecycle API to permanently remove node.
  6. Expected runtime behavior:
  7. node-agent polling and cert-renew paths are denied for removed identity.
  8. node must be re-added as a new node identity if reintroduced later.
  9. Rollback:
  10. no direct rollback; re-onboard as a new node with new node_id and token bundle.

Observability Signatures for Lifecycle Incidents

  1. Capture API error envelope fields:
  2. code, message, correlation_id
  3. Query API logs by correlation:
  4. {service="gpuaas-api"} | json | correlation_id="<CORRELATION_ID>"
  5. If agent-side behavior mismatch is suspected, query node-agent logs:
  6. filter on node_id, lifecycle action, and cert renew/poll errors.
  7. Expected signal examples:
  8. retire denied due to in-use node/allocation state (node_in_use class)
  9. remove denied due to lifecycle precondition (invalid_request class)
  10. agent polling rejected after remove (identity no longer valid)

Post-Retire Decision Path

  1. If host is healthy and intended for reuse:
  2. reactivate.
  3. If host is being decommissioned/reimaged/transferred:
  4. remove.
  5. If host should remain retained but unschedulable:
  6. keep retired and document ownership/intent in ops ticket.

Expiry/Reissue Path

  1. If token is expired or already consumed, do not reuse prior bundle.
  2. Reissue a new bootstrap bundle for the same node.
  3. Invalidate previous onboarding artifact and audit the reissue event.
  4. Repeat onboarding from bootstrap procedure.

Expired Cert Recovery

Use this when a node certificate is present but already expired at node-agent startup. 1. Confirm node has a valid GPUAAS_ENROLLMENT_TOKEN in its current bootstrap bundle. On enrolled nodes this should be a node-bound recovery token, not only the consumed first-boot token. 2. Restart node-agent and verify logs show the expired cert recovery enrollment path. Current agents also trigger this path from task polling before sending an mTLS request with an expired local certificate. 3. If no token is available, reissue a node-bound single-use token and restart node-agent. 4. Validate node returns to active polling path after fresh cert issuance.

Security Controls

  1. Never use shared/static internal enrollment credentials.
  2. Enforce single-use enrollment token semantics on every first enrollment.
  3. Do not expose CA private endpoints (step-ca) to nodes.
  4. Limit node onboarding network path to api.internal.
  5. Record onboarding, failure, and reissue actions in audit trail.

SSH key management operational controls

  1. SSH key mutations must be auditable (actor, target, result, correlation_id).
  2. Default-key and allocation key-set overrides require ownership/admin authorization.
  3. Key-set rollback path:
  4. restore prior default key assignment and prior allocation key-set mapping.
  5. verify next provisioning/reconnect uses restored key set.
  6. Rotation guidance:
  7. add new key, validate access, then demote/remove old key.
  8. avoid destructive key removal before validation window completes.
  9. Alert/runbook linkage:
  10. SSH key management anomalies map to this runbook and ops.node.onboarding.

SKU catalog governance and node-registration controls

  1. Node registration must reference an active SKU catalog entry; free-text SKU values are not permitted.
  2. Validate catalog-to-node consistency before onboarding:
  3. node sku_id resolves to active SKU metadata.
  4. node GPU capacity/region attributes match SKU catalog constraints.
  5. Invalid SKU reference response workflow:
  6. reject registration/update request and capture correlation_id.
  7. file ops incident note with failing node_id, sku_id, and API error envelope.
  8. route to SKU catalog owner for correction before onboarding retry.
  9. Ownership and approval policy:
  10. SKU catalog lifecycle changes require dual approval from Platform Ops + Inventory owner.
  11. deactivation of an in-use SKU requires impact review and rollback plan.
  12. Alert/runbook linkage:
  13. invalid SKU reference alerts map to this runbook and SKU catalog incident workflow.

Terminal least-privilege and sudoers controls

  1. Enforce terminal least-privilege boundary:
  2. terminal session process runs as allocation user only.
  3. terminal path must not execute lifecycle/provisioning privileged operations.
  4. Validate sudo policy on node:
  5. use explicit command allowlist in sudoers.
  6. do not allow blanket NOPASSWD: ALL policy for runtime terminal operations.
  7. During onboarding verification, record sudoers policy checksum/location and reviewer.
  8. On policy drift or privilege escalation findings, quarantine node until corrected.

Validation

  1. Node status moves from onboarding state to enrolled/active.
  2. Enrollment token cannot be replayed after success.
  3. Reissued token works only for intended node.
  4. Allocation lifecycle transitions are workflow-driven and complete end-to-end on the onboarded node.