Node Onboarding Runbook¶
Trigger¶
- New GPU node is registered and needs onboarding into control-plane trust.
- Node onboarding must be retried due to expired token, failed enrollment, or bootstrap drift.
- End-to-end provisioning validation is needed on a real node before release.
Objective¶
- Deliver a valid bootstrap bundle per node.
- Complete enrollment using a single-use enrollment token.
- Ensure onboarding mode (
manualormaas) follows the same security controls. - Validate full runtime path: enroll -> task wait -> task result -> allocation lifecycle.
Preconditions¶
- Node exists in admin inventory with correct node metadata.
- Operator selects onboarding mode:
manual: operator installs the bootstrap payload on node.maas: provider automation injects bootstrap payload.- Node can reach only
api.internalfor onboarding. - For VPN-restricted test nodes, SSH access from control-plane host is available.
End-to-End Validation Checklist (Manual Mode)¶
A. Control plane: register node and capture IDs¶
ADMIN_TOKEN="$(curl -s -X POST \"http://localhost:8080/realms/gpuaas/protocol/openid-connect/token\" \
-H \"Content-Type: application/x-www-form-urlencoded\" \
-d \"grant_type=password\" \
-d \"client_id=gpuaas-api\" \
-d \"client_secret=dev-client-secret\" \
-d \"username=dev-admin\" \
-d \"password=admin123\" | jq -r .access_token)"
curl -s -X POST "http://localhost:8081/api/v1/admin/nodes" \
-H "Authorization: Bearer ${ADMIN_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"host":"<NODE_IP>",
"port":22,
"ssh_username":"hpcadmin",
"ssh_public_key":"ssh-ed25519 AAAA...",
"sku":"h100",
"gpus_total":8,
"region_code":"region-1",
"onboarding_mode":"manual",
"probe":true
}' | tee /tmp/node-create.json
NODE_ID="$(jq -r .id /tmp/node-create.json)"
echo "NODE_ID=${NODE_ID}"
B. Control plane: obtain bootstrap token material¶
Required model: - use per-node bootstrap bundle delivery with single-use enrollment token. - do not use shared/static internal enrollment tokens in onboarding.
C. Control plane: use the bootstrap bundle as the source of truth¶
The enrollment-token response must include bootstrap_bundle. Use that bundle directly; do not reconstruct:
- GPUAAS_API_URL
- GPUAAS_TASK_SIGNING_PUBKEYS
- CA trust paths
- cert/key paths
D. Node: install and run node-agent¶
Local-dev-only prerequisite for bridged VMs or external test nodes:
1. install hostname mapping for the environment-reachable bootstrap hosts (for example api.gpuaas.test, registry.gpuaas.test),
2. install the local dev CA into the node trust store before the first bootstrap fetch,
3. verify trust with curl https://api.gpuaas.test/api/v1/healthz,
4. only then continue with the bootstrap script or bundle fetch.
5. for local-dev external nodes, keep the bootstrap fetch URL on https://api.gpuaas.test but set the installed node-agent runtime URL (GPUAAS_API_URL) to http://<host-ip>:8081.
If the node was already enrolled and kind parity config changed later, use the
checked-in reconcile helper instead of editing /etc/hosts or re-running
docker login by hand:
This trust-bootstrap pre-step applies only to local-dev/private-CA environments. Do not require it in environments that present publicly trusted certificates.
Example local-dev CA install from the Mac host:
# On the Mac host
mkcert -CAROOT
scp "$(mkcert -CAROOT)/rootCA.pem" hpcadmin@<node-host-or-ip>:~/mkcert-rootCA.pem
# On the node
sudo cp ~/mkcert-rootCA.pem /usr/local/share/ca-certificates/mkcert-rootCA.crt
sudo update-ca-certificates
curl -v https://api.gpuaas.test/api/v1/healthz
export GPUAAS_NODE_ID="<NODE_ID>"
export GPUAAS_API_URL="<BOOTSTRAP_BUNDLE_API_URL>"
export GPUAAS_ENROLLMENT_TOKEN="<SINGLE_USE_ENROLLMENT_TOKEN>"
export GPUAAS_TASK_SIGNING_PUBKEYS="<BOOTSTRAP_BUNDLE_TASK_SIGNING_PUBKEYS>"
export GPUAAS_CA_BUNDLE_PATH="<BOOTSTRAP_BUNDLE_CA_BUNDLE_PATH>"
export GPUAAS_NODE_CERT_CA_BUNDLE_PATH="<BOOTSTRAP_BUNDLE_NODE_CERT_CA_BUNDLE_PATH>"
export GPUAAS_CERT_PATH="<BOOTSTRAP_BUNDLE_CERT_PATH>"
export GPUAAS_KEY_PATH="<BOOTSTRAP_BUNDLE_KEY_PATH>"
export GPUAAS_NODE_AGENT_INSTALL_ROOT="<BOOTSTRAP_PACKAGE_INSTALL_ROOT>"
export GPUAAS_NODE_AGENT_ENV_FILE="<BOOTSTRAP_PACKAGE_ENV_FILE_PATH>"
export GPUAAS_NODE_AGENT_SYSTEMD_UNIT="<BOOTSTRAP_PACKAGE_SYSTEMD_UNIT_NAME>"
mkdir -p "$HOME/.gpuaas"
cat <<'EOF' > "$GPUAAS_CA_BUNDLE_PATH"
<BOOTSTRAP_BUNDLE_CA_PEM>
EOF
# Pull the bootstrap OCI package by digest using the short-lived pull credential
# delivered in the bootstrap bundle, then run:
sudo /bundle/install-node-agent.sh
Expected logs: - startup line - enrollment success (or explicit enrollment failure reason) - recurring task wait loop without auth errors.
E. Control plane: verify lifecycle and provisioning path¶
- Node appears as
activein admin nodes list. - Create allocation for a test user.
- Verify progression:
requested -> provisioning -> active- Release allocation and verify:
active -> releasing -> released
F. Negative validation¶
Run all before signoff: 1. Replay same enrollment token for first enrollment after success -> rejected unless the request is an already-onboarded node recovery re-enrollment. 2. Expired token -> rejected. 3. Token for wrong node_id -> rejected. 4. Node without valid cert on wait/result endpoints -> rejected.
Bootstrap Procedure¶
- Generate a fresh bootstrap bundle from control plane.
- Verify bundle contains:
GPUAAS_NODE_IDGPUAAS_ENROLLMENT_TOKENGPUAAS_API_URLGPUAAS_TASK_SIGNING_PUBKEYS- CA bundle PEM + fingerprint
GPUAAS_CA_BUNDLE_PATHGPUAAS_NODE_CERT_CA_BUNDLE_PATHGPUAAS_CERT_PATHGPUAAS_KEY_PATHpackage.oci_refpackage.digestpackage.pull_delivery_modepackage.install_rootpackage.env_file_pathpackage.systemd_unit_name- Confirm token policy before delivery:
- node-bound
- TTL-limited
- consume-once
- Deliver bundle via selected onboarding mode.
Mode-Specific Steps¶
Manual Mode¶
- Deliver bundle to operator through approved secure channel.
- Operator installs bundle on target node only.
- Start/restart node agent and watch enrollment status transition.
MaaS Mode¶
- Attach bundle to provider automation/cloud-init payload.
- Validate target node identity in provider workflow.
- Boot node and observe first enrollment attempt.
MaaS Adapter Validation Checklist (real instance)¶
Use this checklist before marking MaaS onboarding production-ready:
1. Confirm API config:
- MAAS_API_BASE_URL points to real MaaS endpoint.
- MAAS_API_TOKEN is present and not logged.
- timeout configured (MAAS_API_TIMEOUT_MS) and reasonable for environment.
2. Submit node create with:
- onboarding_mode: "maas"
- probe: true where networking allows.
3. Verify expected outcomes:
- success path: node create returns 201 and onboarding proceeds.
- unavailable path: adapter missing/unreachable returns canonical service_unavailable.
- provider failure path: prepare hook failure returns canonical service_unavailable.
4. Capture and store incident/debug evidence:
- API error envelope code, message, correlation_id.
- API log line by correlation_id.
5. Validate fallback:
- same node can be created/onboarded with manual mode when MaaS path is unavailable.
Node Lifecycle Transitions (Retire/Reactivate/Remove)¶
Purpose¶
- Define safe operator procedures for
retire,reactivate, andremove. - Keep lifecycle actions consistent with node-agent identity/cert behavior.
Retire (reversible)¶
- Preconditions:
- node has no active/releasing/release_failed allocation ownership conflicts.
- Action:
- call admin lifecycle API to transition node to retired state.
- Expected runtime behavior:
- node-agent may still be online and polling, but control-plane should reject runtime assignment/terminal use for retired node.
- cert renew can continue for identity continuity unless node is removed.
- Rollback path:
- use
reactivateif retire was premature.
Reactivate (reversible)¶
- Preconditions:
- node is currently
retired. - Action:
- call admin lifecycle API to reactivate same
node_id. - Expected runtime behavior:
- node returns to schedulable/assignable state if health checks pass.
- existing node-agent identity remains valid.
- Validation:
- verify node appears active in admin nodes list.
- verify new allocation can target/select node.
Remove (terminal/irreversible)¶
- Preconditions:
- node is
retired. - no allocations bound to node.
- Action:
- call admin lifecycle API to permanently remove node.
- Expected runtime behavior:
- node-agent polling and cert-renew paths are denied for removed identity.
- node must be re-added as a new node identity if reintroduced later.
- Rollback:
- no direct rollback; re-onboard as a new node with new
node_idand token bundle.
Observability Signatures for Lifecycle Incidents¶
- Capture API error envelope fields:
code,message,correlation_id- Query API logs by correlation:
{service="gpuaas-api"} | json | correlation_id="<CORRELATION_ID>"- If agent-side behavior mismatch is suspected, query node-agent logs:
- filter on
node_id, lifecycle action, and cert renew/poll errors. - Expected signal examples:
- retire denied due to in-use node/allocation state (
node_in_useclass) - remove denied due to lifecycle precondition (
invalid_requestclass) - agent polling rejected after remove (identity no longer valid)
Post-Retire Decision Path¶
- If host is healthy and intended for reuse:
reactivate.- If host is being decommissioned/reimaged/transferred:
remove.- If host should remain retained but unschedulable:
- keep
retiredand document ownership/intent in ops ticket.
Expiry/Reissue Path¶
- If token is expired or already consumed, do not reuse prior bundle.
- Reissue a new bootstrap bundle for the same node.
- Invalidate previous onboarding artifact and audit the reissue event.
- Repeat onboarding from bootstrap procedure.
Expired Cert Recovery¶
Use this when a node certificate is present but already expired at node-agent startup.
1. Confirm node has a valid GPUAAS_ENROLLMENT_TOKEN in its current bootstrap bundle. On enrolled nodes this should be a node-bound recovery token, not only the consumed first-boot token.
2. Restart node-agent and verify logs show the expired cert recovery enrollment path.
Current agents also trigger this path from task polling before sending an mTLS
request with an expired local certificate.
3. If no token is available, reissue a node-bound single-use token and restart node-agent.
4. Validate node returns to active polling path after fresh cert issuance.
Security Controls¶
- Never use shared/static internal enrollment credentials.
- Enforce single-use enrollment token semantics on every first enrollment.
- Do not expose CA private endpoints (
step-ca) to nodes. - Limit node onboarding network path to
api.internal. - Record onboarding, failure, and reissue actions in audit trail.
SSH key management operational controls¶
- SSH key mutations must be auditable (actor, target, result, correlation_id).
- Default-key and allocation key-set overrides require ownership/admin authorization.
- Key-set rollback path:
- restore prior default key assignment and prior allocation key-set mapping.
- verify next provisioning/reconnect uses restored key set.
- Rotation guidance:
- add new key, validate access, then demote/remove old key.
- avoid destructive key removal before validation window completes.
- Alert/runbook linkage:
- SSH key management anomalies map to this runbook and
ops.node.onboarding.
SKU catalog governance and node-registration controls¶
- Node registration must reference an active SKU catalog entry; free-text SKU values are not permitted.
- Validate catalog-to-node consistency before onboarding:
- node
sku_idresolves to active SKU metadata. - node GPU capacity/region attributes match SKU catalog constraints.
- Invalid SKU reference response workflow:
- reject registration/update request and capture
correlation_id. - file ops incident note with failing
node_id,sku_id, and API error envelope. - route to SKU catalog owner for correction before onboarding retry.
- Ownership and approval policy:
- SKU catalog lifecycle changes require dual approval from Platform Ops + Inventory owner.
- deactivation of an in-use SKU requires impact review and rollback plan.
- Alert/runbook linkage:
- invalid SKU reference alerts map to this runbook and SKU catalog incident workflow.
Terminal least-privilege and sudoers controls¶
- Enforce terminal least-privilege boundary:
- terminal session process runs as allocation user only.
- terminal path must not execute lifecycle/provisioning privileged operations.
- Validate sudo policy on node:
- use explicit command allowlist in
sudoers. - do not allow blanket
NOPASSWD: ALLpolicy for runtime terminal operations. - During onboarding verification, record sudoers policy checksum/location and reviewer.
- On policy drift or privilege escalation findings, quarantine node until corrected.
Validation¶
- Node status moves from onboarding state to enrolled/active.
- Enrollment token cannot be replayed after success.
- Reissued token works only for intended node.
- Allocation lifecycle transitions are workflow-driven and complete end-to-end on the onboarded node.