Node Agent OCI Distribution v1¶
As of: March 10, 2026
Purpose¶
Define the next executable delivery model for node-agent and related operator-facing runtime artifacts.
The goal is to stop treating node-agent delivery as a manual scp path and instead align it with the same platform-controlled artifact model used for app artifacts.
Problem Statement¶
The current node path is operationally workable but still relies on: 1. manual binary copy, 2. manual startup script management, 3. manual service installation, 4. manual operator reconstruction when new nodes are added.
That is not acceptable as the long-term bootstrap model for: 1. node-agent itself, 2. future runtime/operator bundles, 3. app rollout that depends on node-side operator work.
Core Model¶
The platform should publish a node bootstrap package as an OCI artifact.
That package should contain:
1. gpuaas-node-agent binary,
2. systemd unit template,
3. environment file template,
4. current control-plane CA bundle,
5. current node task-signing verifier set,
6. install/update script,
7. package metadata version.
The control plane should then issue a bootstrap response that includes: 1. OCI artifact reference, 2. digest, 3. short-lived pull credential delivery reference, 4. trust bundle metadata, 5. node-specific runtime values.
Trust Model¶
OCI is the transport, not the root trust.
The initial trust anchor still comes from the control plane bootstrap bundle. The node must not need direct Vault access to trust the bootstrap artifact.
Required properties: 1. bootstrap bundle remains authoritative for first trust, 2. bootstrap package is digest-pinned, 3. pull credentials are short-lived and platform-scoped, 4. install script verifies the expected digest before use.
Execution Phases¶
1. OCI bootstrap package¶
Make the control plane return a first-class node bootstrap package reference rather than only env values.
Implemented baseline:
1. enrollment-token response now carries package metadata (oci_ref, digest, pull delivery mode, install paths),
2. bootstrap OCI package build assets live under build/node-agent-bootstrap/,
3. make build-node-agent-bootstrap builds the local package image.
2. Release/update path¶
Move node-agent release/install/update to: 1. package pull, 2. verified install, 3. systemd-managed service lifecycle.
Execution modes for this delivery model:
1. reimage
- current proven automated path via MAAS/cloud-init/bootstrap
2. manual_install
- operator downloads verified package/bundle and installs it manually
3. rebootstrap
- control plane performs in-place repair or upgrade over SSH using a platform-managed access credential reference
The OCI package is a common artifact building block across all three modes. The difference is the transport/execution path, not the package identity.
3. Runtime operator primitives¶
Once bootstrap and service lifecycle are stable, expand node-agent with bounded operator primitives for app/runtime rollout: 1. artifact fetch, 2. staged install/update, 3. service unit materialization, 4. typed status reporting.
Implemented first primitives:
1. runtime.write_env_file
2. runtime.install_service_unit
3. runtime.service_control
4. Runtime adapter¶
Only after the above should scheduler/runtime adapters like Slurm depend on the node path.
Non-Negotiable Invariants¶
- node-agent is not a generic remote shell,
- node-side operator work remains typed and auditable,
- first trust is still brokered by the control plane,
- runtime/operator bundles must use digest-pinned artifact references,
- Slurm and later adapters must consume the same node-side install/update model, not invent their own.
Immediate Execution Order¶
A-NODE-AGENT-OCI-BOOTSTRAP-001A-NODE-AGENT-RELEASE-UPDATE-PATH-001A-NODE-AGENT-RUNTIME-OPERATOR-PRIMITIVES-001A-APP-RUNTIME-ADAPTER-SLURM-001
Related Docs¶
doc/architecture/Node_Bootstrap_Trust_Delivery_v1.mddoc/architecture/Platform_Signing_and_Bootstrap_Trust_v1.mddoc/architecture/App_Platform_OCI_Registry_Baseline_v1.mddoc/architecture/Node_Agent_Spec.mddoc/architecture/Node_Operations_and_Agent_Lifecycle_v1.md