Skip to content

Node Agent OCI Distribution v1

As of: March 10, 2026

Purpose

Define the next executable delivery model for node-agent and related operator-facing runtime artifacts.

The goal is to stop treating node-agent delivery as a manual scp path and instead align it with the same platform-controlled artifact model used for app artifacts.

Problem Statement

The current node path is operationally workable but still relies on: 1. manual binary copy, 2. manual startup script management, 3. manual service installation, 4. manual operator reconstruction when new nodes are added.

That is not acceptable as the long-term bootstrap model for: 1. node-agent itself, 2. future runtime/operator bundles, 3. app rollout that depends on node-side operator work.

Core Model

The platform should publish a node bootstrap package as an OCI artifact.

That package should contain: 1. gpuaas-node-agent binary, 2. systemd unit template, 3. environment file template, 4. current control-plane CA bundle, 5. current node task-signing verifier set, 6. install/update script, 7. package metadata version.

The control plane should then issue a bootstrap response that includes: 1. OCI artifact reference, 2. digest, 3. short-lived pull credential delivery reference, 4. trust bundle metadata, 5. node-specific runtime values.

Trust Model

OCI is the transport, not the root trust.

The initial trust anchor still comes from the control plane bootstrap bundle. The node must not need direct Vault access to trust the bootstrap artifact.

Required properties: 1. bootstrap bundle remains authoritative for first trust, 2. bootstrap package is digest-pinned, 3. pull credentials are short-lived and platform-scoped, 4. install script verifies the expected digest before use.

Execution Phases

1. OCI bootstrap package

Make the control plane return a first-class node bootstrap package reference rather than only env values.

Implemented baseline: 1. enrollment-token response now carries package metadata (oci_ref, digest, pull delivery mode, install paths), 2. bootstrap OCI package build assets live under build/node-agent-bootstrap/, 3. make build-node-agent-bootstrap builds the local package image.

2. Release/update path

Move node-agent release/install/update to: 1. package pull, 2. verified install, 3. systemd-managed service lifecycle.

Execution modes for this delivery model: 1. reimage - current proven automated path via MAAS/cloud-init/bootstrap 2. manual_install - operator downloads verified package/bundle and installs it manually 3. rebootstrap - control plane performs in-place repair or upgrade over SSH using a platform-managed access credential reference

The OCI package is a common artifact building block across all three modes. The difference is the transport/execution path, not the package identity.

3. Runtime operator primitives

Once bootstrap and service lifecycle are stable, expand node-agent with bounded operator primitives for app/runtime rollout: 1. artifact fetch, 2. staged install/update, 3. service unit materialization, 4. typed status reporting.

Implemented first primitives: 1. runtime.write_env_file 2. runtime.install_service_unit 3. runtime.service_control

4. Runtime adapter

Only after the above should scheduler/runtime adapters like Slurm depend on the node path.

Non-Negotiable Invariants

  1. node-agent is not a generic remote shell,
  2. node-side operator work remains typed and auditable,
  3. first trust is still brokered by the control plane,
  4. runtime/operator bundles must use digest-pinned artifact references,
  5. Slurm and later adapters must consume the same node-side install/update model, not invent their own.

Immediate Execution Order

  1. A-NODE-AGENT-OCI-BOOTSTRAP-001
  2. A-NODE-AGENT-RELEASE-UPDATE-PATH-001
  3. A-NODE-AGENT-RUNTIME-OPERATOR-PRIMITIVES-001
  4. A-APP-RUNTIME-ADAPTER-SLURM-001
  1. doc/architecture/Node_Bootstrap_Trust_Delivery_v1.md
  2. doc/architecture/Platform_Signing_and_Bootstrap_Trust_v1.md
  3. doc/architecture/App_Platform_OCI_Registry_Baseline_v1.md
  4. doc/architecture/Node_Agent_Spec.md
  5. doc/architecture/Node_Operations_and_Agent_Lifecycle_v1.md