Node Operations and Agent Lifecycle v1¶

1. Purpose¶

Define a generic, extensible operations framework for node bootstrap and day-2 lifecycle actions without adding one-off APIs per platform/service.

This document covers: - configuration-driven operation composition from reusable building blocks, - bootstrap profiles for K8s/Slurm/Ray and BYOM-style runtime setup, - integration with existing provisioning/release flows already in production paths, - explicit deprovision/reclaim operations, - tenant/project-scoped execution lifecycle, - node-agent lifecycle (upgrade/rollback/capability gating).

2. Design Principles¶

One generic operations API, many operation types.
No raw shell passthrough from control plane.
Operations are declarative and versioned (operation_type@version).
Control plane decides; node data plane executes.
Artifact supply chain is digest-pinned and signed.
Policy scope chain: global -> tenant -> project (most specific wins, global hard-deny always applies).
Every mutation and execution path is auditable and traceable (correlation_id, resource_name, trace_id).
Building blocks are configuration-driven and reusable; operation profiles compose blocks rather than embedding bespoke scripts.
Hardware/vendor differences are resolved in agent-side step resolver configuration, not in public API contracts.

3. Control vs Data Plane¶

Control plane responsibilities:
register/deprecate operation definitions,
enforce authz/policy,
schedule tasks/events,
mint short-lived task-bound credentials.
Data plane (node agent) responsibilities:
execute assigned tasks only,
pull/verify artifacts,
run allowlisted steps,
report step status and final results.

This split is intentional and does not violate queue-first execution. Credential minting is allowed only as a task-bound sub-step.

Task authenticity and transport trust are mandatory and inherited from: - doc/architecture/Node_Agent_Spec.md (signed task catalog and node pull model), - doc/architecture/PKI_Spec.md (mTLS node identity and renewal lifecycle).

3.1 Lifecycle vs terminal execution seam¶

The node agent currently hosts two different execution domains:

Lifecycle/task execution
enroll, renew, poll tasks, verify signatures, execute bounded handlers, report typed results
request/response and queue-oriented
deterministic, retryable, auditable
host-mutation surface exists but is strictly allowlisted
Interactive terminal execution
session bootstrap via terminal.open
long-lived bidirectional stream relay after bootstrap
PTY/session lifecycle, close semantics, stream transport failures
no host-mutation surface beyond user impersonation into an already-provisioned allocation user

These two domains are intentionally kept in one binary today for bootstrap simplicity, but they are not the same operational concern and must remain separable.

Current architectural rule: - lifecycle execution remains the authority for node mutation and typed task completion - terminal execution remains a distinct interactive session path with its own transport, metrics, and failure handling

Future split allowance: - if operational pressure justifies it, terminal execution may move into a separate process/component without changing: - signed task bootstrap semantics, - node identity/mTLS model, - task poll/request-reply lifecycle contracts

This document treats that as a supported split seam, not an implementation accident.

4. Generic Operation Model¶

4.1 Operation Definition (platform-owned)¶

Identity: operation_type, version, backend_type.
Input schema: JSON schema for validated inputs.
Step plan: ordered list of building blocks.
Artifact constraints: allowed repo prefixes, digest/signature requirements.
Capability requirements: minimum agent capabilities.

4.2 Operation Execution (tenant/project-scoped)¶

Execution instance fields:
operation_id, operation_type, version,
org_id, project_id, requested_by_user_id,
target_selector (node/allocation/group),
idempotency_key, correlation_id,
status (queued|running|succeeded|failed|canceled).

Target authorization rule (required): - target_selector must resolve to resources owned by the same org_id and project_id as the operation request context. - Server-side validation is required before task dispatch; client hints are never trusted.

4.3 Building Blocks (reusable step kinds)¶

Required v1 step kinds: 1. artifact.pull_oci (OCI image/artifact, digest required) 2. artifact.pull_blob (S3/GCS/HF/blob source with resumable transfer) 3. artifact.verify (digest and optional signature/attestation) 4. credential.mint (typed short-lived credentials) 5. package.install (allowlisted OS packages) 6. python.env.create (venv/conda profile with pinned lock input and index/channel allowlist) 7. file.render (templated config with owner/mode) 8. service.control (start|stop|restart|enable|disable) 9. command.exec_allowlisted (fixed command profile, validated args) 10. gpu.runtime.check (driver/CUDA/ROCm/toolkit compatibility checks) 11. network.fabric.verify (IB/RoCE/NCCL/RCCL readiness checks) 12. health.check (structured checks) 13. cluster.join (backend-specific adapter step) 14. cluster.verify (backend-specific readiness check)

No free-form command.exec is allowed.

python.env.create source constraints (required): - step input must declare package sources (PyPI index/extra indexes or conda channels), - sources must be allowlisted by policy (allowed_artifact_sources / tenant-project overlays), - lockfile/hash pinning is mandatory for reproducible environments.

4.4 `command.exec_allowlisted` formal contract¶

command.exec_allowlisted is high-risk and must follow these rules: 1. Command profiles are platform-defined in operation definitions (not tenant-defined). 2. Profile registration/update is platform-admin only. 3. A profile contains: - fixed executable path, - fixed base args, - typed arg schema for allowed dynamic fields, - explicit env allowlist, - execution user (root or named service user), - timeout and retry limits. 4. Dynamic args may only come from validated operation inputs and must be schema-constrained (no shell interpolation). 5. Agent executes with execve argv form; shell invocation (sh -c) is forbidden for this step kind.

4.5 `credential.mint` formal contract¶

credential.mint is high-risk and must follow these rules: 1. Mint requests are task-bound and include node_id, task_id, operation_id, step_id, credential_type, and scoped intent. 2. Control plane validates node-task binding and operation scope before issuing any secret material. 3. Returned secret material is short-lived and must be consumed only by subsequent typed steps in the same task execution context. 4. Secret propagation is forbidden via raw command arguments; pass by structured in-memory handle or explicitly allowlisted env mapping only. 5. Agent must wipe credential material after step/task completion or failure. 6. Consumed-by-use credentials (for example cluster_join_token) satisfy wipe requirements by successful one-time consumption; durable tokens/keys require explicit wipe.

4.6 Hardware-aware abstract step kinds¶

These remain hardware-agnostic at API/definition level and are resolved by agent configuration: - gpu.driver.validate - gpu.runtime.check - gpu.fabric.verify - gpu.container_runtime.check

4.7 Step resolver layer (agent-side)¶

Resolution path: 1. Control plane dispatches abstract step kind (for example gpu.runtime.check). 2. Agent loads local resolver config. 3. Agent matches hardware capability rules (first match wins, fallback required). 4. Agent resolves to a concrete command profile and executes.

This keeps control-plane contracts stable across NVIDIA/AMD/Intel or other runtime variants.

Resolver config baseline (illustrative):

hardware_profiles:
  - match: { gpu.vendor: nvidia }
    overrides:
      gpu.runtime.check: nvidia_smi_health
      gpu.driver.validate: nvidia_driver_validate
      gpu.fabric.verify: nccl_allreduce_test
  - match: { gpu.vendor: amd }
    overrides:
      gpu.runtime.check: rocm_smi_health
      gpu.driver.validate: rocm_driver_validate
      gpu.fabric.verify: rccl_allreduce_test
  - match: {}
    overrides:
      gpu.runtime.check: generic_gpu_runtime_check

Command profile schema (resolved target):

id: nvidia_smi_health
command: /usr/bin/nvidia-smi
allowed_args:
  - --query-gpu=utilization.gpu,memory.used,memory.total
  - --format=csv,noheader
arg_schema: null
env_passthrough: []
expected_exit_codes: [0]
timeout_seconds: 30
capture_stdout: true
capture_stderr: true

5. Backend Types¶

backend_type in operation definitions: - node_agent (host/bootstrap operations), - k8s_adapter (cluster workload operations), - slurm_adapter, - ray_adapter.

Node agent is required for bootstrap/join and host-level lifecycle. Workload lifecycle can later run through native backend adapters.

6. Bootstrap Profile Composition (config-driven)¶

Profiles are assembled from building blocks; each profile has small validated inputs.

6.1 `k8s.node.bootstrap@v1`¶

install runtime prereqs
render kubelet/container runtime config
join cluster
verify node readiness and required labels/taints

6.2 `slurm.node.bootstrap@v1`¶

install slurmd + auth deps
render slurm.conf + auth material paths
join/register node
verify node state in scheduler

6.3 `ray.node.bootstrap@v1`¶

install ray runtime deps
render ray node config
join head/control plane
verify node visibility via ray API

6.4 `byom.runtime.prepare@v1`¶

This profile is split into three operation profiles to avoid mixed failure domains: 1. gpu.host.bootstrap@v1 - install driver/runtime prerequisites - run gpu.runtime.check 2. model.artifact.stage@v1 - mint source credentials - pull/stage artifacts with resume support - verify digest/signature 3. runtime.readiness.verify@v1 - run workload-specific readiness probes - record benchmark/smoke outputs as artifacts

6.5 Existing allocation provisioning compatibility (required)¶

This model must cover today’s allocation lifecycle without introducing a parallel path: - Provisioning transition (requested -> provisioning -> active) dispatches operation profiles using the same typed task framework. - Release transition (active -> releasing -> released) dispatches deprovision/reclaim profiles. - Existing orchestrator state machine remains source of truth; operation runner is the execution substrate.

6.6 Deprovision/reclaim profiles (required)¶

allocation.runtime.deprovision@v1
stop workload services/processes,
unmount/revoke runtime resources,
remove ephemeral credentials/tokens/secrets.
allocation.identity.revoke@v1
remove/revoke allocation user access (keys/session),
enforce post-release user isolation policy.
node.reclaim.verify@v1
verify node is safe for reassignment,
run hygiene checks (process/file/socket residue checks),
mark reclaim status for scheduler.

If reclaim verification fails, node transitions to quarantine/cleanup path and is not re-assigned.

Required sequencing constraint: 1. allocation.runtime.deprovision@v1 must complete successfully before allocation.identity.revoke@v1. 2. allocation.identity.revoke@v1 must complete before node.reclaim.verify@v1. 3. Orchestrator must enforce this as a composite chained workflow; independent unordered dispatch is not allowed.

7. Artifact and Credential Model¶

7.1 Artifact source¶

OCI registry (Harbor-compatible) is canonical.
Pull by immutable digest only: registry/repo@sha256:....
Optional OCI artifacts for bundles (oras flow).
Non-OCI large artifacts are supported via artifact.pull_blob:
source types: s3, gcs, huggingface, https_signed.
resumable transfer is required for large files.
transfer must support mid-stream integrity checks and final digest verification.

App-platform companion baseline: - doc/architecture/App_Platform_OCI_Registry_Baseline_v1.md - The app control plane owns publish intent, digest registration, and artifact lifecycle metadata. - Direct upload bytes still go to the registry, not through the public API.

7.2 Credentials¶

Task-bound endpoint: POST /internal/v1/nodes/{node_id}/tasks/{task_id}/registry-credential.
Control plane verifies node-task binding and allowed artifact scope.
Credentials are short-lived, pull-only, scoped to repo/digest.
Agent must wipe credentials after task completion/failure.
Exception: consumed-by-use credentials (for example one-time cluster join tokens) satisfy wipe by successful consumption; explicit wipe still applies on failure paths.

7.3 Generalized credential minting¶

For non-OCI and cluster/bootstrap secrets, use: - POST /internal/v1/nodes/{node_id}/tasks/{task_id}/credentials/mint - request includes credential_type and scoped intent. - response returns typed short-lived secret material. Supported credential_type baseline: - oci_pull, - s3_read, - gcs_read, - hf_token, - cluster_join_token.

Secrets must never be embedded in operation request payloads.

8. Authorization and Ownership¶

8.1 Platform admin¶

Manage operation definitions/versions.
Manage global allow/deny and artifact trust.
Manage agent rollout channels.

8.2 Tenant admin¶

Manage tenant/project operation policy overlays.
Trigger/cancel/retry operations within tenant/project scope.
Cannot register raw commands or bypass platform hard-deny.

8.3 Project roles¶

project_admin: submit/cancel/retry/view operations.
project_member: submit/view allowed operations; cancel own scope per policy.
project_viewer: read-only.

9. Node Agent Lifecycle¶

9.1 Version model¶

node fields: agent_channel, desired_agent_version, reported_agent_version, capabilities.
channels: stable, candidate, pinned.

Required v1 state split: - desired_agent_version: version or release ref the control plane wants on the node. - reported_agent_version: version the running node agent reports after successful startup. - desired_delivery_mode: reimage|manual_install|rebootstrap. - last_delivery_attempt_at, last_delivery_result, last_delivery_correlation_id. - drift_state: in_sync|version_drift|config_drift|certificate_only_repair_needed|unknown.

This split exists because a node can be: - healthy but behind the desired version, - healthy on the right binary but holding stale enrollment or cert material, - unreachable or broken enough that only reimage/manual repair is realistic.

Current operator-facing presentation: - /admin/nodes surfaces summary cards for: - Agent current - Agent drift - Unknown build - each node row shows a drift badge derived from lifecycle state: - Current - Outdated - Config drift - Cert repair - Unknown build

commit=unknown or agent=dev must be treated as drift even if the node is otherwise active.

9.2 Upgrade tasks¶

agent.upgrade.stage
agent.upgrade.activate
agent.upgrade.rollback

9.2a Supported execution modes¶

The platform must support explicit node-agent lifecycle modes:

reimage
uses MAAS/cloud-init/bootstrap path
strongest reset path
no stored SSH credential required
highest disruption, slowest recovery
manual_install
control plane provides versioned package/install bundle metadata and verification material
operator performs the install outside GPUaaS
no stored SSH credential required
result still must be verified back through node-agent health/version reporting
rebootstrap
control plane reuses the existing bootstrap trust/delivery path for in-place recovery
preferred before falling back to manual operator steps or full reimage

Current live reality: - reimage remains the strongest automated reset path for nodes that cannot execute signed node tasks. - manual_install is operationally possible but not yet modeled as a first-class workflow. - rebootstrap supports automated in-place node.self_update for healthy enrolled nodes in in_place_upgrade, repair_reinstall, and drift_reconcile scenarios. It mints a short-lived bootstrap-package URL, computes the tarball SHA256 through the API package endpoint, enqueues a signed node task, waits for task completion, and records the scheduled target as the reported lifecycle version. bootstrap_install and certificate_repair still use the brokered manual/reimage paths because they cannot assume a healthy enrolled agent with valid credentials.

9.2b Repair and upgrade scenarios¶

The lifecycle model must distinguish at least these cases:

bootstrap install on a new or reimaged node
in-place upgrade for a healthy enrolled node
repair reinstall when binary/env/systemd state is bad
certificate-only or enrollment-only repair when the binary is still valid
drift reconciliation when the node is healthy but no longer matches desired version or config

Do not collapse all of these into reimage. Reimage remains a valid operator choice, but the control plane must model the scenario explicitly so observability, policy, and future automation behave correctly.

9.2c Interrupted task recovery¶

Node-agent restarts or crashes can happen after a side effect occurred but before task completion was reported back to the control plane.

Required control-plane behavior:

detect ambiguous recent launch failures or stalled launches,
enqueue a bounded status probe,
compare desired state to observed runtime truth,
promote the workload back to running when the probe proves the workload is healthy.

Current live behavior:

app-runtime status probes can heal recent deploying launches that stalled after the workload was actually created,
app-runtime status probes can heal recent ambiguous launch_failed states when the workload is verifiably healthy,
this recovery is intentionally bounded to recent launch ambiguity and must not auto-heal arbitrary old failures.

Still required later:

node-agent startup reconciliation for claimed/in-flight tasks,
explicit operator actions such as Force probe or Recover workload,
background drift/probe cadence reporting in the Node operations hub.

9.2d Upgrade safety rules¶

The lifecycle contract must declare whether a node may be upgraded: - only when idle, - after drain, - or under explicit force/maintenance override.

This rule must be visible in workflow state, not left to operator memory.

9.3 Rollout policy¶

canary then waves by region/pool/tenant boundaries,
auto-pause on failure threshold,
automatic rollback when health checks fail.

Required rollout verification: - verify installed version on-node after delivery (reported_agent_version) - verify service health after activation - verify enrollment/renew path still works after activation - verify the node returns to schedulable truth before marking rollout success

9.4 Compatibility gate¶

Definitions declare required capabilities.
Scheduler must not dispatch incompatible tasks to a node agent.

9.5 Step retry and idempotency¶

Operation idempotency (idempotency_key) is request-level only; step execution must also be tracked: - operation_steps must include attempt_number. - Retry semantics: - retry resumes from first failed step by default, - successful completed steps are not re-run unless profile marks them non-reusable. - Long-transfer steps (artifact.pull_blob) must support checkpointed resume.

9.6 Resolver config distribution¶

Resolver/command-profile configuration is distributed as a signed control-plane task (preferred model): - no out-of-band manual mutation on node, - auditable updates with correlation_id, - rollback supported through task-based config versioning. - local fallback profile set origin: provisioned at enrollment time via signed bootstrap/config artifact. - compiled fallback in agent binary is limited to minimal safe defaults only (enrollment/recovery), not vendor-specific runtime logic.

10. API Surface (planned)¶

10.1 Platform definition lifecycle¶

POST /api/v1/admin/operation-definitions
PATCH /api/v1/admin/operation-definitions/{id}
POST /api/v1/admin/operation-definitions/{id}/deprecate
POST /api/v1/admin/operation-definitions/{id}/retire

10.2 Tenant/project policy lifecycle¶

GET|PUT /api/v1/tenants/{tenant_id}/operation-policies/{operation_type}
GET|PUT /api/v1/projects/{project_id}/operation-policies/{operation_type}

10.3 Execution lifecycle¶

POST /api/v1/projects/{project_id}/operations
GET /api/v1/projects/{project_id}/operations/{operation_id}
POST /api/v1/projects/{project_id}/operations/{operation_id}/cancel
POST /api/v1/projects/{project_id}/operations/{operation_id}/retry
GET /api/v1/projects/{project_id}/operations/{operation_id}/logs

10.3a Admin node-agent lifecycle surface¶

This document also requires an admin-scoped node-agent lifecycle API surface so the three execution modes become explicit control-plane behavior instead of operator folklore.

Minimum v1 surface: - GET /api/v1/admin/nodes/{node_id}/agent-lifecycle - returns current desired/reported version state, drift state, last delivery result, and latest lifecycle run if one exists - POST /api/v1/admin/nodes/{node_id}/agent-lifecycle - submits a lifecycle run with: - mode: reimage|manual_install|rebootstrap - scenario: bootstrap_install|in_place_upgrade|repair_reinstall|certificate_repair|drift_reconcile - target_version - safety_policy

Required semantics: - manual_install must still produce a lifecycle run so operator-driven work is visible - rebootstrap must produce a lifecycle run; supported in-place scenarios enqueue node.self_update, while unsupported scenarios return explicit brokered-bootstrap next-action metadata - reimage remains a first-class lifecycle mode, not an out-of-band fallback

10.4 Logs endpoint semantics¶

GET /api/v1/projects/{project_id}/operations/{operation_id}/logs supports: - paginated batch mode (default): cursor, page_size, optional step_id, - follow mode: follow=true for near-real-time streaming (SSE). Both modes are scoped and authorized by project ownership.

10.5 Provisioning/release operation bindings (required)¶

Internal control-plane bindings (not tenant-facing API surface) must map: - provisioning start -> operation submit (allocation.runtime.provision@* profile family), - release start -> operation submit (allocation.runtime.deprovision@* + reclaim verify), - force-release/admin-retry -> operation retry with audit linkage.

11. Data Model (planned)¶

operation_definitions
operation_definition_versions
operation_policy_overrides
operations
operation_steps
operation_artifacts
agent_rollouts (optional v1.1 if rollout orchestration is persisted centrally)

Recommended additional fields: - operations.trigger_source (allocation_provision|allocation_release|admin|tenant_request) - operations.linked_allocation_id (nullable UUID for allocation-bound operations) - operations.linked_node_id (nullable UUID for node-bound operations)

operation_policy_overrides schema baseline: - keys: - enabled (bool; cannot enable if globally hard-denied), - max_concurrency (int), - timeout_seconds (int), - allowed_artifact_sources (array), - allowed_target_kinds (array), - require_signature_verification (bool), - max_artifact_size_bytes (int). - scopes: global, tenant, project.

12. Events (planned)¶

operation.requested
operation.progress
operation.completed
operation.failed
operation.canceled
agent.upgrade.started
agent.upgrade.completed
agent.upgrade.failed
agent.rollback.completed

operation.progress payload minimum: - operation_id - correlation_id - current_step_index - total_steps - current_step_name - step_status - step_started_at - optional step_percent

13. Observability Requirements¶

For every operation and step: - required fields: correlation_id, operation_id, operation_type, org_id, project_id, resource_name, error_code. - trace continuity: API -> outbox -> worker -> node task execution. - dashboards and alerts: - operation failure spike, - bootstrap verify failures, - agent upgrade failure/rollback spike, - credential mint denial anomalies.

Step-level audit/log requirements: - step_kind (abstract operation-defined step), - resolved_profile (agent-resolved command profile id), - hardware_match (resolver rule identifier), - attempt_number, - status, error_code, duration_ms.

14. Security Requirements¶

No arbitrary shell command execution from API payloads.
All artifacts must be digest-pinned; signature verification policy-enforced.
Credentials are short-lived, task-bound, least privilege.
Every lifecycle mutation writes audit_logs with actor/scope/result.
Tenant overlays cannot override global hard-deny.
Task authenticity verification is mandatory (mTLS + signed task model per Node_Agent_Spec/PKI_Spec).

15. Phased Delivery¶

Phase A (MVP extension)¶

Add generic operation definition + execution contracts.
Implement node-agent step runner with small building block set.
Implement gpu.host.bootstrap@v1, model.artifact.stage@v1, runtime.readiness.verify@v1.
Implement tenant policy overlays and execution APIs.
Bind existing allocation provision/release orchestration paths to operation profiles.
Implement deprovision/reclaim profiles for release path.

Phase B¶

Add K8s/Slurm/Ray bootstrap profiles.
Add agent rollout channel + upgrade/rollback tasks.
Add adapter stubs for non-node workload lifecycle.

Phase C¶

Add backend adapters for k8s/slurm/ray workload execution.
Expand policy controls with OPA integration (existing decision interface).

16. Open Decisions¶

Exact signature trust root for OCI artifacts (cosign keyless vs managed keys).
Whether agent_rollouts persistence is required in v1 or can start in-memory.
Per-tenant private artifact repository support in v1 vs v1.1.
Command profile distribution model (selected baseline: signed control-plane task updates with local fallback profile set).

Per-tenant private registry support is reserved in the contract from day 1: - policy schema already includes source allowlists, - credential minting already supports scoped source intents, - implementation rollout may be phased, but API/schema shape must not assume a single global registry.