Skip to content

Trail: Node & MAAS

Physical-host lifecycle: from bootstrap trust, through PKI/mTLS, MAAS bare-metal lifecycle, node-agent task contract, log gateway, and the RCA that hardened it all.

Trail map

flowchart TB
    classDef impl fill:#d1e7dd,stroke:#0a3622,color:#0a3622
    classDef des  fill:#fff3cd,stroke:#332701,color:#332701
    classDef run  fill:#e9d6ff,stroke:#1e1530,color:#1e1530
    classDef rca  fill:#f8d7da,stroke:#42101e,color:#42101e

    N1[1. Node lifecycle overview]:::impl --> N2[2. Bootstrap trust]:::impl
    N2 --> N3[3. PKI / mTLS]:::impl
    N3 --> N4[4. Node-agent OCI distribution]:::impl
    N4 --> N5[5. Task signing lifecycle]:::impl
    N5 --> N6[6. MAAS bare-metal lifecycle]:::des
    N6 --> N7[7. Hardware profile matrix]:::des
    N7 --> N8[8. Node-agent log gateway]:::impl
    N8 --> N9[9. Runbooks]:::run
    N9 --> N10[10. RCA: mTLS identity]:::rca

1. Node lifecycle overview

Implemented

stateDiagram-v2
    [*] --> commissioned: MAAS commission scripts pass
    commissioned --> ready: tagged + firmware-profile applied
    ready --> deploying: MAAS deploy with cloud-init
    deploying --> enrolled: node-agent bootstrap script runs<br/>+ enrolls mTLS cert
    enrolled --> active: nodes.status='active'<br/>schedulable
    active --> drained: admin drain<br/>(no new placements)
    drained --> active: re-add to pool
    drained --> deploying: re-image
    active --> unavailable: health failure
    unavailable --> active: repaired
    enrolled --> commissioned: decommission MAAS path

State derives from: - nodes.status column (active, drained, unavailable) - node-agent enrollment cert validity - MAAS-side state (when MAAS enabled) - Aggregate of node_resource_slots for slice nodes (whether any are cleanup_blocked)

→ Sources: Node_Operations_and_Agent_Lifecycle_v1.md, State_Machines.md


2. Bootstrap trust

Implemented

sequenceDiagram
    autonumber
    participant OPS as Operator
    participant API as cmd/api
    participant N as New host
    participant SCA as step-ca

    OPS->>API: POST /admin/nodes (mint enrollment token)
    API->>API: token TTL ~15 min, single use
    API-->>OPS: {bootstrap_token, bootstrap_url}

    Note over OPS,N: bootstrap_token delivered out-of-band<br/>(MAAS cloud-init or operator paste)

    N->>API: GET /internal/v1/bootstrap?token=<...>
    API->>API: validate token (not expired, not used)
    API->>SCA: request initial cert (X5C with token claim)
    SCA-->>API: enrollment cert + CA bundle
    API->>API: mark token used
    API-->>N: cert + bundle + agent config
    N->>N: install /etc/gpuaas/agent.{crt,key,ca}
    N->>N: start cmd/node-agent
    N->>API: GET /internal/v1/nodes/{id}/tasks/wait (mTLS)
    Note over API,N: from now on: mTLS only,<br/>token discarded

The bootstrap token is the only short-lived secret in the flow. Once the enrollment cert is in place, the node-agent never sends the token again — only mTLS.

→ Sources: Node_Bootstrap_Trust_Delivery_v1.md, Node_Bootstrap_Script_and_Token_v1.md, Platform_Signing_and_Bootstrap_Trust_v1.md


3. PKI / mTLS

Implemented

Smallstep step-ca issues node certs (24 h TTL, X5C renewal). Vault PKI migration path exists via packages/shared/pki.CAClient interface.

flowchart TB
    subgraph PKI[step-ca PKI today]
        ROOT[Root CA<br/>offline]:::root
        INT[Intermediate CA<br/>online, signing]:::int
        ROOT --> INT
    end
    subgraph NODES[Node certs]
        N1[node-agent on host A<br/>cert TTL 24h]:::cert
        N2[node-agent on host B<br/>cert TTL 24h]:::cert
    end
    INT --> N1
    INT --> N2

    N1 -->|every <24h| X5C[X5C renewal:<br/>present old cert,<br/>get new one with same identity]
    X5C --> N1

    subgraph FUTURE[Vault PKI migration path]
        CACL[packages/shared/pki.CAClient interface]:::iface
        VAULT[(Vault PKI)]:::iface
    end
    CACL -.swap implementation.-> INT

    classDef root fill:#fff3e0,stroke:#e65100
    classDef int fill:#e3f2fd,stroke:#1565c0
    classDef cert fill:#e8f5e9,stroke:#2e7d32
    classDef iface fill:#fff3cd,stroke:#332701

Why 24h TTL: aggressive rotation limits blast radius if a cert leaks. X5C lets the node renew without a new bootstrap token — the existing cert is the auth.

→ Sources: PKI_Spec.md, Node_Control_Plane_Communication_Security_Audit_v1.md


4. Node-agent OCI distribution

Implemented

Node-agent itself ships as an OCI artifact; nodes pull updates by digest.

sequenceDiagram
    autonumber
    participant CI as CI
    participant REG as OCI registry
    participant API as cmd/api
    participant N as Host

    CI->>REG: push node-agent:<git sha> (signed digest)
    CI->>API: POST /admin/node-agent-releases<br/>{digest, version, channel}
    API->>API: promote stable channel pointer

    N->>API: GET /internal/v1/node-agent/release (mTLS)
    API-->>N: {digest, signed_url, channel}
    N->>REG: pull layer by digest
    REG-->>N: layers
    N->>N: verify digest + signature
    N->>N: switch binary, restart agent
    N->>API: announce new agent version

Digest-based pinning makes rollback boring: change the channel pointer to the previous digest, nodes pull and switch.

→ Source: Node_Agent_OCI_Distribution_v1.md


5. Task signing lifecycle

Implemented

Every typed task sent to node-agent carries a signature; node-agent verifies before executing.

sequenceDiagram
    autonumber
    participant WK as Worker
    participant API as cmd/api
    participant KEY as Task signing key
    participant DB as Postgres
    participant N as node-agent

    WK->>API: enqueueTask(node_id, type, params)
    API->>DB: INSERT node_tasks status='queued'
    API->>KEY: sign(task_id, params, expires_at)
    KEY-->>API: signature
    API->>DB: store signature on row

    N->>API: GET /internal/v1/nodes/{id}/tasks/wait (mTLS)
    API->>DB: CTE: claim queued task → dispatched<br/>WHERE node_id matches AND expires_at > now
    API-->>N: {task_id, type, params, signature, expires_at, signing_key_id}

    N->>N: verify signature against pinned public key
    N->>N: check expires_at > now
    N->>N: dispatch typed handler

    N->>API: POST /tasks/{task_id}/result (mTLS)
    API->>DB: UPDATE node_tasks status='completed'<br/>WHERE node_identity_fp matches claim

Three properties this protects:

Property Mechanism
Authenticity Signature on params
Freshness expires_at rejected if past
Identity-bound result Result accepted only from the node whose cert fingerprint matched the claim

The third property is the lesson from RCA 2026-03-node-api-mtls-identity-handoff (step 10 below).

→ Source: Node_Task_Signing_Lifecycle_v1.md


6. MAAS bare-metal lifecycle

Designed

Gated by maas.enabled policy key. When enabled, MAAS owns physical bare-metal provisioning (PXE boot, OS deploy); GPUaaS orchestrates allocation lifecycle on top.

stateDiagram-v2
    [*] --> commissioning: power on, PXE boot
    commissioning --> ready: hardware enumerated, tagged
    ready --> deploying: gpuaas requests deploy with cloud-init
    deploying --> deployed: OS installed, node-agent enrolled
    deployed --> releasing: allocation released
    releasing --> ready: cleanup, optional re-image
    deployed --> failed: deploy error
    failed --> ready: operator reset
    ready --> [*]: decommission

    note right of deploying
      cloud-init applies firmware profile
      (gpuaas-profile-slice-vm or
       gpuaas-profile-baremetal) and
      installs node-agent bootstrap script
    end note
Profile tag Applies
gpuaas-profile-slice-vm KVM/IOMMU/VFIO, OVS, libvirt — slice-ready host
gpuaas-profile-baremetal Bare-metal-only firmware tune

A host should never carry both profile tags in automation; the cloud-init helper blocks conflicts.

→ Sources: MAAS_Bare_Metal_Lifecycle_v1.md, MAAS_Execution_Readiness_v1.md, MAAS_Node_State_Model_v1.md, MAAS_Provisioning_Time_Optimization_v1.md, MAAS_Recovery_Matrix_v1.md, Provisioning_BareMetal_MAAS_API_Boundary_v1.md


7. Hardware profile matrix

Designed

Tags compose to derive SKU placement and firmware policy. No single SKU-specific tag.

flowchart LR
    HW[Host hardware] --> HT[Hardware tags<br/>gpu-nvidia-h200,<br/>server-dell-xe9680l,<br/>fabric-bf3]
    HW --> FP[Firmware profile<br/>gpuaas-profile-slice-vm]
    HT & FP --> RDY[Readiness evidence<br/>per host]
    RDY --> SLOTS[Approved<br/>node_resource_slots]
    SLOTS --> SKU{Derive SKU<br/>at scheduling time}
    SKU --> SLICE[h200-sxm-slice]
    SKU --> BM[h200-sxm-baremetal-8g]
    classDef tag fill:#e3f2fd,stroke:#1565c0
    classDef derived fill:#d1e7dd,stroke:#0a3622
    class HT,FP tag
    class SLICE,BM derived

This composition lets the same model extend to H100, B100, AMD, etc. — drop in the right hardware tag + firmware profile, the rest follows.

→ Sources: MAAS_Hardware_Profile_Capability_Matrix_v1.md, H200_MAAS_Fit_Analysis_v1.md


8. Node-agent log gateway

Implemented

cmd/node-log-gateway streams node-agent logs back to operators. Long-term path is shipping into Loki.

flowchart LR
    NA[node-agent on host] -->|structured logs| FILE[(host-local log file)]
    FILE --> NLG[cmd/node-log-gateway<br/>HTTPS pull]
    NLG -->|read with mTLS| API[cmd/api admin endpoint]
    API --> ADMIN[Admin UI / runbook tooling]

    NLG -.future.-> LOKI[(Loki)]
    LOKI -.via Grafana.-> SRE[SRE dashboards]

    classDef now fill:#d1e7dd,stroke:#0a3622
    classDef future fill:#fff3cd,stroke:#332701
    class NA,NLG,API,ADMIN now
    class LOKI,SRE future

→ Source: Node_Agent_Log_Collection_Loki_v1.md


9. Runbooks

Runbook

flowchart LR
    INC[Node incident or onboarding] --> Q{Symptom}
    Q -- new host won't enroll --> R1[Node_Onboarding_Runbook]
    Q -- mTLS / cert / task pull issues --> R2[Node_Agent_Control_Plane_Recovery_2026-03]
    Q -- MAAS image pipeline for H200 --> R3[MAAS_H200_Host_Image_Pipeline_Runbook]
    Q -- three-host lab issues --> R4[Three_Host_Lab_Incident_Runbook]
    Q -- fleet telemetry pipeline fail --> R5[Fleet_Telemetry_Incident_Runbook]
    classDef rb fill:#e9d6ff,stroke:#1e1530
    class R1,R2,R3,R4,R5 rb
Runbook When
Node Onboarding New host end-to-end enrollment
Node Agent Control Plane Recovery 2026-03 mTLS / cert / task pull issues
MAAS H200 Host Image Pipeline Image pipeline for H200 hosts
Three Host Lab Incident Dev/CI/MAAS lab issues
Fleet Telemetry Incident Host telemetry pipeline failures

10. RCA: mTLS identity handoff

RCA

The most consequential node-side RCA. Lesson: task claim must be bound to the enrollment cert fingerprint, not just the node id.

sequenceDiagram
    autonumber
    participant W as Worker
    participant API as cmd/api
    participant DB as Postgres
    participant N1 as node-agent A<br/>(stale cert, briefly online)
    participant N2 as node-agent B<br/>(replacement)

    W->>API: enqueueTask(node_id=X, type=provision_user)
    API->>DB: INSERT node_tasks status='queued', node_id=X

    Note over N1: brief flap — old node-agent process<br/>still has valid cert
    N1->>API: GET /tasks/wait (mTLS, cert fp = OLD)
    API->>DB: claim WHERE node_id=X<br/>(no identity binding ❌)
    API-->>N1: task
    N1->>API: POST /tasks/{id}/result ack success
    Note over N1: but N1 never actually executed it<br/>(it was being replaced)

    N2->>API: GET /tasks/wait (mTLS, cert fp = NEW)
    API->>DB: nothing queued for X
    API-->>N2: empty

    Note over W,DB: incident: DB says completed,<br/>host has no user

The fix:

sequenceDiagram
    autonumber
    participant W as Worker
    participant API as cmd/api
    participant DB as Postgres
    participant N as node-agent

    W->>API: enqueueTask(node_id, type, params)
    API->>DB: INSERT node_tasks (no identity binding yet)
    N->>API: GET /tasks/wait (mTLS, cert fp = FP)
    API->>DB: CTE: claim AND store claim_fingerprint=FP
    API-->>N: task
    N->>API: POST /tasks/{id}/result (mTLS, cert fp = FP)
    API->>DB: UPDATE ... WHERE claim_fingerprint=FP
    Note over API,DB: ack rejected if FP doesn't match.<br/>Only the claimer can complete.

→ Source: 2026-03-node-api-mtls-identity-handoff.md. Permanent rule documented in Node_Task_Signing_Lifecycle_v1.md.


Recap

sequenceDiagram
    autonumber
    participant OPS as Operator
    participant MAAS as MAAS
    participant API as cmd/api
    participant SCA as step-ca
    participant N as Host
    participant W as workers

    OPS->>MAAS: commission + tag
    MAAS->>N: PXE boot + cloud-init
    N->>N: firmware profile applied
    N->>API: GET bootstrap (one-time token)
    API->>SCA: mint enrollment cert
    SCA-->>API: cert
    API-->>N: cert + CA bundle
    N->>N: start node-agent
    N->>API: tasks/wait (mTLS)
    loop continuous
        W->>API: enqueue typed task (signed)
        N->>API: claim (identity-bound)
        N->>N: verify signature + execute
        N->>API: result (identity-bound ack)
    end
    N-->>API: X5C renewal every <24h