Trail: Node & MAAS¶
Physical-host lifecycle: from bootstrap trust, through PKI/mTLS, MAAS bare-metal lifecycle, node-agent task contract, log gateway, and the RCA that hardened it all.
Trail map¶
flowchart TB
classDef impl fill:#d1e7dd,stroke:#0a3622,color:#0a3622
classDef des fill:#fff3cd,stroke:#332701,color:#332701
classDef run fill:#e9d6ff,stroke:#1e1530,color:#1e1530
classDef rca fill:#f8d7da,stroke:#42101e,color:#42101e
N1[1. Node lifecycle overview]:::impl --> N2[2. Bootstrap trust]:::impl
N2 --> N3[3. PKI / mTLS]:::impl
N3 --> N4[4. Node-agent OCI distribution]:::impl
N4 --> N5[5. Task signing lifecycle]:::impl
N5 --> N6[6. MAAS bare-metal lifecycle]:::des
N6 --> N7[7. Hardware profile matrix]:::des
N7 --> N8[8. Node-agent log gateway]:::impl
N8 --> N9[9. Runbooks]:::run
N9 --> N10[10. RCA: mTLS identity]:::rca
1. Node lifecycle overview¶
Implemented
stateDiagram-v2
[*] --> commissioned: MAAS commission scripts pass
commissioned --> ready: tagged + firmware-profile applied
ready --> deploying: MAAS deploy with cloud-init
deploying --> enrolled: node-agent bootstrap script runs<br/>+ enrolls mTLS cert
enrolled --> active: nodes.status='active'<br/>schedulable
active --> drained: admin drain<br/>(no new placements)
drained --> active: re-add to pool
drained --> deploying: re-image
active --> unavailable: health failure
unavailable --> active: repaired
enrolled --> commissioned: decommission MAAS path
State derives from:
- nodes.status column (active, drained, unavailable)
- node-agent enrollment cert validity
- MAAS-side state (when MAAS enabled)
- Aggregate of node_resource_slots for slice nodes (whether any are cleanup_blocked)
→ Sources: Node_Operations_and_Agent_Lifecycle_v1.md, State_Machines.md
2. Bootstrap trust¶
Implemented
sequenceDiagram
autonumber
participant OPS as Operator
participant API as cmd/api
participant N as New host
participant SCA as step-ca
OPS->>API: POST /admin/nodes (mint enrollment token)
API->>API: token TTL ~15 min, single use
API-->>OPS: {bootstrap_token, bootstrap_url}
Note over OPS,N: bootstrap_token delivered out-of-band<br/>(MAAS cloud-init or operator paste)
N->>API: GET /internal/v1/bootstrap?token=<...>
API->>API: validate token (not expired, not used)
API->>SCA: request initial cert (X5C with token claim)
SCA-->>API: enrollment cert + CA bundle
API->>API: mark token used
API-->>N: cert + bundle + agent config
N->>N: install /etc/gpuaas/agent.{crt,key,ca}
N->>N: start cmd/node-agent
N->>API: GET /internal/v1/nodes/{id}/tasks/wait (mTLS)
Note over API,N: from now on: mTLS only,<br/>token discarded
The bootstrap token is the only short-lived secret in the flow. Once the enrollment cert is in place, the node-agent never sends the token again — only mTLS.
→ Sources: Node_Bootstrap_Trust_Delivery_v1.md, Node_Bootstrap_Script_and_Token_v1.md, Platform_Signing_and_Bootstrap_Trust_v1.md
3. PKI / mTLS¶
Implemented
Smallstep step-ca issues node certs (24 h TTL, X5C renewal). Vault PKI migration path exists via packages/shared/pki.CAClient interface.
flowchart TB
subgraph PKI[step-ca PKI today]
ROOT[Root CA<br/>offline]:::root
INT[Intermediate CA<br/>online, signing]:::int
ROOT --> INT
end
subgraph NODES[Node certs]
N1[node-agent on host A<br/>cert TTL 24h]:::cert
N2[node-agent on host B<br/>cert TTL 24h]:::cert
end
INT --> N1
INT --> N2
N1 -->|every <24h| X5C[X5C renewal:<br/>present old cert,<br/>get new one with same identity]
X5C --> N1
subgraph FUTURE[Vault PKI migration path]
CACL[packages/shared/pki.CAClient interface]:::iface
VAULT[(Vault PKI)]:::iface
end
CACL -.swap implementation.-> INT
classDef root fill:#fff3e0,stroke:#e65100
classDef int fill:#e3f2fd,stroke:#1565c0
classDef cert fill:#e8f5e9,stroke:#2e7d32
classDef iface fill:#fff3cd,stroke:#332701
Why 24h TTL: aggressive rotation limits blast radius if a cert leaks. X5C lets the node renew without a new bootstrap token — the existing cert is the auth.
→ Sources: PKI_Spec.md, Node_Control_Plane_Communication_Security_Audit_v1.md
4. Node-agent OCI distribution¶
Implemented
Node-agent itself ships as an OCI artifact; nodes pull updates by digest.
sequenceDiagram
autonumber
participant CI as CI
participant REG as OCI registry
participant API as cmd/api
participant N as Host
CI->>REG: push node-agent:<git sha> (signed digest)
CI->>API: POST /admin/node-agent-releases<br/>{digest, version, channel}
API->>API: promote stable channel pointer
N->>API: GET /internal/v1/node-agent/release (mTLS)
API-->>N: {digest, signed_url, channel}
N->>REG: pull layer by digest
REG-->>N: layers
N->>N: verify digest + signature
N->>N: switch binary, restart agent
N->>API: announce new agent version
Digest-based pinning makes rollback boring: change the channel pointer to the previous digest, nodes pull and switch.
→ Source: Node_Agent_OCI_Distribution_v1.md
5. Task signing lifecycle¶
Implemented
Every typed task sent to node-agent carries a signature; node-agent verifies before executing.
sequenceDiagram
autonumber
participant WK as Worker
participant API as cmd/api
participant KEY as Task signing key
participant DB as Postgres
participant N as node-agent
WK->>API: enqueueTask(node_id, type, params)
API->>DB: INSERT node_tasks status='queued'
API->>KEY: sign(task_id, params, expires_at)
KEY-->>API: signature
API->>DB: store signature on row
N->>API: GET /internal/v1/nodes/{id}/tasks/wait (mTLS)
API->>DB: CTE: claim queued task → dispatched<br/>WHERE node_id matches AND expires_at > now
API-->>N: {task_id, type, params, signature, expires_at, signing_key_id}
N->>N: verify signature against pinned public key
N->>N: check expires_at > now
N->>N: dispatch typed handler
N->>API: POST /tasks/{task_id}/result (mTLS)
API->>DB: UPDATE node_tasks status='completed'<br/>WHERE node_identity_fp matches claim
Three properties this protects:
| Property | Mechanism |
|---|---|
| Authenticity | Signature on params |
| Freshness | expires_at rejected if past |
| Identity-bound result | Result accepted only from the node whose cert fingerprint matched the claim |
The third property is the lesson from RCA 2026-03-node-api-mtls-identity-handoff (step 10 below).
→ Source: Node_Task_Signing_Lifecycle_v1.md
6. MAAS bare-metal lifecycle¶
Designed
Gated by maas.enabled policy key. When enabled, MAAS owns physical bare-metal provisioning (PXE boot, OS deploy); GPUaaS orchestrates allocation lifecycle on top.
stateDiagram-v2
[*] --> commissioning: power on, PXE boot
commissioning --> ready: hardware enumerated, tagged
ready --> deploying: gpuaas requests deploy with cloud-init
deploying --> deployed: OS installed, node-agent enrolled
deployed --> releasing: allocation released
releasing --> ready: cleanup, optional re-image
deployed --> failed: deploy error
failed --> ready: operator reset
ready --> [*]: decommission
note right of deploying
cloud-init applies firmware profile
(gpuaas-profile-slice-vm or
gpuaas-profile-baremetal) and
installs node-agent bootstrap script
end note
| Profile tag | Applies |
|---|---|
gpuaas-profile-slice-vm |
KVM/IOMMU/VFIO, OVS, libvirt — slice-ready host |
gpuaas-profile-baremetal |
Bare-metal-only firmware tune |
A host should never carry both profile tags in automation; the cloud-init helper blocks conflicts.
→ Sources: MAAS_Bare_Metal_Lifecycle_v1.md, MAAS_Execution_Readiness_v1.md, MAAS_Node_State_Model_v1.md, MAAS_Provisioning_Time_Optimization_v1.md, MAAS_Recovery_Matrix_v1.md, Provisioning_BareMetal_MAAS_API_Boundary_v1.md
7. Hardware profile matrix¶
Designed
Tags compose to derive SKU placement and firmware policy. No single SKU-specific tag.
flowchart LR
HW[Host hardware] --> HT[Hardware tags<br/>gpu-nvidia-h200,<br/>server-dell-xe9680l,<br/>fabric-bf3]
HW --> FP[Firmware profile<br/>gpuaas-profile-slice-vm]
HT & FP --> RDY[Readiness evidence<br/>per host]
RDY --> SLOTS[Approved<br/>node_resource_slots]
SLOTS --> SKU{Derive SKU<br/>at scheduling time}
SKU --> SLICE[h200-sxm-slice]
SKU --> BM[h200-sxm-baremetal-8g]
classDef tag fill:#e3f2fd,stroke:#1565c0
classDef derived fill:#d1e7dd,stroke:#0a3622
class HT,FP tag
class SLICE,BM derived
This composition lets the same model extend to H100, B100, AMD, etc. — drop in the right hardware tag + firmware profile, the rest follows.
→ Sources: MAAS_Hardware_Profile_Capability_Matrix_v1.md, H200_MAAS_Fit_Analysis_v1.md
8. Node-agent log gateway¶
Implemented
cmd/node-log-gateway streams node-agent logs back to operators. Long-term path is shipping into Loki.
flowchart LR
NA[node-agent on host] -->|structured logs| FILE[(host-local log file)]
FILE --> NLG[cmd/node-log-gateway<br/>HTTPS pull]
NLG -->|read with mTLS| API[cmd/api admin endpoint]
API --> ADMIN[Admin UI / runbook tooling]
NLG -.future.-> LOKI[(Loki)]
LOKI -.via Grafana.-> SRE[SRE dashboards]
classDef now fill:#d1e7dd,stroke:#0a3622
classDef future fill:#fff3cd,stroke:#332701
class NA,NLG,API,ADMIN now
class LOKI,SRE future
→ Source: Node_Agent_Log_Collection_Loki_v1.md
9. Runbooks¶
Runbook
flowchart LR
INC[Node incident or onboarding] --> Q{Symptom}
Q -- new host won't enroll --> R1[Node_Onboarding_Runbook]
Q -- mTLS / cert / task pull issues --> R2[Node_Agent_Control_Plane_Recovery_2026-03]
Q -- MAAS image pipeline for H200 --> R3[MAAS_H200_Host_Image_Pipeline_Runbook]
Q -- three-host lab issues --> R4[Three_Host_Lab_Incident_Runbook]
Q -- fleet telemetry pipeline fail --> R5[Fleet_Telemetry_Incident_Runbook]
classDef rb fill:#e9d6ff,stroke:#1e1530
class R1,R2,R3,R4,R5 rb
| Runbook | When |
|---|---|
| Node Onboarding | New host end-to-end enrollment |
| Node Agent Control Plane Recovery 2026-03 | mTLS / cert / task pull issues |
| MAAS H200 Host Image Pipeline | Image pipeline for H200 hosts |
| Three Host Lab Incident | Dev/CI/MAAS lab issues |
| Fleet Telemetry Incident | Host telemetry pipeline failures |
10. RCA: mTLS identity handoff¶
RCA
The most consequential node-side RCA. Lesson: task claim must be bound to the enrollment cert fingerprint, not just the node id.
sequenceDiagram
autonumber
participant W as Worker
participant API as cmd/api
participant DB as Postgres
participant N1 as node-agent A<br/>(stale cert, briefly online)
participant N2 as node-agent B<br/>(replacement)
W->>API: enqueueTask(node_id=X, type=provision_user)
API->>DB: INSERT node_tasks status='queued', node_id=X
Note over N1: brief flap — old node-agent process<br/>still has valid cert
N1->>API: GET /tasks/wait (mTLS, cert fp = OLD)
API->>DB: claim WHERE node_id=X<br/>(no identity binding ❌)
API-->>N1: task
N1->>API: POST /tasks/{id}/result ack success
Note over N1: but N1 never actually executed it<br/>(it was being replaced)
N2->>API: GET /tasks/wait (mTLS, cert fp = NEW)
API->>DB: nothing queued for X
API-->>N2: empty
Note over W,DB: incident: DB says completed,<br/>host has no user
The fix:
sequenceDiagram
autonumber
participant W as Worker
participant API as cmd/api
participant DB as Postgres
participant N as node-agent
W->>API: enqueueTask(node_id, type, params)
API->>DB: INSERT node_tasks (no identity binding yet)
N->>API: GET /tasks/wait (mTLS, cert fp = FP)
API->>DB: CTE: claim AND store claim_fingerprint=FP
API-->>N: task
N->>API: POST /tasks/{id}/result (mTLS, cert fp = FP)
API->>DB: UPDATE ... WHERE claim_fingerprint=FP
Note over API,DB: ack rejected if FP doesn't match.<br/>Only the claimer can complete.
→ Source: 2026-03-node-api-mtls-identity-handoff.md. Permanent rule documented in Node_Task_Signing_Lifecycle_v1.md.
Recap¶
sequenceDiagram
autonumber
participant OPS as Operator
participant MAAS as MAAS
participant API as cmd/api
participant SCA as step-ca
participant N as Host
participant W as workers
OPS->>MAAS: commission + tag
MAAS->>N: PXE boot + cloud-init
N->>N: firmware profile applied
N->>API: GET bootstrap (one-time token)
API->>SCA: mint enrollment cert
SCA-->>API: cert
API-->>N: cert + CA bundle
N->>N: start node-agent
N->>API: tasks/wait (mTLS)
loop continuous
W->>API: enqueue typed task (signed)
N->>API: claim (identity-bound)
N->>N: verify signature + execute
N->>API: result (identity-bound ack)
end
N-->>API: X5C renewal every <24h