RCAs on record¶
RCA
doc/rca/ · 3 documented incidents · each drove a permanent governance rule
Root-cause analyses are written when a single failure crosses multiple layers, when detection was weak, or when the same owner-layer breaks more than once. Every RCA is expected to produce a follow-up: test, observability change, or design hardening.
Inventory¶
| RCA | Date | Domain | Permanent rule it produced |
|---|---|---|---|
| Node API mTLS identity handoff | 2026-03 | Node-agent ↔ API | Task pull is idempotent and identity-bound; node identity is the enrollment cert, not the IP/hostname |
| Terminal stream HTTP/2 buffering | 2026-03 | Terminal gateway | Terminal gateway extracted to its own binary to avoid intermediary HTTP/2 buffering on streaming responses |
| Provisioning workflow recovery gaps | 2026-03 | Provisioning worker | Long-running worker transactions must use clock_timestamp(), not now(), for terminal-state timestamps |
Why these matter for reviewers¶
Every RCA in this list closed a real customer-visible failure and made the system better. Reading them is the fastest way to understand why certain rules exist in the codebase.
Node API mTLS identity handoff¶
Symptom: allocation
39b27711-8153-40d4-a088-ebe6ade685a2showedprovision_usertaskcompletedin Postgres but the user never appeared on the host. SSH access failed.Root cause: task-claim CTE in
cmd/apimatched by host identity attributes that could be reused; a second node-agent process briefly claimed the task and ack'd success without executing.Fix: claim is now bound to the node-agent's enrollment certificate fingerprint, and the task pull is idempotent — same task can be re-claimed only by the same identity.
Permanent rule: every typed-task contract carries a signed identity-claim header; ack rejected if it doesn't match the original claim.
Terminal stream HTTP/2 buffering¶
Symptom: customers reported "terminal hangs after a few seconds of typing." Server logs showed bytes being written into the WebSocket stream but the browser saw nothing for tens of seconds.
Root cause: a reverse proxy in the path was using HTTP/2 and buffering small writes. The single combined API+terminal binary made bypassing it awkward.
Fix: extracted
cmd/terminal-gatewayto its own binary on a dedicated port. Reverse proxy routes/ws/terminal/*to the gateway with HTTP/1.1 keep-alive and no buffering.Permanent rule: WS endpoints live behind a dedicated process; any new WS surface goes through the gateway pattern.
Provisioning workflow recovery gaps¶
Symptom: a 4-minute slice provisioning task showed
completed_atequal tostarted_at. Allocation timeline appeared instant when it really took minutes.Root cause: the provisioning worker held a Postgres transaction open while waiting for node-agent task completion. Inside that transaction,
now()returned the transaction-start time, not the wall-clock time at commit.Fix: use
clock_timestamp()for any terminal-state timestamp or outboxoccurred_atwritten after a long wait.Permanent rule: documented in
Allocation_Provisioning_Task_Timeline_v1.md. Also drove the allocation timeline read model so customers see real progress, not just coarse state.
What an RCA contains (format)¶
Every RCA in doc/rca/ follows this shape (enforced by review):
flowchart TB
A["Summary<br/>impact + root cause<br/>1-2 lines"] --> B[Impact<br/>what failed, who, duration]
B --> C[Symptoms<br/>exact operator-visible evidence]
C --> D[Root Cause<br/>owning layer + concrete defect]
D --> E[Why Detection Was Weak<br/>missing/misleading evidence]
E --> F[Recovery<br/>what changed to restore service]
F --> G[Follow-ups<br/>tests, logging, design,<br/>backlog work]
How RCAs feed governance¶
flowchart LR
RCA[RCA published] --> R1[Coding standard rule]
RCA --> R2[Test pattern]
RCA --> R3[Runbook update]
RCA --> R4[Observability gate]
RCA --> R5[Architecture decision]
R1 & R2 & R3 & R4 & R5 --> CI[CI gate enforces]
Examples from these three RCAs:
scripts/ci/audit_mandatory_guard.sh— catches mutations missing audit logs (similar pattern to the mTLS RCA's identity tracking failure).- Worker-tx timestamp rule — codified in
Coding_Standards.mdand checked in code review. - Terminal gateway split — referenced in
Coding_Standards.md§"WebSocket endpoints".
Where to look next¶
- Threat model — how these failure classes map to STRIDE
- Incident severity model — when an incident becomes RCA-worthy
- Coding patterns — the rules that came from these RCAs