Skip to content

RCAs on record

RCA

Source: doc/rca/ · 3 documented incidents · each drove a permanent governance rule

Root-cause analyses are written when a single failure crosses multiple layers, when detection was weak, or when the same owner-layer breaks more than once. Every RCA is expected to produce a follow-up: test, observability change, or design hardening.

Inventory

RCA Date Domain Permanent rule it produced
Node API mTLS identity handoff 2026-03 Node-agent ↔ API Task pull is idempotent and identity-bound; node identity is the enrollment cert, not the IP/hostname
Terminal stream HTTP/2 buffering 2026-03 Terminal gateway Terminal gateway extracted to its own binary to avoid intermediary HTTP/2 buffering on streaming responses
Provisioning workflow recovery gaps 2026-03 Provisioning worker Long-running worker transactions must use clock_timestamp(), not now(), for terminal-state timestamps

Why these matter for reviewers

Every RCA in this list closed a real customer-visible failure and made the system better. Reading them is the fastest way to understand why certain rules exist in the codebase.

Node API mTLS identity handoff

Symptom: allocation 39b27711-8153-40d4-a088-ebe6ade685a2 showed provision_user task completed in Postgres but the user never appeared on the host. SSH access failed.

Root cause: task-claim CTE in cmd/api matched by host identity attributes that could be reused; a second node-agent process briefly claimed the task and ack'd success without executing.

Fix: claim is now bound to the node-agent's enrollment certificate fingerprint, and the task pull is idempotent — same task can be re-claimed only by the same identity.

Permanent rule: every typed-task contract carries a signed identity-claim header; ack rejected if it doesn't match the original claim.

Terminal stream HTTP/2 buffering

Symptom: customers reported "terminal hangs after a few seconds of typing." Server logs showed bytes being written into the WebSocket stream but the browser saw nothing for tens of seconds.

Root cause: a reverse proxy in the path was using HTTP/2 and buffering small writes. The single combined API+terminal binary made bypassing it awkward.

Fix: extracted cmd/terminal-gateway to its own binary on a dedicated port. Reverse proxy routes /ws/terminal/* to the gateway with HTTP/1.1 keep-alive and no buffering.

Permanent rule: WS endpoints live behind a dedicated process; any new WS surface goes through the gateway pattern.

Provisioning workflow recovery gaps

Symptom: a 4-minute slice provisioning task showed completed_at equal to started_at. Allocation timeline appeared instant when it really took minutes.

Root cause: the provisioning worker held a Postgres transaction open while waiting for node-agent task completion. Inside that transaction, now() returned the transaction-start time, not the wall-clock time at commit.

Fix: use clock_timestamp() for any terminal-state timestamp or outbox occurred_at written after a long wait.

Permanent rule: documented in Allocation_Provisioning_Task_Timeline_v1.md. Also drove the allocation timeline read model so customers see real progress, not just coarse state.

What an RCA contains (format)

Every RCA in doc/rca/ follows this shape (enforced by review):

flowchart TB
    A["Summary<br/>impact + root cause<br/>1-2 lines"] --> B[Impact<br/>what failed, who, duration]
    B --> C[Symptoms<br/>exact operator-visible evidence]
    C --> D[Root Cause<br/>owning layer + concrete defect]
    D --> E[Why Detection Was Weak<br/>missing/misleading evidence]
    E --> F[Recovery<br/>what changed to restore service]
    F --> G[Follow-ups<br/>tests, logging, design,<br/>backlog work]

How RCAs feed governance

flowchart LR
    RCA[RCA published] --> R1[Coding standard rule]
    RCA --> R2[Test pattern]
    RCA --> R3[Runbook update]
    RCA --> R4[Observability gate]
    RCA --> R5[Architecture decision]
    R1 & R2 & R3 & R4 & R5 --> CI[CI gate enforces]

Examples from these three RCAs:

  • scripts/ci/audit_mandatory_guard.sh — catches mutations missing audit logs (similar pattern to the mTLS RCA's identity tracking failure).
  • Worker-tx timestamp rule — codified in Coding_Standards.md and checked in code review.
  • Terminal gateway split — referenced in Coding_Standards.md §"WebSocket endpoints".

Where to look next