Evidence-first execution¶
Implemented
doc/governance/Evidence_First_Change_Protocol.md · doc/governance/Coding_Standards.md §Evidence-First · doc/governance/Testing_Standards.md §Evidence-First Verification Rules · doc/governance/Design_Baseline_Gate.md
The execution discipline that ties governance, orchestration, and queue together: every change requires baseline → prediction → verification. No "looks right" without proof.
The protocol¶
flowchart TB
classDef step fill:#e3f2fd,stroke:#1565c0
classDef gate fill:#fff3cd,stroke:#332701
S1[1. Baseline<br/>establish relevant baseline<br/>BEFORE changing behavior]:::step
S1 --> S2[2. Smallest verifiable unit<br/>prefer the smallest change<br/>that demonstrates the intent]:::step
S2 --> S3[3. Predict outcome<br/>state the expected result<br/>BEFORE running verification]:::step
S3 --> S4[4. Apply change]:::step
S4 --> S5[5. Re-run scoped checks<br/>same ones as baseline]:::step
S5 --> S6[6. Compare results]:::step
S6 --> G{Direct proof of<br/>intended behavior?}
G -- no --> S1
G -- yes --> S7[7. Mark complete<br/>record evidence in commit/PR/audit]:::gate
Hard rule (Coding_Standards.md):
Do not mark work complete without direct proof of the intended behavior change. Do not treat "compiles" or "looks right" as sufficient evidence.
Why this exists¶
flowchart LR
classDef good fill:#d1e7dd,stroke:#0a3622
classDef bad fill:#f8d7da,stroke:#42101e
BAD1[Change ships<br/>without baseline]:::bad
BAD1 --> SYM[Symptom-only fix<br/>masks real problem]:::bad
BAD2[Change ships<br/>with 'tests pass']:::bad
BAD2 --> FALSE[Tests pass but don't<br/>cover the changed behavior]:::bad
BAD3[Change ships<br/>after looking at it]:::bad
BAD3 --> REGR[Regression appears<br/>2 weeks later]:::bad
GOOD[Evidence-first]:::good --> PROOF[Baseline + prediction + verification<br/>provides direct proof]:::good
PROOF --> RGRS[Regressions caught immediately:<br/>if a previously-passing check fails<br/>after the change → regression]:::good
The cost is small (one extra step per change). The payoff is large: agents and humans alike can't ship correctness gaps with "well it compiled."
Verification scoping rule¶
Verification must be reported relative to a baseline, not as an isolated pass/fail claim.
flowchart TB
CHANGE[Change to behavior X] --> SCOPE{What's the<br/>relevant scope?}
SCOPE -- single handler --> S1[Run handler unit + httptest]
SCOPE -- service function --> S2[Run service unit + integration<br/>that touches it]
SCOPE -- cross-domain --> S3[Run integration suite<br/>+ relevant e2e]
SCOPE -- contract --> S4[Run contract validate + schemathesis]
SCOPE -- security --> S5[Run security gates +<br/>relevant pen-test scenarios]
S1 & S2 & S3 & S4 & S5 --> BASE[Capture baseline:<br/>which checks were green before<br/>which were red]
BASE --> PRED[Predict outcome:<br/>which will flip, which stay]
PRED --> RUN[Apply change + re-run]
RUN --> COMP[Compare to prediction]
COMP --> EVI[Direct proof captured]
classDef step fill:#e3f2fd,stroke:#1565c0
class S1,S2,S3,S4,S5,BASE,PRED,RUN,COMP,EVI step
Targeted scope beats ritual full-suite runs. The point is direct proof, not maximum surface.
Regression-as-evidence rule¶
If a previously-passing scoped check fails after the change, treat that as a regression until disproven.
flowchart LR
BASE[Baseline: check K was passing] --> CHG[Change applied]
CHG --> POST[Post-change: check K fails]
POST --> Q{Default assumption}
Q --> R[Regression caused by this change]
R --> ACT[Investigate before merging]
ACT --> Q2{Disproven?}
Q2 -- yes --> NOTE[Document why the test<br/>was wrong or environmental]
Q2 -- no --> FIX[Fix in this change<br/>or block merge]
classDef warn fill:#fff3cd,stroke:#332701
classDef ok fill:#d1e7dd,stroke:#0a3622
class R,ACT,Q2 warn
class NOTE,FIX ok
Failures as evidence about ownership¶
flowchart TB
F[Unexpected failure<br/>in a check unrelated to the change]
F --> RECORD[Recorded as evidence<br/>about dependencies or boundaries]
RECORD --> Q{What does it tell us?}
Q --> O1[Test was load-bearing on<br/>something we didn't change]
Q --> O2[Hidden coupling between modules]
Q --> O3[Environment / fixture issue]
Q --> O4[Race / flake exposed by timing]
O1 & O2 & O3 & O4 --> FOLLOWUP[Open follow-up task<br/>NEVER wave away]
Unexpected failures are evidence about dependencies or ownership boundaries and must be recorded, not waved away.
How this lands in CI¶
The protocol couples to several CI gates:
flowchart LR
EVI[Evidence-first protocol] -.couples to.-> G1[backend_build_and_tests.sh]
EVI -.couples to.-> G2[frontend_e2e*.sh]
EVI -.couples to.-> G3[contracts_schemathesis_report.sh]
EVI -.couples to.-> G4[audit_presence_guard.sh]
G1 -.proves.-> P1[Service + handler behavior]
G2 -.proves.-> P2[User-visible flow]
G3 -.proves.-> P3[Contract still holds]
G4 -.proves.-> P4[Privileged mutation wrote audit]
classDef gate fill:#fff3e0,stroke:#e65100
classDef proof fill:#d1e7dd,stroke:#0a3622
class G1,G2,G3,G4 gate
class P1,P2,P3,P4 proof
Coupling to other standards¶
mindmap
root((Evidence-first<br/>relates to))
Root-cause-first remediation
No symptom-only fixes
Mark blocked if upstream
5xx classification
Upstream vs local defect
Add regression test
Sanitize-first
Verification logs must be safe
Idempotent mutations
Re-run verification is safe
Task Authoring Standard
acceptance_checks is the proof
Pre-read forces baseline understanding
Multi-agent review
D-arch verifies architectural baseline
E-governance verifies process baseline
Design baseline gate¶
For larger changes, a baseline document is required before code lands.
→ Source: Design_Baseline_Gate.md
flowchart LR
P[Proposal for a major change] --> DB{Design Baseline Gate}
DB --> Q1[Baseline doc:<br/>current behavior + invariants]
DB --> Q2[Decision doc:<br/>chosen approach + trade-offs]
DB --> Q3[Acceptance evidence:<br/>what proves done]
Q1 & Q2 & Q3 --> APPROVE[Architecture + security review]
APPROVE --> CODE[Coding can start]
classDef gate fill:#fff3cd,stroke:#332701
class DB,APPROVE gate
How this protects multi-agent execution¶
flowchart LR
AG[Multiple agents working in parallel] --> RISK[Risk:<br/>one agent's change breaks<br/>another's invariants silently]
RISK -.evidence-first protocol.-> MIT[Mitigation:<br/>every agent must baseline first<br/>+ predict + verify]
MIT --> Q[Cross-lane regressions surface<br/>at PR time, not after merge]
classDef good fill:#d1e7dd,stroke:#0a3622
class MIT,Q good
This is what makes multi-agent execution safe at scale. Without evidence-first, two agents shipping in parallel could each leave silent regressions for the other. With it, both have to prove their change against baseline — and the baseline checks include the cross-lane integration suite.
Where to look next¶
- Governance model — where this protocol sits in the rule stack
- Multi-agent orchestration — how the protocol scales to multiple agents
- Testing patterns (Developers section) — verification specifics
- Coding patterns (Developers section) — root-cause + 5xx classification rules
- Source:
Evidence_First_Change_Protocol.md