Skip to content

Evidence-first execution

Implemented

Source: doc/governance/Evidence_First_Change_Protocol.md · doc/governance/Coding_Standards.md §Evidence-First · doc/governance/Testing_Standards.md §Evidence-First Verification Rules · doc/governance/Design_Baseline_Gate.md

The execution discipline that ties governance, orchestration, and queue together: every change requires baseline → prediction → verification. No "looks right" without proof.

The protocol

flowchart TB
    classDef step fill:#e3f2fd,stroke:#1565c0
    classDef gate fill:#fff3cd,stroke:#332701

    S1[1. Baseline<br/>establish relevant baseline<br/>BEFORE changing behavior]:::step
    S1 --> S2[2. Smallest verifiable unit<br/>prefer the smallest change<br/>that demonstrates the intent]:::step
    S2 --> S3[3. Predict outcome<br/>state the expected result<br/>BEFORE running verification]:::step
    S3 --> S4[4. Apply change]:::step
    S4 --> S5[5. Re-run scoped checks<br/>same ones as baseline]:::step
    S5 --> S6[6. Compare results]:::step
    S6 --> G{Direct proof of<br/>intended behavior?}
    G -- no --> S1
    G -- yes --> S7[7. Mark complete<br/>record evidence in commit/PR/audit]:::gate

Hard rule (Coding_Standards.md):

Do not mark work complete without direct proof of the intended behavior change. Do not treat "compiles" or "looks right" as sufficient evidence.

Why this exists

flowchart LR
    classDef good fill:#d1e7dd,stroke:#0a3622
    classDef bad fill:#f8d7da,stroke:#42101e

    BAD1[Change ships<br/>without baseline]:::bad
    BAD1 --> SYM[Symptom-only fix<br/>masks real problem]:::bad

    BAD2[Change ships<br/>with 'tests pass']:::bad
    BAD2 --> FALSE[Tests pass but don't<br/>cover the changed behavior]:::bad

    BAD3[Change ships<br/>after looking at it]:::bad
    BAD3 --> REGR[Regression appears<br/>2 weeks later]:::bad

    GOOD[Evidence-first]:::good --> PROOF[Baseline + prediction + verification<br/>provides direct proof]:::good
    PROOF --> RGRS[Regressions caught immediately:<br/>if a previously-passing check fails<br/>after the change → regression]:::good

The cost is small (one extra step per change). The payoff is large: agents and humans alike can't ship correctness gaps with "well it compiled."

Verification scoping rule

Verification must be reported relative to a baseline, not as an isolated pass/fail claim.

flowchart TB
    CHANGE[Change to behavior X] --> SCOPE{What's the<br/>relevant scope?}
    SCOPE -- single handler --> S1[Run handler unit + httptest]
    SCOPE -- service function --> S2[Run service unit + integration<br/>that touches it]
    SCOPE -- cross-domain --> S3[Run integration suite<br/>+ relevant e2e]
    SCOPE -- contract --> S4[Run contract validate + schemathesis]
    SCOPE -- security --> S5[Run security gates +<br/>relevant pen-test scenarios]

    S1 & S2 & S3 & S4 & S5 --> BASE[Capture baseline:<br/>which checks were green before<br/>which were red]
    BASE --> PRED[Predict outcome:<br/>which will flip, which stay]
    PRED --> RUN[Apply change + re-run]
    RUN --> COMP[Compare to prediction]
    COMP --> EVI[Direct proof captured]

    classDef step fill:#e3f2fd,stroke:#1565c0
    class S1,S2,S3,S4,S5,BASE,PRED,RUN,COMP,EVI step

Targeted scope beats ritual full-suite runs. The point is direct proof, not maximum surface.

Regression-as-evidence rule

If a previously-passing scoped check fails after the change, treat that as a regression until disproven.

flowchart LR
    BASE[Baseline: check K was passing] --> CHG[Change applied]
    CHG --> POST[Post-change: check K fails]
    POST --> Q{Default assumption}
    Q --> R[Regression caused by this change]
    R --> ACT[Investigate before merging]
    ACT --> Q2{Disproven?}
    Q2 -- yes --> NOTE[Document why the test<br/>was wrong or environmental]
    Q2 -- no --> FIX[Fix in this change<br/>or block merge]

    classDef warn fill:#fff3cd,stroke:#332701
    classDef ok fill:#d1e7dd,stroke:#0a3622
    class R,ACT,Q2 warn
    class NOTE,FIX ok

Failures as evidence about ownership

flowchart TB
    F[Unexpected failure<br/>in a check unrelated to the change]
    F --> RECORD[Recorded as evidence<br/>about dependencies or boundaries]
    RECORD --> Q{What does it tell us?}
    Q --> O1[Test was load-bearing on<br/>something we didn't change]
    Q --> O2[Hidden coupling between modules]
    Q --> O3[Environment / fixture issue]
    Q --> O4[Race / flake exposed by timing]

    O1 & O2 & O3 & O4 --> FOLLOWUP[Open follow-up task<br/>NEVER wave away]

Unexpected failures are evidence about dependencies or ownership boundaries and must be recorded, not waved away.

How this lands in CI

The protocol couples to several CI gates:

flowchart LR
    EVI[Evidence-first protocol] -.couples to.-> G1[backend_build_and_tests.sh]
    EVI -.couples to.-> G2[frontend_e2e*.sh]
    EVI -.couples to.-> G3[contracts_schemathesis_report.sh]
    EVI -.couples to.-> G4[audit_presence_guard.sh]

    G1 -.proves.-> P1[Service + handler behavior]
    G2 -.proves.-> P2[User-visible flow]
    G3 -.proves.-> P3[Contract still holds]
    G4 -.proves.-> P4[Privileged mutation wrote audit]

    classDef gate fill:#fff3e0,stroke:#e65100
    classDef proof fill:#d1e7dd,stroke:#0a3622
    class G1,G2,G3,G4 gate
    class P1,P2,P3,P4 proof

Coupling to other standards

mindmap
  root((Evidence-first<br/>relates to))
    Root-cause-first remediation
      No symptom-only fixes
      Mark blocked if upstream
    5xx classification
      Upstream vs local defect
      Add regression test
    Sanitize-first
      Verification logs must be safe
    Idempotent mutations
      Re-run verification is safe
    Task Authoring Standard
      acceptance_checks is the proof
      Pre-read forces baseline understanding
    Multi-agent review
      D-arch verifies architectural baseline
      E-governance verifies process baseline

Design baseline gate

For larger changes, a baseline document is required before code lands.

→ Source: Design_Baseline_Gate.md

flowchart LR
    P[Proposal for a major change] --> DB{Design Baseline Gate}
    DB --> Q1[Baseline doc:<br/>current behavior + invariants]
    DB --> Q2[Decision doc:<br/>chosen approach + trade-offs]
    DB --> Q3[Acceptance evidence:<br/>what proves done]
    Q1 & Q2 & Q3 --> APPROVE[Architecture + security review]
    APPROVE --> CODE[Coding can start]

    classDef gate fill:#fff3cd,stroke:#332701
    class DB,APPROVE gate

How this protects multi-agent execution

flowchart LR
    AG[Multiple agents working in parallel] --> RISK[Risk:<br/>one agent's change breaks<br/>another's invariants silently]
    RISK -.evidence-first protocol.-> MIT[Mitigation:<br/>every agent must baseline first<br/>+ predict + verify]
    MIT --> Q[Cross-lane regressions surface<br/>at PR time, not after merge]

    classDef good fill:#d1e7dd,stroke:#0a3622
    class MIT,Q good

This is what makes multi-agent execution safe at scale. Without evidence-first, two agents shipping in parallel could each leave silent regressions for the other. With it, both have to prove their change against baseline — and the baseline checks include the cross-lane integration suite.

Where to look next