# vyre-conform execution plan
Handoff document for the agent owning vyre-conform. Read top to bottom once
before starting. Every section has concrete deliverables, no optional work, no
"nice to have." If something is unclear, stop and ask — the cost of a
misunderstanding discovered at Phase 5 is a week of rework.
---
## 0. Mission and bar
vyre-conform is the discipline that prevents vyre from degrading as it grows.
Every new op, every new backend, every new contribution must pass through
vyre-conform's gates. If the gates are strong, vyre can accept contributions
from local LLMs at volume without quality loss. If the gates are weak, vyre
becomes slop regardless of who contributes.
The bar: **a malicious LLM explicitly trying to sneak a wrong op into vyre
must fail at the conform gate, not at human review.** If you can't visualize
how the gate catches adversarial output, the gate is too weak and you rebuild
it.
Not a test suite. A compiler for contributions.
---
## 1. Architectural decisions (locked, do not relitigate)
- **vyre = IR + lowering + ops + bytecode.** Nothing else. The thing being
verified.
- **vyre-conform = everything about correctness.** Spec, archetypes,
mutations, oracles, harnesses, templates, generator, minimizer, reference
interpreter, proof engine, layer enforcement. One crate, one concern.
- **vyre-agent-stream = runtime orchestration.** Separate crate. Imports
vyre-conform. Not in scope for this plan.
- **Tests are not committed.** Generator materializes them into
`vyre-conform/tests_generated/` (gitignored) from spec on demand. Source of
truth is ~100K LOC; materialized tests can be GB.
- **Committed exceptions:** regression corpus (human-minimized real bugs) and
`agent_accepted/` (tests that passed gate + human approval) live in the
repo with provenance sidecars.
- **Category A or C, never B.** Compositional or hardware-intrinsic. Runtime
abstraction is forbidden. Not an enum variant you can add.
- **Floats are in, deterministic via restriction.** Strict IEEE 754 path.
Three conformance tracks (integer, float, approximate). Tracks never mix in
a certificate entry.
- **`#[non_exhaustive]` on every public enum.** Adding variants is allowed.
Changing or removing is not. Breaking userspace is not an option (I13).
- **Laws are proven at build time, not claimed.** Claiming a law without a
Verification provenance is a compile error.
- **One execution path:** `source → ir::Program → lower → target shader →
dispatch`. The bytecode VM interpreter path is dead. Bytecode is a wire
format only.
---
## 2. Directory layout
Build this tree. Existing directories marked EXISTING, new ones marked NEW.
```
vyre-conform/
├── Cargo.toml
├── schema.md NEW — single-page architecture doc
├── PLAN.md THIS FILE
├── CONTRIBUTING.md EXISTING — update per Phase 6
├── TEMPLATE_OP.rs EXISTING — update per Phase 2
├── regression/ NEW — human-curated minimized repros
│ └── README.md
├── agent_accepted/ NEW — gate-passed, human-approved agent tests
│ └── README.md
├── corpus/ NEW — shared fuzzing inputs
│ ├── seeds/
│ └── fuzz_targets/
├── tests_generated/ NEW — gitignored, materialized on demand
│ └── .gitkeep
└── src/
├── lib.rs EXISTING — add new pub mod lines
├── spec/ EXISTING — extend
│ ├── mod.rs
│ ├── ops/ NEW — one file per op OpSpec
│ │ └── add.rs NEW — canonical example
│ ├── tables/ NEW — SpecRow tables per op
│ │ └── add.rs
│ ├── invariants.rs NEW — I1..I15 encoded
│ ├── laws.rs NEW — Law enum + DeclaredLaw
│ ├── categories.rs NEW — Category A/C enum
│ ├── versions.rs NEW — Version + since_version
│ └── registry.rs NEW — compile-time coverage check
├── archetypes/ NEW
│ ├── mod.rs — Archetype trait + ArchetypeRef
│ ├── arithmetic.rs — A1..A7
│ ├── structural.rs — S1..S15
│ ├── composition.rs — C1..C7
│ └── backend.rs — X1..X4
├── mutations/ NEW
│ ├── mod.rs — Mutation enum, MutationClass, catalog
│ ├── arithmetic.rs
│ ├── comparison.rs
│ ├── bitwise.rs
│ ├── control.rs
│ ├── buffer.rs
│ ├── ir_specific.rs
│ └── law.rs
├── oracles/ NEW
│ ├── mod.rs — Oracle trait
│ ├── hierarchy.rs — decision tree
│ ├── law.rs
│ ├── spec_table.rs
│ ├── reference.rs
│ ├── cpu_ref.rs
│ └── composition.rs
├── harnesses/ NEW
│ ├── mod.rs
│ ├── backend.rs — H1
│ ├── oom.rs — H2
│ ├── fault.rs — H3
│ ├── swap.rs — H4
│ ├── determinism.rs — H5
│ ├── mutation.rs — H6 (THE GATE)
│ ├── reference_diff.rs — H7
│ ├── law_probe.rs — H8
│ ├── coverage.rs — H9
│ └── composition_proof.rs — H10
├── templates/ NEW — include_str!'d at compile time
│ ├── mod.rs — template loader + interpolator
│ ├── preamble.tmpl
│ ├── op_correctness.tmpl
│ ├── law.tmpl
│ ├── validation.tmpl
│ ├── backend_equiv.tmpl
│ ├── archetype.tmpl
│ └── mutation_kill.tmpl
├── minimize/ NEW — extends existing minimizer.rs
│ ├── mod.rs — Minimizer trait + driver
│ ├── arithmetic.rs
│ ├── structural.rs
│ ├── composition.rs
│ └── type_shrink.rs
├── generator/ NEW
│ ├── mod.rs
│ ├── cross_product.rs — (op × archetype × oracle) iteration
│ ├── emit.rs — Rust code writer
│ ├── naming.rs — deterministic test names
│ ├── seed.rs — reproducible random inputs
│ └── provenance.rs — header comment on every generated file
├── reference/ NEW — Layer 3 reference interpreter
│ ├── mod.rs
│ ├── interp.rs — obviously correct IR evaluator
│ ├── eval_expr.rs
│ ├── eval_node.rs
│ ├── atomics.rs
│ ├── workgroup.rs
│ └── ieee754.rs — strict float path (stubs until floats land)
├── layers/ NEW — enforcement wiring
│ ├── mod.rs
│ ├── layer1_executable_spec.rs
│ ├── layer2_law_inference.rs
│ ├── layer3_reference_interp.rs — thin wrapper over ::reference
│ ├── layer4_mutation_gate.rs — thin wrapper over harnesses::mutation
│ ├── layer5_adversarial.rs — implementor/prosecutor/defender
│ ├── layer6_stability.rs — non_exhaustive + version audit
│ ├── layer7_composition_proof.rs — build-time composition checks
│ └── layer8_feedback_loop.rs — <10s iteration infrastructure
├── algebra/ EXISTING — checker, gpu_checker, composition, minimizer, inference
├── certify/ EXISTING — formal certificate generator
├── streaming/ EXISTING — 100M-scale runner
├── specs/engine/ EXISTING — DFA, eval, scatter engine conformance
└── bin/
└── vyre-gen-tests.rs NEW — generator binary entry point
```
Top-level `xtask/` crate at the workspace root (not under vyre-conform) hosts
the workflow commands.
---
## 3. Build order (phases with dependencies)
Six phases. Each phase ends with a gate: run the phase's acceptance test
before starting the next. Do not skip gates.
- **Phase 1 — Foundation** (schema doc, scaffolding, spec types, invariants, laws)
- **Phase 2 — Proof core** (canonical add OpSpec, reference interpreter, Layer 2 law inference wiring)
- **Phase 3 — Catalogs** (archetypes, mutations, oracles, templates)
- **Phase 4 — Harnesses** (H1–H10, with H6 the critical one)
- **Phase 5 — Generator + xtask + layer enforcement** (1, 4, 5, 6, 7, 8)
- **Phase 6 — Calibration** (first end-to-end on BinOp::Add, audit output, strengthen, repeat)
Phases 3 and 4 can overlap heavily since archetypes/mutations/oracles/templates
are independent of harnesses except H9 (coverage) which needs the spec
registry from Phase 1. Parallelize aggressively.
---
## 4. Phase 1 — Foundation
**Goal:** every type, trait, and invariant that other phases reference exists
and type-checks. No implementation bodies beyond what's needed to compile.
### 4.1 schema.md
Single page. Sections: purpose, directory map, committed-vs-generated rule,
provenance format, the spec-of-the-spec pointer (registry.rs), contribution
flow overview. One page. No more. Anyone reading it should know where every
kind of artifact goes in under two minutes.
### 4.2 Directory scaffolding
Create every directory listed in §2. Every `mod.rs` with `pub mod`
declarations. Every file at least empty with a doc comment stating its
purpose. `cargo check` must pass on the empty skeleton before moving on.
Add `tests_generated/` to `.gitignore`. Create `.gitkeep` so the directory
exists in the repo.
### 4.3 Spec schema types
File: `src/spec/mod.rs`, extending existing spec module.
```rust
#[non_exhaustive]
pub struct OpSpec {
pub name: &'static str,
pub category: Category,
pub signature: OpSignature,
pub reference_fn: fn(&[Value]) -> Value,
pub laws: &'static [DeclaredLaw],
pub spec_table: &'static [SpecRow],
pub archetypes: &'static [ArchetypeRef],
pub mutation_sensitivity: &'static [MutationClass],
pub oracle_override: Option<OracleKind>,
pub since_version: Version,
pub category_c_fallback: Option<fn(&[Value]) -> Value>,
pub docs_path: &'static str,
}
#[non_exhaustive]
pub struct SpecRow {
pub inputs: &'static [Value],
pub expected: Value,
pub rationale: &'static str,
pub source: SpecSource,
}
#[non_exhaustive]
pub struct DeclaredLaw {
pub law: Law,
pub verified_by: Verification,
}
#[non_exhaustive]
pub enum Verification {
ExhaustiveU8,
ExhaustiveU16,
WitnessedU32 { seed: u64, count: u64 },
ExhaustiveFloat { typ: FloatType },
}
#[non_exhaustive]
pub enum Category {
A, // compositional, inlines at lowering
C { fallback_available: bool }, // hardware intrinsic with fallback
}
#[non_exhaustive]
pub enum SpecSource {
HandWritten,
DerivedFromLaw(LawId),
FromCorpus(&'static str),
FromRegression(&'static str), // filename in regression/
}
#[non_exhaustive]
pub enum OracleKind {
Law,
SpecTable,
ReferenceInterpreter,
CpuReference,
Composition,
External,
Property,
}
pub struct Version { pub major: u32, pub minor: u32, pub patch: u32 }
pub type ArchetypeRef = &'static dyn Archetype;
```
Every struct has doc comments that explain the field's purpose and what
checks exist on it.
### 4.4 Registry with coverage checker
File: `src/spec/registry.rs`.
- `pub const OP_REGISTRY: &[&'static OpSpec] = &[...];` populated via a
declarative macro that every op's `ops/*.rs` file invokes.
- `pub fn verify_coverage() -> Result<(), Vec<CoverageError>>` — iterates the
registry and checks every OpSpec has required fields populated. Missing
reference_fn → error. Missing laws field → error (empty slice is allowed
and explicit, but the field must be set). Missing archetypes → error.
Missing mutation_sensitivity → error.
- `build.rs` calls `verify_coverage()` and fails the build if any error is
returned.
The key property: you cannot add an op to the registry without declaring
everything. Omission is a build error, not a runtime warning.
### 4.5 Invariants I1–I15
File: `src/spec/invariants.rs`.
```rust
pub struct Invariant {
pub id: InvariantId,
pub name: &'static str,
pub description: &'static str,
pub category: InvariantCategory,
pub test_family: fn() -> Vec<TestDescriptor>,
}
pub const INVARIANTS: &[Invariant] = &[
Invariant { id: I1, name: "Determinism", ... },
// ... all 15
];
```
Full list (copy the descriptions verbatim into doc comments):
Execution invariants:
- **I1 (Determinism):** Same ir::Program + same inputs → byte-identical
outputs, every run, every backend, every device. No exceptions.
- **I2 (Composition commutativity with lowering):** `lower(compose(a, b)) ≡
lower(a) ; lower(b)` semantically.
- **I3 (Backend equivalence):** For every Program P and every pair of
conformant backends (B1, B2), `B1.run(P) == B2.run(P)`. Bit-exact.
- **I4 (Bytecode round-trip):** `to_ir(to_bytecode(P)) == P` for every valid P.
- **I5 (Validation soundness):** If `validate(P)` returns empty, `lower(P)`
must not panic, allocate unboundedly, or produce UB.
- **I6 (Validation completeness — partial):** For every V001-V020 rule,
there exists a test input that triggers exactly that rule and no others.
Algebra invariants:
- **I7 (Law monotonicity):** If op A declares law L and op B is defined as
`compose(A, ...)` where the composition preserves L per composition.rs
theorems, then B automatically has law L. The inference engine must never
lose a law that the theorems prove.
- **I8 (Reference agreement):** The reference interpreter result equals the
CPU reference_fn result for every op, every input. Zero tolerance.
- **I9 (Law falsifiability):** For every declared law on every op, removing
that law from the declaration must cause at least one test to fail. No
decorative laws.
Resource invariants:
- **I10 (Bounded allocation):** No operation allocates more than
`program.buffers.total_size + program.workgroup_mem + O(nodes)`.
- **I11 (No panic):** No lowered program can panic the runtime regardless
of input data.
- **I12 (No UB):** No lowered shader can produce undefined behavior on any
conformant backend.
Extensibility invariants:
- **I13 (Userspace stability):** A Program valid under vyre v1.x is valid
and produces identical results under every v1.y where y≥x.
- **I14 (Non-exhaustive discipline):** Adding a new DataType, BinOp,
AtomicOp, or Expr variant never breaks existing Programs.
- **I15 (Certificate stability):** A conformance certificate issued at v1.x
remains valid at v1.y if the backend hasn't changed.
Each invariant's `test_family` is a function pointer the generator calls to
produce the concrete tests for that invariant. Bodies can be `todo!()`
initially but the signatures must exist so the generator can reference them.
### 4.6 Laws enum
File: `src/spec/laws.rs`. Extends existing algebra types.
```rust
#[non_exhaustive]
pub enum Law {
Commutative,
Associative,
Identity(Value),
Absorbing(Value),
SelfInverse(Value),
Idempotent,
Involutive,
Distributive { over: BinaryOp },
Bounded { min: Value, max: Value },
ZeroProduct,
Monotonic { direction: MonotonicDirection },
}
```
Wire into existing `algebra/checker.rs` so each Law variant maps to a
checker function. Claiming a Law without a matching checker is a compile
error via exhaustive match.
### 4.7 Categories and versions
Files: `src/spec/categories.rs` (already the Category enum — just placeholders
for Category-level invariants) and `src/spec/versions.rs` (Version type,
`CURRENT_VERSION: Version`, helpers for `since_version` checks that enforce
I13/I15).
### 4.8 Phase 1 gate
- `cargo check --workspace` passes.
- `build.rs` runs `verify_coverage()` successfully on an empty registry (no
ops yet).
- `schema.md` exists and is readable.
- Every new file has a doc comment.
---
## 5. Phase 2 — Proof core
**Goal:** the reference interpreter works for current integer ops, one
OpSpec (add) is fully populated and proven, and law declarations go through
the inference engine mechanically.
### 5.1 Canonical OpSpec for BinOp::Add
Files: `src/spec/ops/add.rs`, `src/spec/tables/add.rs`.
This is the template every other op copies from. Spend disproportionate care
here.
- `name: "add"`
- `category: Category::A`
- `signature: (u32, u32) -> u32`
- `reference_fn: |args| Value::U32(args[0].as_u32().wrapping_add(args[1].as_u32()))`
- `laws`: Commutative, Associative, Identity(0), all ExhaustiveU8
- `spec_table`: ~20 hand-written rows covering:
- (0, 0) = 0 — identity
- (0, 1) = 1
- (u32::MAX, 1) = 0 — overflow wrap
- (u32::MAX, u32::MAX) = u32::MAX - 1 — double overflow
- (2^31, 2^31) = 0 — sign boundary
- (0x55555555, 0xAAAAAAAA) = 0xFFFFFFFF — alternating pattern fill
- ...each row has a `rationale` explaining why it's in the table
- `archetypes: &[A1, A2, A3, A5]` — references into the archetype registry
- `mutation_sensitivity: &[ArithmeticMutations, ConstantMutations]`
- `oracle_override: None`
- `since_version: Version::V1_0`
- `category_c_fallback: None`
- `docs_path: "docs/ops/primitive/add.md"`
Heavy doc comments throughout. This file is read by every contributor adding
a new op. Make it didactic.
Acceptance test: `cargo test -p vyre-conform spec::ops::add` — verifies the
OpSpec entry is well-formed, the reference_fn produces the expected value for
every SpecRow, and `verify_coverage()` approves it.
### 5.2 Reference interpreter
Files: `src/reference/*.rs`.
The oracle for I3 (backend equivalence) and the definition of what every op
means. This must be:
- **Pure Rust**, no wgpu, no unsafe, no FFI.
- **Obviously correct** — one function per Expr variant, one function per
Node variant, small and readable.
- **Slow** — readability trumps performance by orders of magnitude.
- **Complete** — handles every current Expr and Node variant, every BinOp,
every UnOp, every AtomicOp, every BufferAccess kind including Workgroup.
- **Uses vyre's cpu.rs reference_fn for primitive ops** — do not duplicate op
semantics, import them.
Structure:
- `interp.rs` — `pub fn run(program: &Program, inputs: &[Value]) -> Result<Vec<Value>, InterpError>`
- `eval_expr.rs` — dispatches on Expr variants
- `eval_node.rs` — dispatches on Node variants, handles control flow
- `atomics.rs` — sequential semantics (single-threaded reference is easier
and equivalent for correctness)
- `workgroup.rs` — simulates workgroup memory and barrier semantics
- `ieee754.rs` — strict float path stubs; bodies are `todo!()` until floats
land in vyre itself
Acceptance test: `cargo test -p vyre-conform reference::smoke` — runs a
hand-written Program using `BinOp::Add` through the reference interpreter,
verifies result matches `add.rs::reference_fn` bit-exactly. Then runs every
op currently in `vyre/src/ops/primitive/` and verifies reference interpreter
agrees with CPU reference_fn for a witnessed-u32 sample of 10,000 random
inputs per op.
If the reference interpreter disagrees with any cpu.rs for any input: the
test fails and you stop. One of them is wrong. Figure out which, fix it,
rerun. This is a Phase 2 blocker — do not proceed to Phase 3 with a broken
oracle.
### 5.3 Layer 2 wiring — law inference mandatory
File: `src/layers/layer2_law_inference.rs`.
- Existing `algebra/inference.rs` already tries every law automatically.
- Modify the op registration path so hand-declared laws are *verified*, not
trusted. If a contributor submits `laws: &[Commutative]` on an op, the
inference engine runs the commutativity checker at build time; if it
disagrees, the build fails.
- Make inference the *default* contribution path. TEMPLATE_OP.rs drops the
"declare your laws" slot and replaces it with "submit only the
reference_fn; laws are inferred."
- Provide an escape hatch for the rare case where inference is too slow:
`DeclaredLaw` can be provided explicitly only if accompanied by a
`#[verified_manually = "reason"]` attribute. This marks it as needing
human review every time it's touched.
Acceptance test: `cargo test -p vyre-conform layers::layer2::inference_rejects_false_claims`
— attempts to register a fake op claiming commutativity with a
non-commutative reference_fn; verifies the build-time check catches it.
### 5.4 Phase 2 gate
- Reference interpreter passes smoke test.
- Reference interpreter agrees with cpu.rs for every existing primitive op
on 10K witnessed inputs each.
- `add.rs` OpSpec verifies through `verify_coverage()`.
- Inference engine rejects a deliberately-false commutativity claim at build
time.
---
## 6. Phase 3 — Catalogs (archetypes, mutations, oracles, templates)
**Goal:** the four human-designed catalogs exist and are indexable from an
OpSpec. Each catalog is a registry; entries are referenced by id from
OpSpecs.
Can run in parallel with Phase 4.
### 6.1 Archetype library
Files: `src/archetypes/{arithmetic,structural,composition,backend}.rs` + `mod.rs`.
Trait:
```rust
pub trait Archetype: Sync {
fn id(&self) -> &'static str;
fn name(&self) -> &'static str;
fn description(&self) -> &'static str;
fn applies_to(&self, signature: &OpSignature) -> bool;
fn instantiate(&self, op: &OpSpec) -> Vec<TestInput>;
}
```
Implement all archetypes:
**Arithmetic:**
- A1 Identity pair — `(0, 0)`, `(1, 1)`, `(x, x)` for every `x`
- A2 Overflow pair — `(u32::MAX, 1)`, `(u32::MAX, u32::MAX)`, `(i32::MIN, -1)` for signed ops
- A3 Power-of-two boundary — `(2^31 - 1, 2^31)`, every `2^k ± 1` for k in 0..32
- A4 Shift saturation — `(x, 31)`, `(x, 32)`, `(x, 33)`, `(x, 63)`, `(x, u32::MAX)`
- A5 Bit pattern alternation — `0x55555555`, `0xAAAAAAAA`, `0xF0F0F0F0`, `0xDEADBEEF`, `0xCAFEBABE`
- A6 Division zero — `(x, 0)` for every divisible op
- A7 Self-inverse trigger — `(x, x)` for XOR, `(x, -x)` for add
**Structural (for IR):**
- S1 Minimum program — single buffer, single op, no control flow
- S2 Maximum nesting — loops inside conditionals inside loops, up to configured limit
- S3 Node-count boundary — `max_nodes - 1`, `max_nodes`, `max_nodes + 1`
- S4 Dead code — computes never used; lowering must not emit them
- S5 Diamond dataflow — `x → (a, b) → merge`
- S6 Long dependency chain — N sequential ops, N = 1000
- S7 Wide fanout — one value consumed by N ops, N = 1000
- S8 Workgroup memory contention — every thread writes same workgroup index
- S9 Workgroup memory uniqueness — every thread writes own index
- S10 Atomic contention — all threads atomic-add to same slot
- S11 Off-by-one buffer — access at `index = len - 1`, `len`, `len + 1`
- S12 Shadowing attempt — variable redeclaration (must be rejected)
- S13 Barrier under divergent control — must be rejected (V-rule)
- S14 Type confusion — passing u32 where f32 expected
- S15 Empty buffer — buffer declared with `count = 0`
**Composition:**
- C1 Identity composition — `f(x)` vs `compose(f, identity)(x)`
- C2 Associativity triple — `(a · b) · c` vs `a · (b · c)`
- C3 Commutativity swap — `f(a, b)` vs `f(b, a)`
- C4 Involution pair — `f(f(x))` vs `x`
- C5 Idempotent collapse — `f(f(x))` vs `f(x)`
- C6 Absorbing short-circuit — `f(absorbing_elem, x)` vs `absorbing_elem`
- C7 Distributivity cross — `a · (b + c)` vs `(a · b) + (a · c)`
**Backend:**
- X1 Single op, every backend
- X2 Random program, every backend
- X3 Numerical worst case
- X4 Resource saturation — max threads × max buffer × max iterations
Each archetype's `instantiate` is pure: given an OpSpec, produce the concrete
test input list. No I/O, no randomness without a seed.
The global registry:
```rust
pub const ARCHETYPES: &[&'static dyn Archetype] = &[
&arithmetic::A1_IdentityPair, &arithmetic::A2_OverflowPair, ...
];
```
### 6.2 Mutation catalog
Files: `src/mutations/*.rs` + `mod.rs`.
```rust
#[non_exhaustive]
pub enum Mutation {
ArithOpSwap { from: BinOp, to: BinOp },
ConstantIncrement { by: i64 },
CompareInvert,
ControlFlowDelete { branch: Branch },
BufferIndexShift { by: i32 },
AtomicOrderingWeaken,
IrDataTypeSwap { from: DataType, to: DataType },
LawFalselyClaim { law: Law, op: OpId },
LawIdentityCorrupt { op: OpId, wrong: Value },
LowerRemoveBoundsCheck,
LowerRemoveShiftMask,
BytecodeConverterSwap { from: Opcode, to: Opcode },
// ... complete set
}
#[non_exhaustive]
pub enum MutationClass {
ArithmeticMutations,
ComparisonMutations,
BitwiseMutations,
ControlFlowMutations,
BufferAccessMutations,
OrderingMutations,
IrStructuralMutations,
LawMutations,
LoweringMutations,
ConstantMutations,
}
pub fn class_of(m: Mutation) -> MutationClass;
pub fn apply(source: &str, mutation: &Mutation) -> String;
pub fn mutations_for(class: MutationClass) -> Vec<Mutation>;
pub const MUTATION_CATALOG: &[Mutation] = &[...];
```
Full list:
**Arithmetic mutations:**
- `+` ↔ `-`, `*` ↔ `/`, `*` ↔ `+`
- `wrapping_add` → `saturating_add` / `checked_add`
- `shl` ↔ `shr`
- Constants: `0` → `1`, `1` → `0`, `N` → `N ± 1`, `u32::MAX` → `0`
**Comparison mutations:**
- `<` ↔ `<=`, `>` ↔ `>=`, `==` ↔ `!=`
- `<` ↔ `>`, invert boolean results
**Bitwise mutations:**
- `&` ↔ `|`, `^` ↔ `&`, `^` ↔ `|`
- `!x` → `x`, `x & mask` → `x`
**Control flow mutations:**
- Delete `if` branch (take only else)
- Delete `else` branch (take only if)
- Invert condition (`!cond`)
- Loop `0..n` → `0..n+1`, `0..n-1`, `1..n`
- `break` → `continue`, `continue` → (removed)
**Buffer access mutations:**
- `buffer[i]` → `buffer[i-1]`, `buffer[i+1]`, `buffer[0]`, `buffer[len-1]`
- Read vs write swap
- Ordering: `Release` → `Relaxed`, `Acquire` → `Relaxed`, `SeqCst` → `Acquire`
**IR-specific mutations:**
- `BinOp::Add` → `BinOp::Sub` in primitive op definitions
- `DataType::U32` → `DataType::I32` in signatures
- Delete a law from the OpSpec declaration
- Swap `reference_fn` for the wrong op's reference_fn
- `BufferAccess::Storage` → `BufferAccess::Workgroup`
- Remove a validation rule entry
- Swap two opcodes in the bytecode → IR converter
- Emit the wrong WGSL op in the lowering table
- Change `workgroup_size` computation
- Remove bounds check in lowering
- Remove shift mask in lowering
**Law-level mutations (non-negotiable):**
- Falsely claim commutativity on a non-commutative op
- Falsely claim associativity
- Claim wrong identity element
- Claim wrong absorbing element
`apply` modifies Rust source. Can use `syn` for AST-level mutations or regex
for simple textual ones. AST-level is more robust; prefer it for the
IR-specific mutations.
Without `LawFalselyClaim` and `LawIdentityCorrupt` mutations, the whole proof
engine is a fiction. Tests must kill these mutations for every declared law
on every op.
### 6.3 Oracle hierarchy
Files: `src/oracles/*.rs` + `mod.rs`.
```rust
pub trait Oracle: Sync {
fn kind(&self) -> OracleKind;
fn applicable_to(&self, op: &OpSpec, property: &Property) -> bool;
fn verify(&self, op: &OpSpec, input: &[Value], observed: &Value) -> Verdict;
}
pub fn resolve(op: &OpSpec, property: &Property) -> &'static dyn Oracle;
```
`resolve` implements the decision tree: Law > SpecTable > ReferenceInterpreter
> CpuReference > Composition > External > Property. First applicable in this
order wins.
Decision rule (encoded in resolve):
```
Given (op, input_class, property_to_check):
1. If a declared law covers (op, property) → use the law checker as oracle
2. Elif (input, expected) is in the spec table → use the spec table
3. Elif the op has a CPU reference_fn → use reference_fn as oracle
4. Elif this is a composition → use the composition theorem
5. Else → property test (weakest), flag for human review
```
Every generated test declares which oracle it uses. If a test uses "property"
as oracle when a law exists, the generator is wrong.
**The critical rule:** tests never derive their expected output from the
implementation they're testing. The oracle must be independent —
hand-written, law-derived, reference-interpreter-derived, or external.
The generator uses `resolve` at generation time to select the oracle for each
test. Selecting a weaker oracle when a stronger one is applicable is
*impossible* because the generator never makes the choice — the hierarchy
does.
Acceptance test: unit tests that feed concrete (op, property) pairs through
`resolve` and verify the expected oracle comes out.
### 6.4 Templates
Files: `src/templates/*.tmpl` (text) + `src/templates/mod.rs` (loader).
Templates are plain text, included at compile time via `include_str!`. Loader
provides:
```rust
pub fn render(template: &str, context: &TemplateContext) -> Result<String, TemplateError>;
```
Interpolation syntax: `{field}` for scalar, `{list:name}\n...\n{/list}` for
loops, `{if:cond}...{/if}` for conditionals. Keep it tight — Mustache-subset.
**preamble.tmpl:**
```
You are writing ONE test. Not 1000. Not 100. One.
You have unlimited time. There is no deadline.
DO NOT batch-generate similar tests. DO NOT optimize for speed.
Your output will be mutation-tested. If a single-character change to the
source code makes your test pass, the test is rejected and you will be
asked to strengthen it. Your only metric is: does this test catch bugs?
```
**op_correctness.tmpl:**
```
{include:preamble}
TARGET: {op_name} from vyre/src/ops/primitive/{file}.rs
PROPERTY: The op must return {expected} when given inputs {inputs}.
ORACLE: Spec table row from {spec_path}:{line}
CATEGORY: {category}
REQUIRED:
- The test MUST call the op via ir::Program, not via the cpu.rs direct function.
- The test MUST lower the Program and run it through the reference interpreter.
- The test MUST assert byte-equality with {expected}.
- The test MUST be named `test_{op}_{input_desc}_spec_table`.
FORBIDDEN:
- DO NOT derive the expected value from calling the function.
- DO NOT use proptest, quickcheck, or random inputs. This is a specific-input test.
- DO NOT add helper functions. Inline everything.
- DO NOT test multiple properties in one test.
Your test will be mutation-graded. If I delete the `{op}` case from the
primitive/{file}.rs dispatch and your test still passes, the test is rejected.
Output: ONE #[test] function. Nothing else.
```
**law.tmpl:**
```
{include:preamble}
TARGET: {op_name}
PROPERTY: {law_name} ({law_description})
ORACLE: Algebraic law — the checker has verified this law holds exhaustively on u8.
INPUT CLASS: {input_class}
Write a test that asserts {law_formula} for inputs {specific_inputs}.
The test must fail if:
- The law declaration is removed from the OpSpec for {op}
- The identity/absorbing element in the law declaration is corrupted
- The op's reference_fn is swapped with a non-{law}-satisfying function
REQUIRED:
- Use only values from the input class.
- Use the reference interpreter as execution environment.
- Assert the law formula directly, not by computing expected values.
```
**validation.tmpl:**
```
{include:preamble}
TARGET: Validation rule {V_NNN}
PROPERTY: must_reject({description})
ORACLE: V{NNN} rule definition in vyre/src/ir/validate.rs
Construct an ir::Program that violates exactly {V_NNN} and no other rule.
Assert that validate(&program) contains a ValidationError for {V_NNN}.
REQUIRED:
- The program must be otherwise well-formed.
- The error list must contain exactly one error, with rule == V{NNN}.
- If you cannot construct a program that violates ONLY this rule, output:
FAIL: rule V{NNN} is not independently triggerable.
This is a finding — the validation rule set is not separable.
```
**backend_equiv.tmpl:**
```
{include:preamble}
TARGET: op {op_name} on every conformant backend
PROPERTY: I3 (backend equivalence)
ORACLE: Reference interpreter
Build a Program using {op} with inputs {input_class}.
Run it on every registered backend.
Assert every result is byte-identical to the reference interpreter.
REQUIRED:
- If only one backend is registered, skip with reason "I3 needs ≥2 backends."
- Iterate backends via the registry, not hardcoded names.
- Fail with a diff showing which backends disagreed.
```
**archetype.tmpl:**
```
{include:preamble}
TARGET: {op_name}
ARCHETYPE: {archetype_id} ({archetype_name})
PROPERTY: {property}
ORACLE: {oracle_chosen_from_hierarchy}
The archetype is: {archetype_description}
Instantiate the archetype for {op_name}. Use inputs that match the archetype shape.
Assert the expected property using the chosen oracle.
```
**mutation_kill.tmpl:**
```
{include:preamble}
TARGET: {op_name}
PROPERTY: kill mutation {mutation_description}
ORACLE: {oracle_chosen_from_hierarchy}
The mutation gate reports that the following mutation survives all current tests:
{mutation_details}
Write a test that fails when the above mutation is applied to {op_name}'s source.
```
Every template starts with `{include:preamble}` which expands to the
"ONE test, not batched, mutation-graded" warning.
Every template declares REQUIRED and FORBIDDEN sections. The REQUIRED and
FORBIDDEN text is passed to the model verbatim — they are contracts the
model is told will be checked.
Acceptance test: render every template with a mock context; verify no
interpolation errors; verify the output contains every REQUIRED/FORBIDDEN
literal string.
---
## 7. Phase 4 — Harnesses
**Goal:** every harness primitive exists and is testable in isolation. H6
(mutation gate) is the critical piece.
Can run in parallel with Phase 3 except H9 (coverage) which depends on the
Phase 1 registry.
### 7.1 H6 — mutation gate (build first)
File: `src/harnesses/mutation.rs`.
This is the single most load-bearing piece of the whole conform system.
Build it first among the harnesses.
```rust
pub struct GateReport {
pub test_name: String,
pub source_file: PathBuf,
pub mutations_attempted: Vec<Mutation>,
pub mutations_killed: Vec<Mutation>,
pub mutations_survived: Vec<Mutation>,
pub duration: Duration,
pub feedback: Vec<StructuredFeedback>,
}
pub struct StructuredFeedback {
pub mutation: Mutation,
pub hint: String, // "Your test passed when I changed X to Y. Add assertion that distinguishes these."
}
pub fn mutation_probe(
source_file: &Path,
test_fn_name: &str,
mutation_classes: &[MutationClass],
) -> GateReport;
```
Mechanism:
1. For each mutation in the catalog whose class is in `mutation_classes`:
2. Snapshot `source_file`.
3. Apply the mutation.
4. Run `cargo test -- <test_fn_name>` in a workdir with the mutated source.
5. If the test FAILED → mutation killed (good).
6. If the test PASSED → mutation survived (bad, finding).
7. Restore the snapshot.
8. Continue until all applicable mutations are tried.
Parallelism: can run multiple mutations concurrently using copy-on-write
source trees. Use `cargo` with separate target dirs per worker. Skip this
optimization for Phase 4; add in Phase 5 if needed.
Also exposed as a standalone binary: `vyre-conform mutation-gate --source
<path> --test <name> --classes <...>`. Used by CI and by the agent stream.
Acceptance test:
- Feed a hand-written test that asserts `add(2, 3) == 5`.
- Apply the `ArithOpSwap { from: Add, to: Sub }` mutation to the add op source.
- Verify `mutation_probe` reports the mutation as killed.
- Apply a mutation the test shouldn't catch (say, changing an unrelated function).
- Verify `mutation_probe` reports survival and produces structured feedback
explaining what the test failed to distinguish.
- If either case is wrong, **stop**. The gate is the floor; if it's broken,
nothing else matters.
### 7.2 H1–H5 and H7–H10
Build in parallel after H6.
**H1 `with_every_backend(program, |backend, result|)`** — iterates the
backend registry, runs the program on each, yields results. Initially only
one backend (wgpu); add the reference interpreter as a "backend" so I3 tests
can run with a single real backend.
**H2 `with_oom_injection(program, fail_at)`** — requires a test-only
allocator crate feature. Fails the Nth allocation; verifies graceful
handling. Mark with `#[cfg(feature = "oom-injection")]` so production builds
don't include it.
**H3 `with_fault_at_dispatch(program, dispatch_index)`** — simulates device
loss. In the wgpu backend, hook the dispatch call to return an error after
N calls. Test-only.
**H4 `with_backend_swap(program, swap_after)`** — runs the first N ops on
backend A, the next ops on backend B, verifies state transfer is legal or
rejected correctly. Tests the backend-independence claim.
**H5 `run_determinism_suite(program, iterations, threads)`** — runs the
program `iterations` times across `threads` threads, asserts all results
byte-identical. Tests I1.
**H7 `reference_diff(program, inputs)`** — runs the program on the reference
interpreter AND every registered backend, diffs the results, returns a
structured report of any divergence with minimized input via the minimizer.
**H8 `law_probe(op, law)`** — runs the law checker on `(op, law)`, returns
`(verified, counterexample?, minimized_counterexample?)`. Thin wrapper over
existing `algebra/checker.rs`.
**H9 `spec_coverage_check()`** — iterates the op registry, verifies every op
has: reference_fn, at least one declared law (or explicit "no laws" marker),
spec table entries, category assignment. Fails if any op is missing any
required field. Depends on Phase 1 registry.
**H10 `composition_proof_check(program)`** — runs the composition theorem
checker on the program, returns the proof token or a list of theorems that
don't hold. Tests assert the token is present.
Each harness is a standalone function with a clear contract. Write unit
tests for each that feed known-good and known-bad inputs and verify the
harness detects each case.
### 7.3 Phase 4 gate
- H6 passes its acceptance test (can catch ArithOpSwap mutation, produces
structured feedback for survivors).
- All H1-H10 harnesses have unit tests that pass.
- H9 rejects a registry entry missing `reference_fn`.
---
## 8. Phase 5 — Generator, xtask, layer enforcement
**Goal:** the pipeline runs end-to-end. Spec → generator → materialized
tests → cargo test → mutation gate → pass/fail. Layer enforcement wires the
discipline into the build.
### 8.1 Generator
Files: `src/generator/*.rs` + `src/bin/vyre-gen-tests.rs`.
```rust
pub struct GenerationPlan {
pub ops: Vec<&'static OpSpec>,
pub archetypes: Vec<&'static dyn Archetype>,
pub templates: Vec<TemplateKind>,
pub seed: u64,
}
pub fn generate(plan: &GenerationPlan, out_dir: &Path) -> Result<GenerationReport, GenError>;
```
`cross_product.rs` — iterates `(op × archetype × oracle)` tuples from the
plan, filters by `Archetype::applies_to(op.signature)`, routes each tuple to
the appropriate template, emits one test per tuple.
`emit.rs` — Rust code writer. Takes rendered template output + test name +
provenance header, writes a syntactically valid `#[test]` function to a file
in `tests_generated/`. Uses `syn` to parse the rendered output and verify
it's valid Rust before writing; if rendering produced invalid syntax, the
generator fails fast with a clear error.
`naming.rs` — deterministic test names: `test_{op}_{archetype}_{oracle}_{seed_hash}`.
Same inputs always produce the same name. No collisions.
`seed.rs` — RNG wrapper that takes a master seed and derives per-test seeds
deterministically. Same master seed → same test set. Reproducible.
`provenance.rs` — every generated file starts with a comment header:
```
// GENERATED by vyre-gen-tests v{gen_version}
// spec version: {spec_version}
// source tuple: op={op}, archetype={archetype}, oracle={oracle}
// master seed: {seed}
// generated at: {timestamp}
// DO NOT EDIT. Regenerate via `cargo xtask generate-tests`.
```
Binary `vyre-gen-tests`:
```
vyre-gen-tests --op add --out tests_generated/
vyre-gen-tests --all --out tests_generated/
vyre-gen-tests --plan plan.toml --out tests_generated/
```
Acceptance test: running `vyre-gen-tests --op add --out /tmp/gen` produces a
directory of test files, all syntactically valid, with provenance headers,
all oracles chosen via hierarchy (no "property" oracles when a stronger one
exists).
### 8.2 xtask commands
Create top-level `xtask/` crate at the workspace root.
Commands:
- `cargo xtask generate-tests [--op NAME | --all]` — wraps generator binary
- `cargo xtask mutation-gate [--tests PATH]` — runs H6 across all agent-written
tests, emits report
- `cargo xtask coverage-check` — runs H9 + registry verification
- `cargo xtask conform-verify` — runs the whole pipeline end-to-end:
generate, mutation-gate, coverage, `cargo test`. Exit non-zero on any
failure. This is what CI runs.
Each command has `--help`. `conform-verify` is the "is vyre-conform healthy"
check.
### 8.3 Layer enforcement
Files: `src/layers/*.rs`.
**Layer 1 — Executable spec** (`layer1_executable_spec.rs`)
- Hook `verify_coverage()` into `build.rs`.
- Add a compile-time assertion that `OP_REGISTRY.len() > 0` in release builds.
- Integration test that builds a fake vyre-conform consumer and verifies it
can iterate OP_REGISTRY.
**Layer 2 — Law inference mandatory** (already done in Phase 2)
**Layer 3 — Reference interpreter** (`layer3_reference_interp.rs`)
- Thin wrapper that re-exports `::reference` functions.
- Provides integration hook used by H7 and I3 tests.
**Layer 4 — Mutation gate** (`layer4_mutation_gate.rs`)
- Thin wrapper over `harnesses::mutation`.
- Integration hook for CI: `layer4::enforce_on(test_set) -> Result<(), Vec<SurvivingMutation>>`.
- CI fails if any surviving mutation is found in the agent-accepted test set.
**Layer 5 — Adversarial roles** (`layer5_adversarial.rs`)
- Implementor / Prosecutor / Defender three-role contribution flow.
- Implementor submits op + reference_fn.
- Prosecutor writes tests graded by mutation kill count.
- Defender writes intentionally-wrong reference_fn; if Prosecutor's tests
don't catch Defender's wrong impl, Prosecutor is rejected with structured
feedback.
- Mechanical adjudication: no human judgment in the loop.
- Integrates with the agent stream as three distinct task kinds.
**Layer 6 — Stability** (`layer6_stability.rs`)
- `build.rs` check: audit all public enums in vyre and vyre-conform for
`#[non_exhaustive]`. Fail build on any public enum without it.
- Version lock enforcement: when a new OpSpec is added, `since_version` must
equal `CURRENT_VERSION`. Cannot add ops claiming older version.
- Audit script that diffs public API between two versions, reports any
removed or changed items as breaking.
**Layer 7 — Composition proof** (`layer7_composition_proof.rs`)
- Extend `ir::Program` construction to compute a proof token when composed.
- Proof token encodes: "this composition preserves laws L1, L2 because
composition theorems T1, T2 apply."
- `lower()` refuses to lower programs whose proofs don't check out.
- Proof check is fast (O(nodes)); no runtime overhead during actual
execution, only at construction/lowering.
**Layer 8 — Feedback loop** (`layer8_feedback_loop.rs`)
- Target: <10s contribution feedback.
- Incremental mutation runs: cache mutation-probe results keyed by
`(source_hash, test_hash, mutation)`. Only re-run mutations whose cache
entry is invalidated.
- Warm reference interpreter: keep it in-process across test runs, not
spawned fresh per test.
- `cargo xtask quick-check --op NAME` runs the minimal verification path
for one op in <10s.
### 8.4 Phase 5 gate
- `cargo xtask conform-verify` runs end-to-end and exits cleanly.
- `cargo xtask quick-check --op add` runs in <10s.
- Layer 6 audit passes (all public enums `#[non_exhaustive]`).
- Layer 7 rejects a deliberately-broken composition.
---
## 9. Phase 6 — Calibration
**Goal:** the pipeline produces output that meets the bar on a single op,
before being turned loose on the whole registry.
### 9.1 Generate all tests for BinOp::Add
`cargo xtask generate-tests --op add --out tests_generated/`
Verify:
- Every (archetype × oracle) combination that applies to add produces a test.
- Every test has a provenance header.
- `cargo test` on the generated tree passes (or fails with findings in the
actual implementation, which are real bugs to fix).
- `cargo xtask mutation-gate --tests tests_generated/add/` reports every
mutation in the ArithmeticMutations and ConstantMutations classes as
killed. If any survive, generate more tests until they don't.
### 9.2 Audit 10 random generated tests by hand
Read them. Check:
- Is the assertion meaningful, or could a single-character change in the
implementation bypass it?
- Is the oracle the strongest applicable?
- Is the test self-contained (no helpers, no indirection)?
- Is the rationale for why this test exists clear from reading it?
If any test fails this audit, the generator or the templates are wrong. Fix
them. Regenerate. Audit 10 more. Repeat until audit is clean.
### 9.3 Strengthen templates and gate
Based on audit findings, strengthen templates. Common fixes:
- Add more FORBIDDEN patterns ("DO NOT use `assert!(result.is_ok())`").
- Narrow the input class description.
- Add explicit assertion patterns the model must use.
Strengthen the mutation gate with any mutations the audit revealed aren't
covered.
### 9.4 Compare against existing hand-written add tests
`vyre/tests/primitive_programs.rs` has hand-written tests for add. Run
mutation gate against both the hand-written tests and the generated tests.
The generated set must kill a strict superset of mutations. If the
hand-written set catches something the generated set doesn't, add the
missing archetype/template/mutation and regenerate.
Do not delete the hand-written tests until the generated set is strictly
stronger. When it is, the hand-written tests are replaced by the generated
set and Git records the migration.
### 9.5 Phase 6 gate
- Generated test set for BinOp::Add is strictly stronger than existing
hand-written tests (per mutation gate comparison).
- Hand audit of 10 random generated tests finds zero slop.
- `cargo xtask conform-verify` runs cleanly end-to-end on the add-only path.
When this gate passes, scaling to all 32 primitive ops is mechanical: run
the same pipeline with `--all` instead of `--op add`. If Phase 6 passes for
add, it passes for everything structurally similar.
---
## 10. What NOT to do
- **Do not commit generated tests.** `tests_generated/` is in `.gitignore`.
If you find yourself adding test files to git outside of `regression/` or
`agent_accepted/`, stop and reconsider.
- **Do not weaken a template to make a test pass.** Templates are contracts.
If a test the template produces is weak, the template is wrong and must be
strengthened, not the other way around.
- **Do not let tests derive expected values from the implementation.** Every
oracle must be independent of the code being tested. A test that computes
`let expected = add(a, b); assert_eq!(add(a, b), expected)` is slop and
the gate must reject it.
- **Do not skip the audit in Phase 6.** Running the pipeline on all 32 ops
without verifying quality on one op first is the path to generating 32x
slop instead of 1x slop.
- **Do not add a Category B variant.** Runtime abstraction is forbidden.
Every op is A or C. If something seems like it wants to be B, reconsider
whether it's really an op or whether it's a composition of real ops.
- **Do not let a declared law exist without a Verification provenance.** A
law the checker hasn't verified is a lie. Compile error.
- **Do not accept a PR that touches `agent_accepted/` without a provenance
sidecar.** Every file in that directory has a `.meta` file recording who
(which agent id), when, what mutations were killed, who approved. Missing
metadata = rejected PR.
- **Do not modify existing hand-written tests to match broken implementation.**
If a hand-written test fails, the implementation is probably wrong, not
the test. Investigate before touching the test.
---
## 11. Parallelism and dispatch
Each phase has a list of artifacts, most of which are independent. Parallel
agent dispatch strategy:
**Phase 1** (5 parallel)
- schema.md → Codex Spark or Gemini
- Directory scaffolding → Kimi (fast, mechanical)
- Spec schema types → Codex 5.4 (load-bearing)
- Invariants → Gemini
- Laws enum → Gemini (different account from invariants)
**Phase 2** (3 tracks, reference interpreter is largest)
- Canonical add OpSpec → Codex 5.4 (sets the quality bar)
- Reference interpreter → Codex 5.4 or sequential Kimi passes
- Layer 2 wiring → Gemini
**Phase 3** (4 parallel)
- Archetype library → Codex Spark
- Mutation catalog → Gemini
- Oracle hierarchy → Gemini (different account)
- Templates → Kimi
**Phase 4** (H6 first, then 3 parallel)
- H6 mutation gate → Codex 5.4 (the critical gate)
- H1-H5 → Kimi
- H7-H10 → Kimi (second instance)
**Phase 5** (3 parallel)
- Generator → Codex Spark
- xtask commands → Kimi
- Layer enforcement 1/4/5/6/7/8 → Codex 5.4
**Phase 6** (serial, mostly human judgment)
- Run generator on add → mechanical
- Audit 10 tests by hand → HUMAN ONLY
- Strengthen and regenerate → mechanical
- Compare to hand-written → mechanical
**Rate-limit consideration:** Codex 5.4 is the bottleneck (tasks appear in
every phase). Pre-dispatch a Kimi fallback for each Codex task in case of
rate limits. Gemini tasks respect the "3 per account" rule — never switch
accounts until the current batch of 3 is done.
---
## 12. The load-bearing pieces
These five must be right or everything downstream is corrupted. Allocate
disproportionate review attention to these:
1. **Spec schema types** (Phase 1) — the data shape every other component
reads from.
2. **Canonical add OpSpec** (Phase 2) — the template every other op copies.
Didactic quality matters.
3. **Reference interpreter** (Phase 2) — the oracle. Bugs here silently
corrupt everything.
4. **Mutation gate H6** (Phase 4) — the quality floor. A weak gate lets slop
through.
5. **Layer enforcement wiring** (Phase 5) — the discipline that makes the
gates actually enforce.
For each of these, after the agent produces the code:
- Read it line by line.
- Run the acceptance test manually.
- Check that the acceptance test is meaningful (does it actually verify the
thing it claims to verify?).
- If in doubt, redispatch with corrections.
---
## 13. Daily calibration (forever)
Once the pipeline is running and producing tests:
**Every day, read 10 random committed agent tests and delete any that don't
meet the standard.**
This is the calibration loop. If the gate is letting slop through, deletions
force you to strengthen the gate. If nothing needs deleting, you're done
calibrating and can increase batch size. Ten a day, every day, forever.
This is how SQLite stays at 590x ratio *and* high quality — they don't
trust the machinery, they audit it.
**Every bug found in production adds a new mutation to the catalog.** If a
bug class isn't in the catalog, the gate can't catch it, and the next
instance of that bug will ship. Post-mortem rule: every bug fix adds one
mutation operator.
---
## 14. Contact points
- Schema questions → re-read `schema.md`, which is the single source of
truth for the directory layout and provenance rules.
- Spec type questions → `src/spec/mod.rs` doc comments.
- Invariant questions → `src/spec/invariants.rs`, each has a description
field.
- "Should this be a primitive or a composition?" → Category A+C rule. If it
composes from existing ops, it's a composition. If it requires new
hardware, it's Category C. Never a new primitive otherwise.
- "Why is this law declared without ExhaustiveU8?" → the op's value type is
u32 or larger. Use WitnessedU32 with a seed and a count of at least 100K.
- "Where does this generated test go?" → `tests_generated/{op}/{archetype}_{oracle}.rs`.
Not committed.
- "Where does this human-minimized bug go?" → `regression/YYYY-MM-DD-description.rs`.
Committed. Permanent.
- "Where does this agent-approved test go?" → `agent_accepted/{op}/test_NNNN.rs`
with matching `.meta` sidecar. Committed.
---
END OF PLAN