vyre-conform 0.1.0

Conformance suite for vyre backends — proves byte-identical output to CPU reference
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
# vyre-conform execution plan

Handoff document for the agent owning vyre-conform. Read top to bottom once
before starting. Every section has concrete deliverables, no optional work, no
"nice to have." If something is unclear, stop and ask — the cost of a
misunderstanding discovered at Phase 5 is a week of rework.

---

## 0. Mission and bar

vyre-conform is the discipline that prevents vyre from degrading as it grows.
Every new op, every new backend, every new contribution must pass through
vyre-conform's gates. If the gates are strong, vyre can accept contributions
from local LLMs at volume without quality loss. If the gates are weak, vyre
becomes slop regardless of who contributes.

The bar: **a malicious LLM explicitly trying to sneak a wrong op into vyre
must fail at the conform gate, not at human review.** If you can't visualize
how the gate catches adversarial output, the gate is too weak and you rebuild
it.

Not a test suite. A compiler for contributions.

---

## 1. Architectural decisions (locked, do not relitigate)

- **vyre = IR + lowering + ops + bytecode.** Nothing else. The thing being
  verified.
- **vyre-conform = everything about correctness.** Spec, archetypes,
  mutations, oracles, harnesses, templates, generator, minimizer, reference
  interpreter, proof engine, layer enforcement. One crate, one concern.
- **vyre-agent-stream = runtime orchestration.** Separate crate. Imports
  vyre-conform. Not in scope for this plan.
- **Tests are not committed.** Generator materializes them into
  `vyre-conform/tests_generated/` (gitignored) from spec on demand. Source of
  truth is ~100K LOC; materialized tests can be GB.
- **Committed exceptions:** regression corpus (human-minimized real bugs) and
  `agent_accepted/` (tests that passed gate + human approval) live in the
  repo with provenance sidecars.
- **Category A or C, never B.** Compositional or hardware-intrinsic. Runtime
  abstraction is forbidden. Not an enum variant you can add.
- **Floats are in, deterministic via restriction.** Strict IEEE 754 path.
  Three conformance tracks (integer, float, approximate). Tracks never mix in
  a certificate entry.
- **`#[non_exhaustive]` on every public enum.** Adding variants is allowed.
  Changing or removing is not. Breaking userspace is not an option (I13).
- **Laws are proven at build time, not claimed.** Claiming a law without a
  Verification provenance is a compile error.
- **One execution path:** `source → ir::Program → lower → target shader →
  dispatch`. The bytecode VM interpreter path is dead. Bytecode is a wire
  format only.

---

## 2. Directory layout

Build this tree. Existing directories marked EXISTING, new ones marked NEW.

```
vyre-conform/
├── Cargo.toml
├── schema.md                     NEW — single-page architecture doc
├── PLAN.md                       THIS FILE
├── CONTRIBUTING.md               EXISTING — update per Phase 6
├── TEMPLATE_OP.rs                EXISTING — update per Phase 2
├── regression/                   NEW — human-curated minimized repros
│   └── README.md
├── agent_accepted/               NEW — gate-passed, human-approved agent tests
│   └── README.md
├── corpus/                       NEW — shared fuzzing inputs
│   ├── seeds/
│   └── fuzz_targets/
├── tests_generated/              NEW — gitignored, materialized on demand
│   └── .gitkeep
└── src/
    ├── lib.rs                    EXISTING — add new pub mod lines
    ├── spec/                     EXISTING — extend
    │   ├── mod.rs
    │   ├── ops/                  NEW — one file per op OpSpec
    │   │   └── add.rs            NEW — canonical example
    │   ├── tables/               NEW — SpecRow tables per op
    │   │   └── add.rs
    │   ├── invariants.rs         NEW — I1..I15 encoded
    │   ├── laws.rs               NEW — Law enum + DeclaredLaw
    │   ├── categories.rs         NEW — Category A/C enum
    │   ├── versions.rs           NEW — Version + since_version
    │   └── registry.rs           NEW — compile-time coverage check
    ├── archetypes/               NEW
    │   ├── mod.rs                — Archetype trait + ArchetypeRef
    │   ├── arithmetic.rs         — A1..A7
    │   ├── structural.rs         — S1..S15
    │   ├── composition.rs        — C1..C7
    │   └── backend.rs            — X1..X4
    ├── mutations/                NEW
    │   ├── mod.rs                — Mutation enum, MutationClass, catalog
    │   ├── arithmetic.rs
    │   ├── comparison.rs
    │   ├── bitwise.rs
    │   ├── control.rs
    │   ├── buffer.rs
    │   ├── ir_specific.rs
    │   └── law.rs
    ├── oracles/                  NEW
    │   ├── mod.rs                — Oracle trait
    │   ├── hierarchy.rs          — decision tree
    │   ├── law.rs
    │   ├── spec_table.rs
    │   ├── reference.rs
    │   ├── cpu_ref.rs
    │   └── composition.rs
    ├── harnesses/                NEW
    │   ├── mod.rs
    │   ├── backend.rs            — H1
    │   ├── oom.rs                — H2
    │   ├── fault.rs              — H3
    │   ├── swap.rs               — H4
    │   ├── determinism.rs        — H5
    │   ├── mutation.rs           — H6 (THE GATE)
    │   ├── reference_diff.rs     — H7
    │   ├── law_probe.rs          — H8
    │   ├── coverage.rs           — H9
    │   └── composition_proof.rs  — H10
    ├── templates/                NEW — include_str!'d at compile time
    │   ├── mod.rs                — template loader + interpolator
    │   ├── preamble.tmpl
    │   ├── op_correctness.tmpl
    │   ├── law.tmpl
    │   ├── validation.tmpl
    │   ├── backend_equiv.tmpl
    │   ├── archetype.tmpl
    │   └── mutation_kill.tmpl
    ├── minimize/                 NEW — extends existing minimizer.rs
    │   ├── mod.rs                — Minimizer trait + driver
    │   ├── arithmetic.rs
    │   ├── structural.rs
    │   ├── composition.rs
    │   └── type_shrink.rs
    ├── generator/                NEW
    │   ├── mod.rs
    │   ├── cross_product.rs      — (op × archetype × oracle) iteration
    │   ├── emit.rs               — Rust code writer
    │   ├── naming.rs             — deterministic test names
    │   ├── seed.rs               — reproducible random inputs
    │   └── provenance.rs         — header comment on every generated file
    ├── reference/                NEW — Layer 3 reference interpreter
    │   ├── mod.rs
    │   ├── interp.rs             — obviously correct IR evaluator
    │   ├── eval_expr.rs
    │   ├── eval_node.rs
    │   ├── atomics.rs
    │   ├── workgroup.rs
    │   └── ieee754.rs             — strict float path (stubs until floats land)
    ├── layers/                   NEW — enforcement wiring
    │   ├── mod.rs
    │   ├── layer1_executable_spec.rs
    │   ├── layer2_law_inference.rs
    │   ├── layer3_reference_interp.rs  — thin wrapper over ::reference
    │   ├── layer4_mutation_gate.rs     — thin wrapper over harnesses::mutation
    │   ├── layer5_adversarial.rs       — implementor/prosecutor/defender
    │   ├── layer6_stability.rs         — non_exhaustive + version audit
    │   ├── layer7_composition_proof.rs — build-time composition checks
    │   └── layer8_feedback_loop.rs     — <10s iteration infrastructure
    ├── algebra/                  EXISTING — checker, gpu_checker, composition, minimizer, inference
    ├── certify/                  EXISTING — formal certificate generator
    ├── streaming/                EXISTING — 100M-scale runner
    ├── specs/engine/             EXISTING — DFA, eval, scatter engine conformance
    └── bin/
        └── vyre-gen-tests.rs     NEW — generator binary entry point
```

Top-level `xtask/` crate at the workspace root (not under vyre-conform) hosts
the workflow commands.

---

## 3. Build order (phases with dependencies)

Six phases. Each phase ends with a gate: run the phase's acceptance test
before starting the next. Do not skip gates.

- **Phase 1 — Foundation** (schema doc, scaffolding, spec types, invariants, laws)
- **Phase 2 — Proof core** (canonical add OpSpec, reference interpreter, Layer 2 law inference wiring)
- **Phase 3 — Catalogs** (archetypes, mutations, oracles, templates)
- **Phase 4 — Harnesses** (H1–H10, with H6 the critical one)
- **Phase 5 — Generator + xtask + layer enforcement** (1, 4, 5, 6, 7, 8)
- **Phase 6 — Calibration** (first end-to-end on BinOp::Add, audit output, strengthen, repeat)

Phases 3 and 4 can overlap heavily since archetypes/mutations/oracles/templates
are independent of harnesses except H9 (coverage) which needs the spec
registry from Phase 1. Parallelize aggressively.

---

## 4. Phase 1 — Foundation

**Goal:** every type, trait, and invariant that other phases reference exists
and type-checks. No implementation bodies beyond what's needed to compile.

### 4.1 schema.md

Single page. Sections: purpose, directory map, committed-vs-generated rule,
provenance format, the spec-of-the-spec pointer (registry.rs), contribution
flow overview. One page. No more. Anyone reading it should know where every
kind of artifact goes in under two minutes.

### 4.2 Directory scaffolding

Create every directory listed in §2. Every `mod.rs` with `pub mod`
declarations. Every file at least empty with a doc comment stating its
purpose. `cargo check` must pass on the empty skeleton before moving on.

Add `tests_generated/` to `.gitignore`. Create `.gitkeep` so the directory
exists in the repo.

### 4.3 Spec schema types

File: `src/spec/mod.rs`, extending existing spec module.

```rust
#[non_exhaustive]
pub struct OpSpec {
    pub name: &'static str,
    pub category: Category,
    pub signature: OpSignature,
    pub reference_fn: fn(&[Value]) -> Value,
    pub laws: &'static [DeclaredLaw],
    pub spec_table: &'static [SpecRow],
    pub archetypes: &'static [ArchetypeRef],
    pub mutation_sensitivity: &'static [MutationClass],
    pub oracle_override: Option<OracleKind>,
    pub since_version: Version,
    pub category_c_fallback: Option<fn(&[Value]) -> Value>,
    pub docs_path: &'static str,
}

#[non_exhaustive]
pub struct SpecRow {
    pub inputs: &'static [Value],
    pub expected: Value,
    pub rationale: &'static str,
    pub source: SpecSource,
}

#[non_exhaustive]
pub struct DeclaredLaw {
    pub law: Law,
    pub verified_by: Verification,
}

#[non_exhaustive]
pub enum Verification {
    ExhaustiveU8,
    ExhaustiveU16,
    WitnessedU32 { seed: u64, count: u64 },
    ExhaustiveFloat { typ: FloatType },
}

#[non_exhaustive]
pub enum Category {
    A,  // compositional, inlines at lowering
    C { fallback_available: bool },  // hardware intrinsic with fallback
}

#[non_exhaustive]
pub enum SpecSource {
    HandWritten,
    DerivedFromLaw(LawId),
    FromCorpus(&'static str),
    FromRegression(&'static str),  // filename in regression/
}

#[non_exhaustive]
pub enum OracleKind {
    Law,
    SpecTable,
    ReferenceInterpreter,
    CpuReference,
    Composition,
    External,
    Property,
}

pub struct Version { pub major: u32, pub minor: u32, pub patch: u32 }

pub type ArchetypeRef = &'static dyn Archetype;
```

Every struct has doc comments that explain the field's purpose and what
checks exist on it.

### 4.4 Registry with coverage checker

File: `src/spec/registry.rs`.

- `pub const OP_REGISTRY: &[&'static OpSpec] = &[...];` populated via a
  declarative macro that every op's `ops/*.rs` file invokes.
- `pub fn verify_coverage() -> Result<(), Vec<CoverageError>>` — iterates the
  registry and checks every OpSpec has required fields populated. Missing
  reference_fn → error. Missing laws field → error (empty slice is allowed
  and explicit, but the field must be set). Missing archetypes → error.
  Missing mutation_sensitivity → error.
- `build.rs` calls `verify_coverage()` and fails the build if any error is
  returned.

The key property: you cannot add an op to the registry without declaring
everything. Omission is a build error, not a runtime warning.

### 4.5 Invariants I1–I15

File: `src/spec/invariants.rs`.

```rust
pub struct Invariant {
    pub id: InvariantId,
    pub name: &'static str,
    pub description: &'static str,
    pub category: InvariantCategory,
    pub test_family: fn() -> Vec<TestDescriptor>,
}

pub const INVARIANTS: &[Invariant] = &[
    Invariant { id: I1, name: "Determinism", ... },
    // ... all 15
];
```

Full list (copy the descriptions verbatim into doc comments):

Execution invariants:
- **I1 (Determinism):** Same ir::Program + same inputs → byte-identical
  outputs, every run, every backend, every device. No exceptions.
- **I2 (Composition commutativity with lowering):** `lower(compose(a, b)) ≡
  lower(a) ; lower(b)` semantically.
- **I3 (Backend equivalence):** For every Program P and every pair of
  conformant backends (B1, B2), `B1.run(P) == B2.run(P)`. Bit-exact.
- **I4 (Bytecode round-trip):** `to_ir(to_bytecode(P)) == P` for every valid P.
- **I5 (Validation soundness):** If `validate(P)` returns empty, `lower(P)`
  must not panic, allocate unboundedly, or produce UB.
- **I6 (Validation completeness — partial):** For every V001-V020 rule,
  there exists a test input that triggers exactly that rule and no others.

Algebra invariants:
- **I7 (Law monotonicity):** If op A declares law L and op B is defined as
  `compose(A, ...)` where the composition preserves L per composition.rs
  theorems, then B automatically has law L. The inference engine must never
  lose a law that the theorems prove.
- **I8 (Reference agreement):** The reference interpreter result equals the
  CPU reference_fn result for every op, every input. Zero tolerance.
- **I9 (Law falsifiability):** For every declared law on every op, removing
  that law from the declaration must cause at least one test to fail. No
  decorative laws.

Resource invariants:
- **I10 (Bounded allocation):** No operation allocates more than
  `program.buffers.total_size + program.workgroup_mem + O(nodes)`.
- **I11 (No panic):** No lowered program can panic the runtime regardless
  of input data.
- **I12 (No UB):** No lowered shader can produce undefined behavior on any
  conformant backend.

Extensibility invariants:
- **I13 (Userspace stability):** A Program valid under vyre v1.x is valid
  and produces identical results under every v1.y where y≥x.
- **I14 (Non-exhaustive discipline):** Adding a new DataType, BinOp,
  AtomicOp, or Expr variant never breaks existing Programs.
- **I15 (Certificate stability):** A conformance certificate issued at v1.x
  remains valid at v1.y if the backend hasn't changed.

Each invariant's `test_family` is a function pointer the generator calls to
produce the concrete tests for that invariant. Bodies can be `todo!()`
initially but the signatures must exist so the generator can reference them.

### 4.6 Laws enum

File: `src/spec/laws.rs`. Extends existing algebra types.

```rust
#[non_exhaustive]
pub enum Law {
    Commutative,
    Associative,
    Identity(Value),
    Absorbing(Value),
    SelfInverse(Value),
    Idempotent,
    Involutive,
    Distributive { over: BinaryOp },
    Bounded { min: Value, max: Value },
    ZeroProduct,
    Monotonic { direction: MonotonicDirection },
}
```

Wire into existing `algebra/checker.rs` so each Law variant maps to a
checker function. Claiming a Law without a matching checker is a compile
error via exhaustive match.

### 4.7 Categories and versions

Files: `src/spec/categories.rs` (already the Category enum — just placeholders
for Category-level invariants) and `src/spec/versions.rs` (Version type,
`CURRENT_VERSION: Version`, helpers for `since_version` checks that enforce
I13/I15).

### 4.8 Phase 1 gate

- `cargo check --workspace` passes.
- `build.rs` runs `verify_coverage()` successfully on an empty registry (no
  ops yet).
- `schema.md` exists and is readable.
- Every new file has a doc comment.

---

## 5. Phase 2 — Proof core

**Goal:** the reference interpreter works for current integer ops, one
OpSpec (add) is fully populated and proven, and law declarations go through
the inference engine mechanically.

### 5.1 Canonical OpSpec for BinOp::Add

Files: `src/spec/ops/add.rs`, `src/spec/tables/add.rs`.

This is the template every other op copies from. Spend disproportionate care
here.

- `name: "add"`
- `category: Category::A`
- `signature: (u32, u32) -> u32`
- `reference_fn: |args| Value::U32(args[0].as_u32().wrapping_add(args[1].as_u32()))`
- `laws`: Commutative, Associative, Identity(0), all ExhaustiveU8
- `spec_table`: ~20 hand-written rows covering:
  - (0, 0) = 0 — identity
  - (0, 1) = 1
  - (u32::MAX, 1) = 0 — overflow wrap
  - (u32::MAX, u32::MAX) = u32::MAX - 1 — double overflow
  - (2^31, 2^31) = 0 — sign boundary
  - (0x55555555, 0xAAAAAAAA) = 0xFFFFFFFF — alternating pattern fill
  - ...each row has a `rationale` explaining why it's in the table
- `archetypes: &[A1, A2, A3, A5]` — references into the archetype registry
- `mutation_sensitivity: &[ArithmeticMutations, ConstantMutations]`
- `oracle_override: None`
- `since_version: Version::V1_0`
- `category_c_fallback: None`
- `docs_path: "docs/ops/primitive/add.md"`

Heavy doc comments throughout. This file is read by every contributor adding
a new op. Make it didactic.

Acceptance test: `cargo test -p vyre-conform spec::ops::add` — verifies the
OpSpec entry is well-formed, the reference_fn produces the expected value for
every SpecRow, and `verify_coverage()` approves it.

### 5.2 Reference interpreter

Files: `src/reference/*.rs`.

The oracle for I3 (backend equivalence) and the definition of what every op
means. This must be:

- **Pure Rust**, no wgpu, no unsafe, no FFI.
- **Obviously correct** — one function per Expr variant, one function per
  Node variant, small and readable.
- **Slow** — readability trumps performance by orders of magnitude.
- **Complete** — handles every current Expr and Node variant, every BinOp,
  every UnOp, every AtomicOp, every BufferAccess kind including Workgroup.
- **Uses vyre's cpu.rs reference_fn for primitive ops** — do not duplicate op
  semantics, import them.

Structure:
- `interp.rs``pub fn run(program: &Program, inputs: &[Value]) -> Result<Vec<Value>, InterpError>`
- `eval_expr.rs` — dispatches on Expr variants
- `eval_node.rs` — dispatches on Node variants, handles control flow
- `atomics.rs` — sequential semantics (single-threaded reference is easier
  and equivalent for correctness)
- `workgroup.rs` — simulates workgroup memory and barrier semantics
- `ieee754.rs` — strict float path stubs; bodies are `todo!()` until floats
  land in vyre itself

Acceptance test: `cargo test -p vyre-conform reference::smoke` — runs a
hand-written Program using `BinOp::Add` through the reference interpreter,
verifies result matches `add.rs::reference_fn` bit-exactly. Then runs every
op currently in `vyre/src/ops/primitive/` and verifies reference interpreter
agrees with CPU reference_fn for a witnessed-u32 sample of 10,000 random
inputs per op.

If the reference interpreter disagrees with any cpu.rs for any input: the
test fails and you stop. One of them is wrong. Figure out which, fix it,
rerun. This is a Phase 2 blocker — do not proceed to Phase 3 with a broken
oracle.

### 5.3 Layer 2 wiring — law inference mandatory

File: `src/layers/layer2_law_inference.rs`.

- Existing `algebra/inference.rs` already tries every law automatically.
- Modify the op registration path so hand-declared laws are *verified*, not
  trusted. If a contributor submits `laws: &[Commutative]` on an op, the
  inference engine runs the commutativity checker at build time; if it
  disagrees, the build fails.
- Make inference the *default* contribution path. TEMPLATE_OP.rs drops the
  "declare your laws" slot and replaces it with "submit only the
  reference_fn; laws are inferred."
- Provide an escape hatch for the rare case where inference is too slow:
  `DeclaredLaw` can be provided explicitly only if accompanied by a
  `#[verified_manually = "reason"]` attribute. This marks it as needing
  human review every time it's touched.

Acceptance test: `cargo test -p vyre-conform layers::layer2::inference_rejects_false_claims`
— attempts to register a fake op claiming commutativity with a
non-commutative reference_fn; verifies the build-time check catches it.

### 5.4 Phase 2 gate

- Reference interpreter passes smoke test.
- Reference interpreter agrees with cpu.rs for every existing primitive op
  on 10K witnessed inputs each.
- `add.rs` OpSpec verifies through `verify_coverage()`.
- Inference engine rejects a deliberately-false commutativity claim at build
  time.

---

## 6. Phase 3 — Catalogs (archetypes, mutations, oracles, templates)

**Goal:** the four human-designed catalogs exist and are indexable from an
OpSpec. Each catalog is a registry; entries are referenced by id from
OpSpecs.

Can run in parallel with Phase 4.

### 6.1 Archetype library

Files: `src/archetypes/{arithmetic,structural,composition,backend}.rs` + `mod.rs`.

Trait:
```rust
pub trait Archetype: Sync {
    fn id(&self) -> &'static str;
    fn name(&self) -> &'static str;
    fn description(&self) -> &'static str;
    fn applies_to(&self, signature: &OpSignature) -> bool;
    fn instantiate(&self, op: &OpSpec) -> Vec<TestInput>;
}
```

Implement all archetypes:

**Arithmetic:**
- A1 Identity pair — `(0, 0)`, `(1, 1)`, `(x, x)` for every `x`
- A2 Overflow pair — `(u32::MAX, 1)`, `(u32::MAX, u32::MAX)`, `(i32::MIN, -1)` for signed ops
- A3 Power-of-two boundary — `(2^31 - 1, 2^31)`, every `2^k ± 1` for k in 0..32
- A4 Shift saturation — `(x, 31)`, `(x, 32)`, `(x, 33)`, `(x, 63)`, `(x, u32::MAX)`
- A5 Bit pattern alternation — `0x55555555`, `0xAAAAAAAA`, `0xF0F0F0F0`, `0xDEADBEEF`, `0xCAFEBABE`
- A6 Division zero — `(x, 0)` for every divisible op
- A7 Self-inverse trigger — `(x, x)` for XOR, `(x, -x)` for add

**Structural (for IR):**
- S1 Minimum program — single buffer, single op, no control flow
- S2 Maximum nesting — loops inside conditionals inside loops, up to configured limit
- S3 Node-count boundary — `max_nodes - 1`, `max_nodes`, `max_nodes + 1`
- S4 Dead code — computes never used; lowering must not emit them
- S5 Diamond dataflow — `x → (a, b) → merge`
- S6 Long dependency chain — N sequential ops, N = 1000
- S7 Wide fanout — one value consumed by N ops, N = 1000
- S8 Workgroup memory contention — every thread writes same workgroup index
- S9 Workgroup memory uniqueness — every thread writes own index
- S10 Atomic contention — all threads atomic-add to same slot
- S11 Off-by-one buffer — access at `index = len - 1`, `len`, `len + 1`
- S12 Shadowing attempt — variable redeclaration (must be rejected)
- S13 Barrier under divergent control — must be rejected (V-rule)
- S14 Type confusion — passing u32 where f32 expected
- S15 Empty buffer — buffer declared with `count = 0`

**Composition:**
- C1 Identity composition — `f(x)` vs `compose(f, identity)(x)`
- C2 Associativity triple — `(a · b) · c` vs `a · (b · c)`
- C3 Commutativity swap — `f(a, b)` vs `f(b, a)`
- C4 Involution pair — `f(f(x))` vs `x`
- C5 Idempotent collapse — `f(f(x))` vs `f(x)`
- C6 Absorbing short-circuit — `f(absorbing_elem, x)` vs `absorbing_elem`
- C7 Distributivity cross — `a · (b + c)` vs `(a · b) + (a · c)`

**Backend:**
- X1 Single op, every backend
- X2 Random program, every backend
- X3 Numerical worst case
- X4 Resource saturation — max threads × max buffer × max iterations

Each archetype's `instantiate` is pure: given an OpSpec, produce the concrete
test input list. No I/O, no randomness without a seed.

The global registry:
```rust
pub const ARCHETYPES: &[&'static dyn Archetype] = &[
    &arithmetic::A1_IdentityPair, &arithmetic::A2_OverflowPair, ...
];
```

### 6.2 Mutation catalog

Files: `src/mutations/*.rs` + `mod.rs`.

```rust
#[non_exhaustive]
pub enum Mutation {
    ArithOpSwap { from: BinOp, to: BinOp },
    ConstantIncrement { by: i64 },
    CompareInvert,
    ControlFlowDelete { branch: Branch },
    BufferIndexShift { by: i32 },
    AtomicOrderingWeaken,
    IrDataTypeSwap { from: DataType, to: DataType },
    LawFalselyClaim { law: Law, op: OpId },
    LawIdentityCorrupt { op: OpId, wrong: Value },
    LowerRemoveBoundsCheck,
    LowerRemoveShiftMask,
    BytecodeConverterSwap { from: Opcode, to: Opcode },
    // ... complete set
}

#[non_exhaustive]
pub enum MutationClass {
    ArithmeticMutations,
    ComparisonMutations,
    BitwiseMutations,
    ControlFlowMutations,
    BufferAccessMutations,
    OrderingMutations,
    IrStructuralMutations,
    LawMutations,
    LoweringMutations,
    ConstantMutations,
}

pub fn class_of(m: Mutation) -> MutationClass;
pub fn apply(source: &str, mutation: &Mutation) -> String;
pub fn mutations_for(class: MutationClass) -> Vec<Mutation>;
pub const MUTATION_CATALOG: &[Mutation] = &[...];
```

Full list:

**Arithmetic mutations:**
- `+``-`, `*``/`, `*``+`
- `wrapping_add``saturating_add` / `checked_add`
- `shl``shr`
- Constants: `0``1`, `1``0`, `N``N ± 1`, `u32::MAX``0`

**Comparison mutations:**
- `<``<=`, `>``>=`, `==``!=`
- `<``>`, invert boolean results

**Bitwise mutations:**
- `&``|`, `^``&`, `^``|`
- `!x``x`, `x & mask``x`

**Control flow mutations:**
- Delete `if` branch (take only else)
- Delete `else` branch (take only if)
- Invert condition (`!cond`)
- Loop `0..n``0..n+1`, `0..n-1`, `1..n`
- `break``continue`, `continue` → (removed)

**Buffer access mutations:**
- `buffer[i]``buffer[i-1]`, `buffer[i+1]`, `buffer[0]`, `buffer[len-1]`
- Read vs write swap
- Ordering: `Release``Relaxed`, `Acquire``Relaxed`, `SeqCst``Acquire`

**IR-specific mutations:**
- `BinOp::Add``BinOp::Sub` in primitive op definitions
- `DataType::U32``DataType::I32` in signatures
- Delete a law from the OpSpec declaration
- Swap `reference_fn` for the wrong op's reference_fn
- `BufferAccess::Storage``BufferAccess::Workgroup`
- Remove a validation rule entry
- Swap two opcodes in the bytecode → IR converter
- Emit the wrong WGSL op in the lowering table
- Change `workgroup_size` computation
- Remove bounds check in lowering
- Remove shift mask in lowering

**Law-level mutations (non-negotiable):**
- Falsely claim commutativity on a non-commutative op
- Falsely claim associativity
- Claim wrong identity element
- Claim wrong absorbing element

`apply` modifies Rust source. Can use `syn` for AST-level mutations or regex
for simple textual ones. AST-level is more robust; prefer it for the
IR-specific mutations.

Without `LawFalselyClaim` and `LawIdentityCorrupt` mutations, the whole proof
engine is a fiction. Tests must kill these mutations for every declared law
on every op.

### 6.3 Oracle hierarchy

Files: `src/oracles/*.rs` + `mod.rs`.

```rust
pub trait Oracle: Sync {
    fn kind(&self) -> OracleKind;
    fn applicable_to(&self, op: &OpSpec, property: &Property) -> bool;
    fn verify(&self, op: &OpSpec, input: &[Value], observed: &Value) -> Verdict;
}

pub fn resolve(op: &OpSpec, property: &Property) -> &'static dyn Oracle;
```

`resolve` implements the decision tree: Law > SpecTable > ReferenceInterpreter
> CpuReference > Composition > External > Property. First applicable in this
order wins.

Decision rule (encoded in resolve):
```
Given (op, input_class, property_to_check):
1. If a declared law covers (op, property) → use the law checker as oracle
2. Elif (input, expected) is in the spec table → use the spec table
3. Elif the op has a CPU reference_fn → use reference_fn as oracle
4. Elif this is a composition → use the composition theorem
5. Else → property test (weakest), flag for human review
```

Every generated test declares which oracle it uses. If a test uses "property"
as oracle when a law exists, the generator is wrong.

**The critical rule:** tests never derive their expected output from the
implementation they're testing. The oracle must be independent —
hand-written, law-derived, reference-interpreter-derived, or external.

The generator uses `resolve` at generation time to select the oracle for each
test. Selecting a weaker oracle when a stronger one is applicable is
*impossible* because the generator never makes the choice — the hierarchy
does.

Acceptance test: unit tests that feed concrete (op, property) pairs through
`resolve` and verify the expected oracle comes out.

### 6.4 Templates

Files: `src/templates/*.tmpl` (text) + `src/templates/mod.rs` (loader).

Templates are plain text, included at compile time via `include_str!`. Loader
provides:

```rust
pub fn render(template: &str, context: &TemplateContext) -> Result<String, TemplateError>;
```

Interpolation syntax: `{field}` for scalar, `{list:name}\n...\n{/list}` for
loops, `{if:cond}...{/if}` for conditionals. Keep it tight — Mustache-subset.

**preamble.tmpl:**
```
You are writing ONE test. Not 1000. Not 100. One.

You have unlimited time. There is no deadline.
DO NOT batch-generate similar tests. DO NOT optimize for speed.

Your output will be mutation-tested. If a single-character change to the
source code makes your test pass, the test is rejected and you will be
asked to strengthen it. Your only metric is: does this test catch bugs?
```

**op_correctness.tmpl:**
```
{include:preamble}

TARGET: {op_name} from vyre/src/ops/primitive/{file}.rs
PROPERTY: The op must return {expected} when given inputs {inputs}.
ORACLE: Spec table row from {spec_path}:{line}
CATEGORY: {category}

REQUIRED:
- The test MUST call the op via ir::Program, not via the cpu.rs direct function.
- The test MUST lower the Program and run it through the reference interpreter.
- The test MUST assert byte-equality with {expected}.
- The test MUST be named `test_{op}_{input_desc}_spec_table`.

FORBIDDEN:
- DO NOT derive the expected value from calling the function.
- DO NOT use proptest, quickcheck, or random inputs. This is a specific-input test.
- DO NOT add helper functions. Inline everything.
- DO NOT test multiple properties in one test.

Your test will be mutation-graded. If I delete the `{op}` case from the
primitive/{file}.rs dispatch and your test still passes, the test is rejected.

Output: ONE #[test] function. Nothing else.
```

**law.tmpl:**
```
{include:preamble}

TARGET: {op_name}
PROPERTY: {law_name} ({law_description})
ORACLE: Algebraic law — the checker has verified this law holds exhaustively on u8.
INPUT CLASS: {input_class}

Write a test that asserts {law_formula} for inputs {specific_inputs}.

The test must fail if:
- The law declaration is removed from the OpSpec for {op}
- The identity/absorbing element in the law declaration is corrupted
- The op's reference_fn is swapped with a non-{law}-satisfying function

REQUIRED:
- Use only values from the input class.
- Use the reference interpreter as execution environment.
- Assert the law formula directly, not by computing expected values.
```

**validation.tmpl:**
```
{include:preamble}

TARGET: Validation rule {V_NNN}
PROPERTY: must_reject({description})
ORACLE: V{NNN} rule definition in vyre/src/ir/validate.rs

Construct an ir::Program that violates exactly {V_NNN} and no other rule.
Assert that validate(&program) contains a ValidationError for {V_NNN}.

REQUIRED:
- The program must be otherwise well-formed.
- The error list must contain exactly one error, with rule == V{NNN}.
- If you cannot construct a program that violates ONLY this rule, output:
  FAIL: rule V{NNN} is not independently triggerable.
  This is a finding — the validation rule set is not separable.
```

**backend_equiv.tmpl:**
```
{include:preamble}

TARGET: op {op_name} on every conformant backend
PROPERTY: I3 (backend equivalence)
ORACLE: Reference interpreter

Build a Program using {op} with inputs {input_class}.
Run it on every registered backend.
Assert every result is byte-identical to the reference interpreter.

REQUIRED:
- If only one backend is registered, skip with reason "I3 needs ≥2 backends."
- Iterate backends via the registry, not hardcoded names.
- Fail with a diff showing which backends disagreed.
```

**archetype.tmpl:**
```
{include:preamble}

TARGET: {op_name}
ARCHETYPE: {archetype_id} ({archetype_name})
PROPERTY: {property}
ORACLE: {oracle_chosen_from_hierarchy}

The archetype is: {archetype_description}

Instantiate the archetype for {op_name}. Use inputs that match the archetype shape.
Assert the expected property using the chosen oracle.
```

**mutation_kill.tmpl:**
```
{include:preamble}

TARGET: {op_name}
PROPERTY: kill mutation {mutation_description}
ORACLE: {oracle_chosen_from_hierarchy}

The mutation gate reports that the following mutation survives all current tests:
{mutation_details}

Write a test that fails when the above mutation is applied to {op_name}'s source.
```

Every template starts with `{include:preamble}` which expands to the
"ONE test, not batched, mutation-graded" warning.

Every template declares REQUIRED and FORBIDDEN sections. The REQUIRED and
FORBIDDEN text is passed to the model verbatim — they are contracts the
model is told will be checked.

Acceptance test: render every template with a mock context; verify no
interpolation errors; verify the output contains every REQUIRED/FORBIDDEN
literal string.

---

## 7. Phase 4 — Harnesses

**Goal:** every harness primitive exists and is testable in isolation. H6
(mutation gate) is the critical piece.

Can run in parallel with Phase 3 except H9 (coverage) which depends on the
Phase 1 registry.

### 7.1 H6 — mutation gate (build first)

File: `src/harnesses/mutation.rs`.

This is the single most load-bearing piece of the whole conform system.
Build it first among the harnesses.

```rust
pub struct GateReport {
    pub test_name: String,
    pub source_file: PathBuf,
    pub mutations_attempted: Vec<Mutation>,
    pub mutations_killed: Vec<Mutation>,
    pub mutations_survived: Vec<Mutation>,
    pub duration: Duration,
    pub feedback: Vec<StructuredFeedback>,
}

pub struct StructuredFeedback {
    pub mutation: Mutation,
    pub hint: String,  // "Your test passed when I changed X to Y. Add assertion that distinguishes these."
}

pub fn mutation_probe(
    source_file: &Path,
    test_fn_name: &str,
    mutation_classes: &[MutationClass],
) -> GateReport;
```

Mechanism:
1. For each mutation in the catalog whose class is in `mutation_classes`:
2. Snapshot `source_file`.
3. Apply the mutation.
4. Run `cargo test -- <test_fn_name>` in a workdir with the mutated source.
5. If the test FAILED → mutation killed (good).
6. If the test PASSED → mutation survived (bad, finding).
7. Restore the snapshot.
8. Continue until all applicable mutations are tried.

Parallelism: can run multiple mutations concurrently using copy-on-write
source trees. Use `cargo` with separate target dirs per worker. Skip this
optimization for Phase 4; add in Phase 5 if needed.

Also exposed as a standalone binary: `vyre-conform mutation-gate --source
<path> --test <name> --classes <...>`. Used by CI and by the agent stream.

Acceptance test:
- Feed a hand-written test that asserts `add(2, 3) == 5`.
- Apply the `ArithOpSwap { from: Add, to: Sub }` mutation to the add op source.
- Verify `mutation_probe` reports the mutation as killed.
- Apply a mutation the test shouldn't catch (say, changing an unrelated function).
- Verify `mutation_probe` reports survival and produces structured feedback
  explaining what the test failed to distinguish.
- If either case is wrong, **stop**. The gate is the floor; if it's broken,
  nothing else matters.

### 7.2 H1–H5 and H7–H10

Build in parallel after H6.

**H1 `with_every_backend(program, |backend, result|)`** — iterates the
backend registry, runs the program on each, yields results. Initially only
one backend (wgpu); add the reference interpreter as a "backend" so I3 tests
can run with a single real backend.

**H2 `with_oom_injection(program, fail_at)`** — requires a test-only
allocator crate feature. Fails the Nth allocation; verifies graceful
handling. Mark with `#[cfg(feature = "oom-injection")]` so production builds
don't include it.

**H3 `with_fault_at_dispatch(program, dispatch_index)`** — simulates device
loss. In the wgpu backend, hook the dispatch call to return an error after
N calls. Test-only.

**H4 `with_backend_swap(program, swap_after)`** — runs the first N ops on
backend A, the next ops on backend B, verifies state transfer is legal or
rejected correctly. Tests the backend-independence claim.

**H5 `run_determinism_suite(program, iterations, threads)`** — runs the
program `iterations` times across `threads` threads, asserts all results
byte-identical. Tests I1.

**H7 `reference_diff(program, inputs)`** — runs the program on the reference
interpreter AND every registered backend, diffs the results, returns a
structured report of any divergence with minimized input via the minimizer.

**H8 `law_probe(op, law)`** — runs the law checker on `(op, law)`, returns
`(verified, counterexample?, minimized_counterexample?)`. Thin wrapper over
existing `algebra/checker.rs`.

**H9 `spec_coverage_check()`** — iterates the op registry, verifies every op
has: reference_fn, at least one declared law (or explicit "no laws" marker),
spec table entries, category assignment. Fails if any op is missing any
required field. Depends on Phase 1 registry.

**H10 `composition_proof_check(program)`** — runs the composition theorem
checker on the program, returns the proof token or a list of theorems that
don't hold. Tests assert the token is present.

Each harness is a standalone function with a clear contract. Write unit
tests for each that feed known-good and known-bad inputs and verify the
harness detects each case.

### 7.3 Phase 4 gate

- H6 passes its acceptance test (can catch ArithOpSwap mutation, produces
  structured feedback for survivors).
- All H1-H10 harnesses have unit tests that pass.
- H9 rejects a registry entry missing `reference_fn`.

---

## 8. Phase 5 — Generator, xtask, layer enforcement

**Goal:** the pipeline runs end-to-end. Spec → generator → materialized
tests → cargo test → mutation gate → pass/fail. Layer enforcement wires the
discipline into the build.

### 8.1 Generator

Files: `src/generator/*.rs` + `src/bin/vyre-gen-tests.rs`.

```rust
pub struct GenerationPlan {
    pub ops: Vec<&'static OpSpec>,
    pub archetypes: Vec<&'static dyn Archetype>,
    pub templates: Vec<TemplateKind>,
    pub seed: u64,
}

pub fn generate(plan: &GenerationPlan, out_dir: &Path) -> Result<GenerationReport, GenError>;
```

`cross_product.rs` — iterates `(op × archetype × oracle)` tuples from the
plan, filters by `Archetype::applies_to(op.signature)`, routes each tuple to
the appropriate template, emits one test per tuple.

`emit.rs` — Rust code writer. Takes rendered template output + test name +
provenance header, writes a syntactically valid `#[test]` function to a file
in `tests_generated/`. Uses `syn` to parse the rendered output and verify
it's valid Rust before writing; if rendering produced invalid syntax, the
generator fails fast with a clear error.

`naming.rs` — deterministic test names: `test_{op}_{archetype}_{oracle}_{seed_hash}`.
Same inputs always produce the same name. No collisions.

`seed.rs` — RNG wrapper that takes a master seed and derives per-test seeds
deterministically. Same master seed → same test set. Reproducible.

`provenance.rs` — every generated file starts with a comment header:
```
// GENERATED by vyre-gen-tests v{gen_version}
// spec version: {spec_version}
// source tuple: op={op}, archetype={archetype}, oracle={oracle}
// master seed: {seed}
// generated at: {timestamp}
// DO NOT EDIT. Regenerate via `cargo xtask generate-tests`.
```

Binary `vyre-gen-tests`:
```
vyre-gen-tests --op add --out tests_generated/
vyre-gen-tests --all --out tests_generated/
vyre-gen-tests --plan plan.toml --out tests_generated/
```

Acceptance test: running `vyre-gen-tests --op add --out /tmp/gen` produces a
directory of test files, all syntactically valid, with provenance headers,
all oracles chosen via hierarchy (no "property" oracles when a stronger one
exists).

### 8.2 xtask commands

Create top-level `xtask/` crate at the workspace root.

Commands:
- `cargo xtask generate-tests [--op NAME | --all]` — wraps generator binary
- `cargo xtask mutation-gate [--tests PATH]` — runs H6 across all agent-written
  tests, emits report
- `cargo xtask coverage-check` — runs H9 + registry verification
- `cargo xtask conform-verify` — runs the whole pipeline end-to-end:
  generate, mutation-gate, coverage, `cargo test`. Exit non-zero on any
  failure. This is what CI runs.

Each command has `--help`. `conform-verify` is the "is vyre-conform healthy"
check.

### 8.3 Layer enforcement

Files: `src/layers/*.rs`.

**Layer 1 — Executable spec** (`layer1_executable_spec.rs`)
- Hook `verify_coverage()` into `build.rs`.
- Add a compile-time assertion that `OP_REGISTRY.len() > 0` in release builds.
- Integration test that builds a fake vyre-conform consumer and verifies it
  can iterate OP_REGISTRY.

**Layer 2 — Law inference mandatory** (already done in Phase 2)

**Layer 3 — Reference interpreter** (`layer3_reference_interp.rs`)
- Thin wrapper that re-exports `::reference` functions.
- Provides integration hook used by H7 and I3 tests.

**Layer 4 — Mutation gate** (`layer4_mutation_gate.rs`)
- Thin wrapper over `harnesses::mutation`.
- Integration hook for CI: `layer4::enforce_on(test_set) -> Result<(), Vec<SurvivingMutation>>`.
- CI fails if any surviving mutation is found in the agent-accepted test set.

**Layer 5 — Adversarial roles** (`layer5_adversarial.rs`)
- Implementor / Prosecutor / Defender three-role contribution flow.
- Implementor submits op + reference_fn.
- Prosecutor writes tests graded by mutation kill count.
- Defender writes intentionally-wrong reference_fn; if Prosecutor's tests
  don't catch Defender's wrong impl, Prosecutor is rejected with structured
  feedback.
- Mechanical adjudication: no human judgment in the loop.
- Integrates with the agent stream as three distinct task kinds.

**Layer 6 — Stability** (`layer6_stability.rs`)
- `build.rs` check: audit all public enums in vyre and vyre-conform for
  `#[non_exhaustive]`. Fail build on any public enum without it.
- Version lock enforcement: when a new OpSpec is added, `since_version` must
  equal `CURRENT_VERSION`. Cannot add ops claiming older version.
- Audit script that diffs public API between two versions, reports any
  removed or changed items as breaking.

**Layer 7 — Composition proof** (`layer7_composition_proof.rs`)
- Extend `ir::Program` construction to compute a proof token when composed.
- Proof token encodes: "this composition preserves laws L1, L2 because
  composition theorems T1, T2 apply."
- `lower()` refuses to lower programs whose proofs don't check out.
- Proof check is fast (O(nodes)); no runtime overhead during actual
  execution, only at construction/lowering.

**Layer 8 — Feedback loop** (`layer8_feedback_loop.rs`)
- Target: <10s contribution feedback.
- Incremental mutation runs: cache mutation-probe results keyed by
  `(source_hash, test_hash, mutation)`. Only re-run mutations whose cache
  entry is invalidated.
- Warm reference interpreter: keep it in-process across test runs, not
  spawned fresh per test.
- `cargo xtask quick-check --op NAME` runs the minimal verification path
  for one op in <10s.

### 8.4 Phase 5 gate

- `cargo xtask conform-verify` runs end-to-end and exits cleanly.
- `cargo xtask quick-check --op add` runs in <10s.
- Layer 6 audit passes (all public enums `#[non_exhaustive]`).
- Layer 7 rejects a deliberately-broken composition.

---

## 9. Phase 6 — Calibration

**Goal:** the pipeline produces output that meets the bar on a single op,
before being turned loose on the whole registry.

### 9.1 Generate all tests for BinOp::Add

`cargo xtask generate-tests --op add --out tests_generated/`

Verify:
- Every (archetype × oracle) combination that applies to add produces a test.
- Every test has a provenance header.
- `cargo test` on the generated tree passes (or fails with findings in the
  actual implementation, which are real bugs to fix).
- `cargo xtask mutation-gate --tests tests_generated/add/` reports every
  mutation in the ArithmeticMutations and ConstantMutations classes as
  killed. If any survive, generate more tests until they don't.

### 9.2 Audit 10 random generated tests by hand

Read them. Check:
- Is the assertion meaningful, or could a single-character change in the
  implementation bypass it?
- Is the oracle the strongest applicable?
- Is the test self-contained (no helpers, no indirection)?
- Is the rationale for why this test exists clear from reading it?

If any test fails this audit, the generator or the templates are wrong. Fix
them. Regenerate. Audit 10 more. Repeat until audit is clean.

### 9.3 Strengthen templates and gate

Based on audit findings, strengthen templates. Common fixes:
- Add more FORBIDDEN patterns ("DO NOT use `assert!(result.is_ok())`").
- Narrow the input class description.
- Add explicit assertion patterns the model must use.

Strengthen the mutation gate with any mutations the audit revealed aren't
covered.

### 9.4 Compare against existing hand-written add tests

`vyre/tests/primitive_programs.rs` has hand-written tests for add. Run
mutation gate against both the hand-written tests and the generated tests.
The generated set must kill a strict superset of mutations. If the
hand-written set catches something the generated set doesn't, add the
missing archetype/template/mutation and regenerate.

Do not delete the hand-written tests until the generated set is strictly
stronger. When it is, the hand-written tests are replaced by the generated
set and Git records the migration.

### 9.5 Phase 6 gate

- Generated test set for BinOp::Add is strictly stronger than existing
  hand-written tests (per mutation gate comparison).
- Hand audit of 10 random generated tests finds zero slop.
- `cargo xtask conform-verify` runs cleanly end-to-end on the add-only path.

When this gate passes, scaling to all 32 primitive ops is mechanical: run
the same pipeline with `--all` instead of `--op add`. If Phase 6 passes for
add, it passes for everything structurally similar.

---

## 10. What NOT to do

- **Do not commit generated tests.** `tests_generated/` is in `.gitignore`.
  If you find yourself adding test files to git outside of `regression/` or
  `agent_accepted/`, stop and reconsider.
- **Do not weaken a template to make a test pass.** Templates are contracts.
  If a test the template produces is weak, the template is wrong and must be
  strengthened, not the other way around.
- **Do not let tests derive expected values from the implementation.** Every
  oracle must be independent of the code being tested. A test that computes
  `let expected = add(a, b); assert_eq!(add(a, b), expected)` is slop and
  the gate must reject it.
- **Do not skip the audit in Phase 6.** Running the pipeline on all 32 ops
  without verifying quality on one op first is the path to generating 32x
  slop instead of 1x slop.
- **Do not add a Category B variant.** Runtime abstraction is forbidden.
  Every op is A or C. If something seems like it wants to be B, reconsider
  whether it's really an op or whether it's a composition of real ops.
- **Do not let a declared law exist without a Verification provenance.** A
  law the checker hasn't verified is a lie. Compile error.
- **Do not accept a PR that touches `agent_accepted/` without a provenance
  sidecar.** Every file in that directory has a `.meta` file recording who
  (which agent id), when, what mutations were killed, who approved. Missing
  metadata = rejected PR.
- **Do not modify existing hand-written tests to match broken implementation.**
  If a hand-written test fails, the implementation is probably wrong, not
  the test. Investigate before touching the test.

---

## 11. Parallelism and dispatch

Each phase has a list of artifacts, most of which are independent. Parallel
agent dispatch strategy:

**Phase 1** (5 parallel)
- schema.md → Codex Spark or Gemini
- Directory scaffolding → Kimi (fast, mechanical)
- Spec schema types → Codex 5.4 (load-bearing)
- Invariants → Gemini
- Laws enum → Gemini (different account from invariants)

**Phase 2** (3 tracks, reference interpreter is largest)
- Canonical add OpSpec → Codex 5.4 (sets the quality bar)
- Reference interpreter → Codex 5.4 or sequential Kimi passes
- Layer 2 wiring → Gemini

**Phase 3** (4 parallel)
- Archetype library → Codex Spark
- Mutation catalog → Gemini
- Oracle hierarchy → Gemini (different account)
- Templates → Kimi

**Phase 4** (H6 first, then 3 parallel)
- H6 mutation gate → Codex 5.4 (the critical gate)
- H1-H5 → Kimi
- H7-H10 → Kimi (second instance)

**Phase 5** (3 parallel)
- Generator → Codex Spark
- xtask commands → Kimi
- Layer enforcement 1/4/5/6/7/8 → Codex 5.4

**Phase 6** (serial, mostly human judgment)
- Run generator on add → mechanical
- Audit 10 tests by hand → HUMAN ONLY
- Strengthen and regenerate → mechanical
- Compare to hand-written → mechanical

**Rate-limit consideration:** Codex 5.4 is the bottleneck (tasks appear in
every phase). Pre-dispatch a Kimi fallback for each Codex task in case of
rate limits. Gemini tasks respect the "3 per account" rule — never switch
accounts until the current batch of 3 is done.

---

## 12. The load-bearing pieces

These five must be right or everything downstream is corrupted. Allocate
disproportionate review attention to these:

1. **Spec schema types** (Phase 1) — the data shape every other component
   reads from.
2. **Canonical add OpSpec** (Phase 2) — the template every other op copies.
   Didactic quality matters.
3. **Reference interpreter** (Phase 2) — the oracle. Bugs here silently
   corrupt everything.
4. **Mutation gate H6** (Phase 4) — the quality floor. A weak gate lets slop
   through.
5. **Layer enforcement wiring** (Phase 5) — the discipline that makes the
   gates actually enforce.

For each of these, after the agent produces the code:
- Read it line by line.
- Run the acceptance test manually.
- Check that the acceptance test is meaningful (does it actually verify the
  thing it claims to verify?).
- If in doubt, redispatch with corrections.

---

## 13. Daily calibration (forever)

Once the pipeline is running and producing tests:

**Every day, read 10 random committed agent tests and delete any that don't
meet the standard.**

This is the calibration loop. If the gate is letting slop through, deletions
force you to strengthen the gate. If nothing needs deleting, you're done
calibrating and can increase batch size. Ten a day, every day, forever.

This is how SQLite stays at 590x ratio *and* high quality — they don't
trust the machinery, they audit it.

**Every bug found in production adds a new mutation to the catalog.** If a
bug class isn't in the catalog, the gate can't catch it, and the next
instance of that bug will ship. Post-mortem rule: every bug fix adds one
mutation operator.

---

## 14. Contact points

- Schema questions → re-read `schema.md`, which is the single source of
  truth for the directory layout and provenance rules.
- Spec type questions → `src/spec/mod.rs` doc comments.
- Invariant questions → `src/spec/invariants.rs`, each has a description
  field.
- "Should this be a primitive or a composition?" → Category A+C rule. If it
  composes from existing ops, it's a composition. If it requires new
  hardware, it's Category C. Never a new primitive otherwise.
- "Why is this law declared without ExhaustiveU8?" → the op's value type is
  u32 or larger. Use WitnessedU32 with a seed and a count of at least 100K.
- "Where does this generated test go?" → `tests_generated/{op}/{archetype}_{oracle}.rs`.
  Not committed.
- "Where does this human-minimized bug go?" → `regression/YYYY-MM-DD-description.rs`.
  Committed. Permanent.
- "Where does this agent-approved test go?" → `agent_accepted/{op}/test_NNNN.rs`
  with matching `.meta` sidecar. Committed.

---

END OF PLAN