vyre-conform 0.1.0

Conformance suite for vyre backends — proves byte-identical output to CPU reference
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
# vyre-conform

## Vision

### The Abstraction Thesis

Every complex system in computing history was built bottom-up through forced
abstraction layers. The CPU path: transistor → logic gate → ALU → microcode →
instruction set → assembler → compiler → operating system → programming
language → application. Each layer was forced by physics or necessity. The
result: every intermediate layer is simple, complete, tested, and invisible.

The critical property: **abstractions do not decrease performance.** C++
templates compile to the same machine code as hand-written C. Rust's iterators
compile to the same loops as manual indexing. The abstraction exists at design
time. It vanishes at execution time.

GPU compute has not followed this path. Researchers see thousands of parallel
cores and jump directly from GPU threads to compilers, databases, and neural
network frameworks — skipping the intermediate layers because they already know
what the end looks like. This fails every time. Not because GPUs are harder
than CPUs — because the abstraction layers between "GPU core" and "GPU
application" do not exist yet.

**The thesis:** Complex systems emerge from simple layers. Each layer must be
built, tested, and trusted before the next one begins. Skipping a layer produces
a system that appears to work but collapses under weight.

vyre builds those layers. vyre-conform proves they are correct.

### What vyre-conform must become

vyre-conform is a **proof system**, not a test suite.

A test suite says "I ran 1000 inputs and they matched." Confidence is
proportional to sample count. It never reaches certainty.

A proof system says "this implementation satisfies these algebraic laws, which
uniquely characterize the operation. Verified exhaustively on u8 and witnessed
on 10^6 random u32 inputs." Confidence is mathematical.

Rust's borrow checker does not test memory safety — it proves it via type rules.
vyre-conform does not test GPU correctness — it proves it via algebraic laws,
exhaustive verification, and compositional reasoning.

### What vyre-conform must do

1. Define every operation by its algebraic identity (laws, not examples).
2. Verify those laws exhaustively on small domains and statistically on full domains.
3. Produce formal conformance certificates with coverage metrics.
4. Scale to 100M+ test cases without OOM.
5. Support engine-level invariant checking beyond individual op correctness.
6. Enable any backend (wgpu, CUDA, Metal, TPU, FPGA) to prove conformance.

### The Category A+C rule (permanent)

Every operation in vyre belongs to exactly one of two categories:

- **Category A — Compositional, zero-overhead.** The op is expressed as a
  composition of other ops. When lowered, it inlines to hand-written-
  equivalent shader code. The abstraction exists at IR construction time
  and vanishes at shader emission time. 95%+ of vyre ops are Category A.

- **Category C — Hardware intrinsic.** The op maps 1:1 to a specific
  hardware instruction that no software composition can match in
  performance (10×-1000× faster). Tensor cores, warp shuffles, sampler-
  filtered reads, async copy engines, ray-triangle intersection. A
  Category C op declares `{ hardware, backend_availability }` — the
  target hardware identifier and a predicate that says which backends
  support the intrinsic. **There is no software fallback composition.**
  A backend that does not support a Category C op returns
  `Error::UnsupportedByBackend { op, backend }` — a first-class outcome,
  not a silent degraded path. Downstream consumers who need cross-backend
  portability for an op either (a) constrain their target backend set to
  backends that support the op natively, or (b) author an explicit
  Category A composition over primitives that expresses the same
  computation. The choice is visible in code, not hidden in a runtime
  fallback. **Performance is the product**; a slow path that silently
  degrades is a footgun users write unaware and blame vyre for being
  slower than hand-written code.

**Category B — runtime abstraction overhead — is FORBIDDEN.** No virtual
dispatch. No interpreter loops. No JIT. No boxing. If an op cannot lower
to either inlined composition (A) or direct hardware intrinsic (C), it
does not belong in vyre.

This discipline is non-negotiable. It is what makes vyre composable at
arbitrary depth without accumulating cost. A Program with 20 ops composed
from 200 sub-ops lowers to ONE optimized shader, not an interpreter
dispatching through a registry.

The test for every proposed op is mechanical: does lowering produce
overhead-free code? If yes, via inlining → Category A. If yes, via
hardware mapping → Category C. If no → rejected.

### Extensibility — how vyre adapts to new algorithms

A common concern: "can vyre represent new neural architectures when they
are invented? New algorithms, new models, new techniques?"

Yes, because **vyre is an IR, not a fixed algorithm library.** New
algorithms are compositions of existing primitives. New primitives are
rare — they require new hardware. The substrate is stable; the algorithms
that run on it are not.

Examples:

- **Mamba / State Space Models** — matmul + element-wise + scan. All
  existing Category A ops. Composition.
- **Flash Attention** — matmul + softmax + masked loads. Softmax is a
  Layer 2 composition (exp + reduce + div). Masked loads are Category A
  element-wise ops. Composition.
- **Diffusion models (U-Net)** — convolutions + attention + activations.
  Convolution is a Layer 2 composition (matmul with im2col or direct).
  Attention is already a composition. Activations are element-wise.
  Composition.
- **New quantization schemes (NF4, FP8 variants)** — new DataType variants
  with declared semantics. The IR grows one enum; existing code does not
  break.
- **Sparse attention variants** — masked matmul + index gather. Index
  gather is an existing op. Masked matmul is a new compound built from
  matmul + conditional store. Composition.
- **Mixture of Experts routing** — top-k selection + indexed dispatch +
  weighted sum. Top-k is a Layer 2 op (sort + slice). Indexed dispatch is
  a scatter operation. Weighted sum is matmul. Composition.

The principle: **vyre evolves by adding primitives slowly (as hardware
changes) and by growing the composition library quickly (as researchers
invent algorithms).** New hardware is rare. New algorithms are constant.
The substrate accommodates the constant change because algorithms are
compositions, not new primitives.

What would require a new primitive:

- New hardware unit in a GPU generation (e.g., a new sparse tensor core
  not derivable from existing ops). → Add a new Category C op that
  declares `{ hardware, backend_availability }`. Backends without the
  unit return `UnsupportedByBackend`; consumers that need the op
  constrain their target backend set or compose equivalent behavior
  from Category A primitives.
- A new DataType (e.g., FP4 when it becomes mainstream). → Add an enum
  variant with strict semantics.
- A fundamentally new dispatch model (e.g., if quantum GPU compute
  becomes real). → Add a new Program variant.

Everything else — every new model architecture, every new training
algorithm, every new inference technique — is a composition expressible
in the IR as it stands. Researchers publish a paper, implementors compose
vyre ops to match the paper, conformance verifies correctness, the
algorithm runs.

## Architecture

```
vyre-conform/src/
│
├── spec/                     THE SPECIFICATION. Zero deps. Pure definitions.
│   ├── types.rs              DataType, OpSignature, Convention (mirrors ir/types.rs)
│   ├── op.rs                 OpSpec: id + signature + cpu_fn + laws + equivalence classes + boundaries
│   ├── law.rs                AlgebraicLaw: Commutative, Associative, Identity, SelfInverse, Idempotent, Absorbing, Involution, DeMorgan, Monotone, Bounded, Complement, DistributiveOver
│   ├── engine.rs             EngineSpec: engine-level specifications
│   └── invariant.rs          EngineInvariant: Deterministic, NoOutputLost, NoDuplicate, Ordered, Bounded, Termination, AtomicLinearizable, WorkgroupInvariant
│
├── algebra/                  THE PROOF ENGINE.
│   ├── checker.rs            Verify a law holds: exhaustive on u8, witnessed on u32
│   ├── laws/                 Individual law checkers
│   │   ├── commutative.rs    f(a,b) = f(b,a)
│   │   ├── associative.rs    f(f(a,b),c) = f(a,f(b,c))
│   │   ├── identity.rs       f(a,e) = a
│   │   ├── self_inverse.rs   f(a,a) = r
│   │   ├── idempotent.rs     f(a,a) = a
│   │   ├── absorbing.rs      f(a,z) = z
│   │   ├── involution.rs     f(f(a)) = a
│   │   ├── demorgan.rs       f(g(a,b)) = h(f(a),f(b))
│   │   ├── monotone.rs       a ≤ b → f(a) ≤ f(b)
│   │   ├── bounded.rs        lo ≤ f(a) ≤ hi
│   │   ├── complement.rs     f(a) + g(a) = u
│   │   └── distributive.rs   f(a, g(b,c)) = g(f(a,b), f(a,c))
│   ├── witness.rs            Minimal counterexample finder (binary search / delta debug)
│   └── coverage.rs           Coverage report: equivalence classes, boundaries, law results
│
├── generate/                 INPUT GENERATION. Driven by coverage analysis.
│   ├── exhaustive.rs         All values in domain (u8, u16)
│   ├── equivalence.rs        One representative per equivalence class
│   ├── boundary.rs           All boundary values from OpSpec
│   ├── random.rs             Uniform random with deterministic seed
│   ├── adversarial.rs        GPU failure modes: shift-by-32, div-by-0, u32::MAX
│   └── shrink.rs             Counterexample minimization
│
├── verify/                   VERIFICATION STRATEGIES. Each independent.
│   ├── parity.rs             CPU vs GPU byte-for-byte
│   ├── algebraic.rs          Verify all declared laws hold
│   ├── compositional.rs      Prove A∘B correct from A and B's laws
│   ├── invariant.rs          Engine invariant checking
│   ├── determinism.rs        Same input N times → same output N times
│   ├── regression.rs         Persistent failure replay
│   └── performance.rs        Throughput baselines
│
├── backend/                  BACKEND ABSTRACTION.
│   ├── traits.rs             VyreBackend: dispatch_op + dispatch_engine
│   └── wrap.rs               Shader wrapping for conformance dispatch
│
├── certify/                  CONFORMANCE CERTIFICATION.
│   ├── certificate.rs        Formal certificate: ops, laws, coverage, engines
│   ├── level.rs              L1 parity → L2 algebraic → L3 engine → L4 compositional
│   └── console.rs            Human-readable report
│
├── registry/                 ALL REGISTERED SPECS.
│   ├── primitive/            One file per op: cpu_fn + laws + equivalence classes
│   ├── decode/
│   ├── hash/
│   ├── string/
│   ├── collection/
│   ├── graph/
│   ├── match_ops/
│   └── engine/               Engine-level specs with invariants
│
├── suite.rs                  Orchestration: run full proof suite, produce certificate
└── lib.rs
```

### Conformance Levels

Conformance is tracked in two parallel tracks — integer and float —
because the proof strategies differ.

**Integer track (bit-exact):**

| Level | Name | Method | What it proves |
|-------|------|--------|----------------|
| L1 | Parity | CPU vs GPU byte match on generated inputs | Implementation matches reference on tested inputs. |
| L2 | Algebraic | All declared laws verified (exhaustive u8 + witnessed u32) | Implementation satisfies the mathematical identity of each operation. |
| L3 | Engine | All engine invariants verified across generated programs | Composed engines maintain correctness guarantees. |
| L4 | Full | L2 + L3 + compositional proofs + performance baselines | Production-ready: correct, composed, and fast. |

**Float track (bit-exact IEEE 754):**

| Level | Name | Method | What it proves |
|-------|------|--------|----------------|
| L1f | Float Parity | GPU matches CR-Math / strict IEEE reference bit-for-bit | Every float op satisfies strict IEEE 754 semantics. |
| L2f | Float Algebraic | Declared float laws verified (commutativity of FAdd/FMul, negation involution, abs idempotent, etc.) | Operations hold their algebraic properties under rounding. |
| L3f | Tensor | Matrix/reduce/elementwise tensor ops match reference impl bit-for-bit | Composed tensor operations are correct. |
| L4f | ML Conformance | L2f + L3f + composed engines (matmul, attention, softmax, layer norm) match reference | Backend can run real neural networks with bit-exact determinism. |

**Approximate track (for high-performance paths):**

| Level | Name | Method | What it proves |
|-------|------|--------|----------------|
| L1a | Approximate Parity | GPU matches reference within declared ULP tolerance | Implementation is "close enough" for stated use case. |

Approximate operations (e.g. `FReduceApprox`, vendor fast-math paths) are
explicitly separate from strict operations. An op labeled approximate is
tested against its tolerance, not against bit-exact. Approximate conformance
is a real thing but it is NEVER mixed with strict conformance in the same
certificate. A backend is strict OR approximate per operation, by explicit
declaration.

### Scale requirements

The harness must handle 100M+ test cases:
- **Streaming execution:** generate → dispatch → compare one at a time. Never
  hold all inputs in memory.
- **Parallel dispatch:** batch GPU dispatches for throughput.
- **Incremental reporting:** stream results, don't collect.
- **Sharded execution:** run subset by partition ID for CI parallelism.
- **Deterministic resumption:** skip N tests to resume from a failure point.
- **Regression persistence:** every failing input saved, replayed forever.

### Admission Gates

New operation specs pass six admission gates before entering the registry.
Gates 1-3 are category, signature, and zero-overhead checks. Gates 4-6 are:

4. **Deterministic CPU semantics.** The declared `cpu_fn` must return identical
   bytes for repeated calls on the same witness input. A nondeterministic
   reference cannot define conformance.
5. **Observable CPU reference.** The declared `cpu_fn` must produce at least
   one output byte for a valid witness input. Empty references are rejected
   because they cannot prove parity or preserve failure context.
6. **Conformance rules present.** Every spec must declare at least one law,
   equivalence class, boundary value, or alternate WGSL implementation. A spec
   with no rule material is untestable and is rejected.

---

## Ground Truth — IR Semantics

Everything below this line is immutable. Entries are added, never modified or
removed. If a definition is wrong, a new version is added and the old one is
preserved with a deprecation note.

### Data Types

| Type      | Size     | WGSL mapping       | Semantics |
|-----------|----------|---------------------|-----------|
| `U32`     | 4 bytes  | `u32`               | Unsigned 32-bit. Wrapping arithmetic. |
| `I32`     | 4 bytes  | `i32`               | Signed 32-bit. Two's complement wrapping. |
| `U64`     | 8 bytes  | `vec2<u32>`         | Unsigned 64-bit, emulated as (low, high). |
| `U128`    | 16 bytes | `vec4<u32>`         | Unsigned 128-bit, emulated as 4×u32. |
| `Vec2U32` | 8 bytes  | `vec2<u32>`         | Two-component u32 vector. |
| `Vec4U32` | 16 bytes | `vec4<u32>`         | Four-component u32 vector. |
| `Bool`    | 4 bytes  | `u32` (0/1)         | Boolean stored as u32. |
| `Bytes`   | 0+       | `array<u32>`        | Variable-length byte buffer. |
| `F16`     | 2 bytes  | `f16`               | **Strict** IEEE 754 binary16. See float rules below. |
| `BF16`    | 2 bytes  | `u16` + ops         | **Strict** bfloat16 (1-8-7). See float rules below. |
| `F32`     | 4 bytes  | `f32`               | **Strict** IEEE 754 binary32. See float rules below. |
| `Tensor`  | 0+       | `array<T>` + strides | Multi-dimensional tensor with declared element type and shape. |

### Float semantics — strict IEEE 754, deterministic across backends

**Rationale.** GPU float has historically been nondeterministic because
backends were *permitted* to fuse, reorder, flush, and approximate. None of
these are fundamental to floating-point. Each is a specific permission granted
to the compiler. vyre **forbids every such permission**. The result is a
deterministic float subset that is strict IEEE 754 on every conforming backend.

This is the same pattern Rust uses for memory safety. The hardware isn't
safer in Rust than in C — safe Rust forbids the patterns that cause unsafety,
so the compiled code is safe by construction. vyre forbids the float patterns
that cause nondeterminism, so vyre float code is bit-exact across vendors by
construction.

**Forbidden patterns (permanent):**

1. **FMA fusion.** `FMul(a, b) + FAdd(·, c)` is two roundings. It is NOT
   lowered to a hardware FMA instruction. If the contributor wants one
   rounding, they explicitly write `FMulAdd(a, b, c)`. The choice is in the IR,
   never in the backend.

2. **Reduction reordering.** There is no `FReduceUnordered`. Two reduction
   ops exist: `FReduceStrict(op, array)` (left-to-right sequential) and
   `FReduceTreeBinary(op, array)` (balanced binary tree in a canonical shape).
   Both produce bit-exact results across backends. Parallel-tree reductions use
   the canonical tree — same shape on every GPU.

3. **Subnormal flushing.** Subnormals are always preserved. Backends must
   configure the GPU to disable flush-to-zero mode. Slower on GPUs that default
   to FTZ, but deterministic.

4. **Vendor transcendentals.** `FSin`, `FCos`, `FTan`, `FExp`, `FLog`, `FPow`,
   `FExp2`, `FLog2` must be **correctly-rounded** per IEEE 754-2019 CR-Math.
   Backends must use CR-Math implementations (e.g. CORE-MATH), not vendor
   fast-math libraries. Slower than cuBLAS `sinf`, but bit-exact everywhere.

5. **Division and square root approximations.** `FDiv` and `FSqrt` must be
   correctly-rounded IEEE 754. Hardware `rcp` and `rsqrt` approximations are
   forbidden.

6. **Tensor core precision downgrade.** `MatMul { accumulator: F32 }` MUST use
   F32 accumulators. If the backend uses tensor cores, it must use cores that
   correctly implement the declared precision. Silent TF32 substitution is
   forbidden. Backends without sufficient tensor core precision must fall back
   to software F32 matmul.

**Required operations (strict IEEE 754):**

| Op            | Semantics | Notes |
|---------------|-----------|-------|
| `FAdd(a,b)`   | IEEE 754 add, round-to-nearest-even | Commutative. NOT associative. |
| `FSub(a,b)`   | IEEE 754 sub, round-to-nearest-even | |
| `FMul(a,b)`   | IEEE 754 mul, round-to-nearest-even | Commutative. NOT associative. |
| `FDiv(a,b)`   | IEEE 754 div, round-to-nearest-even | `FDiv(x, 0.0) = +Inf/-Inf/NaN per 754`. |
| `FMulAdd(a,b,c)` | IEEE 754 fused multiply-add (ONE rounding) | Explicit fusion only. |
| `FSqrt(a)`    | IEEE 754 sqrt, round-to-nearest-even | `FSqrt(negative) = NaN`. |
| `FMin(a,b)`   | IEEE 754-2019 `minimumNumber` | Defined NaN handling. |
| `FMax(a,b)`   | IEEE 754-2019 `maximumNumber` | |
| `FNeg(a)`     | Sign bit flip | Bit-exact. Even on NaN. |
| `FAbs(a)`     | Clear sign bit | Bit-exact. |
| `FEq, FLt, FGt, FLe, FGe` | IEEE 754 comparisons | NaN comparisons are false. |
| `FIsNaN(a)`   | `a != a` | |
| `FIsInf(a)`   | Exponent all 1, mantissa 0 | |
| `FSin, FCos, FTan, FASin, FACos, FATan, FATan2` | CR-Math correctly rounded | Slower than vendor. Deterministic. |
| `FExp, FLog, FExp2, FLog2, FExp10, FLog10` | CR-Math correctly rounded | |
| `FPow(a,b)`   | CR-Math correctly rounded | |
| `FFloor, FCeil, FRound, FTrunc` | IEEE 754 round-to-X | |

**Conversion operations:**

| From → To | Semantics |
|-----------|-----------|
| `F32 → F16` | IEEE 754 round-to-nearest-even. Overflow → ±Inf. |
| `F16 → F32` | Exact (F16 is a strict subset). |
| `F32 → BF16` | IEEE 754 round-to-nearest-even. |
| `BF16 → F32` | Exact. |
| `F32 → I32` | Round per declared mode (truncate/nearest/floor/ceil). Overflow → saturation to I32::MIN/MAX, not UB. |
| `I32 → F32` | Exact for `|x| < 2^24`, round-to-nearest-even otherwise. |
| `U32 → F32` | Same. |

**Reduction operations:**

| Op | Semantics |
|----|-----------|
| `FReduceStrict(op, array)` | Left-to-right sequential. `fold(op, array[0], array[1..])`. Slow but trivially deterministic. |
| `FReduceTreeBinary(op, array)` | Balanced binary tree in canonical shape. `reduce(reduce(a..mid), reduce(mid..n))`. Faster. Bit-exact across backends that follow the canonical shape. |

There is no unordered reduction. Contributors who want speed at the cost of
precision use `FReduceApprox(op, array, tolerance_ulps)` which is EXPLICITLY
labeled approximate and carries a declared ULP tolerance. Conformance testing
uses the tolerance, not bit-exact comparison. Approximate reductions are a
separate conformance class, never mixed with strict operations.

### Tensor semantics

A `Tensor` is a multi-dimensional view over a flat buffer with explicit
strides and element type. Indexing is linearized per the declared strides.
Out-of-bounds access follows the same rules as buffer access (load returns
zero of the element type, store is a no-op).

**Declaration:**

```
Tensor {
    element: DataType,       // F16, F32, BF16, U32, I32, etc.
    shape: Vec<u32>,         // [batch, seq, heads, dim] for example
    strides: Vec<u32>,       // Canonical (row-major) or explicit
    backing_buffer: String,  // Name of the underlying storage buffer
}
```

The tensor shape may be dynamic — declared with `DynamicDim` placeholders
that are resolved at dispatch time from a uniform buffer. This is how LLM
inference expresses variable sequence length.

**Core tensor operations:**

| Op | Semantics |
|----|-----------|
| `MatMul { a, b, accumulator_type }` | Explicit accumulator type. NO silent precision loss. |
| `Transpose { tensor, axes }` | View operation, no data copy. |
| `Reshape { tensor, new_shape }` | View operation when strides permit. Copy when they don't. |
| `Slice { tensor, ranges }` | View operation. |
| `Broadcast { tensor, target_shape }` | Stride-0 for broadcasted dims. View operation. |
| `Reduce { tensor, axis, op }` | Reduction along axis. Uses `FReduceStrict` or `FReduceTreeBinary` based on caller choice. |
| `ElementWise { op, tensors }` | Map op over tensors with matching or broadcastable shapes. |

### Binary Operations (integer)

All take two u32, produce one u32. Wrapping semantics (mod 2^32).

| Op       | Result | Notes |
|----------|--------|-------|
| `Add`    | `(a + b) mod 2^32` | |
| `Sub`    | `(a - b) mod 2^32` | |
| `Mul`    | `(a * b) mod 2^32` | Low 32 bits. |
| `Div`    | `a / b` | Truncating. `b=0 → 0`. |
| `Mod`    | `a % b` | Truncating. `b=0 → 0`. |
| `BitAnd` | `a & b` | |
| `BitOr`  | `a \| b` | |
| `BitXor` | `a ^ b` | |
| `Shl`    | `a << (b & 31)` | Masked shift. |
| `Shr`    | `a >> (b & 31)` | Logical. Masked. |
| `Eq`     | `a == b ? 1 : 0` | |
| `Ne`     | `a != b ? 1 : 0` | |
| `Lt`     | `a < b ? 1 : 0` | Unsigned. |
| `Gt`     | `a > b ? 1 : 0` | Unsigned. |
| `Le`     | `a <= b ? 1 : 0` | Unsigned. |
| `Ge`     | `a >= b ? 1 : 0` | Unsigned. |
| `And`    | `(a!=0) && (b!=0) ? 1 : 0` | Logical. |
| `Or`     | `(a!=0) \|\| (b!=0) ? 1 : 0` | Logical. |

Div/Mod by zero produce `0`. Not undefined behavior. Permanent.
Shift amounts masked to `0..31`. `a << 32` = `a << 0` = `a`. Permanent.

### Binary Operations

All take two u32, produce one u32. Wrapping semantics (mod 2^32).

| Op       | Result | Notes |
|----------|--------|-------|
| `Add`    | `(a + b) mod 2^32` | |
| `Sub`    | `(a - b) mod 2^32` | |
| `Mul`    | `(a * b) mod 2^32` | Low 32 bits. |
| `Div`    | `a / b` | Truncating. `b=0 → 0`. |
| `Mod`    | `a % b` | Truncating. `b=0 → 0`. |
| `BitAnd` | `a & b` | |
| `BitOr`  | `a \| b` | |
| `BitXor` | `a ^ b` | |
| `Shl`    | `a << (b & 31)` | Masked shift. |
| `Shr`    | `a >> (b & 31)` | Logical. Masked. |
| `Eq`     | `a == b ? 1 : 0` | |
| `Ne`     | `a != b ? 1 : 0` | |
| `Lt`     | `a < b ? 1 : 0` | Unsigned. |
| `Gt`     | `a > b ? 1 : 0` | Unsigned. |
| `Le`     | `a <= b ? 1 : 0` | Unsigned. |
| `Ge`     | `a >= b ? 1 : 0` | Unsigned. |
| `And`    | `(a!=0) && (b!=0) ? 1 : 0` | Logical. |
| `Or`     | `(a!=0) \|\| (b!=0) ? 1 : 0` | Logical. |

Div/Mod by zero produce `0`. Not undefined behavior. Permanent.
Shift amounts masked to `0..31`. `a << 32` = `a << 0` = `a`. Permanent.

### Unary Operations

| Op            | Result | Notes |
|---------------|--------|-------|
| `Negate`      | `(~a) + 1` | Two's complement. |
| `BitNot`      | `~a` | |
| `LogicalNot`  | `a == 0 ? 1 : 0` | |
| `Popcount`    | Set bit count. | 0..32. |
| `Clz`         | Leading zeros. | `Clz(0) = 32`. |
| `Ctz`         | Trailing zeros. | `Ctz(0) = 32`. |
| `ReverseBits` | Bit reversal. | `ReverseBits(1) = 0x80000000`. |

### Atomic Operations

Return value before the operation. Sequentially consistent.

| Op               | Effect |
|------------------|--------|
| `Add`            | `buf[i] += value` |
| `Or`             | `buf[i] \|= value` |
| `And`            | `buf[i] &= value` |
| `Xor`            | `buf[i] ^= value` |
| `Min`            | `buf[i] = min(buf[i], value)` |
| `Max`            | `buf[i] = max(buf[i], value)` |
| `Exchange`       | `buf[i] = value` |
| `CompareExchange`| `if buf[i] == expected { buf[i] = value }` |

Sequential consistency is checked by concrete observable properties:
`CompareExchange` may succeed at most once for a single expected value;
`Add` returns a consecutive permutation of pre-values and the final value is
the initial value plus every operand; bitwise `Or`, `And`, and `Xor` pre-values
must correspond to a valid sequential ordering; `Min` and `Max` pre-values must
move monotonically toward the final value; `Exchange` pre-values must form a
valid chain ending in the final stored value.

Out-of-bounds atomic behavior follows buffer OOB behavior. Atomic loads return
zero of the element type. Atomic stores and read-modify-write operations are
no-ops and do not modify in-bounds memory.

### Program Structure

```
Program { buffers: Vec<BufferDecl>, workgroup_size: [u32;3], entry: Vec<Node> }
BufferDecl { name, binding, access: ReadOnly|ReadWrite|Uniform|Workgroup, element: DataType, count: u32 }
```

Validation: no duplicate names/bindings, all refs resolve, stores only to
ReadWrite, vars in scope, axes 0..2, workgroup_size >= 1.

### Reference Interpreter Limits

The reference interpreter is intentionally bounded. A single call input may not
exceed 64 MiB, and a workgroup memory allocation may not exceed 64 MiB. These
limits are conformance limits, not implementation accidents: programs requiring
larger inputs or workgroup storage must tile the computation into smaller
dispatches.

### Statement Nodes

`Let`, `Assign`, `Store`, `If`, `Loop` (bounded only), `Return`, `Block`,
`Barrier`. Scoping: bindings visible until end of enclosing block.

### Barrier Semantics

`Barrier` is a workgroup-local synchronization point with four required
properties:

1. **Post-barrier visibility.** Writes to workgroup memory before the barrier
   are visible to every live invocation in the same workgroup after the barrier.
2. **Pre-barrier isolation.** Reads before the barrier cannot observe writes
   that happen only after the barrier.
3. **Workgroup scope.** A barrier does not synchronize distinct workgroups and
   must not create cross-workgroup visibility.
4. **Uniform control flow.** If a barrier appears under conditional control
   flow, the condition must evaluate identically for every live invocation in
   the workgroup. Divergent barriers are invalid programs.

### Expression Nodes

`LitU32`, `LitI32`, `LitBool`, `Var`, `Load`, `BufLen`, `InvocationId`,
`WorkgroupId`, `LocalId`, `BinOp`, `UnOp`, `Call`, `Select`, `Cast`, `Atomic`.

### Wire Format Equivalence (I4)

`Program::to_wire` followed by `Program::from_wire` must be structurally
lossless and semantically equivalent. A decoded program must produce identical
backend output to the original program for the same input, dispatch config, and
buffer initialization policy.

### Data-Race Detection

A conforming backend must not rely on data races, uninitialized reads, or
workgroup-order accidents. The data-race detector repeatedly dispatches
adversarial programs and compares byte outputs across runs. Any byte divergence
or repeated-dispatch error is an engine invariant failure.

### Calling Conventions

`V1`: input + output + params. `V2`: V1 + lookup. Extensible, never breaking.

---

## Ground Truth — Algebraic Laws

Declared law sets are characterization material, not decoration. Within a
registry and arity class, an operation's law set must be distinctive enough to
separate it from peer operations. Identical law sets and strict law supersets
are reported as characterization collisions unless another declared rule
(equivalence class, boundary value, alternate implementation, or custom law)
distinguishes the operation.

### Binary Laws

`f: (u32, u32) → u32` is the CPU reference.

| Law | Definition ||
|-----|-----------|---|
| Commutative | `f(a,b) = f(b,a)` | a,b ∈ u32 |
| Associative | `f(f(a,b),c) = f(a,f(b,c))` | a,b,c ∈ u32 |
| Identity(e) | `f(a,e) = a ∧ f(e,a) = a` | a ∈ u32 |
| SelfInverse(r) | `f(a,a) = r` | a ∈ u32 |
| Idempotent | `f(a,a) = a` | a ∈ u32 |
| Absorbing(z) | `f(a,z) = z ∧ f(z,a) = z` | a ∈ u32 |
| DistributiveOver(g) | `f(a,g(b,c)) = g(f(a,b),f(a,c))` | a,b,c ∈ u32 |

### Unary Laws

`f: u32 → u32` is the CPU reference.

| Law | Definition ||
|-----|-----------|---|
| Involution | `f(f(a)) = a` | a ∈ u32 |
| Monotone | `a ≤ b → f(a) ≤ f(b)` | a,b ∈ u32 |
| Bounded(lo,hi) | `lo ≤ f(a) ≤ hi` | a ∈ u32 |

### Cross-Operation Laws

| Law | Definition ||
|-----|-----------|---|
| DeMorgan(inner,dual) | `f(inner(a,b)) = dual(f(a),f(b))` | a,b ∈ u32 |
| Complement(g,u) | `f(a) + g(a) = u` | a ∈ u32 |

### Verification

| Method | Domain | Confidence |
|--------|--------|------------|
| Exhaustive | u8 (256 unary, 65536 binary) | Proof within domain. |
| Witnessed | u32, N ≥ 10^6 random | Statistical. |
| Combined | Exhaustive u8 + Witnessed u32 | Practical proof. |

A law is verified when it passes Combined verification.

Generated tests must be independent of the implementation under test. A test
oracle must not call the target `cpu_fn`, reuse the target WGSL, or derive the
expected value from the op being verified. The independence gate rejects
tautological tests before they enter the suite.