vyre-conform 0.1.0

# vyre-conform

## Vision

### The Abstraction Thesis

Every complex system in computing history was built bottom-up through forced
abstraction layers. The CPU path: transistor → logic gate → ALU → microcode →
instruction set → assembler → compiler → operating system → programming
language → application. Each layer was forced by physics or necessity. The
result: every intermediate layer is simple, complete, tested, and invisible.

The critical property: **abstractions do not decrease performance.** C++
templates compile to the same machine code as hand-written C. Rust's iterators
compile to the same loops as manual indexing. The abstraction exists at design
time. It vanishes at execution time.

GPU compute has not followed this path. Researchers see thousands of parallel
cores and jump directly from GPU threads to compilers, databases, and neural
network frameworks — skipping the intermediate layers because they already know
what the end looks like. This fails every time. Not because GPUs are harder
than CPUs — because the abstraction layers between "GPU core" and "GPU
application" do not exist yet.

**The thesis:** Complex systems emerge from simple layers. Each layer must be
built, tested, and trusted before the next one begins. Skipping a layer produces
a system that appears to work but collapses under weight.

vyre builds those layers. vyre-conform proves they are correct.

### What vyre-conform must become

vyre-conform is a **proof system**, not a test suite.

A test suite says "I ran 1000 inputs and they matched." Confidence is
proportional to sample count. It never reaches certainty.

A proof system says "this implementation satisfies these algebraic laws, which
uniquely characterize the operation. Verified exhaustively on u8 and witnessed
on 10^6 random u32 inputs." Confidence is mathematical.

Rust's borrow checker does not test memory safety — it proves it via type rules.
vyre-conform does not test GPU correctness — it proves it via algebraic laws,
exhaustive verification, and compositional reasoning.

### What vyre-conform must do

1. Define every operation by its algebraic identity (laws, not examples).
2. Verify those laws exhaustively on small domains and statistically on full domains.
3. Produce formal conformance certificates with coverage metrics.
4. Scale to 100M+ test cases without OOM.
5. Support engine-level invariant checking beyond individual op correctness.
6. Enable any backend (wgpu, CUDA, Metal, TPU, FPGA) to prove conformance.

### The Category A+C rule (permanent)

Every operation in vyre belongs to exactly one of two categories:

- **Category A — Compositional, zero-overhead.** The op is expressed as a
  composition of other ops. When lowered, it inlines to hand-written-
  equivalent shader code. The abstraction exists at IR construction time
  and vanishes at shader emission time. 95%+ of vyre ops are Category A.

- **Category C — Hardware intrinsic.** The op maps 1:1 to a specific
  hardware instruction that no software composition can match in
  performance (10×-1000× faster). Tensor cores, warp shuffles, sampler-
  filtered reads, async copy engines, ray-triangle intersection. A
  Category C op declares `{ hardware, backend_availability }` — the
  target hardware identifier and a predicate that says which backends
  support the intrinsic. **There is no software fallback composition.**
  A backend that does not support a Category C op returns
  `Error::UnsupportedByBackend { op, backend }` — a first-class outcome,
  not a silent degraded path. Downstream consumers who need cross-backend
  portability for an op either (a) constrain their target backend set to
  backends that support the op natively, or (b) author an explicit
  Category A composition over primitives that expresses the same
  computation. The choice is visible in code, not hidden in a runtime
  fallback. **Performance is the product**; a slow path that silently
  degrades is a footgun users write unaware and blame vyre for being
  slower than hand-written code.

**Category B — runtime abstraction overhead — is FORBIDDEN.** No virtual
dispatch. No interpreter loops. No JIT. No boxing. If an op cannot lower
to either inlined composition (A) or direct hardware intrinsic (C), it
does not belong in vyre.

This discipline is non-negotiable. It is what makes vyre composable at
arbitrary depth without accumulating cost. A Program with 20 ops composed
from 200 sub-ops lowers to ONE optimized shader, not an interpreter
dispatching through a registry.

The test for every proposed op is mechanical: does lowering produce
overhead-free code? If yes, via inlining → Category A. If yes, via
hardware mapping → Category C. If no → rejected.

### Extensibility — how vyre adapts to new algorithms

A common concern: "can vyre represent new neural architectures when they
are invented? New algorithms, new models, new techniques?"

Yes, because **vyre is an IR, not a fixed algorithm library.** New
algorithms are compositions of existing primitives. New primitives are
rare — they require new hardware. The substrate is stable; the algorithms
that run on it are not.

Examples:

- **Mamba / State Space Models** — matmul + element-wise + scan. All
  existing Category A ops. Composition.
- **Flash Attention** — matmul + softmax + masked loads. Softmax is a
  Layer 2 composition (exp + reduce + div). Masked loads are Category A
  element-wise ops. Composition.
- **Diffusion models (U-Net)** — convolutions + attention + activations.
  Convolution is a Layer 2 composition (matmul with im2col or direct).
  Attention is already a composition. Activations are element-wise.
  Composition.
- **New quantization schemes (NF4, FP8 variants)** — new DataType variants
  with declared semantics. The IR grows one enum; existing code does not
  break.
- **Sparse attention variants** — masked matmul + index gather. Index
  gather is an existing op. Masked matmul is a new compound built from
  matmul + conditional store. Composition.
- **Mixture of Experts routing** — top-k selection + indexed dispatch +
  weighted sum. Top-k is a Layer 2 op (sort + slice). Indexed dispatch is
  a scatter operation. Weighted sum is matmul. Composition.

The principle: **vyre evolves by adding primitives slowly (as hardware
changes) and by growing the composition library quickly (as researchers
invent algorithms).** New hardware is rare. New algorithms are constant.
The substrate accommodates the constant change because algorithms are
compositions, not new primitives.

What would require a new primitive:

- New hardware unit in a GPU generation (e.g., a new sparse tensor core
  not derivable from existing ops). → Add a new Category C op that
  declares `{ hardware, backend_availability }`. Backends without the
  unit return `UnsupportedByBackend`; consumers that need the op
  constrain their target backend set or compose equivalent behavior
  from Category A primitives.
- A new DataType (e.g., FP4 when it becomes mainstream). → Add an enum
  variant with strict semantics.
- A fundamentally new dispatch model (e.g., if quantum GPU compute
  becomes real). → Add a new Program variant.

Everything else — every new model architecture, every new training
algorithm, every new inference technique — is a composition expressible
in the IR as it stands. Researchers publish a paper, implementors compose
vyre ops to match the paper, conformance verifies correctness, the
algorithm runs.

## Architecture

```
vyre-conform/src/
│
├── spec/                     THE SPECIFICATION. Zero deps. Pure definitions.
│   ├── types.rs              DataType, OpSignature, Convention (mirrors ir/types.rs)
│   ├── op.rs                 OpSpec: id + signature + cpu_fn + laws + equivalence classes + boundaries
│   ├── law.rs                AlgebraicLaw: Commutative, Associative, Identity, SelfInverse, Idempotent, Absorbing, Involution, DeMorgan, Monotone, Bounded, Complement, DistributiveOver
│   ├── engine.rs             EngineSpec: engine-level specifications
│   └── invariant.rs          EngineInvariant: Deterministic, NoOutputLost, NoDuplicate, Ordered, Bounded, Termination, AtomicLinearizable, WorkgroupInvariant
│
├── algebra/                  THE PROOF ENGINE.
│   ├── checker.rs            Verify a law holds: exhaustive on u8, witnessed on u32
│   ├── laws/                 Individual law checkers
│   │   ├── commutative.rs    f(a,b) = f(b,a)
│   │   ├── associative.rs    f(f(a,b),c) = f(a,f(b,c))
│   │   ├── identity.rs       f(a,e) = a
│   │   ├── self_inverse.rs   f(a,a) = r
│   │   ├── idempotent.rs     f(a,a) = a
│   │   ├── absorbing.rs      f(a,z) = z
│   │   ├── involution.rs     f(f(a)) = a
│   │   ├── demorgan.rs       f(g(a,b)) = h(f(a),f(b))
│   │   ├── monotone.rs       a ≤ b → f(a) ≤ f(b)
│   │   ├── bounded.rs        lo ≤ f(a) ≤ hi
│   │   ├── complement.rs     f(a) + g(a) = u
│   │   └── distributive.rs   f(a, g(b,c)) = g(f(a,b), f(a,c))
│   ├── witness.rs            Minimal counterexample finder (binary search / delta debug)
│   └── coverage.rs           Coverage report: equivalence classes, boundaries, law results
│
├── generate/                 INPUT GENERATION. Driven by coverage analysis.
│   ├── exhaustive.rs         All values in domain (u8, u16)
│   ├── equivalence.rs        One representative per equivalence class
│   ├── boundary.rs           All boundary values from OpSpec
│   ├── random.rs             Uniform random with deterministic seed
│   ├── adversarial.rs        GPU failure modes: shift-by-32, div-by-0, u32::MAX
│   └── shrink.rs             Counterexample minimization
│
├── verify/                   VERIFICATION STRATEGIES. Each independent.
│   ├── parity.rs             CPU vs GPU byte-for-byte
│   ├── algebraic.rs          Verify all declared laws hold
│   ├── compositional.rs      Prove A∘B correct from A and B's laws
│   ├── invariant.rs          Engine invariant checking
│   ├── determinism.rs        Same input N times → same output N times
│   ├── regression.rs         Persistent failure replay
│   └── performance.rs        Throughput baselines
│
├── backend/                  BACKEND ABSTRACTION.
│   ├── traits.rs             VyreBackend: dispatch_op + dispatch_engine
│   └── wrap.rs               Shader wrapping for conformance dispatch
│
├── certify/                  CONFORMANCE CERTIFICATION.
│   ├── certificate.rs        Formal certificate: ops, laws, coverage, engines
│   ├── level.rs              L1 parity → L2 algebraic → L3 engine → L4 compositional
│   └── console.rs            Human-readable report
│
├── registry/                 ALL REGISTERED SPECS.
│   ├── primitive/            One file per op: cpu_fn + laws + equivalence classes
│   ├── decode/
│   ├── hash/
│   ├── string/
│   ├── collection/
│   ├── graph/
│   ├── match_ops/
│   └── engine/               Engine-level specs with invariants
│
├── suite.rs                  Orchestration: run full proof suite, produce certificate
└── lib.rs
```

### Conformance Levels

Conformance is tracked in two parallel tracks — integer and float —
because the proof strategies differ.

**Integer track (bit-exact):**

| Level | Name | Method | What it proves |
|-------|------|--------|----------------|
| L1 | Parity | CPU vs GPU byte match on generated inputs | Implementation matches reference on tested inputs. |
| L2 | Algebraic | All declared laws verified (exhaustive u8 + witnessed u32) | Implementation satisfies the mathematical identity of each operation. |
| L3 | Engine | All engine invariants verified across generated programs | Composed engines maintain correctness guarantees. |
| L4 | Full | L2 + L3 + compositional proofs + performance baselines | Production-ready: correct, composed, and fast. |

**Float track (bit-exact IEEE 754):**

| Level | Name | Method | What it proves |
|-------|------|--------|----------------|
| L1f | Float Parity | GPU matches CR-Math / strict IEEE reference bit-for-bit | Every float op satisfies strict IEEE 754 semantics. |
| L2f | Float Algebraic | Declared float laws verified (commutativity of FAdd/FMul, negation involution, abs idempotent, etc.) | Operations hold their algebraic properties under rounding. |
| L3f | Tensor | Matrix/reduce/elementwise tensor ops match reference impl bit-for-bit | Composed tensor operations are correct. |
| L4f | ML Conformance | L2f + L3f + composed engines (matmul, attention, softmax, layer norm) match reference | Backend can run real neural networks with bit-exact determinism. |

**Approximate track (for high-performance paths):**

| Level | Name | Method | What it proves |
|-------|------|--------|----------------|
| L1a | Approximate Parity | GPU matches reference within declared ULP tolerance | Implementation is "close enough" for stated use case. |

Approximate operations (e.g. `FReduceApprox`, vendor fast-math paths) are
explicitly separate from strict operations. An op labeled approximate is
tested against its tolerance, not against bit-exact. Approximate conformance
is a real thing but it is NEVER mixed with strict conformance in the same
certificate. A backend is strict OR approximate per operation, by explicit
declaration.

### Scale requirements

The harness must handle 100M+ test cases:
- **Streaming execution:** generate → dispatch → compare one at a time. Never
  hold all inputs in memory.
- **Parallel dispatch:** batch GPU dispatches for throughput.
- **Incremental reporting:** stream results, don't collect.
- **Sharded execution:** run subset by partition ID for CI parallelism.
- **Deterministic resumption:** skip N tests to resume from a failure point.
- **Regression persistence:** every failing input saved, replayed forever.

### Admission Gates

New operation specs pass six admission gates before entering the registry.
Gates 1-3 are category, signature, and zero-overhead checks. Gates 4-6 are:

4. **Deterministic CPU semantics.** The declared `cpu_fn` must return identical
   bytes for repeated calls on the same witness input. A nondeterministic
   reference cannot define conformance.
5. **Observable CPU reference.** The declared `cpu_fn` must produce at least
   one output byte for a valid witness input. Empty references are rejected
   because they cannot prove parity or preserve failure context.
6. **Conformance rules present.** Every spec must declare at least one law,
   equivalence class, boundary value, or alternate WGSL implementation. A spec
   with no rule material is untestable and is rejected.

---

## Ground Truth — IR Semantics

Everything below this line is immutable. Entries are added, never modified or
removed. If a definition is wrong, a new version is added and the old one is
preserved with a deprecation note.

### Data Types

| Type      | Size     | WGSL mapping       | Semantics |
|-----------|----------|---------------------|-----------|
| `U32`     | 4 bytes  | `u32`               | Unsigned 32-bit. Wrapping arithmetic. |
| `I32`     | 4 bytes  | `i32`               | Signed 32-bit. Two's complement wrapping. |
| `U64`     | 8 bytes  | `vec2<u32>`         | Unsigned 64-bit, emulated as (low, high). |
| `U128`    | 16 bytes | `vec4<u32>`         | Unsigned 128-bit, emulated as 4×u32. |
| `Vec2U32` | 8 bytes  | `vec2<u32>`         | Two-component u32 vector. |
| `Vec4U32` | 16 bytes | `vec4<u32>`         | Four-component u32 vector. |
| `Bool`    | 4 bytes  | `u32` (0/1)         | Boolean stored as u32. |
| `Bytes`   | 0+       | `array<u32>`        | Variable-length byte buffer. |
| `F16`     | 2 bytes  | `f16`               | **Strict** IEEE 754 binary16. See float rules below. |
| `BF16`    | 2 bytes  | `u16` + ops         | **Strict** bfloat16 (1-8-7). See float rules below. |
| `F32`     | 4 bytes  | `f32`               | **Strict** IEEE 754 binary32. See float rules below. |
| `Tensor`  | 0+       | `array<T>` + strides | Multi-dimensional tensor with declared element type and shape. |

### Float semantics — strict IEEE 754, deterministic across backends

**Rationale.** GPU float has historically been nondeterministic because
backends were *permitted* to fuse, reorder, flush, and approximate. None of
these are fundamental to floating-point. Each is a specific permission granted
to the compiler. vyre **forbids every such permission**. The result is a
deterministic float subset that is strict IEEE 754 on every conforming backend.

This is the same pattern Rust uses for memory safety. The hardware isn't
safer in Rust than in C — safe Rust forbids the patterns that cause unsafety,
so the compiled code is safe by construction. vyre forbids the float patterns
that cause nondeterminism, so vyre float code is bit-exact across vendors by
construction.

**Forbidden patterns (permanent):**

1. **FMA fusion.** `FMul(a, b) + FAdd(·, c)` is two roundings. It is NOT
   lowered to a hardware FMA instruction. If the contributor wants one
   rounding, they explicitly write `FMulAdd(a, b, c)`. The choice is in the IR,
   never in the backend.

2. **Reduction reordering.** There is no `FReduceUnordered`. Two reduction
   ops exist: `FReduceStrict(op, array)` (left-to-right sequential) and
   `FReduceTreeBinary(op, array)` (balanced binary tree in a canonical shape).
   Both produce bit-exact results across backends. Parallel-tree reductions use
   the canonical tree — same shape on every GPU.

3. **Subnormal flushing.** Subnormals are always preserved. Backends must
   configure the GPU to disable flush-to-zero mode. Slower on GPUs that default
   to FTZ, but deterministic.

4. **Vendor transcendentals.** `FSin`, `FCos`, `FTan`, `FExp`, `FLog`, `FPow`,
   `FExp2`, `FLog2` must be **correctly-rounded** per IEEE 754-2019 CR-Math.
   Backends must use CR-Math implementations (e.g. CORE-MATH), not vendor
   fast-math libraries. Slower than cuBLAS `sinf`, but bit-exact everywhere.

5. **Division and square root approximations.** `FDiv` and `FSqrt` must be
   correctly-rounded IEEE 754. Hardware `rcp` and `rsqrt` approximations are
   forbidden.

6. **Tensor core precision downgrade.** `MatMul { accumulator: F32 }` MUST use
   F32 accumulators. If the backend uses tensor cores, it must use cores that
   correctly implement the declared precision. Silent TF32 substitution is
   forbidden. Backends without sufficient tensor core precision must fall back
   to software F32 matmul.

**Required operations (strict IEEE 754):**

| Op            | Semantics | Notes |
|---------------|-----------|-------|
| `FAdd(a,b)`   | IEEE 754 add, round-to-nearest-even | Commutative. NOT associative. |
| `FSub(a,b)`   | IEEE 754 sub, round-to-nearest-even | |
| `FMul(a,b)`   | IEEE 754 mul, round-to-nearest-even | Commutative. NOT associative. |
| `FDiv(a,b)`   | IEEE 754 div, round-to-nearest-even | `FDiv(x, 0.0) = +Inf/-Inf/NaN per 754`. |
| `FMulAdd(a,b,c)` | IEEE 754 fused multiply-add (ONE rounding) | Explicit fusion only. |
| `FSqrt(a)`    | IEEE 754 sqrt, round-to-nearest-even | `FSqrt(negative) = NaN`. |
| `FMin(a,b)`   | IEEE 754-2019 `minimumNumber` | Defined NaN handling. |
| `FMax(a,b)`   | IEEE 754-2019 `maximumNumber` | |
| `FNeg(a)`     | Sign bit flip | Bit-exact. Even on NaN. |
| `FAbs(a)`     | Clear sign bit | Bit-exact. |
| `FEq, FLt, FGt, FLe, FGe` | IEEE 754 comparisons | NaN comparisons are false. |
| `FIsNaN(a)`   | `a != a` | |
| `FIsInf(a)`   | Exponent all 1, mantissa 0 | |
| `FSin, FCos, FTan, FASin, FACos, FATan, FATan2` | CR-Math correctly rounded | Slower than vendor. Deterministic. |
| `FExp, FLog, FExp2, FLog2, FExp10, FLog10` | CR-Math correctly rounded | |
| `FPow(a,b)`   | CR-Math correctly rounded | |
| `FFloor, FCeil, FRound, FTrunc` | IEEE 754 round-to-X | |

**Conversion operations:**

| From → To | Semantics |
|-----------|-----------|
| `F32 → F16` | IEEE 754 round-to-nearest-even. Overflow → ±Inf. |
| `F16 → F32` | Exact (F16 is a strict subset). |
| `F32 → BF16` | IEEE 754 round-to-nearest-even. |
| `BF16 → F32` | Exact. |
| `F32 → I32` | Round per declared mode (truncate/nearest/floor/ceil). Overflow → saturation to I32::MIN/MAX, not UB. |
| `I32 → F32` | Exact for `|x| < 2^24`, round-to-nearest-even otherwise. |
| `U32 → F32` | Same. |

**Reduction operations:**

| Op | Semantics |
|----|-----------|
| `FReduceStrict(op, array)` | Left-to-right sequential. `fold(op, array[0], array[1..])`. Slow but trivially deterministic. |
| `FReduceTreeBinary(op, array)` | Balanced binary tree in canonical shape. `reduce(reduce(a..mid), reduce(mid..n))`. Faster. Bit-exact across backends that follow the canonical shape. |

There is no unordered reduction. Contributors who want speed at the cost of
precision use `FReduceApprox(op, array, tolerance_ulps)` which is EXPLICITLY
labeled approximate and carries a declared ULP tolerance. Conformance testing
uses the tolerance, not bit-exact comparison. Approximate reductions are a
separate conformance class, never mixed with strict operations.

### Tensor semantics

A `Tensor` is a multi-dimensional view over a flat buffer with explicit
strides and element type. Indexing is linearized per the declared strides.
Out-of-bounds access follows the same rules as buffer access (load returns
zero of the element type, store is a no-op).

**Declaration:**

```
Tensor {
    element: DataType,       // F16, F32, BF16, U32, I32, etc.
    shape: Vec<u32>,         // [batch, seq, heads, dim] for example
    strides: Vec<u32>,       // Canonical (row-major) or explicit
    backing_buffer: String,  // Name of the underlying storage buffer
}
```

The tensor shape may be dynamic — declared with `DynamicDim` placeholders
that are resolved at dispatch time from a uniform buffer. This is how LLM
inference expresses variable sequence length.

**Core tensor operations:**

| Op | Semantics |
|----|-----------|
| `MatMul { a, b, accumulator_type }` | Explicit accumulator type. NO silent precision loss. |
| `Transpose { tensor, axes }` | View operation, no data copy. |
| `Reshape { tensor, new_shape }` | View operation when strides permit. Copy when they don't. |
| `Slice { tensor, ranges }` | View operation. |
| `Broadcast { tensor, target_shape }` | Stride-0 for broadcasted dims. View operation. |
| `Reduce { tensor, axis, op }` | Reduction along axis. Uses `FReduceStrict` or `FReduceTreeBinary` based on caller choice. |
| `ElementWise { op, tensors }` | Map op over tensors with matching or broadcastable shapes. |

### Binary Operations (integer)

All take two u32, produce one u32. Wrapping semantics (mod 2^32).

| Op       | Result | Notes |
|----------|--------|-------|
| `Add`    | `(a + b) mod 2^32` | |
| `Sub`    | `(a - b) mod 2^32` | |
| `Mul`    | `(a * b) mod 2^32` | Low 32 bits. |
| `Div`    | `a / b` | Truncating. `b=0 → 0`. |
| `Mod`    | `a % b` | Truncating. `b=0 → 0`. |
| `BitAnd` | `a & b` | |
| `BitOr`  | `a \| b` | |
| `BitXor` | `a ^ b` | |
| `Shl`    | `a << (b & 31)` | Masked shift. |
| `Shr`    | `a >> (b & 31)` | Logical. Masked. |
| `Eq`     | `a == b ? 1 : 0` | |
| `Ne`     | `a != b ? 1 : 0` | |
| `Lt`     | `a < b ? 1 : 0` | Unsigned. |
| `Gt`     | `a > b ? 1 : 0` | Unsigned. |
| `Le`     | `a <= b ? 1 : 0` | Unsigned. |
| `Ge`     | `a >= b ? 1 : 0` | Unsigned. |
| `And`    | `(a!=0) && (b!=0) ? 1 : 0` | Logical. |
| `Or`     | `(a!=0) \|\| (b!=0) ? 1 : 0` | Logical. |

Div/Mod by zero produce `0`. Not undefined behavior. Permanent.
Shift amounts masked to `0..31`. `a << 32` = `a << 0` = `a`. Permanent.

### Binary Operations

All take two u32, produce one u32. Wrapping semantics (mod 2^32).

| Op       | Result | Notes |
|----------|--------|-------|
| `Add`    | `(a + b) mod 2^32` | |
| `Sub`    | `(a - b) mod 2^32` | |
| `Mul`    | `(a * b) mod 2^32` | Low 32 bits. |
| `Div`    | `a / b` | Truncating. `b=0 → 0`. |
| `Mod`    | `a % b` | Truncating. `b=0 → 0`. |
| `BitAnd` | `a & b` | |
| `BitOr`  | `a \| b` | |
| `BitXor` | `a ^ b` | |
| `Shl`    | `a << (b & 31)` | Masked shift. |
| `Shr`    | `a >> (b & 31)` | Logical. Masked. |
| `Eq`     | `a == b ? 1 : 0` | |
| `Ne`     | `a != b ? 1 : 0` | |
| `Lt`     | `a < b ? 1 : 0` | Unsigned. |
| `Gt`     | `a > b ? 1 : 0` | Unsigned. |
| `Le`     | `a <= b ? 1 : 0` | Unsigned. |
| `Ge`     | `a >= b ? 1 : 0` | Unsigned. |
| `And`    | `(a!=0) && (b!=0) ? 1 : 0` | Logical. |
| `Or`     | `(a!=0) \|\| (b!=0) ? 1 : 0` | Logical. |

Div/Mod by zero produce `0`. Not undefined behavior. Permanent.
Shift amounts masked to `0..31`. `a << 32` = `a << 0` = `a`. Permanent.

### Unary Operations

| Op            | Result | Notes |
|---------------|--------|-------|
| `Negate`      | `(~a) + 1` | Two's complement. |
| `BitNot`      | `~a` | |
| `LogicalNot`  | `a == 0 ? 1 : 0` | |
| `Popcount`    | Set bit count. | 0..32. |
| `Clz`         | Leading zeros. | `Clz(0) = 32`. |
| `Ctz`         | Trailing zeros. | `Ctz(0) = 32`. |
| `ReverseBits` | Bit reversal. | `ReverseBits(1) = 0x80000000`. |

### Atomic Operations

Return value before the operation. Sequentially consistent.

| Op               | Effect |
|------------------|--------|
| `Add`            | `buf[i] += value` |
| `Or`             | `buf[i] \|= value` |
| `And`            | `buf[i] &= value` |
| `Xor`            | `buf[i] ^= value` |
| `Min`            | `buf[i] = min(buf[i], value)` |
| `Max`            | `buf[i] = max(buf[i], value)` |
| `Exchange`       | `buf[i] = value` |
| `CompareExchange`| `if buf[i] == expected { buf[i] = value }` |

Sequential consistency is checked by concrete observable properties:
`CompareExchange` may succeed at most once for a single expected value;
`Add` returns a consecutive permutation of pre-values and the final value is
the initial value plus every operand; bitwise `Or`, `And`, and `Xor` pre-values
must correspond to a valid sequential ordering; `Min` and `Max` pre-values must
move monotonically toward the final value; `Exchange` pre-values must form a
valid chain ending in the final stored value.

Out-of-bounds atomic behavior follows buffer OOB behavior. Atomic loads return
zero of the element type. Atomic stores and read-modify-write operations are
no-ops and do not modify in-bounds memory.

### Program Structure

```
Program { buffers: Vec<BufferDecl>, workgroup_size: [u32;3], entry: Vec<Node> }
BufferDecl { name, binding, access: ReadOnly|ReadWrite|Uniform|Workgroup, element: DataType, count: u32 }
```

Validation: no duplicate names/bindings, all refs resolve, stores only to
ReadWrite, vars in scope, axes 0..2, workgroup_size >= 1.

### Reference Interpreter Limits

The reference interpreter is intentionally bounded. A single call input may not
exceed 64 MiB, and a workgroup memory allocation may not exceed 64 MiB. These
limits are conformance limits, not implementation accidents: programs requiring
larger inputs or workgroup storage must tile the computation into smaller
dispatches.

### Statement Nodes

`Let`, `Assign`, `Store`, `If`, `Loop` (bounded only), `Return`, `Block`,
`Barrier`. Scoping: bindings visible until end of enclosing block.

### Barrier Semantics

`Barrier` is a workgroup-local synchronization point with four required
properties:

1. **Post-barrier visibility.** Writes to workgroup memory before the barrier
   are visible to every live invocation in the same workgroup after the barrier.
2. **Pre-barrier isolation.** Reads before the barrier cannot observe writes
   that happen only after the barrier.
3. **Workgroup scope.** A barrier does not synchronize distinct workgroups and
   must not create cross-workgroup visibility.
4. **Uniform control flow.** If a barrier appears under conditional control
   flow, the condition must evaluate identically for every live invocation in
   the workgroup. Divergent barriers are invalid programs.

### Expression Nodes

`LitU32`, `LitI32`, `LitBool`, `Var`, `Load`, `BufLen`, `InvocationId`,
`WorkgroupId`, `LocalId`, `BinOp`, `UnOp`, `Call`, `Select`, `Cast`, `Atomic`.

### Wire Format Equivalence (I4)

`Program::to_wire` followed by `Program::from_wire` must be structurally
lossless and semantically equivalent. A decoded program must produce identical
backend output to the original program for the same input, dispatch config, and
buffer initialization policy.

### Data-Race Detection

A conforming backend must not rely on data races, uninitialized reads, or
workgroup-order accidents. The data-race detector repeatedly dispatches
adversarial programs and compares byte outputs across runs. Any byte divergence
or repeated-dispatch error is an engine invariant failure.

### Calling Conventions

`V1`: input + output + params. `V2`: V1 + lookup. Extensible, never breaking.

---

## Ground Truth — Algebraic Laws

Declared law sets are characterization material, not decoration. Within a
registry and arity class, an operation's law set must be distinctive enough to
separate it from peer operations. Identical law sets and strict law supersets
are reported as characterization collisions unless another declared rule
(equivalence class, boundary value, alternate implementation, or custom law)
distinguishes the operation.

### Binary Laws

`f: (u32, u32) → u32` is the CPU reference.

| Law | Definition | ∀ |
|-----|-----------|---|
| Commutative | `f(a,b) = f(b,a)` | a,b ∈ u32 |
| Associative | `f(f(a,b),c) = f(a,f(b,c))` | a,b,c ∈ u32 |
| Identity(e) | `f(a,e) = a ∧ f(e,a) = a` | a ∈ u32 |
| SelfInverse(r) | `f(a,a) = r` | a ∈ u32 |
| Idempotent | `f(a,a) = a` | a ∈ u32 |
| Absorbing(z) | `f(a,z) = z ∧ f(z,a) = z` | a ∈ u32 |
| DistributiveOver(g) | `f(a,g(b,c)) = g(f(a,b),f(a,c))` | a,b,c ∈ u32 |

### Unary Laws

`f: u32 → u32` is the CPU reference.

| Law | Definition | ∀ |
|-----|-----------|---|
| Involution | `f(f(a)) = a` | a ∈ u32 |
| Monotone | `a ≤ b → f(a) ≤ f(b)` | a,b ∈ u32 |
| Bounded(lo,hi) | `lo ≤ f(a) ≤ hi` | a ∈ u32 |

### Cross-Operation Laws

| Law | Definition | ∀ |
|-----|-----------|---|
| DeMorgan(inner,dual) | `f(inner(a,b)) = dual(f(a),f(b))` | a,b ∈ u32 |
| Complement(g,u) | `f(a) + g(a) = u` | a ∈ u32 |

### Verification

| Method | Domain | Confidence |
|--------|--------|------------|
| Exhaustive | u8 (256 unary, 65536 binary) | Proof within domain. |
| Witnessed | u32, N ≥ 10^6 random | Statistical. |
| Combined | Exhaustive u8 + Witnessed u32 | Practical proof. |

A law is verified when it passes Combined verification.

Generated tests must be independent of the implementation under test. A test
oracle must not call the target `cpu_fn`, reuse the target WGSL, or derive the
expected value from the op being verified. The independence gate rejects
tautological tests before they enter the suite.