vyre 0.4.0 - Docs.rs

# vyre

## Vision

vyre is a GPU compute intermediate representation. The same kind of thing as
LLVM IR, WebAssembly, or SQL — but for GPU compute.

Any workload compiles to vyre IR. Any GPU backend executes it. The IR is the
contract. Implementations come and go. The specification is forever.

This crate is the reference implementation using wgpu/WGSL. It is one of many
possible implementations. The ground truth specification — IR semantics,
algebraic laws, conformance levels — lives in `vyre-conform/SPEC.md`.

## IR and the wire format — two representations, one spec

vyre has **one** semantic model: the IR. `ir::Program` is the authoritative representation of a computation — what types exist, what operations execute, what invariants hold, what a conforming backend must reproduce bit-exactly.

The **IR wire format** is the compact, versioned, binary serialization of that IR. It is not a separate semantic model, not a second program format, not a VM or opcode stream with its own rules. It is a wire-format encoding of the same `ir::Program` — the way `.wasm` is to `.wat`, or the way a serialized protobuf is to its in-memory message.

> **Terminology note — the word "bytecode" is retired from vyre.** Earlier versions of vyre shipped a `bytecode` module that was a separate stack-machine VM with its own opcode semantics. That module has been deleted. vyre has no VM, no opcode interpreter, no execution path that is not a lowered IR program on a backend. The only binary representation of a vyre program is the IR wire format, which is a lossless serialization of the IR.

- You **author** programs as IR (via `Program::builder()` or a frontend that emits IR).
- You **store or transport** programs as wire-format bytes (`program.to_wire()` / `Program::from_wire()`).
- You **execute** programs by lowering IR to a backend (WGSL, CUDA, SPIR-V, or whatever future vendors contribute). The wire-format bytes are decoded back to IR before lowering — there is no wire-format interpreter, no opcode VM, no execution path that bypasses IR.
- **Round-trip is lossless:** `from_wire(to_wire(p)) == p` for every valid `p`. This is invariant I4. A codec that loses information is a codec bug, not an accepted limitation.

When the spec says `Div(5, 0) = 0`, that is true for `ir::Program` and for the wire-format bytes that serialize it. The codec cannot alter semantics; it can only encode and decode them.

---



## Architecture

```
vyre/src/
│
├── ir/                  Zero deps. Zero feature gates. THE CONSTITUTION.
│   ├── types.rs         DataType, OpSignature, Convention, BinOp, UnOp, AtomicOp, BufferAccess
│   ├── expr.rs          Expression nodes: Literal, Var, Load, BufLen, InvocationId, BinOp, UnOp, Call, Select, Cast, Atomic
│   ├── node.rs          Statement nodes: Let, Assign, Store, If, Loop, Return, Block, Barrier
│   ├── program.rs       Program { buffers, workgroup_size, entry }, BufferDecl
│   ├── validate.rs      Type-check a Program: buffer refs, scoping, access modes
│   └── visit.rs         IR tree walker for passes and analysis
│
├── ops/                 Standard operation library. Depends ONLY on ir/.
│   ├── mod.rs           Op trait: id() + program() + signature() + version()
│   ├── primitive/       L1 — atoms: bitwise, arithmetic, comparison
│   ├── decode/          L2 — base64, hex, url, unicode, xor_sweep
│   ├── hash/            L2 — fnv1a, crc32, rolling, entropy
│   ├── string/          L2 — search, compare, normalize, tokenize
│   ├── collection/      L2 — sort, filter, reduce, prefix_sum, scatter, gather
│   ├── graph/           L2 — bfs, reachability, components, pagerank, fixpoint
│   └── match_ops/       L2 — proximity, count, scope, sequential, dfa_scan
│
├── lower/               IR → target code. Depends ONLY on ir/.
│   └── wgsl.rs          Program → WGSL string. The reference lowering.
│
├── optimize/            IR → IR. Depends ONLY on ir/.
│   ├── constant_fold.rs Compile-time evaluable expressions → literals
│   ├── dead_code.rs     Unreachable branches, unused bindings
│   └── fuse.rs          Adjacent ops → merged body, shared buffer reads
│
├── serialize/           Binary .vyre format. Depends ONLY on ir/.
│   ├── encode.rs        Program → bytes
│   └── decode.rs        bytes → Program
│
├── runtime/             GPU execution. feature="gpu". Depends on lower/ + wgpu.
│   ├── device.rs        Shared device singleton
│   ├── buffer.rs        Buffer pool, reusable allocations
│   ├── shader.rs        Pipeline cache
│   └── dispatch.rs      Program → lower → compile → dispatch → readback
│
├── engine/              L3 composed programs. feature="gpu". Depends on ops/ + runtime/.
│   ├── dfa.rs           DFA pattern matching
│   ├── eval.rs          IR wire format condition evaluation
│   ├── scatter.rs       Match-to-rule scatter
│   ├── dataflow.rs      Graph fixpoint iteration
│   ├── decode.rs        Bulk decode pipeline
│   ├── prefix.rs        Parallel prefix sum
│   └── tokenize.rs      Tokenization pipeline
│
├── ir/wire/            Wire format for compact Program serialization.
│   │                    NOT an execution model. Depends on ir/.
│   ├── instruction.rs   Wire-format instruction encodings
│   ├── program.rs       wire::Program, ProgramBuilder
│   └── to_ir.rs         wire::Program → ir::Program (the only path)
│
├── error.rs
└── lib.rs
```

### Dependency graph

```
ir/  ←── ops/
 ↑        ↑
 ├── lower/
 ├── optimize/
 ├── serialize/
 ├── IR wire format/
 │
 └── runtime/  ←── engine/
```

Nothing points up. Nothing crosses layers. `ir/` is a leaf with zero incoming
edges. Every other module depends on `ir/` and nothing else at its layer.

### Feature gates

```toml
default = []              # ir + ops + lower + optimize + serialize. NO GPU.
gpu = ["wgpu", ...]       # + runtime + engine
```

`cargo add vyre` gives the IR, all ops, WGSL lowering, optimization, and
serialization — without pulling in wgpu. Only the final executor needs `gpu`.

### Layers

**Layer 0 — Runtime.** GPU device, queue, buffer pool, shader cache, dispatch.
One shared device, created once. Buffer pool recycles allocations. Shader cache
prevents recompilation.

**Layer 1 — Primitive ops.** The atoms. Bitwise, arithmetic, comparison. Each
is one `OpSpec` that emits one `ir::Program`. Domain-agnostic — a database, a
compiler, and a scanner all use the same primitives.

**Layer 2 — Compound ops.** Composed from L1. Decode, hash, string, collection,
graph, match. Each is one `OpSpec` whose `Program` contains `Call` nodes to L1 ops.

**Layer 3 — Engines.** Complete compute pipelines. DFA, eval, scatter, dataflow.
An engine is now treated as an op-compatible boundary: it has a stable id,
signature, CPU reference, typed wire format, and GPU workflow. Stage 1 keeps the
legacy `EngineSpec` API intact and adapts registered engines through
`conform::enforce::engine_composition::EngineOpSpec`.

### The operation model

```rust
struct OpSpec {
    id: &'static str,
    signature: OpSignature,
    cpu_fn: fn(&[u8]) -> Vec<u8>,
    ir_program: Option<fn() -> Program>,
}
```

The old runtime `trait Op` / `emit_wgsl()` split is gone. Public operations are
declarative specs. Category A ops provide a `Program`; Category C ops declare the
required intrinsic and backend availability.

Engines cannot be blindly collapsed into this struct yet because current engine
inputs are serialized workflow envelopes while current op signatures only
distinguish broad types such as `Bytes`. The minimum bridge trait is
`ComposableOp`: `id`, `signature`, `input_wire_format`, `output_wire_format`,
and `cpu_reference`. Type compatibility is checked first with `DataType`, then
with the semantic wire format. The bridge rejects unsafe pairs with actionable
`Fix:` diagnostics instead of dispatching corrupt byte streams.

### Lowering pipeline

```
Program (IR)
    ├──▶ optimize/ (fusion, constant folding, dead code)
    ├──▶ lower/wgsl.rs → WGSL string (reference)
    ├──▶ lower/spirv.rs → SPIR-V (future)
    ├──▶ lower/ptx.rs → CUDA PTX (future)
    └──▶ lower/msl.rs → Metal MSL (future)
```

### Execution model — ONE path

vyre has exactly one execution path:

```
source → ir::Program → lower → target shader → GPU dispatch → result
```

There is no "interpreted" path. There is no IR wire format VM execution mode.
Every Program — regardless of how it was produced — goes through the
same lowering and produces one specialized shader that runs once per
dispatch.

### IR wire format is a wire format, not an execution model

The `IR wire format/` module is a compact serialization format for ir::Programs.
It is historically the original format vyre supported, from before the
structured IR existed. It stays because:

1. It is compact — 8 bytes per instruction vs larger IR serialization.
2. Existing frontends (surgec legacy path) emit IR wire format directly.
3. It is convenient for network transport and disk storage.

But it is NOT an execution model. When a IR wire format Program arrives at
the runtime, it is immediately converted to `ir::Program` via
`IR wire format::to_ir`, then lowered and dispatched via the standard path.
The 102 IR wire format opcodes each map to an IR subtree — the converter is
a mechanical translator.

The legacy `eval_shader` (the hand-written WGSL switch statement that
interpreted IR wire format at runtime) is being removed. Its per-opcode cases
become per-op entries in the standard library. The result: no runtime
interpretation, one execution path, full shader specialization per
workload (only ops actually used appear in the generated code).

Frontends have two choices for emitting programs:
- **Direct IR**: construct `ir::Program` in memory, serialize via
  `serialize/`. Larger on disk but easier to inspect, optimize, and
  extend.
- **IR wire format**: emit compact IR wire format for storage/transport, converted
  to IR at load time. Good for high-volume rule distribution.

Both paths terminate in the same execution pipeline.

### The Category A+C rule — every op is provably zero-overhead

Every op in vyre is either:

- **Category A — Compositional.** Derived from other ops. Inlines
  completely during lowering. Produces hand-written-equivalent shader
  code. 95%+ of ops.
- **Category C — Hardware intrinsic.** Maps 1:1 to a specific hardware
  instruction that software cannot match in performance. Tensor core
  MMA, warp shuffles, sampler reads, async copies, ray tracing cores.
  If hardware is missing on a backend, the op is unavailable and the implementation must return `Error::UnsupportedByBackend { op, backend }`.

**Category B — runtime abstraction overhead — is forbidden.** No virtual
dispatch, no interpretation, no boxing, no JIT. If an op cannot lower
to either inlined composition or direct hardware mapping, it is not a
vyre op. This is non-negotiable. It preserves the property that vyre
abstractions compose to arbitrary depth without accumulating cost.

See `vyre-conform/SPEC.md` for the full conformance rules governing
Category A and Category C operations.

### Extensibility — adapting to new algorithms and architectures

New neural network architectures, new algorithms, new training
techniques, new inference strategies — all appear as compositions of
existing vyre ops. vyre does not need an update when Mamba replaces
Transformers or when a new attention variant is published. The
primitives stay the same; the compositions change.

vyre needs updates in three narrow cases:

1. **New hardware instructions.** When a GPU generation introduces a
   new dedicated unit (e.g., a new sparse tensor core), it becomes a
   new Category C op. If a backend lacks the required hardware, the op is unavailable and implementations must return `Error::UnsupportedByBackend { op, backend }`.
2. **New data types.** When a new numeric format becomes mainstream
   (e.g., FP4, FP6, log-number systems), it becomes a new DataType
   variant with declared semantics.
3. **New dispatch models.** Rare. Would be required if fundamentally
   new compute paradigms emerge (quantum, optical).

Everything else — every algorithm, every model, every technique — is
expressible as a Program composed from ops that already exist. This is
why vyre is a substrate, not a library: the substrate outlasts the
algorithms built on it.