vyre-libs 0.6.3

# vyre-libs::nn SKILL

Neural-network primitives  -  activation, linear layers, normalization,
attention. Every op is a Cat-A composition over `vyre-ops` primitives
and lower-level `vyre-libs::math` functions.

## Coverage targets

- Activations: `relu`. Future: `gelu`, `silu`, `tanh`, `sigmoid`.
- Linear: `linear` (feature-depends on `math-linalg`).
- Normalization: `layer_norm`. Future: `rms_norm`, `batch_norm`,
  `group_norm`.
- Attention: `softmax`, `attention`. Future: `flash_attention_v2`
  (post-0.6 LLM template crate R-3).

## Witness sources

- `relu`: trivial  -  identity for non-negative u32.
- `layer_norm`: PyTorch's `torch.nn.LayerNorm` reference with
  `eps=1e-5`, plus a corpus of edge cases (constant input, zero
  variance, large variance).
- `softmax`: exact probabilities summing to 1 ± 1e-6 (tolerance for
  `f32` rounding).
- `attention`: reference pulled from `scaled_dot_product_attention`
  in PyTorch.

## Benchmark targets (criterion)

- `softmax` on 4096 F32 elements: ≤ 500 µs sequential, ≤ 20 µs with
  workgroup-shared variant once `DataType::Shared` lands.
- `layer_norm` on 4096 F32 elements: ≤ 500 µs sequential.
- `attention` at seq_len=128, head_dim=64: ≤ 5 ms sequential; the
  FlashAttention-v2 variant (R-3, post-0.6) targets ≤ 200 µs on a
  3090.

## Backend parity contract

- F32 ops must be bit-identical across backends on inputs whose
  reduction tree is associativity-safe. For non-associative float
  reductions, document an explicit tolerance ≤ `f32::EPSILON * n`.

## Shape contract

- `softmax(input, output, n)`: both 1-D F32 length `n`.
- `layer_norm(input, output, n, eps)`: both 1-D F32 length `n`.
- `attention(q, k, v, out, s, d)`: all four 2-D F32 shape `[s, d]`.
- All builders route through `check_tensors` for collision, dtype,
  and overflow  -  no op-specific shape logic lives outside the builder.