zilla-muf 0.1.1

# zilla-muf

**Zilla Mathematical Unification Framework** — shared structured-matrix and numerical primitives for sparse attention and state space models (SSMs).

A small, dependency-light, tensor-framework-agnostic Rust library that gives you the math both architectures sit on top of — as plain functions over slices, generic over `f32`/`f64`.

---

## Why this exists

Sparse attention and SSMs converge on the *same* computation: multiplying by a **structured matrix** (semiseparable / low-displacement-rank) instead of a dense one. That's the thesis behind Mamba-2's Structured State Space Duality — linear attention and SSMs are two views of one chunked recurrence.

So this isn't "math utils." It's the structured-matrix engine that an attention crate and an SSM crate can both wrap with their own tensor types (`candle`, `burn`, raw `Vec<f32>` — your choice).

---

## What's inside

| Module | Key functions | Purpose |
|---|---|---|
| `scan` | `sequential_scan`, `chunked_scan` | The core recurrence `h_t = a_t · h_{t-1} + b_t` that both architectures reduce to. `chunked_scan` is the SSD-style blocked algorithm — the load-bearing piece. |
| `stable_ops` | `softmax`, `log_sum_exp`, `segsum`, `sigmoid`, `silu` | Numerically stable nonlinearities plus `segsum` (segment-sum) used in the SSD chunked algorithm. |
| `structured` | `semiseparable_matvec`, `toeplitz_matvec`, `cauchy_matvec`, `vandermonde_matvec` | Structured matrix–vector products: the unifying object (Mamba-2 diagonal, convolutional, and S4/S4D kernel views). |
| `discretize` | `zoh_discretize` | Continuous → discrete conversion (zero-order hold) for SSMs. |
| `complex_ops` | `complex_exp`, `diag_complex_matrix_exp`, `conjugate_pair_output` | Complex helpers for S4-style diagonalized state matrices. |
| `fft_conv` *(feature)* | `fft_conv` | O(n log n) long convolution — the FFT view of an SSM kernel. |

---

## Installation

Requires **Rust 1.85+** (edition 2021).

```toml
[dependencies]
zilla-muf = { git = "https://github.com/dvbnrg/zilla-muf" }
```

Import as `zilla_muf` (the crate name's hyphen becomes an underscore).

---

## Feature flags

Everything off by default — you opt in to extra dependencies only when you need them.

| Feature | Pulls in | Effect |
|---|---|---|
| *(default)* | `num-traits`, `num-complex` | Pure-CPU reference implementations. |
| `parallel` | `rayon` | Runs the independent phases of `chunked_scan` (per-chunk compute + per-position correction) in parallel. The carry pass stays sequential — that dependency is structural. |
| `fft` | `rustfft` | Enables the `fft_conv` module. |

```toml
zilla-muf = { git = "https://github.com/dvbnrg/zilla-muf", features = ["parallel", "fft"] }
```

---

## Quick start

**The core scan** — the recurrence both SSMs and linear attention collapse to:

```rust
use zilla_muf::scan::chunked_scan;

// h_t = a_t * h_{t-1} + b_t
let a = [0.9, 0.9, 0.9, 0.9]; // decay per step
let b = [1.0, 2.0, 3.0, 4.0]; // inputs
let h0 = 0.0;
let chunk_size = 2;

let h = chunked_scan(&a, &b, h0, chunk_size);
// matches sequential_scan to within float rounding, but blocked for speed
```

**The duality in one call** — apply an implicit semiseparable matrix to a vector (the "attention = SSM" object):

```rust
use zilla_muf::structured::semiseparable_matvec;

let n = 4;
let rank = 1;                  // SSM state dimension
let a = [0.9, 0.9, 0.9, 0.9];  // scalar decay per timestep
let b = [1.0, 1.0, 1.0, 1.0];  // B_j vectors, length n * rank, row-major
let c = [1.0, 1.0, 1.0, 1.0];  // C_i vectors, length n * rank, row-major
let x = [1.0, 2.0, 3.0, 4.0];  // input sequence

let y = semiseparable_matvec(&a, &b, &c, &x, rank, /* chunk_size */ 2);
```

**Discretize a continuous system** before running SSM math:

```rust
use zilla_muf::discretize::zoh_discretize;

let (a_bar, b_bar) = zoh_discretize(-1.0_f64, 1.0, 0.1); // (A, B, delta)
```

---

## Design principles

- **Slices in, `Vec` out — never a tensor type.** The library stays framework-agnostic; your attention and SSM crates wrap these calls with whatever tensor layer they use.
- **Generic over `Float` from the start.** Use `f32` for speed, `f64` for numerical testing — same code path.
- **Correctness-first.** Every primitive is tested against an *independent* oracle (a dense/naive reference built a different way), not a restatement of its own implementation. The chunked scan is checked against the sequential one; FFT conv against the Toeplitz matvec; Cauchy/Vandermonde against dense and `powu` references.
- **CPU reference, GPU later.** This crate is the correctness reference and CPU fallback. GPU kernels (e.g. via `cudarc`) belong in a separate crate so this one stays simple and dependency-light.

---

## Testing & benchmarks

```bash
cargo test                              # default feature set
cargo test --features parallel          # exercises the rayon scan path
cargo test --features fft               # exercises fft_conv
cargo test --features parallel,fft      # everything
cargo test --all-features               # full matrix (mirrors CI)
cargo test --doc --all-features         # doc-test examples

cargo bench                             # criterion: sequential vs chunked across sizes
cargo bench --bench structured_bench    # criterion: toeplitz, semiseparable, fft_conv
```

CI runs all four feature combinations on every push and PR. Note: the parallel scan path is **only** covered under `--features parallel`, so run that combo when touching `chunked_scan`.

---

## Roadmap

Implemented today: sequential + chunked scan, stable ops, the four structured matvecs, ZOH discretization, complex exp, and FFT conv. Still planned:

- **Parallel/associative (Blelloch) scan** — for when `num_chunks` gets large enough that hierarchical blocking beats a flat carry pass.
- **More discretization** — bilinear / Tustin transform alongside ZOH.
- ~~**More complex helpers**~~ — `diag_complex_matrix_exp` and `conjugate_pair_output` now implemented.

---

## License

BSD 3-Clause © 2026 Dave Banerjee. See [LICENSE](./LICENSE).