# zilla-muf
**Zilla Mathematical Unification Framework** — shared structured-matrix and numerical primitives for sparse attention and state space models (SSMs).
A small, dependency-light, tensor-framework-agnostic Rust library that gives you the math both architectures sit on top of — as plain functions over slices, generic over `f32`/`f64`.
---
## Why this exists
Sparse attention and SSMs converge on the *same* computation: multiplying by a **structured matrix** (semiseparable / low-displacement-rank) instead of a dense one. That's the thesis behind Mamba-2's Structured State Space Duality — linear attention and SSMs are two views of one chunked recurrence.
So this isn't "math utils." It's the structured-matrix engine that an attention crate and an SSM crate can both wrap with their own tensor types (`candle`, `burn`, raw `Vec<f32>` — your choice).
---
## What's inside
| `scan` | `sequential_scan`, `chunked_scan` | The core recurrence `h_t = a_t · h_{t-1} + b_t` that both architectures reduce to. `chunked_scan` is the SSD-style blocked algorithm — the load-bearing piece. |
| `stable_ops` | `softmax`, `log_sum_exp`, `segsum`, `sigmoid`, `silu` | Numerically stable nonlinearities plus `segsum` (segment-sum) used in the SSD chunked algorithm. |
| `structured` | `semiseparable_matvec`, `toeplitz_matvec`, `cauchy_matvec`, `vandermonde_matvec` | Structured matrix–vector products: the unifying object (Mamba-2 diagonal, convolutional, and S4/S4D kernel views). |
| `discretize` | `zoh_discretize` | Continuous → discrete conversion (zero-order hold) for SSMs. |
| `complex_ops` | `complex_exp`, `diag_complex_matrix_exp`, `conjugate_pair_output` | Complex helpers for S4-style diagonalized state matrices. |
| `fft_conv` *(feature)* | `fft_conv` | O(n log n) long convolution — the FFT view of an SSM kernel. |
---
## Installation
Requires **Rust 1.85+** (edition 2021).
```toml
[dependencies]
zilla-muf = { git = "https://github.com/dvbnrg/zilla-muf" }
```
Import as `zilla_muf` (the crate name's hyphen becomes an underscore).
---
## Feature flags
Everything off by default — you opt in to extra dependencies only when you need them.
| *(default)* | `num-traits`, `num-complex` | Pure-CPU reference implementations. |
| `parallel` | `rayon` | Runs the independent phases of `chunked_scan` (per-chunk compute + per-position correction) in parallel. The carry pass stays sequential — that dependency is structural. |
| `fft` | `rustfft` | Enables the `fft_conv` module. |
```toml
zilla-muf = { git = "https://github.com/dvbnrg/zilla-muf", features = ["parallel", "fft"] }
```
---
## Quick start
**The core scan** — the recurrence both SSMs and linear attention collapse to:
```rust
use zilla_muf::scan::chunked_scan;
// h_t = a_t * h_{t-1} + b_t
let a = [0.9, 0.9, 0.9, 0.9]; // decay per step
let b = [1.0, 2.0, 3.0, 4.0]; // inputs
let h0 = 0.0;
let chunk_size = 2;
let h = chunked_scan(&a, &b, h0, chunk_size);
// matches sequential_scan to within float rounding, but blocked for speed
```
**The duality in one call** — apply an implicit semiseparable matrix to a vector (the "attention = SSM" object):
```rust
use zilla_muf::structured::semiseparable_matvec;
let n = 4;
let rank = 1; // SSM state dimension
let a = [0.9, 0.9, 0.9, 0.9]; // scalar decay per timestep
let b = [1.0, 1.0, 1.0, 1.0]; // B_j vectors, length n * rank, row-major
let c = [1.0, 1.0, 1.0, 1.0]; // C_i vectors, length n * rank, row-major
let x = [1.0, 2.0, 3.0, 4.0]; // input sequence
let y = semiseparable_matvec(&a, &b, &c, &x, rank, /* chunk_size */ 2);
```
**Discretize a continuous system** before running SSM math:
```rust
use zilla_muf::discretize::zoh_discretize;
let (a_bar, b_bar) = zoh_discretize(-1.0_f64, 1.0, 0.1); // (A, B, delta)
```
---
## Design principles
- **Slices in, `Vec` out — never a tensor type.** The library stays framework-agnostic; your attention and SSM crates wrap these calls with whatever tensor layer they use.
- **Generic over `Float` from the start.** Use `f32` for speed, `f64` for numerical testing — same code path.
- **Correctness-first.** Every primitive is tested against an *independent* oracle (a dense/naive reference built a different way), not a restatement of its own implementation. The chunked scan is checked against the sequential one; FFT conv against the Toeplitz matvec; Cauchy/Vandermonde against dense and `powu` references.
- **CPU reference, GPU later.** This crate is the correctness reference and CPU fallback. GPU kernels (e.g. via `cudarc`) belong in a separate crate so this one stays simple and dependency-light.
---
## Testing & benchmarks
```bash
cargo test # default feature set
cargo test --features parallel # exercises the rayon scan path
cargo test --features fft # exercises fft_conv
cargo test --features parallel,fft # everything
cargo test --all-features # full matrix (mirrors CI)
cargo test --doc --all-features # doc-test examples
cargo bench # criterion: sequential vs chunked across sizes
cargo bench --bench structured_bench # criterion: toeplitz, semiseparable, fft_conv
```
CI runs all four feature combinations on every push and PR. Note: the parallel scan path is **only** covered under `--features parallel`, so run that combo when touching `chunked_scan`.
---
## Roadmap
Implemented today: sequential + chunked scan, stable ops, the four structured matvecs, ZOH discretization, complex exp, and FFT conv. Still planned:
- **Parallel/associative (Blelloch) scan** — for when `num_chunks` gets large enough that hierarchical blocking beats a flat carry pass.
- **More discretization** — bilinear / Tustin transform alongside ZOH.
- ~~**More complex helpers**~~ — `diag_complex_matrix_exp` and `conjugate_pair_output` now implemented.
---
## License
BSD 3-Clause © 2026 Dave Banerjee. See [LICENSE](./LICENSE).