zilla-muf
Zilla Mathematical Unification Framework — shared structured-matrix and numerical primitives for sparse attention and state space models (SSMs).
A small, dependency-light, tensor-framework-agnostic Rust library that gives you the math both architectures sit on top of — as plain functions over slices, generic over f32/f64.
Why this exists
Sparse attention and SSMs converge on the same computation: multiplying by a structured matrix (semiseparable / low-displacement-rank) instead of a dense one. That's the thesis behind Mamba-2's Structured State Space Duality — linear attention and SSMs are two views of one chunked recurrence.
So this isn't "math utils." It's the structured-matrix engine that an attention crate and an SSM crate can both wrap with their own tensor types (candle, burn, raw Vec<f32> — your choice).
What's inside
| Module | Key functions | Purpose |
|---|---|---|
scan |
sequential_scan, chunked_scan |
The core recurrence h_t = a_t · h_{t-1} + b_t that both architectures reduce to. chunked_scan is the SSD-style blocked algorithm — the load-bearing piece. |
stable_ops |
softmax, log_sum_exp, segsum, sigmoid, silu |
Numerically stable nonlinearities plus segsum (segment-sum) used in the SSD chunked algorithm. |
structured |
semiseparable_matvec, toeplitz_matvec, cauchy_matvec, vandermonde_matvec |
Structured matrix–vector products: the unifying object (Mamba-2 diagonal, convolutional, and S4/S4D kernel views). |
discretize |
zoh_discretize |
Continuous → discrete conversion (zero-order hold) for SSMs. |
complex_ops |
complex_exp, diag_complex_matrix_exp, conjugate_pair_output |
Complex helpers for S4-style diagonalized state matrices. |
fft_conv (feature) |
fft_conv |
O(n log n) long convolution — the FFT view of an SSM kernel. |
Installation
Requires Rust 1.85+ (edition 2021).
[]
= { = "https://github.com/dvbnrg/zilla-muf" }
Import as zilla_muf (the crate name's hyphen becomes an underscore).
Feature flags
Everything off by default — you opt in to extra dependencies only when you need them.
| Feature | Pulls in | Effect |
|---|---|---|
| (default) | num-traits, num-complex |
Pure-CPU reference implementations. |
parallel |
rayon |
Runs the independent phases of chunked_scan (per-chunk compute + per-position correction) in parallel. The carry pass stays sequential — that dependency is structural. |
fft |
rustfft |
Enables the fft_conv module. |
= { = "https://github.com/dvbnrg/zilla-muf", = ["parallel", "fft"] }
Quick start
The core scan — the recurrence both SSMs and linear attention collapse to:
use chunked_scan;
// h_t = a_t * h_{t-1} + b_t
let a = ; // decay per step
let b = ; // inputs
let h0 = 0.0;
let chunk_size = 2;
let h = chunked_scan;
// matches sequential_scan to within float rounding, but blocked for speed
The duality in one call — apply an implicit semiseparable matrix to a vector (the "attention = SSM" object):
use semiseparable_matvec;
let n = 4;
let rank = 1; // SSM state dimension
let a = ; // scalar decay per timestep
let b = ; // B_j vectors, length n * rank, row-major
let c = ; // C_i vectors, length n * rank, row-major
let x = ; // input sequence
let y = semiseparable_matvec;
Discretize a continuous system before running SSM math:
use zoh_discretize;
let = zoh_discretize; // (A, B, delta)
Design principles
- Slices in,
Vecout — never a tensor type. The library stays framework-agnostic; your attention and SSM crates wrap these calls with whatever tensor layer they use. - Generic over
Floatfrom the start. Usef32for speed,f64for numerical testing — same code path. - Correctness-first. Every primitive is tested against an independent oracle (a dense/naive reference built a different way), not a restatement of its own implementation. The chunked scan is checked against the sequential one; FFT conv against the Toeplitz matvec; Cauchy/Vandermonde against dense and
powureferences. - CPU reference, GPU later. This crate is the correctness reference and CPU fallback. GPU kernels (e.g. via
cudarc) belong in a separate crate so this one stays simple and dependency-light.
Testing & benchmarks
CI runs all four feature combinations on every push and PR. Note: the parallel scan path is only covered under --features parallel, so run that combo when touching chunked_scan.
Roadmap
Implemented today: sequential + chunked scan, stable ops, the four structured matvecs, ZOH discretization, complex exp, and FFT conv. Still planned:
- Parallel/associative (Blelloch) scan — for when
num_chunksgets large enough that hierarchical blocking beats a flat carry pass. - More discretization — bilinear / Tustin transform alongside ZOH.
More complex helpers—diag_complex_matrix_expandconjugate_pair_outputnow implemented.
License
BSD 3-Clause © 2026 Dave Banerjee. See LICENSE.