zilla-muf

Zilla Mathematical Unification Framework — shared structured-matrix and numerical primitives for sparse attention and state space models (SSMs).

A small, dependency-light, tensor-framework-agnostic Rust library that gives you the math both architectures sit on top of — as plain functions over slices, generic over f32/f64.

Why this exists

Sparse attention and SSMs converge on the same computation: multiplying by a structured matrix (semiseparable / low-displacement-rank) instead of a dense one. That's the thesis behind Mamba-2's Structured State Space Duality — linear attention and SSMs are two views of one chunked recurrence.

So this isn't "math utils." It's the structured-matrix engine that an attention crate and an SSM crate can both wrap with their own tensor types (candle, burn, raw Vec<f32> — your choice).

What's inside

Module	Key functions	Purpose
`scan`	`sequential_scan`, `chunked_scan`	The core recurrence `h_t = a_t · h_{t-1} + b_t` that both architectures reduce to. `chunked_scan` is the SSD-style blocked algorithm — the load-bearing piece.
`stable_ops`	`softmax`, `log_sum_exp`, `segsum`, `sigmoid`, `silu`	Numerically stable nonlinearities plus `segsum` (segment-sum) used in the SSD chunked algorithm.
`structured`	`semiseparable_matvec`, `toeplitz_matvec`, `cauchy_matvec`, `vandermonde_matvec`	Structured matrix–vector products: the unifying object (Mamba-2 diagonal, convolutional, and S4/S4D kernel views).
`discretize`	`zoh_discretize`	Continuous → discrete conversion (zero-order hold) for SSMs.
`complex_ops`	`complex_exp`, `diag_complex_matrix_exp`, `conjugate_pair_output`	Complex helpers for S4-style diagonalized state matrices.
`fft_conv` (feature)	`fft_conv`	O(n log n) long convolution — the FFT view of an SSM kernel.

Installation

Requires Rust 1.85+ (edition 2021).

[dependencies]
zilla-muf = { git = "https://github.com/dvbnrg/zilla-muf" }

Import as zilla_muf (the crate name's hyphen becomes an underscore).

Feature flags

Everything off by default — you opt in to extra dependencies only when you need them.

Feature	Pulls in	Effect
(default)	`num-traits`, `num-complex`	Pure-CPU reference implementations.
`parallel`	`rayon`	Runs the independent phases of `chunked_scan` (per-chunk compute + per-position correction) in parallel. The carry pass stays sequential — that dependency is structural.
`fft`	`rustfft`	Enables the `fft_conv` module.

zilla-muf = { git = "https://github.com/dvbnrg/zilla-muf", features = ["parallel", "fft"] }

Quick start

The core scan — the recurrence both SSMs and linear attention collapse to:

use zilla_muf::scan::chunked_scan;

// h_t = a_t * h_{t-1} + b_t
let a = [0.9, 0.9, 0.9, 0.9]; // decay per step
let b = [1.0, 2.0, 3.0, 4.0]; // inputs
let h0 = 0.0;
let chunk_size = 2;

let h = chunked_scan(&a, &b, h0, chunk_size);
// matches sequential_scan to within float rounding, but blocked for speed

The duality in one call — apply an implicit semiseparable matrix to a vector (the "attention = SSM" object):

use zilla_muf::structured::semiseparable_matvec;

let n = 4;
let rank = 1;                  // SSM state dimension
let a = [0.9, 0.9, 0.9, 0.9];  // scalar decay per timestep
let b = [1.0, 1.0, 1.0, 1.0];  // B_j vectors, length n * rank, row-major
let c = [1.0, 1.0, 1.0, 1.0];  // C_i vectors, length n * rank, row-major
let x = [1.0, 2.0, 3.0, 4.0];  // input sequence

let y = semiseparable_matvec(&a, &b, &c, &x, rank, /* chunk_size */ 2);

Discretize a continuous system before running SSM math:

use zilla_muf::discretize::zoh_discretize;

let (a_bar, b_bar) = zoh_discretize(-1.0_f64, 1.0, 0.1); // (A, B, delta)

Design principles

Slices in, Vec out — never a tensor type. The library stays framework-agnostic; your attention and SSM crates wrap these calls with whatever tensor layer they use.
Generic over Float from the start. Use f32 for speed, f64 for numerical testing — same code path.
Correctness-first. Every primitive is tested against an independent oracle (a dense/naive reference built a different way), not a restatement of its own implementation. The chunked scan is checked against the sequential one; FFT conv against the Toeplitz matvec; Cauchy/Vandermonde against dense and powu references.
CPU reference, GPU later. This crate is the correctness reference and CPU fallback. GPU kernels (e.g. via cudarc) belong in a separate crate so this one stays simple and dependency-light.

Testing & benchmarks

cargo test                              # default feature set
cargo test --features parallel          # exercises the rayon scan path
cargo test --features fft               # exercises fft_conv
cargo test --features parallel,fft      # everything
cargo test --all-features               # full matrix (mirrors CI)
cargo test --doc --all-features         # doc-test examples

cargo bench                             # criterion: sequential vs chunked across sizes
cargo bench --bench structured_bench    # criterion: toeplitz, semiseparable, fft_conv

CI runs all four feature combinations on every push and PR. Note: the parallel scan path is only covered under --features parallel, so run that combo when touching chunked_scan.

Roadmap

Implemented today: sequential + chunked scan, stable ops, the four structured matvecs, ZOH discretization, complex exp, and FFT conv. Still planned:

Parallel/associative (Blelloch) scan — for when num_chunks gets large enough that hierarchical blocking beats a flat carry pass.
More discretization — bilinear / Tustin transform alongside ZOH.
~~More complex helpers~~ — diag_complex_matrix_exp and conjugate_pair_output now implemented.

zilla-muf 0.1.1