zilla-muf 0.1.1

Shared structured-matrix and numerical primitives for sparse attention and state space models (SSMs).
Documentation

zilla-muf

Zilla Mathematical Unification Framework — shared structured-matrix and numerical primitives for sparse attention and state space models (SSMs).

A small, dependency-light, tensor-framework-agnostic Rust library that gives you the math both architectures sit on top of — as plain functions over slices, generic over f32/f64.


Why this exists

Sparse attention and SSMs converge on the same computation: multiplying by a structured matrix (semiseparable / low-displacement-rank) instead of a dense one. That's the thesis behind Mamba-2's Structured State Space Duality — linear attention and SSMs are two views of one chunked recurrence.

So this isn't "math utils." It's the structured-matrix engine that an attention crate and an SSM crate can both wrap with their own tensor types (candle, burn, raw Vec<f32> — your choice).


What's inside

Module Key functions Purpose
scan sequential_scan, chunked_scan The core recurrence h_t = a_t · h_{t-1} + b_t that both architectures reduce to. chunked_scan is the SSD-style blocked algorithm — the load-bearing piece.
stable_ops softmax, log_sum_exp, segsum, sigmoid, silu Numerically stable nonlinearities plus segsum (segment-sum) used in the SSD chunked algorithm.
structured semiseparable_matvec, toeplitz_matvec, cauchy_matvec, vandermonde_matvec Structured matrix–vector products: the unifying object (Mamba-2 diagonal, convolutional, and S4/S4D kernel views).
discretize zoh_discretize Continuous → discrete conversion (zero-order hold) for SSMs.
complex_ops complex_exp, diag_complex_matrix_exp, conjugate_pair_output Complex helpers for S4-style diagonalized state matrices.
fft_conv (feature) fft_conv O(n log n) long convolution — the FFT view of an SSM kernel.

Installation

Requires Rust 1.85+ (edition 2021).

[dependencies]
zilla-muf = { git = "https://github.com/dvbnrg/zilla-muf" }

Import as zilla_muf (the crate name's hyphen becomes an underscore).


Feature flags

Everything off by default — you opt in to extra dependencies only when you need them.

Feature Pulls in Effect
(default) num-traits, num-complex Pure-CPU reference implementations.
parallel rayon Runs the independent phases of chunked_scan (per-chunk compute + per-position correction) in parallel. The carry pass stays sequential — that dependency is structural.
fft rustfft Enables the fft_conv module.
zilla-muf = { git = "https://github.com/dvbnrg/zilla-muf", features = ["parallel", "fft"] }

Quick start

The core scan — the recurrence both SSMs and linear attention collapse to:

use zilla_muf::scan::chunked_scan;

// h_t = a_t * h_{t-1} + b_t
let a = [0.9, 0.9, 0.9, 0.9]; // decay per step
let b = [1.0, 2.0, 3.0, 4.0]; // inputs
let h0 = 0.0;
let chunk_size = 2;

let h = chunked_scan(&a, &b, h0, chunk_size);
// matches sequential_scan to within float rounding, but blocked for speed

The duality in one call — apply an implicit semiseparable matrix to a vector (the "attention = SSM" object):

use zilla_muf::structured::semiseparable_matvec;

let n = 4;
let rank = 1;                  // SSM state dimension
let a = [0.9, 0.9, 0.9, 0.9];  // scalar decay per timestep
let b = [1.0, 1.0, 1.0, 1.0];  // B_j vectors, length n * rank, row-major
let c = [1.0, 1.0, 1.0, 1.0];  // C_i vectors, length n * rank, row-major
let x = [1.0, 2.0, 3.0, 4.0];  // input sequence

let y = semiseparable_matvec(&a, &b, &c, &x, rank, /* chunk_size */ 2);

Discretize a continuous system before running SSM math:

use zilla_muf::discretize::zoh_discretize;

let (a_bar, b_bar) = zoh_discretize(-1.0_f64, 1.0, 0.1); // (A, B, delta)

Design principles

  • Slices in, Vec out — never a tensor type. The library stays framework-agnostic; your attention and SSM crates wrap these calls with whatever tensor layer they use.
  • Generic over Float from the start. Use f32 for speed, f64 for numerical testing — same code path.
  • Correctness-first. Every primitive is tested against an independent oracle (a dense/naive reference built a different way), not a restatement of its own implementation. The chunked scan is checked against the sequential one; FFT conv against the Toeplitz matvec; Cauchy/Vandermonde against dense and powu references.
  • CPU reference, GPU later. This crate is the correctness reference and CPU fallback. GPU kernels (e.g. via cudarc) belong in a separate crate so this one stays simple and dependency-light.

Testing & benchmarks

cargo test                              # default feature set
cargo test --features parallel          # exercises the rayon scan path
cargo test --features fft               # exercises fft_conv
cargo test --features parallel,fft      # everything
cargo test --all-features               # full matrix (mirrors CI)
cargo test --doc --all-features         # doc-test examples

cargo bench                             # criterion: sequential vs chunked across sizes
cargo bench --bench structured_bench    # criterion: toeplitz, semiseparable, fft_conv

CI runs all four feature combinations on every push and PR. Note: the parallel scan path is only covered under --features parallel, so run that combo when touching chunked_scan.


Roadmap

Implemented today: sequential + chunked scan, stable ops, the four structured matvecs, ZOH discretization, complex exp, and FFT conv. Still planned:

  • Parallel/associative (Blelloch) scan — for when num_chunks gets large enough that hierarchical blocking beats a flat carry pass.
  • More discretization — bilinear / Tustin transform alongside ZOH.
  • More complex helpersdiag_complex_matrix_exp and conjugate_pair_output now implemented.

License

BSD 3-Clause © 2026 Dave Banerjee. See LICENSE.