zilla-muf 0.1.1

# The Real Question First

Before we design folder structure, it's worth naming *why* these two architectures share math at all — this should drive what goes in the library, not just convenience.

Sparse attention and SSMs (state space models) converge on the same underlying computation: both can be expressed as multiplying by a **structured matrix** (semiseparable / low-displacement-rank) instead of a dense one. This is literally the thesis behind Mamba-2's "SSD" (Structured State Space Duality) paper — linear attention and SSMs are two views of the same chunked recurrence. So your shared library isn't just "math utils," it's the structured-matrix engine that both architectures sit on top of.

That reframes the design nicely into 6 concrete modules.

---

## 1. Scan Operations (the core)
This is the highest-leverage piece. Both architectures reduce to a recurrence `h_t = A_t * h_{t-1} + B_t`.

- Sequential scan (reference/correctness baseline)
- Parallel/associative (Blelloch) scan
- **Chunked scan** — split sequence into blocks, do intra-chunk dense compute + inter-chunk recurrence. This is the one function both attention and SSMs actually call.

## 2. Numerically Stable Elementwise Ops
Shared nonlinearities and stability tricks that show up identically in both:
- Softmax + log-sum-exp (attention, and gating in some SSM variants)
- `segsum` (segment-sum, used in the SSD chunked algorithm)
- SiLU/Swish, sigmoid (Mamba's gating)
- RMSNorm

## 3. Structured Matrix Primitives
- Semiseparable matrix-vector product (the unifying object)
- Toeplitz matvec (convolutional view of SSMs)
- Cauchy/Vandermonde matvec (S4's kernel computation)

## 4. Discretization
SSMs need continuous → discrete conversion:
- Zero-order hold (ZOH)
- Bilinear/Tustin transform
- These are pure math, no model-specific state — perfect library candidates.

## 5. Complex Arithmetic Helpers
S4-style models diagonalize state matrices into complex eigenvalues. Thin wrappers over `num-complex` for: complex exp, complex matrix exponential, conjugate-pair handling.

## 6. FFT-based Convolution
S4 computes long convolutions via FFT. Sparse attention approximations (FNet-style) use it too. One shared `fft_conv` module avoids duplicating this twice.

---

## Crate Architecture

```
ssm-attn-math/
├── Cargo.toml
├── src/
│   ├── lib.rs
│   ├── scan/
│   │   ├── sequential.rs
│   │   ├── parallel.rs
│   │   └── chunked.rs
│   ├── stable_ops/
│   │   ├── softmax.rs
│   │   ├── segsum.rs
│   │   └── activations.rs
│   ├── structured/
│   │   ├── semiseparable.rs
│   │   ├── toeplitz.rs
│   │   └── cauchy_vandermonde.rs
│   ├── discretize.rs
│   ├── complex_ops.rs
│   └── fft_conv.rs
└── benches/
```

**Dependency choices:**
| Need | Crate |
|---|---|
| Generic over f32/f64 | `num-traits` |
| Complex numbers | `num-complex` |
| FFT | `rustfft` |
| CPU parallelism | `rayon` |
| Tensor-ish layout (optional) | `ndarray` |
| Benchmarking | `criterion` |

Keep it `no_std`-friendly where possible if you ever want WASM or embedded — but for a v1, don't over-engineer this.

## API Design Pattern

Generic over float type from the start, since you'll want f32 for speed and f64 for numerical testing:

```rust
pub trait ScanOp<T: num_traits::Float> {
	fn combine(&self, a: (T, T), b: (T, T)) -> (T, T);
}

pub fn chunked_scan<T: Float>(
	a: &[T],       // decay/transition coefficients
	b: &[T],       // inputs
	chunk_size: usize,
) -> Vec<T> { ... }

pub fn segsum<T: Float>(x: &[T]) -> Vec<T> { ... }
```

Design rule: **functions take slices, not your model's tensor type.** Keep this library tensor-framework-agnostic — your attention and SSM crates each wrap these calls with their own tensor types (whether that's `candle`, `burn`, or raw `Vec<f32>`).

---

## Build Order (don't boil the ocean)

1. **Sequential scan + segsum** — correctness baseline, easiest to test against NumPy/PyTorch reference values
2. **Chunked scan** — the actual payoff; benchmark against #1
3. **Discretization (ZOH)** — needed before any real SSM math works
4. **Stable softmax/activations** — needed for attention side
5. **Structured matvecs** — once you know which SSM variant (S4 vs Mamba) you're targeting, since DPLR vs diagonal parameterization changes what you need here
6. **FFT conv** — only if you're doing S4-style long convolutions; Mamba doesn't need it

Want me to scaffold the actual `Cargo.toml` + module stubs as a starting repo, or work through the chunked-scan implementation first since that's the load-bearing piece?