# The Real Question First
Before we design folder structure, it's worth naming *why* these two architectures share math at all — this should drive what goes in the library, not just convenience.
Sparse attention and SSMs (state space models) converge on the same underlying computation: both can be expressed as multiplying by a **structured matrix** (semiseparable / low-displacement-rank) instead of a dense one. This is literally the thesis behind Mamba-2's "SSD" (Structured State Space Duality) paper — linear attention and SSMs are two views of the same chunked recurrence. So your shared library isn't just "math utils," it's the structured-matrix engine that both architectures sit on top of.
That reframes the design nicely into 6 concrete modules.
---
## 1. Scan Operations (the core)
This is the highest-leverage piece. Both architectures reduce to a recurrence `h_t = A_t * h_{t-1} + B_t`.
- Sequential scan (reference/correctness baseline)
- Parallel/associative (Blelloch) scan
- **Chunked scan** — split sequence into blocks, do intra-chunk dense compute + inter-chunk recurrence. This is the one function both attention and SSMs actually call.
## 2. Numerically Stable Elementwise Ops
Shared nonlinearities and stability tricks that show up identically in both:
- Softmax + log-sum-exp (attention, and gating in some SSM variants)
- `segsum` (segment-sum, used in the SSD chunked algorithm)
- SiLU/Swish, sigmoid (Mamba's gating)
- RMSNorm
## 3. Structured Matrix Primitives
- Semiseparable matrix-vector product (the unifying object)
- Toeplitz matvec (convolutional view of SSMs)
- Cauchy/Vandermonde matvec (S4's kernel computation)
## 4. Discretization
SSMs need continuous → discrete conversion:
- Zero-order hold (ZOH)
- Bilinear/Tustin transform
- These are pure math, no model-specific state — perfect library candidates.
## 5. Complex Arithmetic Helpers
S4-style models diagonalize state matrices into complex eigenvalues. Thin wrappers over `num-complex` for: complex exp, complex matrix exponential, conjugate-pair handling.
## 6. FFT-based Convolution
S4 computes long convolutions via FFT. Sparse attention approximations (FNet-style) use it too. One shared `fft_conv` module avoids duplicating this twice.
---
## Crate Architecture
```
ssm-attn-math/
├── Cargo.toml
├── src/
│ ├── lib.rs
│ ├── scan/
│ │ ├── sequential.rs
│ │ ├── parallel.rs
│ │ └── chunked.rs
│ ├── stable_ops/
│ │ ├── softmax.rs
│ │ ├── segsum.rs
│ │ └── activations.rs
│ ├── structured/
│ │ ├── semiseparable.rs
│ │ ├── toeplitz.rs
│ │ └── cauchy_vandermonde.rs
│ ├── discretize.rs
│ ├── complex_ops.rs
│ └── fft_conv.rs
└── benches/
```
**Dependency choices:**
| Generic over f32/f64 | `num-traits` |
| Complex numbers | `num-complex` |
| FFT | `rustfft` |
| CPU parallelism | `rayon` |
| Tensor-ish layout (optional) | `ndarray` |
| Benchmarking | `criterion` |
Keep it `no_std`-friendly where possible if you ever want WASM or embedded — but for a v1, don't over-engineer this.
## API Design Pattern
Generic over float type from the start, since you'll want f32 for speed and f64 for numerical testing:
```rust
pub trait ScanOp<T: num_traits::Float> {
fn combine(&self, a: (T, T), b: (T, T)) -> (T, T);
}
pub fn chunked_scan<T: Float>(
a: &[T], // decay/transition coefficients
b: &[T], // inputs
chunk_size: usize,
) -> Vec<T> { ... }
pub fn segsum<T: Float>(x: &[T]) -> Vec<T> { ... }
```
Design rule: **functions take slices, not your model's tensor type.** Keep this library tensor-framework-agnostic — your attention and SSM crates each wrap these calls with their own tensor types (whether that's `candle`, `burn`, or raw `Vec<f32>`).
---
## Build Order (don't boil the ocean)
1. **Sequential scan + segsum** — correctness baseline, easiest to test against NumPy/PyTorch reference values
2. **Chunked scan** — the actual payoff; benchmark against #1
3. **Discretization (ZOH)** — needed before any real SSM math works
4. **Stable softmax/activations** — needed for attention side
5. **Structured matvecs** — once you know which SSM variant (S4 vs Mamba) you're targeting, since DPLR vs diagonal parameterization changes what you need here
6. **FFT conv** — only if you're doing S4-style long convolutions; Mamba doesn't need it
Want me to scaffold the actual `Cargo.toml` + module stubs as a starting repo, or work through the chunked-scan implementation first since that's the load-bearing piece?