numeris 0.5.0

Pure-Rust numerical algorithms library — high performance with SIMD support while also supporting no-std for embedded and WASM targets.
Documentation
# Design

Key architectural decisions in numeris.

## Column-Major Storage

`Matrix<T, M, N>` stores data as `[[T; M]; N]` — N columns of M rows. Element `(row, col)` is at `data[col][row]`.

This is **column-major** (Fortran / LAPACK order). The reasons:

1. **LAPACK compatibility**: all standard numerical algorithms (LU, QR, Cholesky, etc.) are formulated and optimized for column-major storage.
2. **SIMD inner loops**: Householder reflections, LU elimination, and BLAS-1 AXPY (`y += α·x`) operate on a single column — a contiguous block of M floats. `col_as_slice()` returns this directly, enabling SIMD dispatch with no gathering.
3. **Matmul**: the j-k-i loop order accesses A in column-major (contiguous i-access) and B by row (B[k,j] strides), matching the memory layout for the SIMD-vectorized i-loop.

`Matrix::new()` accepts **row-major** input (as written on paper) and transposes internally, so the API is natural. `DynMatrix::from_rows()` does the same.

## Stack Allocation and Const Generics

Fixed-size matrices are fully stack-allocated. There is no heap usage, no `Box`, no `Vec` — just `[[T; M]; N]` on the stack. This enables:

- **No-std / embedded**: works on targets with no heap allocator.
- **Zero-cost abstractions**: the compiler can see all dimensions at monomorphization time. Dead branches are eliminated. SIMD kernel dispatch via `TypeId` is a compile-time decision.
- **LLVM optimization**: the compiler knows the loop bounds and can unroll, vectorize, and schedule optimally.

Avoids `[T; M*N]` flat storage (which would require unstable `generic_const_exprs` for expressions like `const LEN: usize = M * N`).

## MatrixRef / MatrixMut Traits

All decomposition free functions are written against two simple accessor traits:

```rust
pub trait MatrixRef<T> {
    fn nrows(&self) -> usize;
    fn ncols(&self) -> usize;
    fn get(&self, row: usize, col: usize) -> T;
    fn col_as_slice(&self, col: usize) -> &[T];
}

pub trait MatrixMut<T>: MatrixRef<T> {
    fn get_mut(&mut self, row: usize, col: usize) -> &mut T;
    fn col_as_mut_slice(&mut self, col: usize) -> &mut [T];
}
```

Both `Matrix<T, M, N>` and `DynMatrix<T>` implement these traits. With the `nalgebra` feature, `nalgebra::SMatrix` and `nalgebra::DMatrix` also implement them. The consequence: **the same LU, Cholesky, QR, SVD, Eigen, and Schur code handles fixed, dynamic, and nalgebra matrices**. There are no separate implementations — just one set of free functions.

`col_as_slice()` / `col_as_mut_slice()` are the key methods: they return contiguous `&[T]` slices of individual columns, enabling SIMD dispatch to operate on contiguous data without gathering.

## LinalgScalar Hierarchy

Three element traits form a hierarchy:

```
Scalar          Copy + PartialEq + Debug + Zero + One + Num
  └─ FloatScalar    Scalar + Float + LinalgScalar<Real=Self>
  └─ LinalgScalar   Scalar + modulus + conj + re + lsqrt + lln + from_real
```

| Trait | Used by | Examples |
|---|---|---|
| `Scalar` | Matrix ops, arithmetic, iteration | `i32`, `u64`, `f32`, `f64`, `Complex<f64>` |
| `FloatScalar` | Quaternion, ODE, optim, estimate, ordered comparisons | `f32`, `f64` |
| `LinalgScalar` | Decompositions, norms | `f32`, `f64`, `Complex<f32>`, `Complex<f64>` |

**Integer matrices work**: `Matrix<i32, 3, 3>` supports arithmetic, indexing, iteration, and transpose. It does not support `det()`, norms, or decompositions (those require `LinalgScalar`).

**Complex is additive**: `LinalgScalar` methods on `f64` — `conj()`, `re()`, `modulus()` — are `#[inline]` identity functions, fully erased by the compiler. Complex support adds zero code to real-valued paths.

## In-Place Algorithms

Decompositions operate in-place on `&mut impl MatrixMut<T>`. This avoids:

- Allocator/storage traits (nalgebra's approach)
- Cloning the input (scipy's approach)
- Extra indirection through trait objects

The pattern:

```rust
pub fn cholesky_in_place<T: LinalgScalar>(
    a: &mut impl MatrixMut<T>,
) -> Result<(), LinalgError> {
    // modifies a in-place: a → L (lower triangular factor)
}
```

Wrapper structs (`CholeskyDecomposition`, etc.) own the modified matrix and provide `solve()`, `inverse()`, `det()` as convenience methods.

## SIMD Dispatch

SIMD is selected at monomorphization time via `TypeId`:

```rust
pub fn dot_dispatch<T: Scalar>(a: &[T], b: &[T]) -> T {
    use core::any::TypeId;
    if TypeId::of::<T>() == TypeId::of::<f64>() {
        // SAFETY: we just verified T = f64
        unsafe { f64_dot(transmute(a), transmute(b)) as T }
    } else if TypeId::of::<T>() == TypeId::of::<f32>() {
        unsafe { f32_dot(transmute(a), transmute(b)) as T }
    } else {
        scalar_dot(a, b)  // generic fallback for integers, complex, etc.
    }
}
```

The `TypeId` check is a compile-time constant — dead branches are eliminated by LLVM at monomorphization. For integer `T`, the entire SIMD path compiles away; for `f64`, only the f64 path remains.

The `Scalar` trait has a `'static` bound (required by `TypeId`). This is backwards-compatible — all scalar types are `'static`.

## DynMatrix Storage

`DynMatrix<T>` uses `Vec<T>` in column-major order: element `(row, col)` is at index `col * nrows + row`. Implementing `MatrixRef`/`MatrixMut` means all decomposition free functions work automatically, with no duplicate code.

`DynVector` is a newtype wrapper around `DynMatrix` that enforces the single-column invariant and provides single-index access.

## Avoiding Unstable Features

numeris uses only stable Rust (MSRV 1.70). Constraints:

- `generic_const_exprs` is unstable → no `[T; M*N]` flat storage
- `min_const_generics` (stable since 1.51) → `[[T; M]; N]` two-level storage works
- `core::arch` intrinsics for SIMD are stable on aarch64 and x86_64

## num-traits Integration

numeris uses `num-traits` with `default-features = false` for generic numeric bounds. This avoids pulling in `std` through `num-traits` in no-std mode. Key traits used:

- `Zero`, `One` — additive/multiplicative identity
- `Num` — basic arithmetic (Scalar bound)
- `Float` — transcendentals, `sqrt`, `sin`, etc. (FloatScalar bound)

When `std` is enabled, `Float` delegates to the system's hardware libm. Without `std`, it uses `libm`'s software implementations via the `libm` feature.