numeris 0.5.13

Pure-Rust numerical algorithms library — high performance with SIMD support while also supporting no-std for embedded and WASM targets.
Documentation
# Changelog

## 0.5.13

- **Review follow-ups: one less image pass and ~190 lines of duplication
  removed** — internal-only; results are arithmetically identical.
  - `imageproc`: `local_variance` (and `local_stddev` / `adaptive_threshold`
    built on it) no longer materializes a full squared copy of the image before
    building its second summed-area table; a fused `integral_image_with_squares`
    builds both tables in one pass over the source.
  - `interp`: the cubic Hermite basis evaluation and the natural-spline Thomas
    solve + Horner segment evaluation were duplicated between the fixed-size
    and `Dyn` variants; both now share module-level helpers
    (`hermite_segment[_with_derivative]`, `natural_spline_coeffs`,
    `cubic_eval[_with_derivative]`).
  - `stats`: the identical `sample_array` method was copy-pasted across all ten
    distributions; it is now generated by a small `impl_sample_array!` macro.
- **stats: parameter-only constants cached at construction** — internal-only;
  results are arithmetically identical. Distribution structs now precompute the
  parts of their (log-)density that depend only on the immutable parameters,
  instead of recomputing them on every `pdf` / `ln_pdf` / `pmf` call (which
  multiplies inside the Newton-based `quantile` solvers): `Gamma` caches
  `α·ln β − lgamma(α)`, `ChiSquared` caches `−(k/2)·ln 2 − lgamma(k/2)`,
  `StudentT` caches `lgamma((ν+1)/2) − lgamma(ν/2) − ½·ln(νπ)`, `Beta` caches
  `lbeta(α, β)`, `Binomial` caches `lgamma(n+1)`, `ln p`, `ln(1−p)`, `Poisson`
  caches `ln λ`, and `Normal` caches `σ·√(2π)` and `−ln σ − ½·ln(2π)`.
- **special: deduplicated incomplete-gamma core**`erf`/`erfc` previously
  carried a ~85-line copy of the `incgamma` series/continued-fraction
  implementation (Option-returning instead of Result-returning). The core is
  now a single `pub(crate)` Option-based routine in `incgamma.rs`; the public
  `gamma_inc` / `gamma_inc_upper` wrap it with domain checks, and `erf`/`erfc`
  call it directly.
- **Fixed: `special`/`stats` now actually build under no-std**  `gamma_fn.rs` and the Box-Muller sampler called inherent `f64` math methods
  (`TAU.sqrt()`, `.ln()`, `.cos()`), which only exist with `std`, so
  `--no-default-features --features special,stats` failed to compile. The
  derived constants √(2π) and ½·ln(2π) are now precomputed literals (also
  removing a per-call recomputation in `gamma`/`lgamma`), and the sampler uses
  fully-qualified `num_traits::Float` calls (libm-backed in no-std, identical
  inherent methods with `std`).
- **SIMD conjugated dot in linalg Householder loops** — internal-only. A new
  private `simd::dotc_dispatch` computes `Σ conj(aᵢ)·bᵢ`, forwarding to the
  SIMD `dot_dispatch` for `f32`/`f64` (where `conj` is the identity) with a
  scalar conjugating loop for complex elements. The column-oriented Householder
  reflection loops in QR (`qr_in_place`, `qr_col_pivot_in_place`, `q()`,
  `solve`), SVD bidiagonalization, and Hessenberg reduction previously computed
  `vᴴ·A[:,j]` with a scalar loop right next to a SIMD-dispatched AXPY; they now
  use the dispatched dot, and the redundant second `split_two_col_slices` call
  per column was removed by reordering the diagonal-element update after the
  AXPY. Slices shorter than 8 elements keep the inlined scalar loop — the
  out-of-line SIMD kernel call costs more than it saves there (a 4×4 QR
  regressed ~33% without the cutoff; with it, benchmarks show ~2× on 80×80
  dynamic QR and parity at 4×4/6×6).
- **Hot-loop performance cleanups** — internal-only; no API or behavior changes.
  - `estimate`: `Ukf`, `SrUkf`, and `Ckf` `update_gated` previously ran the full
    sigma-point measurement transform (Cholesky of P, sigma points, measurement
    transform, innovation covariance, Cholesky of S) to compute the NIS, then
    called `update`, which recomputed all of it from scratch. The transform is
    now factored into a shared private `measurement_transform` /
    `apply_update` pair used by both paths, roughly halving the cost of every
    accepted gated update (and deleting ~60 duplicated lines per filter).
    `predict` also reuses the new shared sigma-point generation.
  - `optim`: `least_squares_lm` computed `j.transpose()` twice per iteration
    (for `JᵀJ` and `Jᵀr`); it is now computed once.
    `finite_difference_jacobian_dyn` / `finite_difference_gradient_dyn` cloned
    the full state vector once per column; they now clone once and
    perturb/restore in place.
  - `ode`: the PID step-size controller coefficients (β₁, β₂, β₃) were rebuilt
    from `T::from(...)` conversions and divisions on every step in both the
    adaptive RK loop and the Rosenbrock loop; they are now hoisted above the
    step loop (along with `GAMMA_DIAG` in the Rosenbrock loop).
  - `imageproc`: `morphology_gradient`, `top_hat`, and `black_hat` replaced
    their per-element bounds-checked subtraction loops with the SIMD-dispatched
    slice subtraction already used by `DynMatrix` operators; `unsharp_mask`
    reuses the blurred buffer and iterates flat slices instead of per-element
    indexing (one less full-image allocation).
  - `dynmatrix`: by-value `Neg` negates in place instead of allocating a new
    backing vector.

## 0.5.12

- **`rayon` feature (opt-in parallelism)** — a new `rayon` Cargo feature adds
  multi-threaded parallelism on runtime-sized paths without disturbing the
  no-std-first baseline. It implies `std` and is purely additive: builds without
  it are byte-for-byte unchanged, and enabling it never alters an existing
  signature. Dispatch is hidden behind a private `par` module (mirroring `simd`).
  - First slice: new parallel finite-difference routines
    `finite_difference_jacobian_dyn_par` and `finite_difference_gradient_dyn_par`
    compute their columns across threads (each an independent evaluation of `f`),
    writing into disjoint slices so the result is identical regardless of thread
    count and equal to the sequential version. They require `Fn + Sync + Send`.
    Kept as *separate* functions (not a feature-gated change to the sequential
    `finite_difference_jacobian_dyn` / `_gradient_dyn`, which keep their `FnMut`
    signatures) so that enabling `rayon` — possibly transitively, via another
    crate — can never break a caller passing a `FnMut` closure. Parallelism
    engages only above a column threshold, biased toward the regime it helps (an
    expensive `f`); see `bench/fd_jacobian` and `bench/results.md` for the
    measured crossover (≈4× at n=256 with a moderately expensive evaluation).
  - Second slice: separable convolution (`convolve2d_separable`, and thus
    `gaussian_blur` / `box_blur` and everything built on them — `unsharp_mask`,
    `laplacian_of_gaussian`, `canny`, `harris_corners`, `shi_tomasi_corners`,
    `difference_of_gaussians`, `gaussian_pyramid`) computes its output columns in
    parallel. Each output column is a disjoint write, so results are unchanged.
    The fan-out decision is gated on per-pass *work* (`nrows · ncols · klen`),
    not column count, so it accounts for image height and kernel size — a small
    image stays sequential, a large one parallelizes (~2.6× at 512² for a small
    Gaussian; see `bench/convolve`).
  - Third slice: more `imageproc` per-column kernels — the rank / median filters
    (`rank_filter`, `percentile_filter`, `median_filter` and its 3×3/5×5 fast
    paths), `resize_bilinear`, and the local-statistics queries (`local_mean`,
    `local_variance`, `local_stddev`, and `adaptive_threshold` / `median_pool_upsampled`
    built on them). The median quickselect is the most expensive per-pixel work
    in imageproc — ~3.4–3.7× at 256² (see `bench/rank`). Fan-out is gated on
    per-pass work scaled by window area, and a shared `par::work_col_threshold`
    helper now backs all the imageproc gates. That helper also scales the work
    budget by `rayon::current_num_threads()` (normalized to the 8-core tuning
    machine), so the fan-out crossover adapts to the host core count — smaller
    inputs parallelize on a 2-core laptop, larger ones are required on a
    many-core server. The summed-area-table build under the local-stats queries
    stays sequential (it is a prefix-sum scan, a separate decomposition).
  - Fourth slice: morphology (`max_filter`/`min_filter`, `dilate`/`erode`,
    `opening`/`closing`, `morphology_gradient`, `top_hat`, `black_hat`). Under
    `rayon` the separable Van Herk filter runs the horizontal direction as a
    second *vertical* pass in transposed space (`out = (V(T(V(src))))ᵀ`), so both
    filter passes and both transposes parallelize over output columns while
    keeping O(1)-per-pixel cost — ~3.9–4.1× at 512²–1024² (`bench/morphology`).
    The change is `cfg`-split: the no-`rayon` build keeps the original lean
    sequential pass (two buffers + row scratch, no transposes), so the embedded
    baseline is unchanged; the two extra full-image allocations are paid only
    when `rayon` is enabled.
  - The `Send + Sync` requirement these parallel paths place on the element type
    is expressed through a hidden `MaybeSync` marker bound that is empty (a
    blanket impl for all types) unless `rayon` is enabled, so non-`rayon`
    signatures are unconstrained and `f32`/`f64` satisfy it automatically when it
    applies.
  - Fixed-size `Matrix` Jacobians and other small stack-allocated paths stay
    sequential by design.

- **MSRV corrected to 1.80.** The declared `rust-version` was `1.77`, but core
  matrix code uses `[T]::as_flattened` / `as_flattened_mut` (stabilized in Rust
  1.80), so the crate never actually built on 1.77 with a fresh lockfile. The
  declaration now reflects reality. (rayon also requires 1.80, so the feature
  does not raise the floor.)

- **Clippy clean + CI gate.** The codebase is now `clippy -D warnings` clean and
  a clippy job was added to CI. Incidental additions from the cleanup:
  `Vector::is_empty`, a `Default` impl for `BatchLsq`, and a `SmoothedStates<T, N>`
  type alias for the `rts_smooth` return type. `Pid::with_derivative_filter` now
  rejects a NaN time constant (the precondition assert was `!(tau < 0)`, which
  let NaN through; it is now `tau >= 0`).

## 0.5.11

- **Dynamic-dimension `optim` routines**`minimize_bfgs_dyn`,
  `least_squares_gn_dyn`, `least_squares_lm_dyn`,
  `finite_difference_gradient_dyn`, and `finite_difference_jacobian_dyn` mirror
  the fixed-size API but accept `DynVector<T>` / `DynMatrix<T>`, picking the
  parameter and residual dimensions at runtime. Settings structs
  (`BfgsSettings`, `GaussNewtonSettings`, `LmSettings`) and `OptimError` are
  shared with the fixed-size routines; results come back as
  `MinimizeResultDyn<T>` / `LeastSquaresResultDyn<T>`. The dynamic variants
  require the `alloc` feature; fixed-size routines remain no-alloc.

## 0.5.10

- **`imageproc::connected_components`** — connected-components labeling via two-pass SAUF
  (Scan + Array-based Union-Find), with path compression and union by rank. Operates on any
  `MatrixRef<T>` with foreground defined as `elem != background`; accepts 4- or 8-connectivity.
  Each `Component` reports `area`, inclusive bounding box, centroid, and central second moments
  (`mu20`, `mu02`, `mu11`). Two labels-image variants:
  - `connected_components_labeled` returns a `DynMatrix<u32>` (column-major, matches the rest
    of the imageproc API) for downstream masking.
  - `connected_components_with_label_buffer` returns a row-major flat `Vec<u32>` for
    cache-friendly downstream iteration when sweeping per-component bounding boxes in
    scan order.

## 0.5.9

- **`imageproc` feature** — 2D image processing on `DynMatrix<T>` buffers, column-major,
  `BorderMode`-aware (Zero / Constant / Replicate / Reflect), no-std with `alloc`. Convolution
  inner loops dispatch through the existing SIMD dot/AXPY kernels on contiguous column slices.
  - **Filtering**: `convolve2d` (any `MatrixRef` kernel), `convolve2d_separable`,
    `gaussian_blur`, `box_blur`, `unsharp_mask`, `laplacian`, `laplacian_of_gaussian`,
    `sobel_gradients`, `scharr_gradients`, `gradient_magnitude`.
  - **Order statistics**: `rank_filter`, `percentile_filter`, `median_filter` (quickselect
    with 3×3 and 5×5 stack-array fast paths), `median_filter_u16` (Huang sliding histogram,
    bit-exact vs. quickselect), `median_pool`, `median_pool_upsampled`.
  - **Morphology** (Van Herk - Gil-Werman, ~3 compares per pixel regardless of radius):
    `max_filter`, `min_filter`, `dilate`, `erode`, `opening`, `closing`, `morphology_gradient`,
    `top_hat`, `black_hat`.
  - **Local statistics** via integral image (O(1) per pixel): `integral_image`,
    `integral_rect_sum`, `local_mean`, `local_variance`, `local_stddev`.
  - **Multi-scale**: `difference_of_gaussians`, `gaussian_pyramid`.
  - **Thresholding**: `threshold`, `threshold_otsu` (256-bin between-class variance),
    `adaptive_threshold` (local mean + offset).
  - **Edges & corners**: `canny` (Gaussian → Sobel → NMS → hysteresis),
    `harris_corners`, `shi_tomasi_corners`.
  - **Geometric**: `flip_horizontal`, `flip_vertical`, `rotate_90` / `180` / `270`,
    `pad` (BorderMode-aware), `crop`, `resize_nearest`, `resize_bilinear` (precomputed
    per-axis tables, column-contiguous inner loop).
  - Kernel generators: `gaussian_kernel_1d`, `box_kernel_1d`, `sobel_x_3x3` / `sobel_y_3x3`,
    `scharr_x_3x3` / `scharr_y_3x3`, `laplacian_3x3` / `laplacian_3x3_diag`.
- **`DynMatrix::into_vec`** — zero-copy extraction of the underlying column-major `Vec<T>`
  by moving ownership out. Useful for recovering an owned pixel buffer after chaining
  through numeric routines.

## 0.5.7

- **Binary search interpolation** — dense output `interpolate` now uses `partition_point`
  for O(log n) step lookup instead of O(n) linear scan.
- **`interpolate_batch`** — new method for interpolating at multiple sorted time points
  in a single O(n+m) pass, avoiding repeated binary searches.

## 0.5.6

- **`Quaternion::rotation_between(a, b)`** — shortest-arc unit quaternion that rotates
  vector `a` to vector `b`. Normalizes inputs, handles parallel and anti-parallel cases.

## 0.5.5

- **`serde` feature** — optional serialization/deserialization for all types:
  - `Matrix<T, M, N>` serializes as row-major `[[1,2],[3,4]]` (matches `Matrix::new()` input).
  - `Vector<T, N>` serializes as flat `[1,2,3]`.
  - `Quaternion<T>` serializes as `{"w":1.0,"x":0.0,"y":0.0,"z":0.0}`.
  - `DynMatrix<T>` serializes as `{"nrows":2,"ncols":3,"data":[[1,2,3],[4,5,6]]}`.
  - `DynVector<T>` serializes as flat `[1,2,3]`.
  - `Solution` and `DenseOutput` also serializable.
- **Improved `Display` for `Matrix`**:
  - Precision formatting: `format!("{:.2}", m)` works.
  - Vectors display as compact `[1, 2, 3]` instead of columnar.
  - Padding inside `` borders for readability.

## 0.5.4

- **`From<[T; N]>` for `Vector`**`let v: Vector<f64, 3> = [1.0, 2.0, 3.0].into()`.
- **`Eq` and `Hash` for `Matrix`** — enables integer matrices as HashMap keys.
- **`Default` for `Matrix`** — returns the zero matrix.
- **`AsRef<[T]>` / `AsMut<[T]>` for `Matrix`** — view as flat column-major slice.
- **`solve_matrix` on `CholeskyDecomposition` and `QrDecomposition`** — multi-RHS solve.
- **`Quaternion::from_axis_angle` normalizes axis** — no longer requires pre-normalized input.

## 0.5.3

- **`Solution` and `DenseOutput` derive `Debug`**.
- **`DynVector` iterators**: `iter()`, `iter_mut()`, `IntoIterator` impls.
- **`DynVector::from_fn(n, |i| ...)`** constructor.
- **`Matrix::t()` / `DynMatrix::t()`** shorthand for `transpose()`.
- **`LuDecomposition::solve_matrix(&b)`** — multi-RHS solve (`AX = B`).
- **`PartialOrd` / `Ord` / `Eq` for `Vector`** — lexicographic ordering.

## 0.5.2

- **`DynVector` is now N×1** (column vector), matching `Vector<T, N>` = `Matrix<T, N, 1>`.
  Previously stored as 1×N internally — `nrows()` and `ncols()` now return the expected values.
- **`DynVector::zeros`** no longer takes a dummy parameter: `DynVector::<f64>::zeros(n)`.
- **`vecmul` removed** — use `a * v` (standard `Mul` trait) instead. `Vector` is now N×1 so
  `Matrix<T,M,N> * Vector<T,N>` works directly via matrix multiplication.
- **`vector!` macro** doc corrected: creates a column vector (N×1), not a row vector.

## 0.5.1

- **`DynMatrix::zeros`** no longer takes a dummy type-inference parameter.
  Use `DynMatrix::<f64>::zeros(m, n)` instead of `DynMatrix::zeros(m, n, 0.0_f64)`.
- **`DynMatrix::eye`** same change: `DynMatrix::<f64>::eye(n)` instead of `DynMatrix::eye(n, 0.0_f64)`.

## 0.5.0

### Breaking Changes

- **Vector is now a column vector**: `Vector<T, N>` = `Matrix<T, N, 1>` (was `Matrix<T, 1, N>`).
  This matches nalgebra's convention. Single-index access `v[i]` still works.
- **`ColumnVector` removed**: Use `Vector` everywhere. `from_column()` replaced by `from_array()`.
- **`ColumnVector1``ColumnVector6` aliases removed**: Use `Vector1``Vector6`.
- **ODE `Solution` type**: Now `Solution<T, M, N>` (was `Solution<T, S>`), parameterized by
  matrix dimensions instead of a single size.
- **MSRV bumped to 1.77** (from 1.70).

### New Features

- **nalgebra interop** (`nalgebra` cargo feature):
  - `From`/`Into` conversions between `Matrix<T,M,N>` and `nalgebra::SMatrix<T,M,N>` (owned + ref).
  - `From`/`Into` for `DynMatrix<T>` / `DynVector<T>` and `nalgebra::DMatrix<T>` / `DVector<T>`.
  - `MatrixRef` / `MatrixMut` implemented for `nalgebra::SMatrix` and `DMatrix`, so numeris
    linalg free functions (`lu_in_place`, `cholesky_in_place`, `qr_in_place`, etc.) work
    directly on nalgebra matrices.
- **Matrix ODE integration**: `rk4`, `rk4_step`, and all `RKAdaptive` solvers now accept
  `Matrix<T, M, N>` as state (not just vectors), enabling matrix ODE integration
  (e.g., state transition matrices, matrix Riccati equations). Rosenbrock solvers remain
  vector-only due to Jacobian requirements.
- **`scaled_norm` generalized**: Moved from `Vector<T, N>` to `Matrix<T, M, N>`, using
  `frobenius_norm() / sqrt(M * N)`.

### Migration Guide

Replace `ColumnVector` with `Vector` and `from_column` with `from_array`:

```rust
// Before (0.4)
use numeris::ColumnVector;
let x = ColumnVector::from_column([1.0, 2.0]);
let val = x[(0, 0)];

// After (0.5)
use numeris::Vector;
let x = Vector::from_array([1.0, 2.0]);
let val = x[0];
```

ODE solution types gain a second dimension parameter:

```rust
// Before: Solution<T, S>
// After:  Solution<T, M, N>  (vectors are Solution<T, S, 1>)
```

## 0.4.0

- Numerical stability improvements and robustness guards.
- Spline diagonal check, matrix exponential overflow clamp, Cholesky SIMD scaling.

## 0.3.0

- Added `quad` module (Gauss-Legendre, adaptive Simpson, composite rules).
- Added `stats` sampling via xoshiro256++ RNG.
- Added lead/lag compensators, PID tuning (Ziegler-Nichols, Cohen-Coon, SIMC).
- Added `matrix!`/`vector!` macros, `prelude` module.

## 0.2.0

- Initial public release on crates.io.
- Matrix, linalg, quaternion, ODE, dynmatrix, interp, optim, estimate, control, special, stats modules.
- SIMD acceleration (NEON, SSE2, AVX, AVX-512).