numeris
Pure-Rust numerical algorithms library, no-std compatible. Similar in scope to SciPy, suitable for embedded targets (no heap allocation, no FPU assumptions) while being highly performant on desktop/server hardware via SIMD intrinsics.
[!NOTE] Alpha software — APIs are unstable and may change without notice.
Features
- Fixed-size matrices — stack-allocated, const-generic
Matrix<T, M, N> - Dynamic matrices — heap-allocated
DynMatrix<T>with runtime dimensions (optionalallocfeature) - Linear algebra — LU, Cholesky, QR, SVD decompositions; symmetric eigendecomposition; real Schur decomposition
- ODE integration — fixed-step RK4, 7 adaptive Runge-Kutta solvers with dense output, RODAS4 stiff solver
- Optimization — root finding (Brent, Newton), BFGS minimization, Gauss-Newton and Levenberg-Marquardt least squares
- Complex number support — all decompositions work with
Complex<f32>/Complex<f64>(optional feature) - State estimation — EKF, UKF, Square-Root UKF, Cubature Kalman Filter, RTS smoother, batch least-squares
- Interpolation — linear, Hermite, barycentric Lagrange, natural cubic spline, bilinear (2D)
- Special functions — gamma, lgamma, digamma, beta, incomplete gamma/beta, erf/erfc
- Statistical distributions — Normal, Uniform, Exponential, Gamma, Beta, Chi-squared, Student's t, Bernoulli, Binomial, Poisson
- Quaternions — unit quaternion rotations, SLERP, Euler angles, rotation matrices
- Norms — L1, L2, Frobenius, infinity, one norms
- SIMD acceleration — NEON (aarch64), SSE2/AVX/AVX-512 (x86_64) intrinsics for matmul, dot products, and element-wise ops; zero-cost scalar fallback for integers and unsupported architectures
- No-std / embedded — runs without
stdor heap; float math falls back to softwarelibm
Quick start
[]
= "0.3"
use ;
// Solve a linear system Ax = b
let a = new;
let b = from_array;
let x = a.solve.unwrap; // x = [2, 3, -1]
// Cholesky decomposition of a symmetric positive-definite matrix
let spd = new;
let chol = spd.cholesky.unwrap;
let inv = chol.inverse;
// QR decomposition and least-squares
let a = new;
let b = from_array;
let x = a.qr.unwrap.solve; // least-squares fit
// Symmetric eigendecomposition
let sym = new;
let eig = sym.eig_symmetric.unwrap;
let eigenvalues = eig.eigenvalues; // sorted ascending
let eigenvectors = eig.eigenvectors; // columns = eigenvectors
// General eigenvalues via Schur decomposition
let a = new; // 90° rotation
let = a.eigenvalues.unwrap; // eigenvalues ±i
// Quaternion rotation
use Quaternion;
let q = from_axis_angle;
let v = from_array;
let rotated = q * v; // [0, 1, 0]
No-std / embedded
numeris works on bare-metal targets with no allocator:
[]
= { = "0.3", = false, = ["libm"] }
use ;
// All fixed-size operations work without std or alloc
let a = new;
let b = from_array;
let x = a.solve.unwrap;
// Eigendecomposition on a microcontroller
let sym = new;
let eig = sym.eig_symmetric.unwrap;
Add alloc for DynMatrix on targets with a heap but no OS:
= { = "0.3", = false, = ["libm", "alloc"] }
Dynamic matrices
When dimensions aren't known at compile time, use DynMatrix (requires alloc, included with default std feature):
use ;
// Runtime-sized matrix
let a = from_rows;
let b = from_slice;
let x = a.solve.unwrap;
// Convert between fixed and dynamic
use Matrix;
let fixed = new;
let dynamic: = fixed.into;
let back: = .try_into.unwrap;
// Mixed arithmetic
let result = fixed * &dynamic; // Matrix * DynMatrix → DynMatrix
Type aliases: DynMatrixf64, DynMatrixf32, DynVectorf64, DynVectorf32, DynMatrixz64 (complex, requires complex feature), etc.
ODE integration
Fixed-step RK4 and 7 adaptive Runge-Kutta solvers with embedded error estimation and dense output:
use ;
use Vector;
// Harmonic oscillator: y'' = -y
let y0 = from_array;
let tau = 2.0 * PI;
let sol = RKTS54integrate.unwrap;
assert!; // cos(2π) ≈ 1
Explicit solvers
| Solver | Stages | Order | FSAL | Interpolant |
|---|---|---|---|---|
RKF45 |
6 | 5(4) | no | — |
RKTS54 |
7 | 5(4) | yes | 4th degree |
RKV65 |
10 | 6(5) | no | 6th degree |
RKV87 |
13+4 | 8(7) | no | 7th degree |
RKV98 |
16+5 | 9(8) | no | 8th degree |
RKV98NoInterp |
16 | 9(8) | no | — |
RKV98Efficient |
16+10 | 9(8) | no | 9th degree |
Stiff solvers (Rosenbrock)
For stiff ODEs (chemical kinetics, circuit simulation, orbital mechanics with drag), use RODAS4 — a linearly-implicit Rosenbrock method that solves linear systems involving the Jacobian instead of nonlinear Newton iterations:
use ;
use ;
// Stiff decay: y' = -1000*y, y(0) = 1
let y0 = from_array;
let sol = RODAS4integrate.unwrap;
// Or use integrate_auto for automatic finite-difference Jacobian
| Solver | Stages | Order | L-stable |
|---|---|---|---|
RODAS4 |
6 | 4(3) | yes |
Optimization
Root finding, unconstrained minimization, and nonlinear least squares (requires optim feature):
[]
= { = "0.3", = ["optim"] }
use ;
use ;
// Root finding: solve x² - 2 = 0
let root = brent.unwrap;
assert!;
// BFGS minimization: minimize (x-1)² + (y-2)²
let min = minimize_bfgs.unwrap;
assert!;
// Levenberg-Marquardt: fit y = a * exp(b * x)
let t = ;
let y = ;
let fit = least_squares_lm.unwrap;
assert!;
| Algorithm | Function | Use case |
|---|---|---|
| Brent's method | brent |
Bracketed scalar root finding |
| Newton's method | newton_1d |
Scalar root finding with derivative |
| BFGS | minimize_bfgs |
Unconstrained minimization |
| Gauss-Newton | least_squares_gn |
Nonlinear least squares (QR-based) |
| Levenberg-Marquardt | least_squares_lm |
Nonlinear least squares (damped) |
Finite-difference utilities: finite_difference_gradient and finite_difference_jacobian for when analytical derivatives aren't available.
Digital IIR filters
Biquad cascade IIR filters with Butterworth and Chebyshev Type I design (requires control feature):
[]
= { = "0.3", = ["control"] }
use ;
// 4th-order Butterworth lowpass at 1 kHz, 8 kHz sample rate (N=2 biquad sections)
let mut lpf: = butterworth_lowpass.unwrap;
let y = lpf.tick; // filter one sample
// Bulk processing
let input = ;
let mut output = ;
lpf.reset;
lpf.process;
// Chebyshev Type I: steeper rolloff with 1 dB passband ripple
let mut cheb: = chebyshev1_lowpass.unwrap;
| Design | Lowpass | Highpass |
|---|---|---|
| Butterworth | butterworth_lowpass |
butterworth_highpass |
| Chebyshev Type I | chebyshev1_lowpass |
chebyshev1_highpass |
All filters are no-std compatible, use no complex number arithmetic, and work with both f32 and f64.
PID controller
Discrete-time PID controller with trapezoidal integration, derivative-on-measurement (no derivative kick), optional derivative low-pass filter, and anti-windup via back-calculation:
use Pid;
// PID controller at 1 kHz with output clamping
let mut pid = new
.with_output_limits
.with_derivative_filter; // first-order LPF on derivative
// Control loop
let setpoint = 1.0;
let mut measurement = 0.0;
for _ in 0..1000
assert!;
State estimation
Six estimators for nonlinear state estimation and offline batch processing (requires estimate feature):
[]
= { = "0.3", = ["estimate"] }
use ;
use ;
// EKF: 2-state constant-velocity, 1 measurement (position)
let x0 = from_column; // [pos, vel]
let p0 = new;
let mut ekf = new;
let dt = 0.1;
let q = new;
let r = new;
// Predict with dynamics model + Jacobian
ekf.predict;
// Update with position measurement
ekf.update.unwrap;
// UKF: no Jacobians needed — uses sigma points
let mut ukf = new;
ukf.predict.unwrap;
ukf.update.unwrap;
// CKF: tuning-free sigma-point filter (2N points, equal weights)
let mut ckf = new;
ckf.predict.unwrap;
// Batch least-squares: accumulate linear observations
let mut lsq = new;
let h = new;
let r_lsq = new;
lsq.add_observation.unwrap;
lsq.add_observation.unwrap;
let = lsq.solve.unwrap;
| Filter | Struct | Jacobians | No-std |
|---|---|---|---|
| Extended Kalman | Ekf<T, N, M> |
User-supplied or finite-difference | Yes |
| Unscented Kalman | Ukf<T, N, M> |
Not needed (sigma points) | No (alloc) |
| Square-Root UKF | SrUkf<T, N, M> |
Not needed (propagates Cholesky factor) | No (alloc) |
| Cubature Kalman | Ckf<T, N, M> |
Not needed (2N cubature points) | No (alloc) |
| RTS Smoother | rts_smooth() |
Stored from EKF forward pass | No (alloc) |
| Batch Least-Squares | BatchLsq<T, N> |
Linear H matrix | Yes |
All support f32 and f64. Process noise Q is optional (None for zero noise).
Complex matrices
Enable the complex feature to use decompositions with complex elements:
[]
= { = "0.3", = ["complex"] }
use ;
type C = ;
let a = new;
let b = from_array;
let x = a.solve.unwrap;
// Hermitian positive-definite Cholesky (A = L * L^H)
let hpd = new;
let chol = hpd.cholesky.unwrap;
Complex support adds zero overhead to real-valued code paths. The LinalgScalar trait methods (modulus, conj, etc.) are #[inline] identity functions for f32/f64, fully erased by the compiler.
Cargo features
| Feature | Default | Description |
|---|---|---|
std |
yes | Implies alloc. Uses hardware FPU via system libm. |
alloc |
via std |
Enables DynMatrix / DynVector (heap-allocated, runtime-sized). |
ode |
yes | ODE integration (RK4, adaptive solvers). |
optim |
no | Optimization (root finding, BFGS, Gauss-Newton, LM). |
control |
no | Digital IIR filters, PID controller, lead/lag compensators, PID tuning. |
estimate |
no | State estimation (EKF, UKF, SR-UKF, CKF, RTS, batch LSQ). Implies alloc. |
interp |
no | Interpolation (linear, Hermite, barycentric Lagrange, cubic spline, bilinear). |
special |
no | Special functions (gamma, beta, erf, incomplete gamma/beta, digamma). |
stats |
no | Statistical distributions (Normal, Gamma, Beta, etc.). Implies special. |
libm |
baseline | Pure-Rust software float math. Always available as fallback. |
complex |
no | Adds Complex<f32> / Complex<f64> support via num-complex. |
all |
no | All features: std + ode + optim + control + estimate + interp + special + stats + complex. |
# Default (std + ode)
# All features
# No-std for embedded
# No-std with dynamic matrices
# With optimization and complex support
Benchmarks
Measured on Apple Silicon (aarch64, NEON) with Criterion. Compared against nalgebra and faer. Run with cd bench && cargo bench.
| Benchmark | numeris | nalgebra | faer |
|---|---|---|---|
| matmul 4x4 | 4.9 ns | 4.9 ns | 58 ns |
| matmul 6x6 | 13.4 ns | 20.0 ns | 87 ns |
| matmul 50x50 (dyn) | 5.76 us | 6.63 us | 6.3 us |
| matmul 200x200 (dyn) | 369 us | 361 us | 193 us |
| dot 100 (dyn) | 11.6 ns | 14.5 ns | — |
| LU 4x4 | 33.2 ns | 28.2 ns | 203 ns |
| LU 6x6 | 84.7 ns | 82.1 ns | 292 ns |
| QR 4x4 | 46.4 ns | 90.6 ns | 303 ns |
| QR 6x6 | 85.5 ns | 207.9 ns | 445 ns |
| SVD 4x4 | 299 ns | 461 ns | 1278 ns |
| Cholesky 4x4 | 25.2 ns | 11.8 ns | 139 ns |
| Eigen sym 4x4 | 165 ns | 201 ns | 578 ns |
| Eigen sym 6x6 | 287 ns | 528 ns | 1088 ns |
numeris is competitive with nalgebra at small fixed sizes and significantly faster for QR, SVD, and eigendecomposition. faer excels at large dynamic matrices (200x200+) due to cache-aware A+B panel packing.
Module overview
Matrix<T, M, N> with [[T; M]; N] column-major storage. Matrix::new() accepts row-major input and transposes internally.
- Arithmetic:
+,-,*(matrix and scalar), negation, element-wise multiply/divide - Indexing:
m[(i, j)], row/column access, block extraction and insertion - Square:
trace,det,diag,from_diag,pow,is_symmetric,eye - Norms:
frobenius_norm,norm_inf,norm_one - Utilities:
transpose,from_fn,map,swap_rows,swap_cols,sum,abs - Iteration:
iter(),iter_mut(),as_slice(),IntoIterator
Vectors are type aliases: Vector<T, N> = Matrix<T, 1, N> (row vector), ColumnVector<T, N> = Matrix<T, N, 1>.
Vector-specific: dot, cross, outer, norm, norm_l1, normalize.
Size aliases
Convenience aliases are provided for common sizes (all re-exported from the crate root):
| Square | Rectangular (examples) | Vectors |
|---|---|---|
Matrix1 .. Matrix6 |
Matrix2x3, Matrix3x4, Matrix4x6, ... |
Vector1 .. Vector6 |
| All M×N combinations for M,N ∈ 1..6, M≠N | ColumnVector1 .. ColumnVector6 |
use ;
let rotation: = eye;
let points: = zeros; // 4 rows, 3 cols
let v: = from_array;
DynMatrix<T> with Vec<T> column-major storage and runtime dimensions. from_rows() accepts row-major data (transposes internally); from_slice() accepts column-major data directly.
- Same arithmetic, norms, block ops, and utilities as fixed
Matrix - Mixed ops:
Matrix * DynMatrix,DynMatrix + Matrix, etc. →DynMatrix DynVector<T>newtype with single-index access anddot- Conversions:
From<Matrix>→DynMatrix,TryFrom<&DynMatrix>→Matrix - Full linalg:
DynLu,DynCholesky,DynQr,DynSvd,DynSymmetricEigen,DynSchurwrappers + convenience methods
Type aliases: DynMatrixf64, DynMatrixf32, DynVectorf64, DynVectorf32, DynMatrixi32, DynMatrixi64, DynMatrixu32, DynMatrixu64, DynMatrixz64, DynMatrixz32 (complex).
All decompositions provide both free functions (operating on &mut impl MatrixMut<T>) and wrapper structs with solve(), inverse(), det().
| Decomposition | Free function | Fixed struct | Dynamic struct | Notes |
|---|---|---|---|---|
| LU | lu_in_place |
LuDecomposition |
DynLu |
Partial pivoting |
| Cholesky | cholesky_in_place |
CholeskyDecomposition |
DynCholesky |
A = LL^H (Hermitian) |
| QR | qr_in_place |
QrDecomposition |
DynQr |
Householder reflections, least-squares |
| SVD | bidiagonalize + bidiagonal_qr |
SvdDecomposition |
DynSvd |
Rectangular M×N, singular values, rank |
| Symmetric Eigen | tridiagonalize + tridiagonal_qr_* |
SymmetricEigen |
DynSymmetricEigen |
Real eigenvalues, orthogonal eigenvectors |
| Schur | hessenberg + francis_qr |
SchurDecomposition |
DynSchur |
Quasi-upper-triangular, general eigenvalues |
Convenience methods on Matrix: a.lu(), a.cholesky(), a.qr(), a.svd(), a.solve(&b), a.inverse(), a.det(), a.eig_symmetric(), a.eigenvalues_symmetric(), a.schur(), a.eigenvalues().
Same convenience methods on DynMatrix: a.lu(), a.cholesky(), a.qr(), a.svd(), a.solve(&b), a.inverse(), a.det(), a.eig_symmetric(), a.eigenvalues_symmetric(), a.schur(), a.eigenvalues().
Fixed-step rk4 / rk4_step and 7 adaptive Runge-Kutta solvers via the RKAdaptive trait. PI step-size controller (Söderlind & Wang 2006). Dense output / interpolation available for most solvers (gated behind std). For stiff systems, RODAS4 provides an L-stable Rosenbrock method via the Rosenbrock trait — accepts user-supplied or automatic finite-difference Jacobians.
- Root finding:
brent(bracketed, superlinear convergence),newton_1d(with derivative) - Minimization:
minimize_bfgs(BFGS quasi-Newton with Armijo line search) - Least squares:
least_squares_gn(Gauss-Newton via QR),least_squares_lm(Levenberg-Marquardt via damped normal equations) - Finite differences:
finite_difference_gradient,finite_difference_jacobianfor numerical derivatives - All algorithms use
FloatScalarbound (real-valued), configurable via settings structs withDefaultimpls forf32andf64
Six estimators with const-generic state (N) and measurement (M) dimensions. Closure-based dynamics and measurement models.
Ekf— Extended Kalman Filter withpredict/update(explicit Jacobian) andpredict_fd/update_fd(finite-difference). Joseph-form covariance update. Fully no-std.Ukf— Unscented Kalman Filter with Merwe-scaled sigma points (alpha,beta,kappa). Requiresalloc.SrUkf— Square-Root UKF. Propagates Cholesky factorS(whereP = SSᵀ) for guaranteed positive-definiteness. Requiresalloc.Ckf— Cubature Kalman Filter. Third-degree spherical-radial cubature rule with2Nequally-weighted points. No tuning parameters. Requiresalloc.rts_smooth— Rauch–Tung–Striebel fixed-interval smoother. Takes a forward EKF pass (EkfSteprecords) and returns smoothed estimates. Requiresalloc.BatchLsq— Linear batch least-squares via information accumulation (Λ = Σ HᵀR⁻¹H). Supports mixed measurement dimensions via method-level const generic. Fully no-std.- Process noise
Qis optional in predict step (Some(&q)orNone) FloatScalarbound (real-valued), works withf32andf64
Biquad cascade filters designed via the bilinear transform, PID controller, lead/lag compensator design, and PID tuning rules. No complex feature dependency.
Biquad<T>— single second-order section, Direct Form II TransposedBiquadCascade<T, N>— cascade ofNbiquad sections (filter order ≤ 2N)- Design functions:
butterworth_lowpass,butterworth_highpass,chebyshev1_lowpass,chebyshev1_highpass - Supports arbitrary even/odd filter orders; odd-order uses degenerate first-order last section
tick,process,process_inplacefor sample-by-sample or bulk filteringPid<T>— PID controller with derivative filtering, output clamping, anti-windup back-calculationlead_compensator/lag_compensator— continuous-to-discrete compensator design via bilinear transformFopdtModel— FOPDT process model withziegler_nichols,cohen_coon,simctuning methodsziegler_nichols_ultimate— closed-loop ultimate gain PID tuning
Fixed-size (const N, stack-allocated, no-std) and dynamic variants (Dyn*, requires alloc). Out-of-bounds evaluations extrapolate.
| Method | Fixed | Dynamic | Notes |
|---|---|---|---|
| Linear | LinearInterp<T, N> |
DynLinearInterp<T> |
Piecewise linear |
| Hermite | HermiteInterp<T, N> |
DynHermiteInterp<T> |
User-supplied derivatives |
| Lagrange | LagrangeInterp<T, N> |
DynLagrangeInterp<T> |
Barycentric, O(N) eval |
| Cubic spline | CubicSpline<T, N> |
DynCubicSpline<T> |
Natural BCs, Thomas algorithm |
| Bilinear | BilinearInterp<T, NX, NY> |
DynBilinearInterp<T> |
2D rectangular grid |
Fully no-std, generic over FloatScalar (f32/f64).
| Function | Description |
|---|---|
gamma(x), lgamma(x) |
Gamma function and log-gamma (Lanczos approximation) |
digamma(x) |
Digamma / psi function (recurrence + asymptotic series) |
beta(a,b), lbeta(a,b) |
Beta function and log-beta |
gamma_inc(a,x) |
Regularized lower incomplete gamma P(a,x) |
gamma_inc_upper(a,x) |
Regularized upper incomplete gamma Q(a,x) |
betainc(a,b,x) |
Regularized incomplete beta I_x(a,b) |
erf(x), erfc(x) |
Error function and complementary error function |
Continuous distributions implement ContinuousDistribution (pdf, cdf, quantile, mean, variance). Discrete distributions implement DiscreteDistribution (pmf, cdf, quantile, mean, variance). Implies special feature.
| Distribution | Struct | Parameters |
|---|---|---|
| Normal | Normal<T> |
mean μ, std dev σ |
| Uniform | Uniform<T> |
bounds [a, b] |
| Exponential | Exponential<T> |
rate λ |
| Gamma | Gamma<T> |
shape α, rate β |
| Beta | Beta<T> |
shape α, shape β |
| Chi-squared | ChiSquared<T> |
degrees of freedom k |
| Student's t | StudentT<T> |
degrees of freedom ν |
| Bernoulli | Bernoulli<T> |
probability p |
| Binomial | Binomial<T> |
trials n, probability p |
| Poisson | Poisson<T> |
rate λ |
Quaternion<T> with scalar-first convention [w, x, y, z].
- Construction:
new,identity,from_axis_angle,from_euler,from_rotation_matrix,rotx,roty,rotz - Operations:
*(Hamilton product),* Vector3(rotation),conjugate,inverse,normalize,slerp - Conversion:
to_rotation_matrix,to_axis_angle,to_euler
| Trait | Bounds | Used by |
|---|---|---|
Scalar |
Copy + PartialEq + Debug + Zero + One + Num |
All matrix ops |
FloatScalar |
Scalar + Float + LinalgScalar<Real=Self> |
Quaternions, ordered comparisons |
LinalgScalar |
Scalar + modulus, conj, re, lsqrt, lln, from_real |
Decompositions, norms |
MatrixRef<T> |
Read-only get(row, col) |
Generic algorithms |
MatrixMut<T> |
Adds get_mut(row, col) |
In-place decompositions |
Module plan
Checked items are implemented; unchecked are potential future work.
- matrix — Fixed-size matrix (stack-allocated, const-generic dimensions), size aliases up to 6×6
- linalg — LU, Cholesky, QR, SVD decompositions; symmetric eigendecomposition; real Schur decomposition; solvers, inverse, determinant; complex support
- quaternion — Unit quaternion for rotations (SLERP, Euler, axis-angle, rotation matrices)
- ode — ODE integration (RK4, 7 adaptive solvers, dense output, RODAS4 stiff solver)
- dynmatrix — Heap-allocated runtime-sized matrix/vector (
allocfeature) - interp — Interpolation (linear, Hermite, barycentric Lagrange, natural cubic spline)
- optim — Optimization (Brent, Newton, BFGS, Gauss-Newton, Levenberg-Marquardt)
- estimate — State estimation: EKF, UKF, SR-UKF, CKF, RTS smoother, batch least-squares
- quad — Numerical quadrature / integration
- fft — Fast Fourier Transform
- special — Special functions (gamma, lgamma, digamma, beta, lbeta, incomplete gamma/beta, erf, erfc)
- stats — Statistical distributions (Normal, Uniform, Exponential, Gamma, Beta, Chi-squared, Student's t, Bernoulli, Binomial, Poisson)
- poly — Polynomial operations and root-finding
- control — Digital IIR filters (Butterworth, Chebyshev), PID controller, lead/lag compensators, PID tuning (Ziegler-Nichols, Cohen-Coon, SIMC)
Performance
numeris is designed for two use cases: no-std embedded systems and high-performance desktop/server computing.
SIMD acceleration is always-on for f32/f64 — no feature flag needed. On aarch64 (NEON) and x86_64 (SSE2/AVX/AVX-512), matrix multiply, dot products, and element-wise operations use hardware SIMD intrinsics via core::arch. Matrix multiply uses register-blocked micro-kernels (inspired by nano-gemm by Sarah Quinones) that accumulate MR×NR tiles in SIMD registers with k-blocking (KC=256) for cache locality, writing C only once per tile — reducing memory traffic by up to O(n) vs. naive implementations. Small matrices (4x4 and below) use direct formulas for inverse and determinant, bypassing LU decomposition entirely. Integer and complex types fall back to scalar loops at zero cost (dead-code eliminated at monomorphization via TypeId dispatch).
SSE2 and NEON are always-on baselines. AVX and AVX-512 are compile-time opt-in via -C target-cpu=native; dispatch selects the widest available ISA.
| Architecture | ISA | f64 tile (MR×NR) | f32 tile (MR×NR) |
|---|---|---|---|
| aarch64 | NEON (128-bit) | 8×4 | 8×4 |
| x86_64 | SSE2 (128-bit) | 4×4 | 8×4 |
| x86_64 | AVX (256-bit) | 8×4 | 16×4 |
| x86_64 | AVX-512 (512-bit) | 16×4 | 32×4 |
| other | scalar fallback | 4×4 | 4×4 |
Design decisions
- Column-major storage:
[[T; M]; N](N columns of M rows), matching LAPACK conventions. Stack-allocated, const generics. - Heap-allocated:
DynMatrixusesVec<T>for runtime dimensions, behindallocfeature. num-traits: Generic numeric bounds withdefault-features = false.- In-place algorithms: Decompositions operate on
&mut impl MatrixMut<T>, avoiding allocator/storage trait complexity. BothMatrixandDynMatriximplementMatrixMut, so the same free functions work for both. - Integer matrices: Work with
Scalar(all basic ops). Float-only operations (det, norms, decompositions) requireLinalgScalarorFloatScalar. - Complex support: Additive, behind a feature flag. Zero cost for real-only usage.
Acknowledgments
The register-blocked SIMD matrix multiply micro-kernels are inspired by nano-gemm and faer by Sarah Quinones.
License
MIT