Expand description
§OxiBLAS - Pure Rust BLAS/LAPACK Implementation
OxiBLAS is a pure Rust implementation of BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra PACKage), designed to be the defacto standard for the scirs2 ecosystem.
§Features
- Pure Rust: No C dependencies, fully portable
- SIMD Optimized: Custom SIMD layer using
core::archintrinsics - Cache Aware: BLIS-style blocked algorithms for optimal cache usage
- Parallel: Optional rayon-based parallelization
§Crate Structure
OxiBLAS is organized into several sub-crates:
oxiblas_core: Core traits and SIMD abstractionsoxiblas_matrix: Matrix types (Mat,MatRef,MatMut)oxiblas_blas: BLAS operations (gemm, etc.)oxiblas_lapack: LAPACK operations (decompositions, solvers)oxiblas_sparse: Sparse matrix types and operations (withsparsefeature, default)oxiblas_ndarray: ndarray integration (withndarrayfeature)
§Quick Start
use oxiblas::prelude::*;
// Create matrices
let a: Mat<f64> = Mat::filled(100, 50, 1.0);
let b: Mat<f64> = Mat::filled(50, 80, 2.0);
let mut c: Mat<f64> = Mat::zeros(100, 80);
// GEMM: C = A * B
gemm(1.0, a.as_ref(), b.as_ref(), 0.0, c.as_mut());
// Each element of C should be 100 (50 * 1.0 * 2.0)
assert!((c[(0, 0)] - 100.0).abs() < 1e-10);§Supported Types
f32: Single precision floating pointf64: Double precision floating pointComplex32: Single precision complexComplex64: Double precision complex
§SIMD Support
OxiBLAS automatically detects and uses the best available SIMD:
- x86_64: SSE4.2 (128-bit), AVX2/FMA (256-bit), AVX-512F (512-bit)
- AVX-512BW for byte/word operations (i8, i16 SIMD)
- AVX-512VNNI for neural network acceleration (quantized inference)
- AArch64: NEON (128-bit), SVE (128-2048 bit scalable vectors)
- WASM: SIMD128 (128-bit)
- Fallback: Scalar operations
§Scalar Specialization
For performance-critical code, OxiBLAS provides compile-time type dispatch:
SimdCompatible: SIMD width hints per scalar typeScalarBatch: Vectorized batch operations (dot, axpy, fma)ExtendedPrecision: Accumulator type mapping (f32→f64)KahanSum,pairwise_sum: Compensated summation for accuracy
§Feature Flags
sparse: Enable sparse matrix operations (enabled by default)parallel: Enable rayon-based parallelizationndarray: Enable ndarray integration for interop with ndarray typesnalgebra: Enable nalgebra integration for interop with nalgebra typesmmap: Enable memory-mapped matrices for large datasetsf16: Enable half-precision (f16) supportf128: Enable quad-precision (f128) supportserde: Enable serialization support for matrix typesfull: Enable all features
§SIMD Control Features
These features control SIMD optimization levels (useful for debugging or compatibility):
force-scalar: Disable all SIMD optimizations, use scalar operations onlymax-simd-128: Limit SIMD to 128-bit registers (SSE/NEON)max-simd-256: Limit SIMD to 256-bit registers (AVX2)
Use detect_simd_level() to check the active SIMD level, and
detect_simd_level_raw() to get the actual hardware capability.
§ndarray Integration
With the ndarray feature enabled, you can use OxiBLAS with ndarray types:
use oxiblas::ndarray::prelude::*;
use ndarray::Array2;
let a = Array2::<f64>::from_shape_fn((100, 100), |idx| (idx.0 + idx.1) as f64);
let b = Array2::<f64>::from_shape_fn((100, 100), |idx| (idx.0 * idx.1) as f64);
let c = matmul(&a, &b);§nalgebra Integration
With the nalgebra feature enabled, you can convert between OxiBLAS and nalgebra types:
use oxiblas::prelude::*;
use oxiblas::{mat_to_dmatrix, dmatrix_to_mat, MatNalgebraExt, DMatrixOxiblasExt};
use nalgebra::DMatrix;
// Convert from nalgebra to OxiBLAS
let na_mat = DMatrix::from_fn(3, 3, |row, col| (row + col) as f64);
let oxi_mat: Mat<f64> = na_mat.to_mat();
// Convert from OxiBLAS to nalgebra
let mat: Mat<f64> = Mat::from_rows(&[&[1.0, 2.0], &[3.0, 4.0]]);
let dm: DMatrix<f64> = mat.to_dmatrix();§Lazy Evaluation
OxiBLAS supports lazy evaluation for matrix operations, enabling expression trees
that are only evaluated when .eval() is called. This allows for operation fusion
and optimization.
use oxiblas::prelude::*;
let a: Mat<f64> = Mat::from_rows(&[&[1.0, 2.0], &[3.0, 4.0]]);
let b: Mat<f64> = Mat::from_rows(&[&[5.0, 6.0], &[7.0, 8.0]]);
// Build an expression tree (no computation yet)
let expr = a.as_ref().lazy() + b.as_ref().lazy();
// Evaluate to get the result
let result = expr.eval();
assert!((result[(0, 0)] - 6.0).abs() < 1e-10); // 1 + 5
// Chained operations with scaling
let scaled = (a.as_ref().lazy() + b.as_ref().lazy()).scale(2.0);
let result2 = scaled.eval();
assert!((result2[(0, 0)] - 12.0).abs() < 1e-10); // (1 + 5) * 2
// Fused multiply-add: alpha * A + beta * B
let fma_result = lazy_fma(2.0, a.as_ref().lazy(), 3.0, b.as_ref().lazy()).eval();
assert!((fma_result[(0, 0)] - 17.0).abs() < 1e-10); // 2*1 + 3*5Available lazy operations:
- Element-wise:
+,-, negation,scale() - Matrix:
matmul(),t()(transpose) - Complex:
conj(),h()(Hermitian/conjugate transpose) - Fused:
lazy_fma(),lazy_gemm() - Optimizations:
.simplify()for double-transpose/scale/negation elimination
§Performance Guide
§Choosing the Right Algorithm
OxiBLAS automatically selects optimized code paths based on matrix size:
| Size | GEMM Strategy | Expected Performance |
|---|---|---|
| < 32 | Unrolled small-matrix kernels | Minimal overhead |
| 32-512 | BLIS-style blocked GEMM | ~70% peak |
| > 512 | Auto-tuned blocking + parallel | ~80% peak |
| > 2048 | Consider Strassen (optional) | Reduced complexity |
§Memory Layout
- Column-major (default): Best for BLAS/LAPACK compatibility
- Contiguous data: Avoid strided views when possible for 2-10x speedup
- Aligned allocation:
Mat<T>uses 64-byte alignment for SIMD
§Parallelization
Enable the parallel feature and use Par::Rayon for large matrices:
use oxiblas::prelude::*;
use oxiblas::core::Par;
// Sequential (default)
gemm(1.0, a.as_ref(), b.as_ref(), 0.0, c.as_mut());
// Parallel - recommended for n >= 256
gemm_with_par(1.0, a.as_ref(), b.as_ref(), 0.0, c.as_mut(), Par::Rayon);§Cache Optimization Tips
- Blocking parameters: Use
GemmBlocking::auto_tuned()for runtime tuning - Panel reuse: Process multiple right-hand sides together when possible
- Prefetching: Enabled automatically in GEMM micro-kernels
§Complex Numbers
For complex GEMM, the 3M algorithm (3 real multiplies) is used automatically for matrices >= 64, providing ~25% speedup over naive 4-multiply approach.
§Migration Guide
§From ndarray-linalg
// ndarray-linalg
use ndarray::Array2;
use ndarray_linalg::Solve;
let a = Array2::<f64>::eye(3);
let b = Array2::<f64>::ones((3, 1));
let x = a.solve(&b).unwrap();
// OxiBLAS equivalent
use oxiblas::prelude::*;
let a = MatBuilder::<f64>::identity(3);
let b = MatBuilder::<f64>::ones(3, 1);
let x = solve(a.as_ref(), b.as_ref()).unwrap();§From nalgebra
// nalgebra
use nalgebra::{DMatrix, DVector};
let a = DMatrix::from_fn(3, 3, |i, j| (i + j) as f64);
let b = DVector::from_element(3, 1.0);
let lu = a.lu();
let x = lu.solve(&b).unwrap();
// OxiBLAS equivalent (with nalgebra feature)
use oxiblas::prelude::*;
let a: Mat<f64> = na_matrix.to_mat(); // Convert from nalgebra
let b: Mat<f64> = na_vector.to_mat();
let x = solve(a.as_ref(), b.as_ref()).unwrap();
let result = x.to_dmatrix(); // Convert back if needed§From NumPy (via Rust)
| NumPy | OxiBLAS |
|---|---|
np.dot(a, b) | gemm(1.0, a, b, 0.0, c) or a.matmul(&b) |
np.linalg.solve(a, b) | solve(a, b) |
np.linalg.svd(a) | Svd::compute(a) |
np.linalg.eig(a) | GeneralEvd::compute(a) |
np.linalg.qr(a) | Qr::compute(a) |
np.linalg.cholesky(a) | Cholesky::compute(a) |
np.linalg.inv(a) | inv(a) |
np.linalg.det(a) | det(a) |
§Architecture
§Crate Hierarchy
oxiblas (umbrella)
├── oxiblas-core # Scalar traits, SIMD, memory, parallelism
├── oxiblas-matrix # Mat/MatRef/MatMut, storage formats
├── oxiblas-blas # BLAS Level 1/2/3 operations
├── oxiblas-lapack # Decompositions, solvers, eigenvalue
├── oxiblas-sparse # Sparse formats, iterative solvers
└── oxiblas-ndarray # ndarray integration§GEMM Optimization Stack
User API: gemm(alpha, a, b, beta, c)
│
▼
Size dispatch: small (<32) → unrolled, large → blocked
│
▼
Blocking: MC×KC×NC panels (auto-tuned to cache hierarchy)
│
▼
Packing: pack_a (MC×KC), pack_b (KC×NC) with SIMD
│
▼
Micro-kernel: MR×NR FMA loop (8×6 f64, 8×8 f32 on NEON/AVX2)§SIMD Abstraction
The oxiblas-core crate provides a unified SIMD abstraction:
SimdLevel: Runtime detection (Scalar, SSE42, AVX2, AVX512, NEON, SVE)F64x2/F64x4: Portable f64 vector types with fallback- Architecture-specific intrinsics wrapped in safe functions
§Memory Model
- Arena allocation: Bump allocator for GEMM temporaries (zero malloc)
- Aligned vectors: 64-byte aligned for AVX-512 compatibility
- Copy-on-write:
CowMat<T>for lazy cloning - Memory-mapped:
MmapMatfor large datasets (withmmapfeature)
Re-exports§
pub use oxiblas_blas as blas;pub use oxiblas_core as core;pub use oxiblas_lapack as lapack;pub use oxiblas_matrix as matrix;pub use oxiblas_sparse as sparse;
Modules§
- auto
- Automatic algorithm selection for OxiBLAS.
- builder
- Matrix builder patterns for ergonomic matrix creation.
- fluent
- Fluent API for chained matrix operations.
- gemm
- General Matrix Multiplication (GEMM).
- prelude
- Prelude module - import everything commonly needed.
Structs§
- Aligned
Vec - A vector with guaranteed alignment and custom allocator support.
- DiagRef
- A view into the diagonal of a matrix.
- ExprAdd
- A lazy addition expression.
- Expr
Conj - A lazy complex conjugate expression.
- ExprFma
- Fused multiply-add expression: alpha * A + beta * B
- Expr
Gemm - GEMM expression: alpha * A * B + beta * C
- Expr
Hermitian - A lazy conjugate transpose (Hermitian) expression.
- Expr
Leaf - A leaf expression wrapping a matrix reference.
- ExprMul
- A lazy matrix multiplication expression.
- ExprNeg
- A lazy negation expression.
- Expr
Scale - A lazy scalar multiplication expression.
- ExprSub
- A lazy subtraction expression.
- Expr
Transpose - A lazy transpose expression.
- Gemm
Blocking - Blocking parameters for GEMM.
- KBKSum
- Kahan-Babuska-Klein summation (improved compensated summation).
- Kahan
Sum - Kahan summation for improved accuracy.
- Mat
- An owned, heap-allocated matrix with column-major storage.
- MatMut
- A mutable view into a matrix.
- MatRef
- An immutable view into a matrix.
- MemStack
- A memory stack for temporary allocations.
- ParThreshold
- Threshold configuration for parallel operations.
- Stack
Req - Represents the memory requirements for an operation.
- Transpose
Ref - A transposed view of a matrix.
Enums§
- Par
- Parallelization mode.
- Scalar
Class - Type-level scalar classification for compile-time dispatch.
- Simd
Level - SIMD capability level detected at runtime.
Constants§
Traits§
- Complex
Expr - Extension trait for complex expressions.
- Complex
Ext - Extension trait for more ergonomic complex number operations.
- Complex
Scalar - Marker trait for complex scalar types.
- Expr
- Trait for lazy matrix expressions.
- Extended
Precision - Extended precision accumulation support.
- Field
- Field trait - complete algebraic structure with all operations.
- Gemm
Kernel - Trait for types that have GEMM micro-kernel implementations.
- HasFast
Fma - Marker trait for types with hardware FMA (fused multiply-add) support.
- LazyExt
- Trait extension for MatRef to create lazy expressions.
- Real
- Trait for real number types (f32, f64).
- Scalar
- Base trait for all scalar types used in OxiBLAS.
- Scalar
Batch - Batch operations on scalar arrays for performance-critical code.
- Scalar
Classify - Trait for compile-time scalar classification.
- Simd
Compatible - Marker trait for types that can be efficiently vectorized with SIMD.
- ToComplex
- Trait for converting real numbers to complex.
- Unroll
Hints - Unrolling hints for vectorized loops.
Functions§
- c32
- Creates a complex number from real and imaginary parts (f32).
- c64
- Creates a complex number from real and imaginary parts.
- detect_
simd_ level - Detects the best available SIMD level at runtime.
- detect_
simd_ level_ raw - Raw SIMD level detection without feature flag limits.
- gemm
- GEMM operation: C = alpha * A * B + beta * C
- gemm_
with_ par - GEMM with parallelization control.
- lazy_
fma - Creates a fused multiply-add expression: alpha * A + beta * B
- lazy_
gemm - Creates a GEMM expression: alpha * A * B + beta * C
- pairwise_
sum - Pairwise summation for reduced error accumulation.