OxiBLAS - Pure Rust BLAS/LAPACK Implementation
OxiBLAS is a pure Rust implementation of BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra PACKage), designed to be the defacto standard for the scirs2 ecosystem.
Features
- Pure Rust: No C dependencies, fully portable
- SIMD Optimized: Custom SIMD layer using
core::archintrinsics - Cache Aware: BLIS-style blocked algorithms for optimal cache usage
- Parallel: Optional rayon-based parallelization
Crate Structure
OxiBLAS is organized into several sub-crates:
oxiblas_core: Core traits and SIMD abstractionsoxiblas_matrix: Matrix types (Mat,MatRef,MatMut)oxiblas_blas: BLAS operations (gemm, etc.)oxiblas_lapack: LAPACK operations (decompositions, solvers)oxiblas_sparse: Sparse matrix types and operations (withsparsefeature, default)oxiblas_ndarray: ndarray integration (withndarrayfeature)
Quick Start
use *;
// Create matrices
let a: = filled;
let b: = filled;
let mut c: = zeros;
// GEMM: C = A * B
gemm;
// Each element of C should be 100 (50 * 1.0 * 2.0)
assert!;
Supported Types
f32: Single precision floating pointf64: Double precision floating pointComplex32: Single precision complexComplex64: Double precision complex
SIMD Support
OxiBLAS automatically detects and uses the best available SIMD:
- x86_64: SSE4.2 (128-bit), AVX2/FMA (256-bit), AVX-512F (512-bit)
- AVX-512BW for byte/word operations (i8, i16 SIMD)
- AVX-512VNNI for neural network acceleration (quantized inference)
- AArch64: NEON (128-bit), SVE (128-2048 bit scalable vectors)
- WASM: SIMD128 (128-bit)
- Fallback: Scalar operations
Scalar Specialization
For performance-critical code, OxiBLAS provides compile-time type dispatch:
SimdCompatible: SIMD width hints per scalar typeScalarBatch: Vectorized batch operations (dot, axpy, fma)ExtendedPrecision: Accumulator type mapping (f32→f64)KahanSum,pairwise_sum: Compensated summation for accuracy
Feature Flags
sparse: Enable sparse matrix operations (enabled by default)parallel: Enable rayon-based parallelizationndarray: Enable ndarray integration for interop with ndarray typesnalgebra: Enable nalgebra integration for interop with nalgebra typesmmap: Enable memory-mapped matrices for large datasetsf16: Enable half-precision (f16) supportf128: Enable quad-precision (f128) supportserde: Enable serialization support for matrix typesfull: Enable all features
SIMD Control Features
These features control SIMD optimization levels (useful for debugging or compatibility):
force-scalar: Disable all SIMD optimizations, use scalar operations onlymax-simd-128: Limit SIMD to 128-bit registers (SSE/NEON)max-simd-256: Limit SIMD to 256-bit registers (AVX2)
Use detect_simd_level() to check the active SIMD level, and
detect_simd_level_raw() to get the actual hardware capability.
ndarray Integration
With the ndarray feature enabled, you can use OxiBLAS with ndarray types:
use oxiblas::ndarray::prelude::*;
use ndarray::Array2;
let a = Array2::<f64>::from_shape_fn((100, 100), |idx| (idx.0 + idx.1) as f64);
let b = Array2::<f64>::from_shape_fn((100, 100), |idx| (idx.0 * idx.1) as f64);
let c = matmul(&a, &b);
nalgebra Integration
With the nalgebra feature enabled, you can convert between OxiBLAS and nalgebra types:
use oxiblas::prelude::*;
use oxiblas::{mat_to_dmatrix, dmatrix_to_mat, MatNalgebraExt, DMatrixOxiblasExt};
use nalgebra::DMatrix;
// Convert from nalgebra to OxiBLAS
let na_mat = DMatrix::from_fn(3, 3, |row, col| (row + col) as f64);
let oxi_mat: Mat<f64> = na_mat.to_mat();
// Convert from OxiBLAS to nalgebra
let mat: Mat<f64> = Mat::from_rows(&[&[1.0, 2.0], &[3.0, 4.0]]);
let dm: DMatrix<f64> = mat.to_dmatrix();
Lazy Evaluation
OxiBLAS supports lazy evaluation for matrix operations, enabling expression trees
that are only evaluated when .eval() is called. This allows for operation fusion
and optimization.
use *;
let a: = from_rows;
let b: = from_rows;
// Build an expression tree (no computation yet)
let expr = a.as_ref.lazy + b.as_ref.lazy;
// Evaluate to get the result
let result = expr.eval;
assert!; // 1 + 5
// Chained operations with scaling
let scaled = .scale;
let result2 = scaled.eval;
assert!; // (1 + 5) * 2
// Fused multiply-add: alpha * A + beta * B
let fma_result = lazy_fma.eval;
assert!; // 2*1 + 3*5
Available lazy operations:
- Element-wise:
+,-, negation,scale() - Matrix:
matmul(),t()(transpose) - Complex:
conj(),h()(Hermitian/conjugate transpose) - Fused:
lazy_fma(),lazy_gemm() - Optimizations:
.simplify()for double-transpose/scale/negation elimination
Performance Guide
Choosing the Right Algorithm
OxiBLAS automatically selects optimized code paths based on matrix size:
| Size | GEMM Strategy | Expected Performance |
|---|---|---|
| < 32 | Unrolled small-matrix kernels | Minimal overhead |
| 32-512 | BLIS-style blocked GEMM | ~70% peak |
| > 512 | Auto-tuned blocking + parallel | ~80% peak |
| > 2048 | Consider Strassen (optional) | Reduced complexity |
Memory Layout
- Column-major (default): Best for BLAS/LAPACK compatibility
- Contiguous data: Avoid strided views when possible for 2-10x speedup
- Aligned allocation:
Mat<T>uses 64-byte alignment for SIMD
Parallelization
Enable the parallel feature and use Par::Rayon for large matrices:
use oxiblas::prelude::*;
use oxiblas::core::Par;
// Sequential (default)
gemm(1.0, a.as_ref(), b.as_ref(), 0.0, c.as_mut());
// Parallel - recommended for n >= 256
gemm_with_par(1.0, a.as_ref(), b.as_ref(), 0.0, c.as_mut(), Par::Rayon);
Cache Optimization Tips
- Blocking parameters: Use
GemmBlocking::auto_tuned()for runtime tuning - Panel reuse: Process multiple right-hand sides together when possible
- Prefetching: Enabled automatically in GEMM micro-kernels
Complex Numbers
For complex GEMM, the 3M algorithm (3 real multiplies) is used automatically for matrices >= 64, providing ~25% speedup over naive 4-multiply approach.
Migration Guide
From ndarray-linalg
// ndarray-linalg
use ndarray::Array2;
use ndarray_linalg::Solve;
let a = Array2::<f64>::eye(3);
let b = Array2::<f64>::ones((3, 1));
let x = a.solve(&b).unwrap();
// OxiBLAS equivalent
use oxiblas::prelude::*;
let a = MatBuilder::<f64>::identity(3);
let b = MatBuilder::<f64>::ones(3, 1);
let x = solve(a.as_ref(), b.as_ref()).unwrap();
From nalgebra
// nalgebra
use nalgebra::{DMatrix, DVector};
let a = DMatrix::from_fn(3, 3, |i, j| (i + j) as f64);
let b = DVector::from_element(3, 1.0);
let lu = a.lu();
let x = lu.solve(&b).unwrap();
// OxiBLAS equivalent (with nalgebra feature)
use oxiblas::prelude::*;
let a: Mat<f64> = na_matrix.to_mat(); // Convert from nalgebra
let b: Mat<f64> = na_vector.to_mat();
let x = solve(a.as_ref(), b.as_ref()).unwrap();
let result = x.to_dmatrix(); // Convert back if needed
From NumPy (via Rust)
| NumPy | OxiBLAS |
|---|---|
np.dot(a, b) |
gemm(1.0, a, b, 0.0, c) or a.matmul(&b) |
np.linalg.solve(a, b) |
solve(a, b) |
np.linalg.svd(a) |
Svd::compute(a) |
np.linalg.eig(a) |
GeneralEvd::compute(a) |
np.linalg.qr(a) |
Qr::compute(a) |
np.linalg.cholesky(a) |
Cholesky::compute(a) |
np.linalg.inv(a) |
inv(a) |
np.linalg.det(a) |
det(a) |
Architecture
Crate Hierarchy
oxiblas (umbrella)
├── oxiblas-core # Scalar traits, SIMD, memory, parallelism
├── oxiblas-matrix # Mat/MatRef/MatMut, storage formats
├── oxiblas-blas # BLAS Level 1/2/3 operations
├── oxiblas-lapack # Decompositions, solvers, eigenvalue
├── oxiblas-sparse # Sparse formats, iterative solvers
└── oxiblas-ndarray # ndarray integration
GEMM Optimization Stack
User API: gemm(alpha, a, b, beta, c)
│
▼
Size dispatch: small (<32) → unrolled, large → blocked
│
▼
Blocking: MC×KC×NC panels (auto-tuned to cache hierarchy)
│
▼
Packing: pack_a (MC×KC), pack_b (KC×NC) with SIMD
│
▼
Micro-kernel: MR×NR FMA loop (8×6 f64, 8×8 f32 on NEON/AVX2)
SIMD Abstraction
The oxiblas-core crate provides a unified SIMD abstraction:
SimdLevel: Runtime detection (Scalar, SSE42, AVX2, AVX512, NEON, SVE)F64x2/F64x4: Portable f64 vector types with fallback- Architecture-specific intrinsics wrapped in safe functions
Memory Model
- Arena allocation: Bump allocator for GEMM temporaries (zero malloc)
- Aligned vectors: 64-byte aligned for AVX-512 compatibility
- Copy-on-write:
CowMat<T>for lazy cloning - Memory-mapped:
MmapMatfor large datasets (withmmapfeature)