Crate oxiblas

Expand description

§OxiBLAS - Pure Rust BLAS/LAPACK Implementation

OxiBLAS is a pure Rust implementation of BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra PACKage), designed to be the defacto standard for the scirs2 ecosystem.

§Features

Pure Rust: No C dependencies, fully portable
SIMD Optimized: Custom SIMD layer using core::arch intrinsics
Cache Aware: BLIS-style blocked algorithms for optimal cache usage
Parallel: Optional rayon-based parallelization

§Crate Structure

OxiBLAS is organized into several sub-crates:

oxiblas_core: Core traits and SIMD abstractions
oxiblas_matrix: Matrix types (Mat, MatRef, MatMut)
oxiblas_blas: BLAS operations (gemm, etc.)
oxiblas_lapack: LAPACK operations (decompositions, solvers)
oxiblas_sparse: Sparse matrix types and operations (with sparse feature, default)
oxiblas_ndarray: ndarray integration (with ndarray feature)

§Quick Start

use oxiblas::prelude::*;

// Create matrices
let a: Mat<f64> = Mat::filled(100, 50, 1.0);
let b: Mat<f64> = Mat::filled(50, 80, 2.0);
let mut c: Mat<f64> = Mat::zeros(100, 80);

// GEMM: C = A * B
gemm(1.0, a.as_ref(), b.as_ref(), 0.0, c.as_mut());

// Each element of C should be 100 (50 * 1.0 * 2.0)
assert!((c[(0, 0)] - 100.0).abs() < 1e-10);

§Supported Types

f32: Single precision floating point
f64: Double precision floating point
Complex32: Single precision complex
Complex64: Double precision complex

§SIMD Support

OxiBLAS automatically detects and uses the best available SIMD:

x86_64: SSE4.2 (128-bit), AVX2/FMA (256-bit), AVX-512F (512-bit)
- AVX-512BW for byte/word operations (i8, i16 SIMD)
- AVX-512VNNI for neural network acceleration (quantized inference)
AArch64: NEON (128-bit), SVE (128-2048 bit scalable vectors)
WASM: SIMD128 (128-bit)
Fallback: Scalar operations

§Scalar Specialization

For performance-critical code, OxiBLAS provides compile-time type dispatch:

SimdCompatible: SIMD width hints per scalar type
ScalarBatch: Vectorized batch operations (dot, axpy, fma)
ExtendedPrecision: Accumulator type mapping (f32→f64)
KahanSum, pairwise_sum: Compensated summation for accuracy

§Feature Flags

sparse: Enable sparse matrix operations (enabled by default)
parallel: Enable rayon-based parallelization
ndarray: Enable ndarray integration for interop with ndarray types
nalgebra: Enable nalgebra integration for interop with nalgebra types
mmap: Enable memory-mapped matrices for large datasets
f16: Enable half-precision (f16) support
f128: Enable quad-precision (f128) support
serde: Enable serialization support for matrix types
full: Enable all features

§SIMD Control Features

These features control SIMD optimization levels (useful for debugging or compatibility):

force-scalar: Disable all SIMD optimizations, use scalar operations only
max-simd-128: Limit SIMD to 128-bit registers (SSE/NEON)
max-simd-256: Limit SIMD to 256-bit registers (AVX2)

Use detect_simd_level() to check the active SIMD level, and detect_simd_level_raw() to get the actual hardware capability.

§ndarray Integration

With the ndarray feature enabled, you can use OxiBLAS with ndarray types:

use oxiblas::ndarray::prelude::*;
use ndarray::Array2;

let a = Array2::<f64>::from_shape_fn((100, 100), |idx| (idx.0 + idx.1) as f64);
let b = Array2::<f64>::from_shape_fn((100, 100), |idx| (idx.0 * idx.1) as f64);
let c = matmul(&a, &b);

§nalgebra Integration

With the nalgebra feature enabled, you can convert between OxiBLAS and nalgebra types:

use oxiblas::prelude::*;
use oxiblas::{mat_to_dmatrix, dmatrix_to_mat, MatNalgebraExt, DMatrixOxiblasExt};
use nalgebra::DMatrix;

// Convert from nalgebra to OxiBLAS
let na_mat = DMatrix::from_fn(3, 3, |row, col| (row + col) as f64);
let oxi_mat: Mat<f64> = na_mat.to_mat();

// Convert from OxiBLAS to nalgebra
let mat: Mat<f64> = Mat::from_rows(&[&[1.0, 2.0], &[3.0, 4.0]]);
let dm: DMatrix<f64> = mat.to_dmatrix();

§Lazy Evaluation

OxiBLAS supports lazy evaluation for matrix operations, enabling expression trees that are only evaluated when .eval() is called. This allows for operation fusion and optimization.

use oxiblas::prelude::*;

let a: Mat<f64> = Mat::from_rows(&[&[1.0, 2.0], &[3.0, 4.0]]);
let b: Mat<f64> = Mat::from_rows(&[&[5.0, 6.0], &[7.0, 8.0]]);

// Build an expression tree (no computation yet)
let expr = a.as_ref().lazy() + b.as_ref().lazy();

// Evaluate to get the result
let result = expr.eval();
assert!((result[(0, 0)] - 6.0).abs() < 1e-10); // 1 + 5

// Chained operations with scaling
let scaled = (a.as_ref().lazy() + b.as_ref().lazy()).scale(2.0);
let result2 = scaled.eval();
assert!((result2[(0, 0)] - 12.0).abs() < 1e-10); // (1 + 5) * 2

// Fused multiply-add: alpha * A + beta * B
let fma_result = lazy_fma(2.0, a.as_ref().lazy(), 3.0, b.as_ref().lazy()).eval();
assert!((fma_result[(0, 0)] - 17.0).abs() < 1e-10); // 2*1 + 3*5

Available lazy operations:

Element-wise: +, -, negation, scale()
Matrix: matmul(), t() (transpose)
Complex: conj(), h() (Hermitian/conjugate transpose)
Fused: lazy_fma(), lazy_gemm()
Optimizations: .simplify() for double-transpose/scale/negation elimination

§Performance Guide

§Choosing the Right Algorithm

OxiBLAS automatically selects optimized code paths based on matrix size:

Size	GEMM Strategy	Expected Performance
< 32	Unrolled small-matrix kernels	Minimal overhead
32-512	BLIS-style blocked GEMM	~70% peak
> 512	Auto-tuned blocking + parallel	~80% peak
> 2048	Consider Strassen (optional)	Reduced complexity

§Memory Layout

Column-major (default): Best for BLAS/LAPACK compatibility
Contiguous data: Avoid strided views when possible for 2-10x speedup
Aligned allocation: Mat<T> uses 64-byte alignment for SIMD

§Parallelization

Enable the parallel feature and use Par::Rayon for large matrices:

use oxiblas::prelude::*;
use oxiblas::core::Par;

// Sequential (default)
gemm(1.0, a.as_ref(), b.as_ref(), 0.0, c.as_mut());

// Parallel - recommended for n >= 256
gemm_with_par(1.0, a.as_ref(), b.as_ref(), 0.0, c.as_mut(), Par::Rayon);

§Cache Optimization Tips

Blocking parameters: Use GemmBlocking::auto_tuned() for runtime tuning
Panel reuse: Process multiple right-hand sides together when possible
Prefetching: Enabled automatically in GEMM micro-kernels

§Complex Numbers

For complex GEMM, the 3M algorithm (3 real multiplies) is used automatically for matrices >= 64, providing ~25% speedup over naive 4-multiply approach.

§Migration Guide

§From ndarray-linalg

// ndarray-linalg
use ndarray::Array2;
use ndarray_linalg::Solve;
let a = Array2::<f64>::eye(3);
let b = Array2::<f64>::ones((3, 1));
let x = a.solve(&b).unwrap();

// OxiBLAS equivalent
use oxiblas::prelude::*;
let a = MatBuilder::<f64>::identity(3);
let b = MatBuilder::<f64>::ones(3, 1);
let x = solve(a.as_ref(), b.as_ref()).unwrap();

§From nalgebra

// nalgebra
use nalgebra::{DMatrix, DVector};
let a = DMatrix::from_fn(3, 3, |i, j| (i + j) as f64);
let b = DVector::from_element(3, 1.0);
let lu = a.lu();
let x = lu.solve(&b).unwrap();

// OxiBLAS equivalent (with nalgebra feature)
use oxiblas::prelude::*;
let a: Mat<f64> = na_matrix.to_mat(); // Convert from nalgebra
let b: Mat<f64> = na_vector.to_mat();
let x = solve(a.as_ref(), b.as_ref()).unwrap();
let result = x.to_dmatrix(); // Convert back if needed

§From NumPy (via Rust)

NumPy	OxiBLAS
`np.dot(a, b)`	`gemm(1.0, a, b, 0.0, c)` or `a.matmul(&b)`
`np.linalg.solve(a, b)`	`solve(a, b)`
`np.linalg.svd(a)`	`Svd::compute(a)`
`np.linalg.eig(a)`	`GeneralEvd::compute(a)`
`np.linalg.qr(a)`	`Qr::compute(a)`
`np.linalg.cholesky(a)`	`Cholesky::compute(a)`
`np.linalg.inv(a)`	`inv(a)`
`np.linalg.det(a)`	`det(a)`

§Architecture

§Crate Hierarchy

oxiblas (umbrella)
├── oxiblas-core     # Scalar traits, SIMD, memory, parallelism
├── oxiblas-matrix   # Mat/MatRef/MatMut, storage formats
├── oxiblas-blas     # BLAS Level 1/2/3 operations
├── oxiblas-lapack   # Decompositions, solvers, eigenvalue
├── oxiblas-sparse   # Sparse formats, iterative solvers
└── oxiblas-ndarray  # ndarray integration

§GEMM Optimization Stack

User API: gemm(alpha, a, b, beta, c)
    │
    ▼
Size dispatch: small (<32) → unrolled, large → blocked
    │
    ▼
Blocking: MC×KC×NC panels (auto-tuned to cache hierarchy)
    │
    ▼
Packing: pack_a (MC×KC), pack_b (KC×NC) with SIMD
    │
    ▼
Micro-kernel: MR×NR FMA loop (8×6 f64, 8×8 f32 on NEON/AVX2)

§SIMD Abstraction

The oxiblas-core crate provides a unified SIMD abstraction:

SimdLevel: Runtime detection (Scalar, SSE42, AVX2, AVX512, NEON, SVE)
F64x2/F64x4: Portable f64 vector types with fallback
Architecture-specific intrinsics wrapped in safe functions

§Memory Model

Arena allocation: Bump allocator for GEMM temporaries (zero malloc)
Aligned vectors: 64-byte aligned for AVX-512 compatibility
Copy-on-write: CowMat<T> for lazy cloning
Memory-mapped: MmapMat for large datasets (with mmap feature)

Re-exports§

pub use oxiblas_blas as blas;
pub use oxiblas_core as core;
pub use oxiblas_lapack as lapack;
pub use oxiblas_matrix as matrix;
pub use oxiblas_sparse as sparse;

Modules§

auto: Automatic algorithm selection for OxiBLAS.
builder: Matrix builder patterns for ergonomic matrix creation.
fluent: Fluent API for chained matrix operations.
gemm: General Matrix Multiplication (GEMM).
prelude: Prelude module - import everything commonly needed.

Structs§

AlignedVec: A vector with guaranteed alignment and custom allocator support.
DiagRef: A view into the diagonal of a matrix.
ExprAdd: A lazy addition expression.
ExprConj: A lazy complex conjugate expression.
ExprFma: Fused multiply-add expression: alpha * A + beta * B
ExprGemm: GEMM expression: alpha * A * B + beta * C
ExprHermitian: A lazy conjugate transpose (Hermitian) expression.
ExprLeaf: A leaf expression wrapping a matrix reference.
ExprMul: A lazy matrix multiplication expression.
ExprNeg: A lazy negation expression.
ExprScale: A lazy scalar multiplication expression.
ExprSub: A lazy subtraction expression.
ExprTranspose: A lazy transpose expression.
GemmBlocking: Blocking parameters for GEMM.
KBKSum: Kahan-Babuska-Klein summation (improved compensated summation).
KahanSum: Kahan summation for improved accuracy.
Mat: An owned, heap-allocated matrix with column-major storage.
MatMut: A mutable view into a matrix.
MatRef: An immutable view into a matrix.
MemStack: A memory stack for temporary allocations.
ParThreshold: Threshold configuration for parallel operations.
StackReq: Represents the memory requirements for an operation.
TransposeRef: A transposed view of a matrix.

Enums§

Par: Parallelization mode.
ScalarClass: Type-level scalar classification for compile-time dispatch.
SimdLevel: SIMD capability level detected at runtime.

Constants§

I32: The imaginary unit i (32-bit).
I64: The imaginary unit i (64-bit).

Traits§

ComplexExpr: Extension trait for complex expressions.
ComplexExt: Extension trait for more ergonomic complex number operations.
ComplexScalar: Marker trait for complex scalar types.
Expr: Trait for lazy matrix expressions.
ExtendedPrecision: Extended precision accumulation support.
Field: Field trait - complete algebraic structure with all operations.
GemmKernel: Trait for types that have GEMM micro-kernel implementations.
HasFastFma: Marker trait for types with hardware FMA (fused multiply-add) support.
LazyExt: Trait extension for MatRef to create lazy expressions.
Real: Trait for real number types (f32, f64).
Scalar: Base trait for all scalar types used in OxiBLAS.
ScalarBatch: Batch operations on scalar arrays for performance-critical code.
ScalarClassify: Trait for compile-time scalar classification.
SimdCompatible: Marker trait for types that can be efficiently vectorized with SIMD.
ToComplex: Trait for converting real numbers to complex.
UnrollHints: Unrolling hints for vectorized loops.

Functions§

c32: Creates a complex number from real and imaginary parts (f32).
c64: Creates a complex number from real and imaginary parts.
detect_simd_level: Detects the best available SIMD level at runtime.
detect_simd_level_raw: Raw SIMD level detection without feature flag limits.
gemm: GEMM operation: C = alpha * A * B + beta * C
gemm_with_par: GEMM with parallelization control.
lazy_fma: Creates a fused multiply-add expression: alpha * A + beta * B
lazy_gemm: Creates a GEMM expression: alpha * A * B + beta * C
pairwise_sum: Pairwise summation for reduced error accumulation.

Type Aliases§

C32: 32-bit complex type alias (same as Complex32).
C64: 64-bit complex type alias (same as Complex64).

Crate oxiblas

Crate oxiblas Copy item path

§OxiBLAS - Pure Rust BLAS/LAPACK Implementation

§Features

§Crate Structure

§Quick Start

§Supported Types

§SIMD Support

§Scalar Specialization

§Feature Flags

§SIMD Control Features

§ndarray Integration

§nalgebra Integration

§Lazy Evaluation

§Performance Guide

§Choosing the Right Algorithm

§Memory Layout

§Parallelization

§Cache Optimization Tips

§Complex Numbers

§Migration Guide

§From ndarray-linalg

§From nalgebra

§From NumPy (via Rust)

§Architecture

§Crate Hierarchy

§GEMM Optimization Stack

§SIMD Abstraction

§Memory Model

Re-exports§

Modules§

Structs§

Enums§

Constants§

Traits§

Functions§

Type Aliases§

Crate oxiblas