Crate oxiblas

Crate oxiblas 

Source
Expand description

§OxiBLAS - Pure Rust BLAS/LAPACK Implementation

OxiBLAS is a pure Rust implementation of BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra PACKage), designed to be the defacto standard for the scirs2 ecosystem.

§Features

  • Pure Rust: No C dependencies, fully portable
  • SIMD Optimized: Custom SIMD layer using core::arch intrinsics
  • Cache Aware: BLIS-style blocked algorithms for optimal cache usage
  • Parallel: Optional rayon-based parallelization

§Crate Structure

OxiBLAS is organized into several sub-crates:

  • oxiblas_core: Core traits and SIMD abstractions
  • oxiblas_matrix: Matrix types (Mat, MatRef, MatMut)
  • oxiblas_blas: BLAS operations (gemm, etc.)
  • oxiblas_lapack: LAPACK operations (decompositions, solvers)
  • oxiblas_sparse: Sparse matrix types and operations (with sparse feature, default)
  • oxiblas_ndarray: ndarray integration (with ndarray feature)

§Quick Start

use oxiblas::prelude::*;

// Create matrices
let a: Mat<f64> = Mat::filled(100, 50, 1.0);
let b: Mat<f64> = Mat::filled(50, 80, 2.0);
let mut c: Mat<f64> = Mat::zeros(100, 80);

// GEMM: C = A * B
gemm(1.0, a.as_ref(), b.as_ref(), 0.0, c.as_mut());

// Each element of C should be 100 (50 * 1.0 * 2.0)
assert!((c[(0, 0)] - 100.0).abs() < 1e-10);

§Supported Types

  • f32: Single precision floating point
  • f64: Double precision floating point
  • Complex32: Single precision complex
  • Complex64: Double precision complex

§SIMD Support

OxiBLAS automatically detects and uses the best available SIMD:

  • x86_64: SSE4.2 (128-bit), AVX2/FMA (256-bit), AVX-512F (512-bit)
    • AVX-512BW for byte/word operations (i8, i16 SIMD)
    • AVX-512VNNI for neural network acceleration (quantized inference)
  • AArch64: NEON (128-bit), SVE (128-2048 bit scalable vectors)
  • WASM: SIMD128 (128-bit)
  • Fallback: Scalar operations

§Scalar Specialization

For performance-critical code, OxiBLAS provides compile-time type dispatch:

  • SimdCompatible: SIMD width hints per scalar type
  • ScalarBatch: Vectorized batch operations (dot, axpy, fma)
  • ExtendedPrecision: Accumulator type mapping (f32→f64)
  • KahanSum, pairwise_sum: Compensated summation for accuracy

§Feature Flags

  • sparse: Enable sparse matrix operations (enabled by default)
  • parallel: Enable rayon-based parallelization
  • ndarray: Enable ndarray integration for interop with ndarray types
  • nalgebra: Enable nalgebra integration for interop with nalgebra types
  • mmap: Enable memory-mapped matrices for large datasets
  • f16: Enable half-precision (f16) support
  • f128: Enable quad-precision (f128) support
  • serde: Enable serialization support for matrix types
  • full: Enable all features

§SIMD Control Features

These features control SIMD optimization levels (useful for debugging or compatibility):

  • force-scalar: Disable all SIMD optimizations, use scalar operations only
  • max-simd-128: Limit SIMD to 128-bit registers (SSE/NEON)
  • max-simd-256: Limit SIMD to 256-bit registers (AVX2)

Use detect_simd_level() to check the active SIMD level, and detect_simd_level_raw() to get the actual hardware capability.

§ndarray Integration

With the ndarray feature enabled, you can use OxiBLAS with ndarray types:

use oxiblas::ndarray::prelude::*;
use ndarray::Array2;

let a = Array2::<f64>::from_shape_fn((100, 100), |idx| (idx.0 + idx.1) as f64);
let b = Array2::<f64>::from_shape_fn((100, 100), |idx| (idx.0 * idx.1) as f64);
let c = matmul(&a, &b);

§nalgebra Integration

With the nalgebra feature enabled, you can convert between OxiBLAS and nalgebra types:

use oxiblas::prelude::*;
use oxiblas::{mat_to_dmatrix, dmatrix_to_mat, MatNalgebraExt, DMatrixOxiblasExt};
use nalgebra::DMatrix;

// Convert from nalgebra to OxiBLAS
let na_mat = DMatrix::from_fn(3, 3, |row, col| (row + col) as f64);
let oxi_mat: Mat<f64> = na_mat.to_mat();

// Convert from OxiBLAS to nalgebra
let mat: Mat<f64> = Mat::from_rows(&[&[1.0, 2.0], &[3.0, 4.0]]);
let dm: DMatrix<f64> = mat.to_dmatrix();

§Lazy Evaluation

OxiBLAS supports lazy evaluation for matrix operations, enabling expression trees that are only evaluated when .eval() is called. This allows for operation fusion and optimization.

use oxiblas::prelude::*;

let a: Mat<f64> = Mat::from_rows(&[&[1.0, 2.0], &[3.0, 4.0]]);
let b: Mat<f64> = Mat::from_rows(&[&[5.0, 6.0], &[7.0, 8.0]]);

// Build an expression tree (no computation yet)
let expr = a.as_ref().lazy() + b.as_ref().lazy();

// Evaluate to get the result
let result = expr.eval();
assert!((result[(0, 0)] - 6.0).abs() < 1e-10); // 1 + 5

// Chained operations with scaling
let scaled = (a.as_ref().lazy() + b.as_ref().lazy()).scale(2.0);
let result2 = scaled.eval();
assert!((result2[(0, 0)] - 12.0).abs() < 1e-10); // (1 + 5) * 2

// Fused multiply-add: alpha * A + beta * B
let fma_result = lazy_fma(2.0, a.as_ref().lazy(), 3.0, b.as_ref().lazy()).eval();
assert!((fma_result[(0, 0)] - 17.0).abs() < 1e-10); // 2*1 + 3*5

Available lazy operations:

  • Element-wise: +, -, negation, scale()
  • Matrix: matmul(), t() (transpose)
  • Complex: conj(), h() (Hermitian/conjugate transpose)
  • Fused: lazy_fma(), lazy_gemm()
  • Optimizations: .simplify() for double-transpose/scale/negation elimination

§Performance Guide

§Choosing the Right Algorithm

OxiBLAS automatically selects optimized code paths based on matrix size:

SizeGEMM StrategyExpected Performance
< 32Unrolled small-matrix kernelsMinimal overhead
32-512BLIS-style blocked GEMM~70% peak
> 512Auto-tuned blocking + parallel~80% peak
> 2048Consider Strassen (optional)Reduced complexity

§Memory Layout

  • Column-major (default): Best for BLAS/LAPACK compatibility
  • Contiguous data: Avoid strided views when possible for 2-10x speedup
  • Aligned allocation: Mat<T> uses 64-byte alignment for SIMD

§Parallelization

Enable the parallel feature and use Par::Rayon for large matrices:

use oxiblas::prelude::*;
use oxiblas::core::Par;

// Sequential (default)
gemm(1.0, a.as_ref(), b.as_ref(), 0.0, c.as_mut());

// Parallel - recommended for n >= 256
gemm_with_par(1.0, a.as_ref(), b.as_ref(), 0.0, c.as_mut(), Par::Rayon);

§Cache Optimization Tips

  1. Blocking parameters: Use GemmBlocking::auto_tuned() for runtime tuning
  2. Panel reuse: Process multiple right-hand sides together when possible
  3. Prefetching: Enabled automatically in GEMM micro-kernels

§Complex Numbers

For complex GEMM, the 3M algorithm (3 real multiplies) is used automatically for matrices >= 64, providing ~25% speedup over naive 4-multiply approach.

§Migration Guide

§From ndarray-linalg

// ndarray-linalg
use ndarray::Array2;
use ndarray_linalg::Solve;
let a = Array2::<f64>::eye(3);
let b = Array2::<f64>::ones((3, 1));
let x = a.solve(&b).unwrap();

// OxiBLAS equivalent
use oxiblas::prelude::*;
let a = MatBuilder::<f64>::identity(3);
let b = MatBuilder::<f64>::ones(3, 1);
let x = solve(a.as_ref(), b.as_ref()).unwrap();

§From nalgebra

// nalgebra
use nalgebra::{DMatrix, DVector};
let a = DMatrix::from_fn(3, 3, |i, j| (i + j) as f64);
let b = DVector::from_element(3, 1.0);
let lu = a.lu();
let x = lu.solve(&b).unwrap();

// OxiBLAS equivalent (with nalgebra feature)
use oxiblas::prelude::*;
let a: Mat<f64> = na_matrix.to_mat(); // Convert from nalgebra
let b: Mat<f64> = na_vector.to_mat();
let x = solve(a.as_ref(), b.as_ref()).unwrap();
let result = x.to_dmatrix(); // Convert back if needed

§From NumPy (via Rust)

NumPyOxiBLAS
np.dot(a, b)gemm(1.0, a, b, 0.0, c) or a.matmul(&b)
np.linalg.solve(a, b)solve(a, b)
np.linalg.svd(a)Svd::compute(a)
np.linalg.eig(a)GeneralEvd::compute(a)
np.linalg.qr(a)Qr::compute(a)
np.linalg.cholesky(a)Cholesky::compute(a)
np.linalg.inv(a)inv(a)
np.linalg.det(a)det(a)

§Architecture

§Crate Hierarchy

oxiblas (umbrella)
├── oxiblas-core     # Scalar traits, SIMD, memory, parallelism
├── oxiblas-matrix   # Mat/MatRef/MatMut, storage formats
├── oxiblas-blas     # BLAS Level 1/2/3 operations
├── oxiblas-lapack   # Decompositions, solvers, eigenvalue
├── oxiblas-sparse   # Sparse formats, iterative solvers
└── oxiblas-ndarray  # ndarray integration

§GEMM Optimization Stack

User API: gemm(alpha, a, b, beta, c)
    │
    ▼
Size dispatch: small (<32) → unrolled, large → blocked
    │
    ▼
Blocking: MC×KC×NC panels (auto-tuned to cache hierarchy)
    │
    ▼
Packing: pack_a (MC×KC), pack_b (KC×NC) with SIMD
    │
    ▼
Micro-kernel: MR×NR FMA loop (8×6 f64, 8×8 f32 on NEON/AVX2)

§SIMD Abstraction

The oxiblas-core crate provides a unified SIMD abstraction:

  • SimdLevel: Runtime detection (Scalar, SSE42, AVX2, AVX512, NEON, SVE)
  • F64x2/F64x4: Portable f64 vector types with fallback
  • Architecture-specific intrinsics wrapped in safe functions

§Memory Model

  • Arena allocation: Bump allocator for GEMM temporaries (zero malloc)
  • Aligned vectors: 64-byte aligned for AVX-512 compatibility
  • Copy-on-write: CowMat<T> for lazy cloning
  • Memory-mapped: MmapMat for large datasets (with mmap feature)

Re-exports§

pub use oxiblas_blas as blas;
pub use oxiblas_core as core;
pub use oxiblas_lapack as lapack;
pub use oxiblas_matrix as matrix;
pub use oxiblas_sparse as sparse;

Modules§

auto
Automatic algorithm selection for OxiBLAS.
builder
Matrix builder patterns for ergonomic matrix creation.
fluent
Fluent API for chained matrix operations.
gemm
General Matrix Multiplication (GEMM).
prelude
Prelude module - import everything commonly needed.

Structs§

AlignedVec
A vector with guaranteed alignment and custom allocator support.
DiagRef
A view into the diagonal of a matrix.
ExprAdd
A lazy addition expression.
ExprConj
A lazy complex conjugate expression.
ExprFma
Fused multiply-add expression: alpha * A + beta * B
ExprGemm
GEMM expression: alpha * A * B + beta * C
ExprHermitian
A lazy conjugate transpose (Hermitian) expression.
ExprLeaf
A leaf expression wrapping a matrix reference.
ExprMul
A lazy matrix multiplication expression.
ExprNeg
A lazy negation expression.
ExprScale
A lazy scalar multiplication expression.
ExprSub
A lazy subtraction expression.
ExprTranspose
A lazy transpose expression.
GemmBlocking
Blocking parameters for GEMM.
KBKSum
Kahan-Babuska-Klein summation (improved compensated summation).
KahanSum
Kahan summation for improved accuracy.
Mat
An owned, heap-allocated matrix with column-major storage.
MatMut
A mutable view into a matrix.
MatRef
An immutable view into a matrix.
MemStack
A memory stack for temporary allocations.
ParThreshold
Threshold configuration for parallel operations.
StackReq
Represents the memory requirements for an operation.
TransposeRef
A transposed view of a matrix.

Enums§

Par
Parallelization mode.
ScalarClass
Type-level scalar classification for compile-time dispatch.
SimdLevel
SIMD capability level detected at runtime.

Constants§

I32
The imaginary unit i (32-bit).
I64
The imaginary unit i (64-bit).

Traits§

ComplexExpr
Extension trait for complex expressions.
ComplexExt
Extension trait for more ergonomic complex number operations.
ComplexScalar
Marker trait for complex scalar types.
Expr
Trait for lazy matrix expressions.
ExtendedPrecision
Extended precision accumulation support.
Field
Field trait - complete algebraic structure with all operations.
GemmKernel
Trait for types that have GEMM micro-kernel implementations.
HasFastFma
Marker trait for types with hardware FMA (fused multiply-add) support.
LazyExt
Trait extension for MatRef to create lazy expressions.
Real
Trait for real number types (f32, f64).
Scalar
Base trait for all scalar types used in OxiBLAS.
ScalarBatch
Batch operations on scalar arrays for performance-critical code.
ScalarClassify
Trait for compile-time scalar classification.
SimdCompatible
Marker trait for types that can be efficiently vectorized with SIMD.
ToComplex
Trait for converting real numbers to complex.
UnrollHints
Unrolling hints for vectorized loops.

Functions§

c32
Creates a complex number from real and imaginary parts (f32).
c64
Creates a complex number from real and imaginary parts.
detect_simd_level
Detects the best available SIMD level at runtime.
detect_simd_level_raw
Raw SIMD level detection without feature flag limits.
gemm
GEMM operation: C = alpha * A * B + beta * C
gemm_with_par
GEMM with parallelization control.
lazy_fma
Creates a fused multiply-add expression: alpha * A + beta * B
lazy_gemm
Creates a GEMM expression: alpha * A * B + beta * C
pairwise_sum
Pairwise summation for reduced error accumulation.

Type Aliases§

C32
32-bit complex type alias (same as Complex32).
C64
64-bit complex type alias (same as Complex64).