oxiblas-0.1.0 has been yanked.

OxiBLAS - Pure Rust BLAS/LAPACK Implementation

OxiBLAS is a pure Rust implementation of BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra PACKage), designed to be the defacto standard for the scirs2 ecosystem.

Features

Pure Rust: No C dependencies, fully portable
SIMD Optimized: Custom SIMD layer using core::arch intrinsics
Cache Aware: BLIS-style blocked algorithms for optimal cache usage
Parallel: Optional rayon-based parallelization

Crate Structure

OxiBLAS is organized into several sub-crates:

oxiblas_core: Core traits and SIMD abstractions
oxiblas_matrix: Matrix types (Mat, MatRef, MatMut)
oxiblas_blas: BLAS operations (gemm, etc.)
oxiblas_lapack: LAPACK operations (decompositions, solvers)
oxiblas_sparse: Sparse matrix types and operations (with sparse feature, default)
oxiblas_ndarray: ndarray integration (with ndarray feature)

Quick Start

use oxiblas::prelude::*;

// Create matrices
let a: Mat<f64> = Mat::filled(100, 50, 1.0);
let b: Mat<f64> = Mat::filled(50, 80, 2.0);
let mut c: Mat<f64> = Mat::zeros(100, 80);

// GEMM: C = A * B
gemm(1.0, a.as_ref(), b.as_ref(), 0.0, c.as_mut());

// Each element of C should be 100 (50 * 1.0 * 2.0)
assert!((c[(0, 0)] - 100.0).abs() < 1e-10);

Supported Types

f32: Single precision floating point
f64: Double precision floating point
Complex32: Single precision complex
Complex64: Double precision complex

SIMD Support

OxiBLAS automatically detects and uses the best available SIMD:

x86_64: SSE4.2 (128-bit), AVX2/FMA (256-bit), AVX-512F (512-bit)
- AVX-512BW for byte/word operations (i8, i16 SIMD)
- AVX-512VNNI for neural network acceleration (quantized inference)
AArch64: NEON (128-bit), SVE (128-2048 bit scalable vectors)
WASM: SIMD128 (128-bit)
Fallback: Scalar operations

Scalar Specialization

For performance-critical code, OxiBLAS provides compile-time type dispatch:

SimdCompatible: SIMD width hints per scalar type
ScalarBatch: Vectorized batch operations (dot, axpy, fma)
ExtendedPrecision: Accumulator type mapping (f32→f64)
KahanSum, pairwise_sum: Compensated summation for accuracy

Feature Flags

sparse: Enable sparse matrix operations (enabled by default)
parallel: Enable rayon-based parallelization
ndarray: Enable ndarray integration for interop with ndarray types
nalgebra: Enable nalgebra integration for interop with nalgebra types
mmap: Enable memory-mapped matrices for large datasets
f16: Enable half-precision (f16) support
f128: Enable quad-precision (f128) support
serde: Enable serialization support for matrix types
full: Enable all features

SIMD Control Features

These features control SIMD optimization levels (useful for debugging or compatibility):

force-scalar: Disable all SIMD optimizations, use scalar operations only
max-simd-128: Limit SIMD to 128-bit registers (SSE/NEON)
max-simd-256: Limit SIMD to 256-bit registers (AVX2)

Use detect_simd_level() to check the active SIMD level, and detect_simd_level_raw() to get the actual hardware capability.

ndarray Integration

With the ndarray feature enabled, you can use OxiBLAS with ndarray types:

use oxiblas::ndarray::prelude::*;
use ndarray::Array2;

let a = Array2::<f64>::from_shape_fn((100, 100), |idx| (idx.0 + idx.1) as f64);
let b = Array2::<f64>::from_shape_fn((100, 100), |idx| (idx.0 * idx.1) as f64);
let c = matmul(&a, &b);

nalgebra Integration

With the nalgebra feature enabled, you can convert between OxiBLAS and nalgebra types:

use oxiblas::prelude::*;
use oxiblas::{mat_to_dmatrix, dmatrix_to_mat, MatNalgebraExt, DMatrixOxiblasExt};
use nalgebra::DMatrix;

// Convert from nalgebra to OxiBLAS
let na_mat = DMatrix::from_fn(3, 3, |row, col| (row + col) as f64);
let oxi_mat: Mat<f64> = na_mat.to_mat();

// Convert from OxiBLAS to nalgebra
let mat: Mat<f64> = Mat::from_rows(&[&[1.0, 2.0], &[3.0, 4.0]]);
let dm: DMatrix<f64> = mat.to_dmatrix();

Lazy Evaluation

OxiBLAS supports lazy evaluation for matrix operations, enabling expression trees that are only evaluated when .eval() is called. This allows for operation fusion and optimization.

use oxiblas::prelude::*;

let a: Mat<f64> = Mat::from_rows(&[&[1.0, 2.0], &[3.0, 4.0]]);
let b: Mat<f64> = Mat::from_rows(&[&[5.0, 6.0], &[7.0, 8.0]]);

// Build an expression tree (no computation yet)
let expr = a.as_ref().lazy() + b.as_ref().lazy();

// Evaluate to get the result
let result = expr.eval();
assert!((result[(0, 0)] - 6.0).abs() < 1e-10); // 1 + 5

// Chained operations with scaling
let scaled = (a.as_ref().lazy() + b.as_ref().lazy()).scale(2.0);
let result2 = scaled.eval();
assert!((result2[(0, 0)] - 12.0).abs() < 1e-10); // (1 + 5) * 2

// Fused multiply-add: alpha * A + beta * B
let fma_result = lazy_fma(2.0, a.as_ref().lazy(), 3.0, b.as_ref().lazy()).eval();
assert!((fma_result[(0, 0)] - 17.0).abs() < 1e-10); // 2*1 + 3*5

Available lazy operations:

Element-wise: +, -, negation, scale()
Matrix: matmul(), t() (transpose)
Complex: conj(), h() (Hermitian/conjugate transpose)
Fused: lazy_fma(), lazy_gemm()
Optimizations: .simplify() for double-transpose/scale/negation elimination

Performance Guide

Choosing the Right Algorithm

OxiBLAS automatically selects optimized code paths based on matrix size:

Size	GEMM Strategy	Expected Performance
< 32	Unrolled small-matrix kernels	Minimal overhead
32-512	BLIS-style blocked GEMM	~70% peak
> 512	Auto-tuned blocking + parallel	~80% peak
> 2048	Consider Strassen (optional)	Reduced complexity

Memory Layout

Column-major (default): Best for BLAS/LAPACK compatibility
Contiguous data: Avoid strided views when possible for 2-10x speedup
Aligned allocation: Mat<T> uses 64-byte alignment for SIMD

Parallelization

Enable the parallel feature and use Par::Rayon for large matrices:

use oxiblas::prelude::*;
use oxiblas::core::Par;

// Sequential (default)
gemm(1.0, a.as_ref(), b.as_ref(), 0.0, c.as_mut());

// Parallel - recommended for n >= 256
gemm_with_par(1.0, a.as_ref(), b.as_ref(), 0.0, c.as_mut(), Par::Rayon);

Cache Optimization Tips

Blocking parameters: Use GemmBlocking::auto_tuned() for runtime tuning
Panel reuse: Process multiple right-hand sides together when possible
Prefetching: Enabled automatically in GEMM micro-kernels

Complex Numbers

For complex GEMM, the 3M algorithm (3 real multiplies) is used automatically for matrices >= 64, providing ~25% speedup over naive 4-multiply approach.

Migration Guide

From ndarray-linalg

// ndarray-linalg
use ndarray::Array2;
use ndarray_linalg::Solve;
let a = Array2::<f64>::eye(3);
let b = Array2::<f64>::ones((3, 1));
let x = a.solve(&b).unwrap();

// OxiBLAS equivalent
use oxiblas::prelude::*;
let a = MatBuilder::<f64>::identity(3);
let b = MatBuilder::<f64>::ones(3, 1);
let x = solve(a.as_ref(), b.as_ref()).unwrap();

From nalgebra

// nalgebra
use nalgebra::{DMatrix, DVector};
let a = DMatrix::from_fn(3, 3, |i, j| (i + j) as f64);
let b = DVector::from_element(3, 1.0);
let lu = a.lu();
let x = lu.solve(&b).unwrap();

// OxiBLAS equivalent (with nalgebra feature)
use oxiblas::prelude::*;
let a: Mat<f64> = na_matrix.to_mat(); // Convert from nalgebra
let b: Mat<f64> = na_vector.to_mat();
let x = solve(a.as_ref(), b.as_ref()).unwrap();
let result = x.to_dmatrix(); // Convert back if needed

From NumPy (via Rust)

NumPy	OxiBLAS
`np.dot(a, b)`	`gemm(1.0, a, b, 0.0, c)` or `a.matmul(&b)`
`np.linalg.solve(a, b)`	`solve(a, b)`
`np.linalg.svd(a)`	`Svd::compute(a)`
`np.linalg.eig(a)`	`GeneralEvd::compute(a)`
`np.linalg.qr(a)`	`Qr::compute(a)`
`np.linalg.cholesky(a)`	`Cholesky::compute(a)`
`np.linalg.inv(a)`	`inv(a)`
`np.linalg.det(a)`	`det(a)`

Architecture

Crate Hierarchy

oxiblas (umbrella)
├── oxiblas-core     # Scalar traits, SIMD, memory, parallelism
├── oxiblas-matrix   # Mat/MatRef/MatMut, storage formats
├── oxiblas-blas     # BLAS Level 1/2/3 operations
├── oxiblas-lapack   # Decompositions, solvers, eigenvalue
├── oxiblas-sparse   # Sparse formats, iterative solvers
└── oxiblas-ndarray  # ndarray integration

GEMM Optimization Stack

User API: gemm(alpha, a, b, beta, c)
    │
    ▼
Size dispatch: small (<32) → unrolled, large → blocked
    │
    ▼
Blocking: MC×KC×NC panels (auto-tuned to cache hierarchy)
    │
    ▼
Packing: pack_a (MC×KC), pack_b (KC×NC) with SIMD
    │
    ▼
Micro-kernel: MR×NR FMA loop (8×6 f64, 8×8 f32 on NEON/AVX2)

SIMD Abstraction

The oxiblas-core crate provides a unified SIMD abstraction:

SimdLevel: Runtime detection (Scalar, SSE42, AVX2, AVX512, NEON, SVE)
F64x2/F64x4: Portable f64 vector types with fallback
Architecture-specific intrinsics wrapped in safe functions

Memory Model

Arena allocation: Bump allocator for GEMM temporaries (zero malloc)
Aligned vectors: 64-byte aligned for AVX-512 compatibility
Copy-on-write: CowMat<T> for lazy cloning
Memory-mapped: MmapMat for large datasets (with mmap feature)

oxiblas 0.1.0