oxiblas-blas
BLAS (Basic Linear Algebra Subprograms) operations for OxiBLAS
Overview
oxiblas-blas implements the complete BLAS API in pure Rust with SIMD optimization and optional parallelization. It provides Level 1 (vector-vector), Level 2 (matrix-vector), and Level 3 (matrix-matrix) operations with performance competitive with OpenBLAS.
Features
Complete BLAS Implementation
- Level 1: 15+ vector-vector operations
- Level 2: 19+ matrix-vector operations (including banded and packed variants)
- Level 3: 9+ matrix-matrix operations
- Extended Operations: Tensor operations, Einstein summation, batched operations
High Performance
- SIMD-optimized kernels for x86_64 (AVX2/AVX-512) and AArch64 (NEON)
- Cache-aware blocking for optimal L1/L2/L3 utilization
- Parallel execution via Rayon for large matrices (with
parallelfeature) - Competitive with OpenBLAS: 80-172% performance depending on operation and platform
Installation
[]
= "0.1"
# With parallelization
= { = "0.1", = ["parallel"] }
BLAS Level 1 (Vector-Vector)
| Function | Description | Complexity |
|---|---|---|
dot |
Dot product: x · y | O(n) |
axpy |
y = α×x + y | O(n) |
scal |
x = α×x | O(n) |
copy |
y = x | O(n) |
swap |
Swap x and y | O(n) |
nrm2 |
Euclidean norm: ‖x‖₂ | O(n) |
asum |
Sum of absolute values: Σ|xᵢ| | O(n) |
iamax |
Index of max absolute value | O(n) |
rot |
Apply Givens rotation | O(n) |
rotg |
Generate Givens rotation | O(1) |
rotm |
Apply modified Givens rotation | O(n) |
rotmg |
Generate modified Givens rotation | O(1) |
Usage Example
use ;
let x = vec!;
let y = vec!;
// Dot product
let result = dot; // 70.0
// AXPY: y = 2.5*x + y
let mut y = vec!;
axpy;
// y = [3.5, 7.0, 10.5, 14.0]
// Euclidean norm
let norm = nrm2; // sqrt(30) ≈ 5.477
BLAS Level 2 (Matrix-Vector)
| Function | Description | Complexity |
|---|---|---|
gemv |
General matrix-vector: y = α×A×x + β×y | O(mn) |
symv |
Symmetric matrix-vector | O(n²) |
hemv |
Hermitian matrix-vector (complex) | O(n²) |
trmv |
Triangular matrix-vector | O(n²) |
trsv |
Triangular solve: x = A⁻¹×b | O(n²) |
ger |
Rank-1 update: A = α×x×yᵀ + A | O(mn) |
syr |
Symmetric rank-1: A = α×x×xᵀ + A | O(n²) |
her |
Hermitian rank-1 (complex) | O(n²) |
syr2 |
Symmetric rank-2 | O(n²) |
her2 |
Hermitian rank-2 (complex) | O(n²) |
Banded & Packed variants: gbmv, sbmv, hbmv, tbmv, tbsv, spmv, hpmv, tpmv, tpsv
Usage Example
use ;
use Mat;
let a = from_rows;
let x = vec!;
let mut y = vec!;
// y = A × x
gemv;
// y = [14.0, 32.0]
BLAS Level 3 (Matrix-Matrix)
| Function | Description | Complexity | Performance |
|---|---|---|---|
gemm |
General matrix multiply: C = α×A×B + β×C | O(n³) | 79-172% of OpenBLAS |
symm |
Symmetric matrix multiply | O(n³) | Optimized |
hemm |
Hermitian matrix multiply (complex) | O(n³) | Optimized |
trmm |
Triangular matrix multiply | O(n³) | 7-11× vs naive |
trsm |
Triangular solve multiple RHS | O(n³) | 10× vs naive |
syrk |
Symmetric rank-k: C = α×A×Aᵀ + β×C | O(n²k) | 6-15× vs naive |
herk |
Hermitian rank-k (complex) | O(n²k) | 6-12× vs naive |
syr2k |
Symmetric rank-2k | O(n²k) | 6-15× vs naive |
her2k |
Hermitian rank-2k (complex) | O(n²k) | 6-15× vs naive |
GEMM Example
use gemm;
use Mat;
let a = from_rows;
let b = from_rows;
let mut c = zeros;
// C = A × B
gemm;
// C = [[58, 64], [139, 154]]
Tensor Operations
Einstein Summation
Supports 24 tensor contraction patterns:
use einsum;
// Matrix multiplication: C = A × B
let c = einsum?;
// Outer product: C = x ⊗ y
let c = einsum?;
// Transpose: Aᵀ
let at = einsum?;
// Matrix trace: tr(A)
let trace = einsum?;
// Hadamard (element-wise) product: C = A ⊙ B
let c = einsum?;
Supported patterns: matmul, transpose, trace, diagonal, tensor transposes, reductions, products, batched operations
Batched Operations
use ;
// Batch of matrix multiplications
let a = from_data;
let b = from_data;
let c = batched_matmul?;
// c[i] = a[i] × b[i] for each batch i
Extended Precision
use ;
// Kahan (compensated) summation for accuracy
let result_kahan = dot_kahan;
// Pairwise summation (divide-and-conquer)
let result_pairwise = dot_pairwise;
// Mixed precision: f32 computation, f64 accumulation
let x_f32: = vec!;
let y_f32: = vec!;
let result_f64 = dsdot;
Performance
macOS M3 (Apple Silicon NEON)
| Operation | Size | OxiBLAS | OpenBLAS | Ratio |
|---|---|---|---|---|
| DGEMM (f64) | 1024×1024 | 40.25 ms (427 Gf/s) | 40.54 ms (424 Gf/s) | 101% 🟢 |
| SGEMM (f32) | 1024×1024 | 19.18 ms (896 Gf/s) | 32.94 ms (521 Gf/s) | 172% 🟢 |
| DOT (f64) | 1M | 167 µs (6.0 Gelem/s) | 279 µs (3.6 Gelem/s) | 165% 🟢 |
Linux x86_64 (Intel Xeon AVX2)
| Operation | Size | OxiBLAS | OpenBLAS | Ratio |
|---|---|---|---|---|
| DGEMM (f64) | 1024×1024 | 80.68 ms (213 Gf/s) | 82.51 ms (208 Gf/s) | 102% 🟢 |
| SGEMM (f32) | 64×64 | 16.60 µs (253 Gf/s) | 18.64 µs (225 Gf/s) | 112% 🟢 |
| DGEMM (f64) | 256×256 | 1.220 ms (220 Gf/s) | 1.159 ms (232 Gf/s) | 95% 🟢 |
Feature Flags
| Feature | Description | Default |
|---|---|---|
default |
Core BLAS operations | ✓ |
parallel |
Rayon-based parallelization |
Architecture
oxiblas-blas/
├── level1/ # Vector-vector operations
│ ├── dot.rs
│ ├── axpy.rs
│ ├── scal.rs
│ └── ...
├── level2/ # Matrix-vector operations
│ ├── gemv.rs
│ ├── ger.rs
│ ├── symv.rs
│ └── ...
├── level3/ # Matrix-matrix operations
│ ├── gemm.rs # General matrix multiply
│ ├── gemm_kernel.rs # SIMD micro-kernels
│ ├── autotune.rs # Cache-aware blocking
│ ├── syrk.rs
│ ├── trmm.rs
│ └── ...
└── tensor/ # Tensor operations
├── einsum.rs # Einstein summation
├── batched.rs # Batched operations
└── tensor3.rs # 3D tensor type
Examples
# Run BLAS examples
Benchmarks
# Level 1 benchmarks
# Level 2 benchmarks
# Level 3 benchmarks
# Compare with OpenBLAS
Related Crates
oxiblas-core- Core traits and SIMDoxiblas-matrix- Matrix typesoxiblas-lapack- LAPACK decompositionsoxiblas- Meta-crate
License
Licensed under MIT or Apache-2.0 at your option.