Expand description
Fast matrix multiplication in Rust, built from scratch.
I built this to understand what makes BLAS fast. Turns out it’s: cache blocking, SIMD intrinsics, and FMA instructions. This crate implements all of that, achieving ~62% of NumPy/OpenBLAS performance.
§Usage
use matmul::multiply;
let a = vec![1.0f64; 256 * 256];
let b = vec![1.0f64; 256 * 256];
let mut c = vec![0.0f64; 256 * 256];
multiply(&a, &b, &mut c, 256, 256, 256);For large matrices, use the multi-threaded version:
use matmul::multiply_parallel;
let a = vec![1.0f64; 1024 * 1024];
let b = vec![1.0f64; 1024 * 1024];
let mut c = vec![0.0f64; 1024 * 1024];
multiply_parallel(&a, &b, &mut c, 1024, 1024, 1024, 4);§What’s inside
- 4x4, 12x4 AVX2 kernels
- 8x8 AVX-512 kernel
- Cache blocking tuned for L1/L2
- Adaptive multi-threading (scales down for small matrices)
Re-exports§
pub use matrix::naive_ijk::matmul_naive_ijk;pub use matrix::naive_ikj::matmul_naive_ikj;pub use matrix::transpose::transpose;
Modules§
- blocked
- Cache-blocked GEMM implementations.
- kernels
- SIMD microkernels for the inner loop of matrix multiplication.
- matrix
- Basic matrix operations and naive implementations.
- threaded
- Multi-threaded GEMM implementations.
Functions§
- multiply
- Matrix multiply: C += A * B
- multiply_
parallel - Same as
multiplybut uses multiple threads.