Crate matmul

Expand description

Fast matrix multiplication in Rust, built from scratch.

I built this to understand what makes BLAS fast. Turns out it’s: cache blocking, SIMD intrinsics, and FMA instructions. This crate implements all of that, achieving ~62% of NumPy/OpenBLAS performance.

§Usage

use matmul::multiply;

let a = vec![1.0f64; 256 * 256];
let b = vec![1.0f64; 256 * 256];
let mut c = vec![0.0f64; 256 * 256];

multiply(&a, &b, &mut c, 256, 256, 256);

For large matrices, use the multi-threaded version:

use matmul::multiply_parallel;

let a = vec![1.0f64; 1024 * 1024];
let b = vec![1.0f64; 1024 * 1024];
let mut c = vec![0.0f64; 1024 * 1024];

multiply_parallel(&a, &b, &mut c, 1024, 1024, 1024, 4);

§What’s inside

4x4, 12x4 AVX2 kernels
8x8 AVX-512 kernel
Cache blocking tuned for L1/L2
Adaptive multi-threading (scales down for small matrices)

Re-exports§

pub use matrix::naive_ijk::matmul_naive_ijk;
pub use matrix::naive_ikj::matmul_naive_ikj;
pub use matrix::transpose::transpose;

Modules§

blocked: Cache-blocked GEMM implementations.
kernels: SIMD microkernels for the inner loop of matrix multiplication.
matrix: Basic matrix operations and naive implementations.
threaded: Multi-threaded GEMM implementations.

Functions§

multiply: Matrix multiply: C += A * B
multiply_parallel: Same as multiply but uses multiple threads.

Crate matmul

Crate matmul Copy item path

§Usage

§What’s inside

Re-exports§

Modules§

Functions§

Crate matmul