Skip to main content

Crate matmul

Crate matmul 

Source
Expand description

Fast matrix multiplication in Rust, built from scratch.

I built this to understand what makes BLAS fast. Turns out it’s: cache blocking, SIMD intrinsics, and FMA instructions. This crate implements all of that, achieving ~62% of NumPy/OpenBLAS performance.

§Usage

use matmul::multiply;

let a = vec![1.0f64; 256 * 256];
let b = vec![1.0f64; 256 * 256];
let mut c = vec![0.0f64; 256 * 256];

multiply(&a, &b, &mut c, 256, 256, 256);

For large matrices, use the multi-threaded version:

use matmul::multiply_parallel;

let a = vec![1.0f64; 1024 * 1024];
let b = vec![1.0f64; 1024 * 1024];
let mut c = vec![0.0f64; 1024 * 1024];

multiply_parallel(&a, &b, &mut c, 1024, 1024, 1024, 4);

§What’s inside

  • 4x4, 12x4 AVX2 kernels
  • 8x8 AVX-512 kernel
  • Cache blocking tuned for L1/L2
  • Adaptive multi-threading (scales down for small matrices)

Re-exports§

pub use matrix::naive_ijk::matmul_naive_ijk;
pub use matrix::naive_ikj::matmul_naive_ikj;
pub use matrix::transpose::transpose;

Modules§

blocked
Cache-blocked GEMM implementations.
kernels
SIMD microkernels for the inner loop of matrix multiplication.
matrix
Basic matrix operations and naive implementations.
threaded
Multi-threaded GEMM implementations.

Functions§

multiply
Matrix multiply: C += A * B
multiply_parallel
Same as multiply but uses multiple threads.