Expand description
High-performance CPU Matrix Multiplication (GEMM) kernel.
This kernel uses the matrixmultiply crate which implements a BLIS-style
macro/microkernel approach with cache-oblivious tiling, SIMD vectorization
(AVX/FMA/SSE2/NEON), and optional multithreading.
The kernel strictly respects memory strides. This means if a user transposes a tensor (which is a zero-copy O(1) operation), this kernel correctly reads the memory in transposed order without ever allocating a duplicate buffer.
Functionsยง
- matmul_
forward - Executes the physical forward pass for 2D Matrix Multiplication: C = A @ B