General matrix multiplication for f32, f64 matrices.
Allows arbitrary row, column strided matrices.
Uses the same microkernel algorithm as BLIS, but in a much simpler and less featureful implementation. See their multithreading page for a very good diagram over how the algorithm partitions the matrix (Note: this crate does not implement multithreading).