pub fn matmul(
output: &mut [f32],
input: &[f32],
weight: &[f32],
m: usize,
k: usize,
n: usize,
)Expand description
General matrix multiply: output[m,n] = input[m,k] * weight^T[k,n]
For m=1 (single token), delegates to the optimized vector version.