1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
/// Cache-friendly matrix multiplication using i-k-j loop order.
///
/// By swapping the j and k loops, the innermost loop now accesses both
/// B and C sequentially (stride 1). This alone gives ~9× speedup over
/// the naive i-j-k order on large matrices.
///
/// This is the scalar baseline that SIMD kernels are compared against.
///
/// # Arguments
///
/// * `a` - Matrix A (m × k), row-major
/// * `b` - Matrix B (k × n), row-major
/// * `c` - Matrix C (m × n), row-major, accumulated into (C += A * B)
/// * `m` - Rows of A and C
/// * `n` - Columns of B and C
/// * `k` - Columns of A, rows of B
/// i-k-j multiplication with pre-transposed B matrix.
///
/// When B is already transposed (stored as B^T), accessing `bt[j * k + p]`
/// becomes sequential in j. Useful when multiplying the same B many times.
///
/// # Arguments
///
/// * `bt` - Transposed matrix B^T (n × k), row-major