1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
//! BLIS Microkernels - High-Performance SIMD Compute Kernels
//!
//! This module contains the microkernel implementations for different architectures:
//! - Scalar reference (correctness validation)
//! - AVX2 intrinsics
//! - AVX2 hand-tuned ASM with software pipelining
//! - ARM NEON
//!
//! # Performance Targets
//!
//! - 70%+ FMA utilization on Haswell+ CPUs
//! - 4-way K unrolling for software pipelining
//! - 10-12 instruction latency hiding
//!
//! # References
//!
//! - Goto, K., & Van de Geijn, R. A. (2008). Anatomy of High-Performance Matrix Multiplication.
//! - Agner Fog (2024). Optimizing subroutines in assembly language, Section 12.7.
//! - Intel(R) 64 and IA-32 Architectures Optimization Reference Manual.
// Re-export all public microkernel functions
pub use ;
pub use microkernel_8x8_neon;
use ;
/// Scalar microkernel for correctness validation
///
/// Computes C[MR x NR] += A[MR x K] * B[K x NR]
/// where A is packed column-major and B is packed row-major.
///
/// This serves as the reference for validating SIMD microkernels.