macro_rules! simd_8acc_dot_loop {
($a_ptr:expr, $b_ptr:expr, $end:expr,
$zero:expr, $load:ident, $fmadd:ident, $add:ident, $lane:expr) => { ... };
}Expand description
8-accumulator unrolled SIMD loop for dot product (ILP optimization).
Processes 8 × lane elements per iteration using 8 independent
accumulators to maximally hide FMA latency on wide-issue CPUs.
Targets AVX-512 kernels for very large vectors (>= 1024 dimensions).
Returns (combined_accumulator, updated_a_ptr, updated_b_ptr).
§Arguments
$a_ptr,$b_ptr— Starting pointers for the two input vectors$end— End pointer for the main loop (aligned to8 × lane)$zero— Zero-init expression (e.g.,_mm512_setzero_ps())$load— SIMD load intrinsic (e.g.,_mm512_loadu_ps)$fmadd— FMA intrinsic with signaturefmadd(a, b, acc) → a*b + acc$add— SIMD add intrinsic (e.g.,_mm512_add_ps)$lane— Number of f32 elements per SIMD register (16 for AVX-512)
§Safety
Must be invoked inside an unsafe context where the specified
SIMD intrinsics are valid for the current CPU.