Macro simd_8acc_dot_loop

Source

macro_rules! simd_8acc_dot_loop {
    ($a_ptr:expr, $b_ptr:expr, $end:expr,
     $zero:expr, $load:ident, $fmadd:ident, $add:ident, $lane:expr) => { ... };
}

Expand description

8-accumulator unrolled SIMD loop for dot product (ILP optimization).

Processes 8 × lane elements per iteration using 8 independent accumulators to maximally hide FMA latency on wide-issue CPUs. Targets AVX-512 kernels for very large vectors (>= 1024 dimensions).

Returns (combined_accumulator, updated_a_ptr, updated_b_ptr).

§Arguments

$a_ptr, $b_ptr — Starting pointers for the two input vectors
$end — End pointer for the main loop (aligned to 8 × lane)
$zero — Zero-init expression (e.g., _mm512_setzero_ps())
$load — SIMD load intrinsic (e.g., _mm512_loadu_ps)
$fmadd — FMA intrinsic with signature fmadd(a, b, acc) → a*b + acc
$add — SIMD add intrinsic (e.g., _mm512_add_ps)
$lane — Number of f32 elements per SIMD register (16 for AVX-512)

§Safety

Must be invoked inside an unsafe context where the specified SIMD intrinsics are valid for the current CPU.

simd_8acc_dot_loop

Macro simd_8acc_dot_loop Copy item path

§Arguments

§Safety

Macro simd_8acc_dot_loop