Expand description
Shared SIMD utility helpers for the graph kernels.
This module was physically moved here from mnemosyne_rs::util in
Phase 223 so both the native PyO3 crate and the forthcoming
graph_wasm sub-crate can share the dot-product primitives
without re-rolling their own SIMD. The native crate re-exports
these symbols through a thin bridge module at mnemosyne_rs::util,
so the existing crate::util::dot_simd / dot_and_self_dot call
sites keep working unchanged.
§Why a hand-rolled AVX2+FMA path?
Phase 221 shipped wide::f32x8 here. The Phase 222 design-gate
investigation showed that at ≥4096 floats the wide-backed dot
product was up to 6× slower than NumPy’s OpenBLAS sdot. The
root cause is the rustc default target baseline, which on
x86_64-unknown-linux-gnu is just sse2. That forces LLVM to
lower wide::f32x8 (256-bit) into pairs of 128-bit SSE2
instructions with no FMA — about half the throughput of native
AVX2+FMA. See benchmark_results/cosine_simd_investigation.md
for the disassembly and measurements.
The fix is a runtime-dispatched hot path:
- On x86_64 at call time we check
is_x86_feature_detected!foravx2+fma. If both are present we jump into an#[target_feature(enable = "avx2,fma")]function that emits 256-bitvmovups ymm/vfmadd231ps ymmand gets us to BLAS throughput for the per-pair dot product. - On every other configuration (non-AVX2 x86_64, aarch64, any
other target including
wasm32) we fall back to the portablewide::f32x8path.wideemits NEON on aarch64 and SIMD128 on wasm32 so those targets are unaffected.
The dispatch adds one cached atomic load per call (the
is_x86_feature_detected! macro memoises its result). For
4096-float dot products that overhead is ~2 ns out of ~3 µs —
well under 0.1%.
We also expose a fused dot_and_self_dot that computes both
the cross-dot product (with a query) and the self-dot product
(for the row’s L2 norm) in a single pass. The per-query-batch
kernels are memory-bound at ≥10k candidates × 4096 dims (160 MB
is far larger than any single-core L3), so cutting candidate-row
traffic in half by computing both accumulators during the same
row sweep is the second big win on top of AVX2+FMA.
No new crates, no baseline bumps, no C deps — the whole fix is
~100 lines of unsafe behind a CPU-feature gate.
Functions§
- dot_
and_ self_ dot - Fused
(dot(a,b), dot(b,b))— computes both the cross-dot product and the self-dot product ofbin a single pass throughb. - dot_
simd - SIMD-accelerated dot product.
- l2_norm
- L2 norm of an
f32slice. Uses the SIMD dot product internally so we reuse a single tight inner loop for bothdotandnorm.