Skip to main content

Module util

Module util 

Source
Expand description

Shared SIMD utility helpers for the graph kernels.

This module was physically moved here from mnemosyne_rs::util in Phase 223 so both the native PyO3 crate and the forthcoming graph_wasm sub-crate can share the dot-product primitives without re-rolling their own SIMD. The native crate re-exports these symbols through a thin bridge module at mnemosyne_rs::util, so the existing crate::util::dot_simd / dot_and_self_dot call sites keep working unchanged.

§Why a hand-rolled AVX2+FMA path?

Phase 221 shipped wide::f32x8 here. The Phase 222 design-gate investigation showed that at ≥4096 floats the wide-backed dot product was up to 6× slower than NumPy’s OpenBLAS sdot. The root cause is the rustc default target baseline, which on x86_64-unknown-linux-gnu is just sse2. That forces LLVM to lower wide::f32x8 (256-bit) into pairs of 128-bit SSE2 instructions with no FMA — about half the throughput of native AVX2+FMA. See benchmark_results/cosine_simd_investigation.md for the disassembly and measurements.

The fix is a runtime-dispatched hot path:

  • On x86_64 at call time we check is_x86_feature_detected! for avx2 + fma. If both are present we jump into an #[target_feature(enable = "avx2,fma")] function that emits 256-bit vmovups ymm / vfmadd231ps ymm and gets us to BLAS throughput for the per-pair dot product.
  • On every other configuration (non-AVX2 x86_64, aarch64, any other target including wasm32) we fall back to the portable wide::f32x8 path. wide emits NEON on aarch64 and SIMD128 on wasm32 so those targets are unaffected.

The dispatch adds one cached atomic load per call (the is_x86_feature_detected! macro memoises its result). For 4096-float dot products that overhead is ~2 ns out of ~3 µs — well under 0.1%.

We also expose a fused dot_and_self_dot that computes both the cross-dot product (with a query) and the self-dot product (for the row’s L2 norm) in a single pass. The per-query-batch kernels are memory-bound at ≥10k candidates × 4096 dims (160 MB is far larger than any single-core L3), so cutting candidate-row traffic in half by computing both accumulators during the same row sweep is the second big win on top of AVX2+FMA.

No new crates, no baseline bumps, no C deps — the whole fix is ~100 lines of unsafe behind a CPU-feature gate.

Functions§

dot_and_self_dot
Fused (dot(a,b), dot(b,b)) — computes both the cross-dot product and the self-dot product of b in a single pass through b.
dot_simd
SIMD-accelerated dot product.
l2_norm
L2 norm of an f32 slice. Uses the SIMD dot product internally so we reuse a single tight inner loop for both dot and norm.