1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
// Architecture and the f64 dot/reduction kernel adapted from the `dia`
// project (github — MIT/Apache-2.0), src/ops/.
//! Dot product and `sum_of_squares` dispatchers.
//!
//! Each public fn routes to the best available SIMD backend on this
//! `target_arch` after runtime CPU-feature detection, falling back to
//! [`crate::simd::scalar`] when no SIMD backend applies (non-`aarch64`
//! targets or `--cfg mlxrs_force_scalar`).
use cratescalar;
use crate;
/// Inner product of two equal-length f64 slices: `Σ a[i] * b[i]`.
///
/// Routes to NEON on `aarch64` (when the CPU reports NEON), else to
/// [`crate::simd::scalar::dot`]. Callers needing byte-identical scalar
/// output across every build configuration call
/// [`crate::simd::scalar::dot`] directly.
///
/// # Panics
///
/// If `a.len() != b.len()`. This is enforced **unconditionally** — the
/// NEON kernel reads raw pointers bounded only by `a.len()` and would
/// otherwise load past the end of `b` in release builds, where its
/// `debug_assert!` is a no-op.
/// Sum of squares of an f64 slice: `Σ v[i]²`.
///
/// The `b ≡ a` specialization of [`dot`]. Routes to NEON on `aarch64`
/// (when the CPU reports NEON), else to
/// [`crate::simd::scalar::sum_of_squares`].
///
/// On `aarch64` the NEON and scalar paths produce **bit-identical**
/// results — both use a `f64::mul_add` per-element FMA and the same
/// 4-accumulator reduction tree. There is no slice-length
/// precondition (a sum over a single slice cannot be mismatched), so
/// this dispatcher cannot panic.