1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
//! AVX2 utility intrinsics shared across Q4_0, Q8_0, and future kernels.
//!
//! All functions in this module are `unsafe` and require the `avx2` and `fma`
//! CPU features. The feature gate at the top restricts compilation to the
//! correct platform; callers must additionally verify CPU support at runtime.
use *;
/// Horizontal sum of a 256-bit packed-float register.
///
/// Reduces eight FP32 lanes to a single `f32` scalar using the
/// hadd/extract pattern that avoids the slower `_mm256_permutevar8x32_ps`
/// path.
///
/// # Safety
/// Requires the `avx` CPU feature (subset of `avx2`).
pub unsafe
/// Read two bytes from `bytes` as a little-endian IEEE 754 FP16 value and
/// return the FP32 equivalent.
///
/// Uses the `half` crate for the conversion, which handles denormals,
/// infinities, and NaNs correctly.
///
/// # Safety
/// `bytes` must be at least 2 bytes long. The caller is responsible for
/// ensuring the slice bounds are valid before calling this function.
pub unsafe