1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
//! Fixed-size array APIs for bitwise Hamming distance.
//!
//! Use this module when the vector size is known at compile time (e.g., 1024-bit
//! embeddings stored as `[u8; 128]`). This is faster than the equivalent
//! [`slice`](crate::slice) API.
// ============================================================================
// PERFORMANCE INVARIANT: AVX-512 Gather Avoidance (load-bearing without LTO)
// ============================================================================
//
// `batch()` iterates over `&[[u8; N]]` — contiguous memory. With AVX-512
// target features available, LLVM can transform that loop to use VPGATHERQQ
// gather instructions, which are 2–10x slower than contiguous VMOVDQU64 loads
// (each element fetched separately, cache locality destroyed, no prefetcher
// win). The asm! barrier on `target` below forces LLVM to use the simple
// load-xor-popcount form instead.
//
// Whether the barrier matters depends on LTO:
//
// - With LTO + multiversion (the recommended config): LLVM inlines across
// the multiversion dispatch boundary, sees `N` is a compile-time constant,
// unrolls the inner loop, and never emits gathers in the first place.
// The barrier is a verified no-op here — assembly is identical with and
// without it.
//
// - Without LTO + multiversion: each multiversion specialization is a
// separate translation unit. LLVM can't see N, falls back to outer-loop
// vectorization, and emits VPGATHERQQ across iterations. Measured: 112
// such instructions per benchmark binary, ~4x slower than the barriered
// form on Zen 5. The barrier is the difference between fast and slow here.
//
// The barrier is kept unconditionally as defense for users who don't enable
// LTO (and as insurance against future LLVM versions changing the heuristic
// under LTO too). It has no measurable cost under LTO.
//
// Why not `black_box`? Both prevent the gather, but `black_box` compiles to
// a stack store + reload (~5-cycle store-forwarding penalty per iteration).
// Under LTO + AVX-512 that penalty is ~7x slower than the asm! barrier
// (gather_demo: black_box = 2.85µs vs asm_barrier = 410ns at 64B).
//
// On non-x86 (ARM etc.), no barrier is needed — gather instructions don't
// exist on those architectures, and `opaque_ptr` is a plain identity.
//
// Verify: inspect AVX-512 assembly under CARGO_PROFILE_BENCH_LTO=false for
// absence of VPGATHERQQ in the asm-barriered batch loop.
// Proof: benches/batch_input_type.rs `gather_demo` (no_barrier / black_box /
// asm_barrier A/B/C comparison).
// ============================================================================
/// Make a pointer opaque to LLVM's stride analysis without store-forwarding.
///
/// On x86, uses `asm!` with `nomem` + `nostack` — the pointer stays in a
/// register but LLVM treats it as a new, unknown value (preventing the
/// outer-loop gather vectorization LLVM otherwise picks under no-LTO
/// multiversion builds). On non-x86, returns the pointer unchanged since
/// gather instructions don't exist on those architectures.
///
/// # Safety
///
/// The pointer must be valid. The asm block is a no-op (empty template),
/// so the returned pointer is identical to the input.
unsafe
// ============================================================================
// Public API
// ============================================================================
/// Compute the bitwise Hamming distance between two fixed-size byte arrays.
///
/// This is the recommended API when the vector size is known at compile time.
/// It is faster than [`slice::distance`](crate::slice::distance).
///
/// # Example
///
/// ```
/// use hamming_bitwise_fast::array;
///
/// let a: [u8; 128] = [0x12; 128]; // 1024-bit
/// let b: [u8; 128] = [0xFE; 128];
/// let distance = array::distance(&a, &b);
/// ```
/// Compute Hamming distance from one source to many targets (one-to-many).
///
/// Faster than calling [`distance`] in a loop for one-to-many comparisons.
///
/// # Panics
///
/// Panics if `out.len() != targets.len()`.
///
/// # Example
///
/// ```
/// use hamming_bitwise_fast::array;
///
/// let source: [u8; 128] = [0; 128];
/// let targets = vec![[1u8; 128], [2u8; 128], [3u8; 128]];
/// let mut distances = vec![0u32; 3]; // pre-allocate and reuse
///
/// array::batch(&source, &targets, &mut distances);
/// ```