Hamming Bitwise Fast
Fast bitwise Hamming distance for byte arrays/slices, using auto-vectorization with runtime SIMD detection on x86.
Note: This is for comparing bit-vectors, not for comparing strings.
On x86, enable LTO for best performance — see Performance on x86.
Usage
For variable-length slices:
use slice;
// Single comparison
let a: = vec!;
let b: = vec!;
let distance: u32 = distance; // 1024
// Batch comparison (one source vs many targets)
// Pre-allocate result vec once and reuse across calls for best performance
let source: = vec!;
let targets: = vec!;
let mut distances: = vec!;
batch; // [1024, 0]
For fixed-size arrays (recommended when size is known at compile time):
use array;
// Single comparison
let a: = ; // 1024-bit vectors
let b: = ;
let distance: u32 = distance; // 1024
// Batch comparison (one source vs many targets)
// Pre-allocate result vec once and reuse across calls for best performance
let source: = ;
let targets: = vec!;
let mut distances: = vec!;
batch; // [1024, 0]
API
array::distance/slice::distance— single comparisonarray::batch/slice::batch— batch comparison (one source vs many targets)
Benchmarks
Compared against other Hamming distance crates:
simsimd,
hamming,
triple_accel,
hamming_rs (Linux/Windows only; AVX2 on x86 for ≥1024 bytes, v0.2.25+ uses this crate's v1.0 algorithm as fallback)
Single Comparison
ARM (Apple M2 Max)
| Function | 64 bytes | 128 bytes | 256 bytes |
|---|---|---|---|
| array::distance | 1.2ns | 2.4ns | 4.3ns |
| slice::distance | 1.2ns | 2.4ns | 4.2ns |
| v1 (baseline) | 1.5ns | 3.3ns | 6.1ns |
| hamming | 3.9ns | 5.4ns | 19.3ns |
| simsimd | 4.3ns | 6.2ns | 10.1ns |
| triple_accel | 4.8ns | 9.7ns | 19.4ns |
x86 (AMD EPYC 9845 Zen 5, AVX-512 VPOPCNTDQ, with lto = true)
| Function | 64 bytes | 128 bytes | 256 bytes |
|---|---|---|---|
| array::distance | 1.8ns | 2.0ns | 2.8ns |
| slice::distance | 2.7ns | 2.8ns | 3.6ns |
| triple_accel | 2.7ns | 3.3ns | 4.7ns |
| simsimd | 3.2ns | 3.6ns | 4.7ns |
| v1 (baseline) | 4.3ns | 9.2ns | 18.2ns |
| hamming_rs | 4.3ns | 8.4ns | 18.4ns |
| hamming | 48ns | 96ns | 28ns¹ |
¹ The hamming crate's distance_fast hits a vectorized fast path at 256 bytes
when its input happens to be sufficiently aligned. The 64 B and 128 B numbers
reflect its slow path. Treat the 256 B cell as a best-case outlier.
x86 (AMD EPYC 7713 Zen 3, AVX2 only, with lto = true)
For users on Skylake-class Intel or pre-Genoa AMD CPUs that don't have AVX-512 VPOPCNTDQ. Multiversion picks the AVX2+POPCNT path.
| Function | 64 bytes | 128 bytes | 256 bytes |
|---|---|---|---|
| array::distance | 4.3ns | 6.8ns | 11.5ns |
| triple_accel | 5.4ns | 6.5ns² | 10.7ns² |
| hamming_rs | 5.7ns | 11.1ns | 24.8ns |
| v1 (baseline) | 5.9ns | 12.7ns | 23.2ns |
| slice::distance | 6.6ns | 9.8ns | 13.1ns |
| simsimd | 10.2ns | 14.7ns | 26.0ns |
| hamming | 63ns | 123ns | 46ns¹ |
² On Zen 3, triple_accel is competitive with array::distance at 128–256 B
single-call (within 5–8%). For one-shot AVX2-only workloads at these sizes,
either is reasonable. array::batch (below) still dominates by 1.5–2.5×
for the one-to-many case.
Batch Comparison (1000 comparisons)
The batch functions are faster for one-to-many comparisons.
ARM (Apple M2 Max)
| Function | 64 bytes | 128 bytes | 256 bytes |
|---|---|---|---|
| array::batch | 1.3µs | 2.2µs | 4.6µs |
| slice::batch | 1.9µs | 3.0µs | 5.4µs |
| hamming | 4.1µs | 5.6µs | 19.5µs |
| simsimd | 4.6µs | 6.6µs | 10.4µs |
| triple_accel | 7.2µs | 12.0µs | 21.6µs |
| v1 (baseline) | 5.1µs | 9.4µs | 12.3µs |
x86 (AMD EPYC 9845 Zen 5, AVX-512 VPOPCNTDQ, with lto = true)
| Function | 64 bytes | 128 bytes | 256 bytes |
|---|---|---|---|
| array::batch | 414ns | 814ns | 1.7µs |
| triple_accel | 4.0µs | 4.7µs | 6.5µs |
| simsimd | 5.1µs | 5.7µs | 7.5µs |
| slice::batch | 5.2µs | 6.3µs | 7.7µs |
| hamming_rs | 5.9µs | 10.3µs | 19.1µs |
| v1 (baseline) | 9.7µs | 10.8µs | 20.6µs |
| hamming | 49µs | 97µs | 32µs¹ |
¹ See the hamming footnote in the single-comparison table above.
x86 (AMD EPYC 7713 Zen 3, AVX2 only, with lto = true)
| Function | 64 bytes | 128 bytes | 256 bytes |
|---|---|---|---|
| array::batch | 2.7µs | 5.1µs | 10.1µs |
| slice::batch | 4.4µs | 6.9µs | 12.1µs |
| triple_accel | 7.4µs | 9.2µs | 11.9µs |
| simsimd | 9.4µs | 13.8µs | 23.0µs |
| hamming_rs | 9.9µs | 14.2µs | 30.7µs |
| v1 (baseline) | 13.9µs | 19.1µs | 34.7µs |
| hamming | 67µs | 143µs | 51µs¹ |
array::batch is the fastest option when sizes are known at compile time. For
runtime-sized inputs, slice::batch is competitive with simsimd and faster
than every other competitor, but is 5–13× slower than array::batch (see the
LTO note for why).
Running benchmarks
# Run competitor comparison
Methodology note: x86 numbers were measured on dedicated-CPU cloud instances; ARM numbers on a quiescent Apple M2 Max. Cloud-VM placement variance can shift sub-ns single-call timings by 5–15% between physical hosts even at the same SKU. Treat the relative ranking as reliable and the absolute numbers as a representative sample of a well-behaved host.
Performance on x86
Enable LTO
This is the single most impactful optimization. This crate relies on auto-vectorization — the compiler widens u64 operations into SIMD instructions (e.g., AVX-512 VPOPCNTDQ). Without LTO, the compiler can't see through the crate boundary to vectorize effectively.
[]
= true
Use full LTO (true), not thin ("thin") — thin LTO doesn't give LLVM enough
cross-module visibility for auto-vectorization.
Impact on single-call performance (128 bytes, AMD EPYC 9845 Zen 5)
| Configuration | array::distance |
slice::distance |
|---|---|---|
| Default (no LTO) | 6.1ns | 3.4ns |
Default + lto = true |
2.0ns (3.1x) | 2.8ns (1.2x) |
Impact on batch performance
LTO is not universally a win for batch workloads. It modestly improves
array::batch (a few hundred ns at most), but currently hurts slice::batch
by 2–3× because LLVM picks a less favorable vectorization shape when it can
see across the crate boundary.
128-byte vectors, 1000-comparison batch, AMD EPYC 9845 Zen 5:
| Configuration | array::batch |
slice::batch |
|---|---|---|
| Default (no LTO) | 1.10µs | 2.3µs |
Default + lto = true |
814ns (1.4x) | 6.3µs (2.7x slower) |
If slice::batch is your hot path and you can't switch to array::batch,
benchmark with and without LTO for your specific workload. When sizes are
known at compile time, array::batch is always the right answer.
Runtime SIMD detection
Runtime SIMD detection is enabled by default via the multiversion crate. Rust targets baseline x86-64 (SSE2 only) by default; the multiversion_x86 feature compiles multiple code paths and selects the fastest at runtime via CPUID — including AVX-512 VPOPCNTDQ on Intel Ice Lake (2019+) and AMD Zen 4+ (2022+) CPUs.
What the feature buys you
128-byte vectors, lto = true, AMD EPYC 9845 Zen 5:
| Configuration | array::distance |
array::batch (1000) |
|---|---|---|
default-features = false (SSE2 only) |
9.2ns | 8.9µs |
default (multiversion_x86) |
2.0ns (4.7x) | 814ns (11x) |
The dispatch overhead itself is negligible: when both paths produce the same
AVX-512 codegen (e.g., compiled with target-cpu=native), the multiversion'd
function and a statically-compiled equivalent are within 1% of each other.
The win above comes entirely from CPUID picking VPOPCNTDQ over SSE2.
To disable (e.g., for zero dependencies or when using -C target-cpu=native):
[]
= { = "1", = false }
Compile-time CPU targeting
If you know your target CPU, combine lto = true with RUSTFLAGS for the
fastest single-call performance:
# Best single-call performance, but binary only runs on identical CPUs
RUSTFLAGS="-C target-cpu=native"
# Requires AVX-512 (2017+ server CPUs)
RUSTFLAGS="-C target-cpu=x86-64-v4"
Binaries built with compile-time CPU targeting will crash with "illegal instruction" if run on a CPU that doesn't support the required features. For portable deployments, use the default
multiversion_x86feature instead.
Caveat for batch workloads: target-cpu=native makes single-call faster
(~45% on Zen 5) but currently makes the batch APIs slower at larger sizes.
On a 256-byte vector, 1000-comparison batch:
| Configuration | array::batch |
slice::batch |
|---|---|---|
default features + lto = true (multiversion) |
1.7µs | 7.7µs |
target-cpu=native + lto = true |
1.9µs (+7%) | 11.6µs (+50%) |
LLVM picks a different (worse) vectorization shape for batch loops when
compiled with native than when going through the multiversion dispatch
boundary. If batch throughput matters, prefer the default features over
target-cpu=native, or benchmark both for your specific workload.
License
This project is licensed under either of the following licenses, at your option:
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this project by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.