hamming-bitwise-fast 1.1.0

Fast bitwise Hamming distance using auto-vectorization with runtime SIMD detection on x86.
Documentation

Hamming Bitwise Fast

Fast bitwise Hamming distance for byte arrays/slices, using auto-vectorization with runtime SIMD detection on x86.

Note: This is for comparing bit-vectors, not for comparing strings.

On x86, enable LTO for best performance — see Performance on x86.

Usage

For variable-length slices:

use hamming_bitwise_fast::slice;

// Single comparison
let a: Vec<u8> = vec![0xFF; 128];
let b: Vec<u8> = vec![0x00; 128];
let distance: u32 = slice::distance(&a, &b); // 1024

// Batch comparison (one source vs many targets)
// Pre-allocate result vec once and reuse across calls for best performance
let source: Vec<u8> = vec![0x00; 128];
let targets: Vec<&[u8]> = vec![&a, &b];
let mut distances: Vec<u32> = vec![0; 2];
slice::batch(&source, &targets, &mut distances); // [1024, 0]

For fixed-size arrays (recommended when size is known at compile time):

use hamming_bitwise_fast::array;

// Single comparison
let a: [u8; 128] = [0xFF; 128];  // 1024-bit vectors
let b: [u8; 128] = [0x00; 128];
let distance: u32 = array::distance(&a, &b); // 1024

// Batch comparison (one source vs many targets)
// Pre-allocate result vec once and reuse across calls for best performance
let source: [u8; 128] = [0x00; 128];
let targets: Vec<[u8; 128]> = vec![a, b];
let mut distances: Vec<u32> = vec![0; 2];
array::batch(&source, &targets, &mut distances); // [1024, 0]

API

  • array::distance / slice::distance — single comparison
  • array::batch / slice::batch — batch comparison (one source vs many targets)

Benchmarks

Compared against other Hamming distance crates: simsimd, hamming, triple_accel, hamming_rs (Linux/Windows only; AVX2 on x86 for ≥1024 bytes, v0.2.25+ uses this crate's v1.0 algorithm as fallback)

Single Comparison

ARM (Apple M2 Max)

Function 64 bytes 128 bytes 256 bytes
array::distance 1.2ns 2.4ns 4.3ns
slice::distance 1.2ns 2.4ns 4.2ns
v1 (baseline) 1.5ns 3.3ns 6.1ns
hamming 3.9ns 5.4ns 19.3ns
simsimd 4.3ns 6.2ns 10.1ns
triple_accel 4.8ns 9.7ns 19.4ns

x86 (AMD EPYC 9845 Zen 5, AVX-512 VPOPCNTDQ, with lto = true)

Function 64 bytes 128 bytes 256 bytes
array::distance 1.8ns 2.0ns 2.8ns
slice::distance 2.7ns 2.8ns 3.6ns
triple_accel 2.7ns 3.3ns 4.7ns
simsimd 3.2ns 3.6ns 4.7ns
v1 (baseline) 4.3ns 9.2ns 18.2ns
hamming_rs 4.3ns 8.4ns 18.4ns
hamming 48ns 96ns 28ns¹

¹ The hamming crate's distance_fast hits a vectorized fast path at 256 bytes when its input happens to be sufficiently aligned. The 64 B and 128 B numbers reflect its slow path. Treat the 256 B cell as a best-case outlier.

x86 (AMD EPYC 7713 Zen 3, AVX2 only, with lto = true)

For users on Skylake-class Intel or pre-Genoa AMD CPUs that don't have AVX-512 VPOPCNTDQ. Multiversion picks the AVX2+POPCNT path.

Function 64 bytes 128 bytes 256 bytes
array::distance 4.3ns 6.8ns 11.5ns
triple_accel 5.4ns 6.5ns² 10.7ns²
hamming_rs 5.7ns 11.1ns 24.8ns
v1 (baseline) 5.9ns 12.7ns 23.2ns
slice::distance 6.6ns 9.8ns 13.1ns
simsimd 10.2ns 14.7ns 26.0ns
hamming 63ns 123ns 46ns¹

² On Zen 3, triple_accel is competitive with array::distance at 128–256 B single-call (within 5–8%). For one-shot AVX2-only workloads at these sizes, either is reasonable. array::batch (below) still dominates by 1.5–2.5× for the one-to-many case.

Batch Comparison (1000 comparisons)

The batch functions are faster for one-to-many comparisons.

ARM (Apple M2 Max)

Function 64 bytes 128 bytes 256 bytes
array::batch 1.3µs 2.2µs 4.6µs
slice::batch 1.9µs 3.0µs 5.4µs
hamming 4.1µs 5.6µs 19.5µs
simsimd 4.6µs 6.6µs 10.4µs
triple_accel 7.2µs 12.0µs 21.6µs
v1 (baseline) 5.1µs 9.4µs 12.3µs

x86 (AMD EPYC 9845 Zen 5, AVX-512 VPOPCNTDQ, with lto = true)

Function 64 bytes 128 bytes 256 bytes
array::batch 414ns 814ns 1.7µs
triple_accel 4.0µs 4.7µs 6.5µs
simsimd 5.1µs 5.7µs 7.5µs
slice::batch 5.2µs 6.3µs 7.7µs
hamming_rs 5.9µs 10.3µs 19.1µs
v1 (baseline) 9.7µs 10.8µs 20.6µs
hamming 49µs 97µs 32µs¹

¹ See the hamming footnote in the single-comparison table above.

x86 (AMD EPYC 7713 Zen 3, AVX2 only, with lto = true)

Function 64 bytes 128 bytes 256 bytes
array::batch 2.7µs 5.1µs 10.1µs
slice::batch 4.4µs 6.9µs 12.1µs
triple_accel 7.4µs 9.2µs 11.9µs
simsimd 9.4µs 13.8µs 23.0µs
hamming_rs 9.9µs 14.2µs 30.7µs
v1 (baseline) 13.9µs 19.1µs 34.7µs
hamming 67µs 143µs 51µs¹

array::batch is the fastest option when sizes are known at compile time. For runtime-sized inputs, slice::batch is competitive with simsimd and faster than every other competitor, but is 5–13× slower than array::batch (see the LTO note for why).

Running benchmarks

# Run competitor comparison
cargo bench --bench competitors

Methodology note: x86 numbers were measured on dedicated-CPU cloud instances; ARM numbers on a quiescent Apple M2 Max. Cloud-VM placement variance can shift sub-ns single-call timings by 5–15% between physical hosts even at the same SKU. Treat the relative ranking as reliable and the absolute numbers as a representative sample of a well-behaved host.

Performance on x86

Enable LTO

This is the single most impactful optimization. This crate relies on auto-vectorization — the compiler widens u64 operations into SIMD instructions (e.g., AVX-512 VPOPCNTDQ). Without LTO, the compiler can't see through the crate boundary to vectorize effectively.

[profile.release]
lto = true

Use full LTO (true), not thin ("thin") — thin LTO doesn't give LLVM enough cross-module visibility for auto-vectorization.

Impact on single-call performance (128 bytes, AMD EPYC 9845 Zen 5)

Configuration array::distance slice::distance
Default (no LTO) 6.1ns 3.4ns
Default + lto = true 2.0ns (3.1x) 2.8ns (1.2x)

Impact on batch performance

LTO is not universally a win for batch workloads. It modestly improves array::batch (a few hundred ns at most), but currently hurts slice::batch by 2–3× because LLVM picks a less favorable vectorization shape when it can see across the crate boundary.

128-byte vectors, 1000-comparison batch, AMD EPYC 9845 Zen 5:

Configuration array::batch slice::batch
Default (no LTO) 1.10µs 2.3µs
Default + lto = true 814ns (1.4x) 6.3µs (2.7x slower)

If slice::batch is your hot path and you can't switch to array::batch, benchmark with and without LTO for your specific workload. When sizes are known at compile time, array::batch is always the right answer.

Runtime SIMD detection

Runtime SIMD detection is enabled by default via the multiversion crate. Rust targets baseline x86-64 (SSE2 only) by default; the multiversion_x86 feature compiles multiple code paths and selects the fastest at runtime via CPUID — including AVX-512 VPOPCNTDQ on Intel Ice Lake (2019+) and AMD Zen 4+ (2022+) CPUs.

cargo add hamming-bitwise-fast

What the feature buys you

128-byte vectors, lto = true, AMD EPYC 9845 Zen 5:

Configuration array::distance array::batch (1000)
default-features = false (SSE2 only) 9.2ns 8.9µs
default (multiversion_x86) 2.0ns (4.7x) 814ns (11x)

The dispatch overhead itself is negligible: when both paths produce the same AVX-512 codegen (e.g., compiled with target-cpu=native), the multiversion'd function and a statically-compiled equivalent are within 1% of each other. The win above comes entirely from CPUID picking VPOPCNTDQ over SSE2.

To disable (e.g., for zero dependencies or when using -C target-cpu=native):

[dependencies]
hamming-bitwise-fast = { version = "1", default-features = false }

Compile-time CPU targeting

If you know your target CPU, combine lto = true with RUSTFLAGS for the fastest single-call performance:

# Best single-call performance, but binary only runs on identical CPUs
RUSTFLAGS="-C target-cpu=native" cargo build --release

# Requires AVX-512 (2017+ server CPUs)
RUSTFLAGS="-C target-cpu=x86-64-v4" cargo build --release

Binaries built with compile-time CPU targeting will crash with "illegal instruction" if run on a CPU that doesn't support the required features. For portable deployments, use the default multiversion_x86 feature instead.

Caveat for batch workloads: target-cpu=native makes single-call faster (~45% on Zen 5) but currently makes the batch APIs slower at larger sizes. On a 256-byte vector, 1000-comparison batch:

Configuration array::batch slice::batch
default features + lto = true (multiversion) 1.7µs 7.7µs
target-cpu=native + lto = true 1.9µs (+7%) 11.6µs (+50%)

LLVM picks a different (worse) vectorization shape for batch loops when compiled with native than when going through the multiversion dispatch boundary. If batch throughput matters, prefer the default features over target-cpu=native, or benchmark both for your specific workload.

License

This project is licensed under either of the following licenses, at your option:

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this project by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.