hamming-bitwise-fast 1.1.0

Fast bitwise Hamming distance using auto-vectorization with runtime SIMD detection on x86.
Documentation
# Hamming Bitwise Fast

> Fast bitwise Hamming distance for byte arrays/slices, using auto-vectorization
> with runtime SIMD detection on x86.

**Note:** This is for comparing bit-vectors, _not_ for comparing strings.

On x86, enable LTO for best performance — see [Performance on x86](#performance-on-x86).

## Usage

For variable-length slices:

```rust
use hamming_bitwise_fast::slice;

// Single comparison
let a: Vec<u8> = vec![0xFF; 128];
let b: Vec<u8> = vec![0x00; 128];
let distance: u32 = slice::distance(&a, &b); // 1024

// Batch comparison (one source vs many targets)
// Pre-allocate result vec once and reuse across calls for best performance
let source: Vec<u8> = vec![0x00; 128];
let targets: Vec<&[u8]> = vec![&a, &b];
let mut distances: Vec<u32> = vec![0; 2];
slice::batch(&source, &targets, &mut distances); // [1024, 0]
```

For fixed-size arrays (recommended when size is known at compile time):

```rust
use hamming_bitwise_fast::array;

// Single comparison
let a: [u8; 128] = [0xFF; 128];  // 1024-bit vectors
let b: [u8; 128] = [0x00; 128];
let distance: u32 = array::distance(&a, &b); // 1024

// Batch comparison (one source vs many targets)
// Pre-allocate result vec once and reuse across calls for best performance
let source: [u8; 128] = [0x00; 128];
let targets: Vec<[u8; 128]> = vec![a, b];
let mut distances: Vec<u32> = vec![0; 2];
array::batch(&source, &targets, &mut distances); // [1024, 0]
```

## API

- `array::distance` / `slice::distance` — single comparison
- `array::batch` / `slice::batch` — batch comparison (one source vs many targets)

## Benchmarks

Compared against other Hamming distance crates:
[`simsimd`](https://crates.io/crates/simsimd),
[`hamming`](https://crates.io/crates/hamming),
[`triple_accel`](https://crates.io/crates/triple_accel),
[`hamming_rs`](https://crates.io/crates/hamming_rs) (Linux/Windows only; AVX2 on x86 for ≥1024 bytes, v0.2.25+ uses this crate's v1.0 algorithm as fallback)

### Single Comparison

#### ARM (Apple M2 Max)

| Function            | 64 bytes  | 128 bytes | 256 bytes |
| ------------------- | --------- | --------- | --------- |
| **array::distance** | **1.2ns** | **2.4ns** | **4.3ns** |
| **slice::distance** | **1.2ns** | **2.4ns** | **4.2ns** |
| v1 (baseline)       | 1.5ns     | 3.3ns     | 6.1ns     |
| hamming             | 3.9ns     | 5.4ns     | 19.3ns    |
| simsimd             | 4.3ns     | 6.2ns     | 10.1ns    |
| triple_accel        | 4.8ns     | 9.7ns     | 19.4ns    |

#### x86 (AMD EPYC 9845 Zen 5, AVX-512 VPOPCNTDQ, with `lto = true`)

| Function            | 64 bytes   | 128 bytes  | 256 bytes  |
| ------------------- | ---------- | ---------- | ---------- |
| **array::distance** | **1.8ns**  | **2.0ns**  | **2.8ns**  |
| **slice::distance** | **2.7ns**  | **2.8ns**  | **3.6ns**  |
| triple_accel        | 2.7ns      | 3.3ns      | 4.7ns      |
| simsimd             | 3.2ns      | 3.6ns      | 4.7ns      |
| v1 (baseline)       | 4.3ns      | 9.2ns      | 18.2ns     |
| hamming_rs          | 4.3ns      | 8.4ns      | 18.4ns     |
| hamming             | 48ns       | 96ns       | 28ns¹      |

¹ The `hamming` crate's `distance_fast` hits a vectorized fast path at 256 bytes
when its input happens to be sufficiently aligned. The 64 B and 128 B numbers
reflect its slow path. Treat the 256 B cell as a best-case outlier.

#### x86 (AMD EPYC 7713 Zen 3, AVX2 only, with `lto = true`)

For users on Skylake-class Intel or pre-Genoa AMD CPUs that don't have
AVX-512 VPOPCNTDQ. Multiversion picks the AVX2+POPCNT path.

| Function            | 64 bytes   | 128 bytes  | 256 bytes  |
| ------------------- | ---------- | ---------- | ---------- |
| **array::distance** | **4.3ns**  | **6.8ns**  | **11.5ns** |
| triple_accel        | 5.4ns      | 6.5ns²     | 10.7ns²    |
| hamming_rs          | 5.7ns      | 11.1ns     | 24.8ns     |
| v1 (baseline)       | 5.9ns      | 12.7ns     | 23.2ns     |
| **slice::distance** | **6.6ns**  | **9.8ns**  | **13.1ns** |
| simsimd             | 10.2ns     | 14.7ns     | 26.0ns     |
| hamming             | 63ns       | 123ns      | 46ns¹      |

² On Zen 3, `triple_accel` is competitive with `array::distance` at 128–256 B
single-call (within 5–8%). For one-shot AVX2-only workloads at these sizes,
either is reasonable. `array::batch` (below) still dominates by 1.5–2.5×
for the one-to-many case.

### Batch Comparison (1000 comparisons)

The batch functions are faster for one-to-many comparisons.

#### ARM (Apple M2 Max)

| Function         | 64 bytes  | 128 bytes | 256 bytes |
| ---------------- | --------- | --------- | --------- |
| **array::batch** | **1.3µs** | **2.2µs** | **4.6µs** |
| **slice::batch** | **1.9µs** | **3.0µs** | **5.4µs** |
| hamming          | 4.1µs     | 5.6µs     | 19.5µs    |
| simsimd          | 4.6µs     | 6.6µs     | 10.4µs    |
| triple_accel     | 7.2µs     | 12.0µs    | 21.6µs    |
| v1 (baseline)    | 5.1µs     | 9.4µs     | 12.3µs    |

#### x86 (AMD EPYC 9845 Zen 5, AVX-512 VPOPCNTDQ, with `lto = true`)

| Function         | 64 bytes  | 128 bytes | 256 bytes |
| ---------------- | --------- | --------- | --------- |
| **array::batch** | **414ns** | **814ns** | **1.7µs** |
| triple_accel     | 4.0µs     | 4.7µs     | 6.5µs     |
| simsimd          | 5.1µs     | 5.7µs     | 7.5µs     |
| **slice::batch** | **5.2µs** | **6.3µs** | **7.7µs** |
| hamming_rs       | 5.9µs     | 10.3µs    | 19.1µs    |
| v1 (baseline)    | 9.7µs     | 10.8µs    | 20.6µs    |
| hamming          | 49µs      | 97µs      | 32µs¹     |

¹ See the `hamming` footnote in the single-comparison table above.

#### x86 (AMD EPYC 7713 Zen 3, AVX2 only, with `lto = true`)

| Function         | 64 bytes  | 128 bytes | 256 bytes |
| ---------------- | --------- | --------- | --------- |
| **array::batch** | **2.7µs** | **5.1µs** | **10.1µs**|
| **slice::batch** | **4.4µs** | **6.9µs** | **12.1µs**|
| triple_accel     | 7.4µs     | 9.2µs     | 11.9µs    |
| simsimd          | 9.4µs     | 13.8µs    | 23.0µs    |
| hamming_rs       | 9.9µs     | 14.2µs    | 30.7µs    |
| v1 (baseline)    | 13.9µs    | 19.1µs    | 34.7µs    |
| hamming          | 67µs      | 143µs     | 51µs¹     |

`array::batch` is the fastest option when sizes are known at compile time. For
runtime-sized inputs, `slice::batch` is competitive with `simsimd` and faster
than every other competitor, but is 5–13× slower than `array::batch` (see the
[LTO note](#impact-on-batch-performance) for why).

### Running benchmarks

```sh
# Run competitor comparison
cargo bench --bench competitors
```

> **Methodology note:** x86 numbers were measured on dedicated-CPU cloud
> instances; ARM numbers on a quiescent Apple M2 Max. Cloud-VM placement
> variance can shift sub-ns single-call timings by 5–15% between physical
> hosts even at the same SKU. Treat the relative ranking as reliable and
> the absolute numbers as a representative sample of a well-behaved host.

## Performance on x86

### Enable LTO

**This is the single most impactful optimization.** This crate relies on
auto-vectorization — the compiler widens u64 operations into SIMD instructions
(e.g., AVX-512 VPOPCNTDQ). Without LTO, the compiler can't see through the crate
boundary to vectorize effectively.

```toml
[profile.release]
lto = true
```

Use full LTO (`true`), not thin (`"thin"`) — thin LTO doesn't give LLVM enough
cross-module visibility for auto-vectorization.

#### Impact on single-call performance (128 bytes, AMD EPYC 9845 Zen 5)

| Configuration              | `array::distance` | `slice::distance` |
| -------------------------- | -----------------:| -----------------:|
| Default (no LTO)           | 6.1ns             | 3.4ns             |
| **Default + `lto = true`** | **2.0ns (3.1x)**  | **2.8ns (1.2x)**  |

#### Impact on batch performance

LTO is _not_ universally a win for batch workloads. It modestly improves
`array::batch` (a few hundred ns at most), but currently _hurts_ `slice::batch`
by 2–3× because LLVM picks a less favorable vectorization shape when it can
see across the crate boundary.

128-byte vectors, 1000-comparison batch, AMD EPYC 9845 Zen 5:

| Configuration              | `array::batch`   | `slice::batch`     |
| -------------------------- | ----------------:| ------------------:|
| Default (no LTO)           | 1.10µs           | **2.3µs**          |
| **Default + `lto = true`** | **814ns (1.4x)** | 6.3µs (**2.7x slower**) |

If `slice::batch` is your hot path and you can't switch to `array::batch`,
benchmark with and without LTO for your specific workload. When sizes are
known at compile time, `array::batch` is always the right answer.

### Runtime SIMD detection

Runtime SIMD detection is enabled by default via the [`multiversion`](https://crates.io/crates/multiversion) crate. Rust targets baseline x86-64 (SSE2 only) by default; the `multiversion_x86` feature compiles multiple code paths and selects the fastest at runtime via CPUID — including AVX-512 VPOPCNTDQ on Intel Ice Lake (2019+) and AMD Zen 4+ (2022+) CPUs.

```sh
cargo add hamming-bitwise-fast
```

#### What the feature buys you

128-byte vectors, `lto = true`, AMD EPYC 9845 Zen 5:

| Configuration                  | `array::distance` | `array::batch` (1000) |
| ------------------------------ | -----------------:| ---------------------:|
| `default-features = false` (SSE2 only) | 9.2ns     | 8.9µs                 |
| **default (`multiversion_x86`)**       | **2.0ns (4.7x)** | **814ns (11x)**       |

The dispatch overhead itself is negligible: when both paths produce the same
AVX-512 codegen (e.g., compiled with `target-cpu=native`), the multiversion'd
function and a statically-compiled equivalent are within 1% of each other.
The win above comes entirely from CPUID picking VPOPCNTDQ over SSE2.

To disable (e.g., for zero dependencies or when using `-C target-cpu=native`):

```toml
[dependencies]
hamming-bitwise-fast = { version = "1", default-features = false }
```

### Compile-time CPU targeting

If you know your target CPU, combine `lto = true` with `RUSTFLAGS` for the
fastest single-call performance:

```sh
# Best single-call performance, but binary only runs on identical CPUs
RUSTFLAGS="-C target-cpu=native" cargo build --release

# Requires AVX-512 (2017+ server CPUs)
RUSTFLAGS="-C target-cpu=x86-64-v4" cargo build --release
```

> Binaries built with compile-time CPU targeting will crash with "illegal instruction" if run on a CPU that doesn't support the required features. For portable deployments, use the default `multiversion_x86` feature instead.

**Caveat for batch workloads:** `target-cpu=native` makes single-call faster
(~45% on Zen 5) but currently makes the batch APIs _slower_ at larger sizes.
On a 256-byte vector, 1000-comparison batch:

| Configuration                  | `array::batch` | `slice::batch` |
| ------------------------------ | --------------:| --------------:|
| default features + `lto = true` (multiversion) | **1.7µs**  | **7.7µs**  |
| `target-cpu=native` + `lto = true`             | 1.9µs (+7%) | 11.6µs (+50%) |

LLVM picks a different (worse) vectorization shape for batch loops when
compiled with native than when going through the multiversion dispatch
boundary. If batch throughput matters, prefer the default features over
`target-cpu=native`, or benchmark both for your specific workload.

## License

This project is licensed under either of the following licenses, at your option:

- [Apache License, Version 2.0]LICENSE-APACHE
- [MIT License]LICENSE-MIT

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this project by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.