axhash 1.0.0

Simple Rust entrypoint for the AxHash engine.
Documentation
# Algorithm

AxHash is a 64-bit non-cryptographic hash function optimized for HashMap and persistence workloads.

## Overview

- **Output:** 64-bit digest
- **State:** 64-bit accumulator (`acc`) + 128-bit sponge (for streaming)
- **Seed handling:** Raw seed is mixed with secret constants via `seed_lane()` before use
- **Finalization:** Each length-class branch ends with one or more `folded_multiply`
  operations as the integrated finalizer
- **Cross-device determinism:** All backends (scalar, AArch64 NEON, x86_64 AVX2)
  produce **bit-identical output** for the same input. The same hash is computed
  on every CPU, so values can be persisted, sharded, or transmitted across
  machines without reproducibility concerns.

## Input-Length Dispatch

`hash_bytes_core` selects a path based on input length:

| Length | Path | Backend |
|--------|------|---------|
| 0 | Return `acc` ||
| 1–16 | `hash_bytes_short` | Scalar (all platforms) |
| 17–32 | `hash_bytes_17_32` | Scalar (all platforms) |
| 33–64 | `hash_bytes_33_64` | Scalar (all platforms) |
| 65–128 | `hash_bytes_65_128` | Scalar (all platforms) |
| >128 | `hash_bytes_long` | Scalar / NEON / AVX2 — bit-identical |

## Short Paths (≤128 bytes)

All platforms share the same scalar implementation for short inputs. Each path loads fixed-width 64-bit chunks from the start and end of the buffer, mixes them with secret constants via `folded_multiply`, and XOR-rotates the intermediate results.

`hash_bytes_short` (≤16 bytes) has three sub-paths:
- `len ≥ 8` — two 64-bit loads (overlapping at `len-8`)
- `len == 4` — single `u32` fast path with const-folded length mixing
- `len ∈ {1,2,3,5,6,7}``read_partial_u64` (branchless byte gather for 1–3 bytes)

## Long Path (>128 bytes)

The long path uses an 8-accumulator mixing kernel inspired by XXH3. The
algorithm is defined in scalar arithmetic so that every backend can compute
the same value:

- 8 × `u64` accumulators, initialized from the seeded input accumulator and the
  expanded secret stream.
- 64-byte stripes. Per stripe, for each of 8 lanes:
  - `acc[i ^ 1] += data[i]` (cross-add the data word into the paired lane)
  - `acc[i]     += lo32(data[i] ^ secret[i]) * hi32(data[i] ^ secret[i])`
- Secret advances by 8 bytes per stripe within a 384-byte secret stream.
- Every 16 stripes (one block, 1024 bytes), a **scramble step** runs:
  - `acc[i] = (acc[i] ^ (acc[i] >> 47) ^ scramble_secret[i]) * PRIME32_1`
  - This injects non-linearity and stripe-position dependence; without it,
    stripe accumulation is commutative and sparse inputs collide.
- After the streaming stripes, the trailing 64 bytes are mixed with a fixed
  final-stripe secret offset, ensuring the tail influences every input length.
- Final merge:
  - 4 × `folded_multiply` (one per accumulator pair) preserves more entropy
    than collapsing pairs first
  - Sums folds together with the input length and seeded accumulator
  - SplitMix64 avalanche tail (`h ^= h >> 33; h *= K1; h ^= h >> 33; h *= K2; h ^= h >> 33`)
    distributes bias evenly across all 64 output bits

### SMHasher3 validation

Validated against [SMHasher3](https://gitlab.com/fwojcik/smhasher3): **188/188 tests pass** on
AArch64 (verification values LE `0xF74A4F15`, BE `0xD864AC06`). The full
battery covers avalanche, bias, sparse/cyclic/permutation collision keysets,
two-byte differential, long text, and distribution tests up to 64 bits.

### Backend implementations

All three backends compute the same arithmetic:

- **Scalar** (`backend/scalar.rs`) — the canonical reference. All other
  backends must produce bit-identical output.
- **AArch64 NEON** (`backend/aarch64.rs`) — 4 × `uint64x2_t` accumulators,
  `vmull_u32` for 32×32→64, `veorq_u64`/`vaddq_u64` for XOR/ADD, `vextq_u64`
  for the pair swap.
- **x86_64 AVX2** (`backend/x86_64.rs`) — 2 × `__m256i` accumulators,
  `_mm256_mul_epu32` for 32×32→64, `_mm256_xor_si256`/`_mm256_add_epi64`,
  `_mm256_shuffle_epi32` for the pair swap.

### Why no AES instructions

Prior versions used `vaeseq_u8`/`vaesmcq_u8` on AArch64 and `_mm_aesenc_si128`
on x86_64 as mixing primitives. These are extremely fast, but the x86 and ARM
AES round functions are not bit-equivalent per round (the ordering of
`ShiftRows`, `SubBytes`, `MixColumns`, and `AddRoundKey` differs). Using them
made the hash output device-specific, which broke any use case beyond
in-process `HashMap` lookups. The current backends use only operations whose
results are exact across architectures: XOR, ADD, 32×32→64 multiply, lane
shuffles.

## Cross-backend parity tests

`tests/backend_parity.rs` verifies that the runtime-selected SIMD backend
produces the same hash as the scalar reference for every input length from
129 to 2048 bytes, across diverse seeds and byte patterns. CI runs these on
both AArch64 and x86_64 to detect any future regression.