axhash 1.0.0

Simple Rust entrypoint for the AxHash engine.
Documentation
# Architecture

Codebase structure for `axhash-core`. See [doc/algorithm.md](algorithm.md) for the hash function itself.

## Directory Layout

```
src/
├── lib.rs              # Re-exports + runtime API (RuntimeBackend, runtime_has_simd)
├── constants.rs        # SECRET, STRIPE_SECRET, SECRET_STREAM (384-byte expanded secret)
├── math.rs             # folded_multiply, avalanche, seed_lane
├── memory.rs           # Unsafe unaligned reads (r_u64, NEON r_u64x2 / r_u64x2_aligned)
├── backend/            # Long-input mixing backends — all produce bit-identical output
│   ├── mod.rs          # Dispatch (hash_bytes_core, selected_backend)
│   ├── scalar.rs       # All short paths + canonical long-input kernel (scramble, merge)
│   ├── x86_64.rs       # Portable AVX2 long-input path (no AES instructions)
│   └── aarch64.rs      # Portable NEON long-input path (no AES instructions)
├── hasher/             # Hasher trait + public API
│   ├── mod.rs          # Re-exports
│   ├── core.rs         # AxHasher struct + state
│   ├── build.rs        # AxBuildHasher + seed generation
│   ├── api.rs          # One-shot functions (axhash, axhash_seeded, ...)
│   └── trait_impl.rs   # std::hash::Hasher implementation
└── tests/              # Test modules
    ├── mod.rs          # Shared helpers (DemoRecord, chi_squared, ...)
    ├── determinism.rs
    ├── buildhasher.rs
    ├── trait_contract.rs
    ├── backend_parity.rs # Scalar vs native SIMD: bit-identity for every length 129..2048
    ├── lower_bits.rs
    ├── collisions.rs
    └── predictability.rs
```

## Design Principles

1. **Scalar is canonical.** `backend/scalar.rs` defines the algorithm. Every SIMD backend must produce **bit-identical** output to scalar for every input; otherwise hashes diverge across CPUs. Enforced by `tests/backend_parity.rs` over the full length range 129..2048 with diverse seeds and byte patterns.
2. **Backend dispatch.** `backend/mod.rs` selects scalar, AArch64 NEON, or x86_64 AVX2 at runtime (via `std::arch::is_aarch64_feature_detected!` / `is_x86_feature_detected!`) or at compile time for `no_std` builds (via `target_feature`).
3. **Short inputs are scalar-only.** Lengths ≤ 128 bytes use scalar-only paths (`hash_bytes_short`, `hash_bytes_17_32`, `hash_bytes_33_64`, `hash_bytes_65_128`). These were already cross-device deterministic.
4. **Long inputs (> 128 bytes) use the same 8-accumulator kernel everywhere.** The kernel is built from operations whose results are exact across architectures: XOR, ADD, 32×32→64 multiply, lane shuffle, constant shift, 64×32 wrapping multiply (SIMD-emulated via two 32×32→64). **AES round instructions are deliberately not used** because x86 `aesenc` and ARM `aese`+`aesmc` are not bit-equivalent per round.
5. **No hidden state.** `AxHasher` is pure: same seed + same input → same output. No process-local randomization, no time-dependent salts.
6. **Validated against SMHasher3** (188/188 PASS on AArch64). The long path includes a scramble step every 16 stripes (1024 bytes) and a SplitMix64 avalanche tail in the final merge to satisfy strict zero-collision keysets (Sparse, OneByte, Long text).

## Key data structures

- `SECRET_STREAM: [u64; 48]` — 384 bytes of expanded secret, deterministically derived at compile time via SplitMix64 from `STRIPE_SECRET[0]`. Partitioned into three regions:
  - `[0..8]` — accumulator initialization
  - `[8..32]` — per-stripe secret window (advances 1 u64 per stripe, resets per block)
  - `[32..40]` — scramble step secret
  - `[40..48]` — final-stripe (trailing-64-bytes) secret
- `AxHasher { acc: u64, sponge: u128, sponge_bits: u8 }` — streaming state. The 128-bit sponge buffers primitive writes (`write_u32`, `write_u64`, `write_u128`) for batched flushing into `acc`.

## Long-path control flow

```
init_acc(seeded_acc)                              # 8 u64 accumulators
secret_off = 8
for each 64-byte stripe in input:
    mix_stripe(acc, stripe_bytes, secret[secret_off..secret_off+8])
    secret_off += 1
    if 16 stripes processed since last scramble:
        scramble_acc(acc, secret[32..40])         # XOR-shift + XOR-secret + mul PRIME32_1
        secret_off = 8                            # restart secret per block
mix_stripe(acc, last_64_bytes, secret[40..48])    # trailing stripe (may overlap)
merge_acc(acc, len, seeded_acc)                   # 4 folds + sum + SplitMix64 avalanche
```

Each `mix_stripe` body (8 lanes × `acc[i^1] += data[i]`; `acc[i] += lo32(data^secret) * hi32(data^secret)`) maps directly to:
- Scalar: 8 iterations of the loop body
- NEON: 4 × `uint64x2_t` pair operations (`vmull_u32`, `veorq_u64`, `vaddq_u64`, `vextq_u64`)
- AVX2: 2 × `__m256i` quad operations (`_mm256_mul_epu32`, `_mm256_xor_si256`, `_mm256_add_epi64`, `_mm256_shuffle_epi32`)

Likewise, `scramble_acc` uses the same XOR/shift/multiply pattern in each backend, with the SIMD versions emulating 64×32 wrapping multiply via two 32×32→64 multiplies + shift + add.