# Architecture
Codebase structure for `axhash-core`. See [doc/algorithm.md](algorithm.md) for the hash function itself.
## Directory Layout
```
src/
├── lib.rs # Re-exports + runtime API (RuntimeBackend, runtime_has_simd)
├── constants.rs # SECRET, STRIPE_SECRET, SECRET_STREAM (384-byte expanded secret)
├── math.rs # folded_multiply, avalanche, seed_lane
├── memory.rs # Unsafe unaligned reads (r_u64, NEON r_u64x2 / r_u64x2_aligned)
├── backend/ # Long-input mixing backends — all produce bit-identical output
│ ├── mod.rs # Dispatch (hash_bytes_core, selected_backend)
│ ├── scalar.rs # All short paths + canonical long-input kernel (scramble, merge)
│ ├── x86_64.rs # Portable AVX2 long-input path (no AES instructions)
│ └── aarch64.rs # Portable NEON long-input path (no AES instructions)
├── hasher/ # Hasher trait + public API
│ ├── mod.rs # Re-exports
│ ├── core.rs # AxHasher struct + state
│ ├── build.rs # AxBuildHasher + seed generation
│ ├── api.rs # One-shot functions (axhash, axhash_seeded, ...)
│ └── trait_impl.rs # std::hash::Hasher implementation
└── tests/ # Test modules
├── mod.rs # Shared helpers (DemoRecord, chi_squared, ...)
├── determinism.rs
├── buildhasher.rs
├── trait_contract.rs
├── backend_parity.rs # Scalar vs native SIMD: bit-identity for every length 129..2048
├── lower_bits.rs
├── collisions.rs
└── predictability.rs
```
## Design Principles
1. **Scalar is canonical.** `backend/scalar.rs` defines the algorithm. Every SIMD backend must produce **bit-identical** output to scalar for every input; otherwise hashes diverge across CPUs. Enforced by `tests/backend_parity.rs` over the full length range 129..2048 with diverse seeds and byte patterns.
2. **Backend dispatch.** `backend/mod.rs` selects scalar, AArch64 NEON, or x86_64 AVX2 at runtime (via `std::arch::is_aarch64_feature_detected!` / `is_x86_feature_detected!`) or at compile time for `no_std` builds (via `target_feature`).
3. **Short inputs are scalar-only.** Lengths ≤ 128 bytes use scalar-only paths (`hash_bytes_short`, `hash_bytes_17_32`, `hash_bytes_33_64`, `hash_bytes_65_128`). These were already cross-device deterministic.
4. **Long inputs (> 128 bytes) use the same 8-accumulator kernel everywhere.** The kernel is built from operations whose results are exact across architectures: XOR, ADD, 32×32→64 multiply, lane shuffle, constant shift, 64×32 wrapping multiply (SIMD-emulated via two 32×32→64). **AES round instructions are deliberately not used** because x86 `aesenc` and ARM `aese`+`aesmc` are not bit-equivalent per round.
5. **No hidden state.** `AxHasher` is pure: same seed + same input → same output. No process-local randomization, no time-dependent salts.
6. **Validated against SMHasher3** (188/188 PASS on AArch64). The long path includes a scramble step every 16 stripes (1024 bytes) and a SplitMix64 avalanche tail in the final merge to satisfy strict zero-collision keysets (Sparse, OneByte, Long text).
## Key data structures
- `SECRET_STREAM: [u64; 48]` — 384 bytes of expanded secret, deterministically derived at compile time via SplitMix64 from `STRIPE_SECRET[0]`. Partitioned into three regions:
- `[0..8]` — accumulator initialization
- `[8..32]` — per-stripe secret window (advances 1 u64 per stripe, resets per block)
- `[32..40]` — scramble step secret
- `[40..48]` — final-stripe (trailing-64-bytes) secret
- `AxHasher { acc: u64, sponge: u128, sponge_bits: u8 }` — streaming state. The 128-bit sponge buffers primitive writes (`write_u32`, `write_u64`, `write_u128`) for batched flushing into `acc`.
## Long-path control flow
```
init_acc(seeded_acc) # 8 u64 accumulators
secret_off = 8
for each 64-byte stripe in input:
mix_stripe(acc, stripe_bytes, secret[secret_off..secret_off+8])
secret_off += 1
if 16 stripes processed since last scramble:
scramble_acc(acc, secret[32..40]) # XOR-shift + XOR-secret + mul PRIME32_1
secret_off = 8 # restart secret per block
mix_stripe(acc, last_64_bytes, secret[40..48]) # trailing stripe (may overlap)
merge_acc(acc, len, seeded_acc) # 4 folds + sum + SplitMix64 avalanche
```
Each `mix_stripe` body (8 lanes × `acc[i^1] += data[i]`; `acc[i] += lo32(data^secret) * hi32(data^secret)`) maps directly to:
- Scalar: 8 iterations of the loop body
- NEON: 4 × `uint64x2_t` pair operations (`vmull_u32`, `veorq_u64`, `vaddq_u64`, `vextq_u64`)
- AVX2: 2 × `__m256i` quad operations (`_mm256_mul_epu32`, `_mm256_xor_si256`, `_mm256_add_epi64`, `_mm256_shuffle_epi32`)
Likewise, `scramble_acc` uses the same XOR/shift/multiply pattern in each backend, with the SIMD versions emulating 64×32 wrapping multiply via two 32×32→64 multiplies + shift + add.