innr 0.1.8

SIMD-accelerated vector similarity primitives (dot, cosine, norm, maxsim, matryoshka, clifford rotors)
Documentation
# innr

[![crates.io](https://img.shields.io/crates/v/innr.svg)](https://crates.io/crates/innr)
[![Documentation](https://docs.rs/innr/badge.svg)](https://docs.rs/innr)
[![CI](https://github.com/arclabs561/innr/actions/workflows/ci.yml/badge.svg)](https://github.com/arclabs561/innr/actions/workflows/ci.yml)

Vector similarity primitives with SIMD dispatch (AVX-512, AVX2+FMA, NEON). Pure Rust, zero dependencies, MSRV 1.75.

Computes dot product, cosine similarity, L2/L1 distance, binary/ternary quantized distances, ColBERT MaxSim, Matryoshka prefix similarity, and batch k-NN over columnar layouts. Runtime CPU detection picks the widest available ISA -- no build-time flags required.

## Quickstart

```toml
[dependencies]
innr = "0.1.7"
```

```rust
use innr::{dot, cosine, norm};

let a = [1.0_f32, 0.0, 0.0];
let b = [0.707, 0.707, 0.0];

let d = dot(&a, &b);      // 0.707
let c = cosine(&a, &b);   // 0.707
let n = norm(&a);         // 1.0
```

## Operations

### Core (always available)

| Function | Description |
|----------|-------------|
| `dot`, `dot_portable` | Inner product (SIMD / portable) |
| `cosine` | Cosine similarity |
| `norm` | L2 norm |
| `l2_distance` | Euclidean distance |
| `l2_distance_squared` | Squared Euclidean distance (avoids sqrt) |
| `l1_distance` | Manhattan distance |
| `angular_distance` | Angular distance (arccos-based) |
| `pool_mean` | Mean pooling over a set of vectors |
| `bilinear` | Scaled dot product (`phi^T * psi / sqrt(d)`) |
| `geometric_outer_product` | Tensor (outer) product of two vectors |
| `metric_residual` | MRN distance (symmetric + asymmetric components) |

### Matryoshka embeddings

| Function | Description |
|----------|-------------|
| `matryoshka_dot` | Dot product on a prefix of the embedding |
| `matryoshka_cosine` | Cosine similarity on a prefix of the embedding |

### Binary quantization (1-bit)

| Type / Function | Description |
|-----------------|-------------|
| `encode_binary` | Quantize `f32` vector to packed bits |
| `PackedBinary` | Packed bit-vector type |
| `binary_dot` | Dot product on packed binary vectors |
| `binary_hamming` | Hamming distance |
| `binary_jaccard` | Jaccard similarity |

### Ternary quantization (1.58-bit)

| Type / Function | Description |
|-----------------|-------------|
| `ternary::encode_ternary` | Quantize `f32` to {-1, 0, +1} |
| `ternary::PackedTernary` | Packed ternary vector type |
| `ternary::ternary_dot` | Inner product on packed ternary vectors |
| `ternary::ternary_hamming` | Hamming distance on ternary vectors |
| `ternary::asymmetric_dot` | Float query x ternary doc product |
| `ternary::sparsity` | Fraction of zero entries |

### Fast approximate math

| Function | Description |
|----------|-------------|
| `fast_cosine` | Approximate cosine via `fast_rsqrt` |
| `fast_rsqrt` | Fast inverse square root (hardware rsqrt + Newton-Raphson) |
| `fast_rsqrt_precise` | Two-iteration Newton-Raphson variant |
| `fast_cosine_distance` | `1 - fast_cosine` |

### Batch operations (PDX-style columnar layout)

| Type / Function | Description |
|-----------------|-------------|
| `batch::VerticalBatch` | Columnar (SoA) vector store |
| `batch::batch_dot` | Batch dot products against a query |
| `batch::batch_l2_squared` | Batch squared L2 distances |
| `batch::batch_cosine` | Batch cosine similarities |
| `batch::batch_norms` | Norms for all vectors in the batch |
| `batch::batch_knn` | Exact k-NN over a batch |
| `batch::batch_knn_adaptive` | Adaptive early-exit k-NN |
| `batch::batch_l2_squared_pruning` | Batch L2 with early termination |
| `batch::BatchKnnResult` | k-NN result (indices + distances) |

### Metric traits

| Trait | Description |
|-------|-------------|
| `SymmetricMetric` | Symmetric distance interface (`d(a,b) = d(b,a)`) |
| `Quasimetric` | Directed distance interface (`d(a,b) != d(b,a)`) |

### Clifford algebra

| Type / Function | Description |
|-----------------|-------------|
| `clifford::Rotor2D` | 2D rotor (even subalgebra of Cl(2)) |
| `clifford::wedge_2d` | 2D wedge (outer) product |
| `clifford::geometric_product_2d` | 2D geometric product (scalar + bivector) |

### Feature-gated

| Function | Feature | Description |
|----------|---------|-------------|
| `sparse_dot`, `sparse_dot_portable` | `sparse` | Sparse vector dot (sorted-index merge) |
| `sparse_maxsim` | `sparse` | Sparse MaxSim scoring |
| `maxsim`, `maxsim_cosine` | `maxsim` | ColBERT late interaction scoring |

## SIMD Dispatch

| Architecture | Instructions | Detection |
|--------------|--------------|-----------|
| x86_64 | AVX-512F | Runtime |
| x86_64 | AVX2 + FMA | Runtime |
| aarch64 | NEON | Always |
| Other | Portable | LLVM auto-vec |

Vectors < 16 dimensions use portable code.

## Features

- `sparse` -- sparse vector operations
- `maxsim` -- ColBERT late interaction scoring
- `full` -- all features

## Performance

![Benchmark throughput](docs/bench_throughput.png)

*Apple Silicon (NEON). Run `cargo bench` to reproduce on your hardware.*

For maximum performance, build with native CPU features:

```bash
RUSTFLAGS="-C target-cpu=native" cargo build --release
```

Or specify a portable baseline with SIMD:

```bash
# AVX2 (89% of x86_64 CPUs)
RUSTFLAGS="-C target-cpu=x86-64-v3" cargo build --release

# SSE2 only (100% compatible)
RUSTFLAGS="-C target-cpu=x86-64" cargo build --release
```

Run benchmarks:

```bash
cargo bench
```

Generate flamegraphs (requires `cargo-flamegraph`):

```bash
./scripts/profile.sh dense
```

## Examples

[**01_basic_ops.rs**](examples/01_basic_ops.rs) -- The three core similarity metrics (dot product, cosine, L2 distance) and their mathematical relationships. Proves the identity `L2^2(a,b) = 2(1 - cosine(a,b))` for normalized vectors, showing that cosine and L2 are interchangeable for ranking.

[**batch_demo.rs**](examples/batch_demo.rs) -- PDX-style columnar layout for batch retrieval. Transposes 10K vectors (128d) into column-major order, runs 100 queries, and verifies k-NN results against brute-force. Demonstrates the cache-friendly memory access pattern that enables auto-vectorization.

[**binary_demo.rs**](examples/binary_demo.rs) -- Binary (1-bit) quantization for first-stage retrieval. Quantizes 384d vectors to packed bits (32x memory reduction: 150 MB vs 4.6 GB for 1M documents), computes Hamming distance and binary dot product, and measures recall@10 against full-precision search.

[**fast_math_demo.rs**](examples/fast_math_demo.rs) -- Newton-Raphson rsqrt approximation for fast cosine similarity. Benchmarks the hot path in ANN search (640 distance calls per query in HNSW at 1M scale), measures 3-10x speedup over standard cosine at `<1e-4` error, and shows architecture-specific SIMD dispatch.

[**matryoshka_search.rs**](examples/matryoshka_search.rs) -- Two-stage retrieval using Matryoshka embeddings. Uses a 128d prefix for coarse filtering (100 candidates from 10K corpus), then rescores with full 768d vectors to produce the final top-10. Measures recall and speedup vs single-stage search.

[**maxsim_colbert.rs**](examples/maxsim_colbert.rs) -- ColBERT-style late interaction scoring. Computes MaxSim (sum of per-query-token maximum similarities across document tokens) for 32 query tokens x 128 doc tokens at 128d. Demonstrates non-commutativity and batch scoring of 1000 documents.

[**ternary_demo.rs**](examples/ternary_demo.rs) -- Ternary (1.58-bit) quantization for extreme compression. Quantizes 768d vectors to `{-1, 0, +1}` (16-20x memory reduction: 90 MB vs 3.1 GB for 1M documents), measures recall trade-offs, and analyzes sparsity patterns.

```bash
cargo run --example 01_basic_ops
cargo run --example batch_demo
cargo run --example binary_demo
cargo run --example fast_math_demo
cargo run --example matryoshka_search
cargo run --example maxsim_colbert --features maxsim
cargo run --example ternary_demo
```

## Tests

```bash
cargo test -p innr
```

## License

Dual-licensed under MIT or Apache-2.0.