innr 0.1.9

SIMD-accelerated vector similarity primitives (dot, cosine, norm, maxsim, matryoshka, clifford rotors)
Documentation

innr

crates.io Documentation CI

Vector similarity primitives with SIMD dispatch (AVX-512, AVX2+FMA, NEON). Pure Rust, zero dependencies, MSRV 1.75.

Computes dot product, cosine similarity, L2/L1 distance, binary/ternary quantized distances, ColBERT MaxSim, Matryoshka prefix similarity, and batch k-NN over columnar layouts. Runtime CPU detection picks the widest available ISA -- no build-time flags required.

Quickstart

[dependencies]
innr = "0.1.7"
use innr::{dot, cosine, norm};

let a = [1.0_f32, 0.0, 0.0];
let b = [0.707, 0.707, 0.0];

let d = dot(&a, &b);      // 0.707
let c = cosine(&a, &b);   // 0.707
let n = norm(&a);         // 1.0

Operations

Core (always available)

Function Description
dot, dot_portable Inner product (SIMD / portable)
cosine Cosine similarity
norm L2 norm
l2_distance Euclidean distance
l2_distance_squared Squared Euclidean distance (avoids sqrt)
l1_distance Manhattan distance
angular_distance Angular distance (arccos-based)
pool_mean Mean pooling over a set of vectors
bilinear Scaled dot product (phi^T * psi / sqrt(d))
geometric_outer_product Tensor (outer) product of two vectors
metric_residual MRN distance (symmetric + asymmetric components)

Matryoshka embeddings

Function Description
matryoshka_dot Dot product on a prefix of the embedding
matryoshka_cosine Cosine similarity on a prefix of the embedding

Binary quantization (1-bit)

Type / Function Description
encode_binary Quantize f32 vector to packed bits
PackedBinary Packed bit-vector type
binary_dot Dot product on packed binary vectors
binary_hamming Hamming distance
binary_jaccard Jaccard similarity

Ternary quantization (1.58-bit)

Type / Function Description
ternary::encode_ternary Quantize f32 to {-1, 0, +1}
ternary::PackedTernary Packed ternary vector type
ternary::ternary_dot Inner product on packed ternary vectors
ternary::ternary_hamming Hamming distance on ternary vectors
ternary::asymmetric_dot Float query x ternary doc product
ternary::sparsity Fraction of zero entries

Fast approximate math

Function Description
fast_cosine Approximate cosine via fast_rsqrt
fast_rsqrt Fast inverse square root (hardware rsqrt + Newton-Raphson)
fast_rsqrt_precise Two-iteration Newton-Raphson variant
fast_cosine_distance 1 - fast_cosine

Batch operations (PDX-style columnar layout)

Type / Function Description
batch::VerticalBatch Columnar (SoA) vector store
batch::batch_dot Batch dot products against a query
batch::batch_l2_squared Batch squared L2 distances
batch::batch_cosine Batch cosine similarities
batch::batch_norms Norms for all vectors in the batch
batch::batch_knn Exact k-NN over a batch
batch::batch_knn_adaptive Adaptive early-exit k-NN
batch::batch_l2_squared_pruning Batch L2 with early termination
batch::BatchKnnResult k-NN result (indices + distances)

Metric traits

Trait Description
SymmetricMetric Symmetric distance interface (d(a,b) = d(b,a))
Quasimetric Directed distance interface (d(a,b) != d(b,a))

Clifford algebra

Type / Function Description
clifford::Rotor2D 2D rotor (even subalgebra of Cl(2))
clifford::wedge_2d 2D wedge (outer) product
clifford::geometric_product_2d 2D geometric product (scalar + bivector)

Feature-gated

Function Feature Description
sparse_dot, sparse_dot_portable sparse Sparse vector dot (sorted-index merge)
sparse_maxsim sparse Sparse MaxSim scoring
maxsim, maxsim_cosine maxsim ColBERT late interaction scoring

SIMD Dispatch

Architecture Instructions Detection
x86_64 AVX-512F Runtime
x86_64 AVX2 + FMA Runtime
aarch64 NEON Always
Other Portable LLVM auto-vec

Vectors < 16 dimensions use portable code.

Features

  • sparse -- sparse vector operations
  • maxsim -- ColBERT late interaction scoring
  • full -- all features

Performance

Benchmark throughput

Apple Silicon (NEON). Run cargo bench to reproduce on your hardware.

For maximum performance, build with native CPU features:

RUSTFLAGS="-C target-cpu=native" cargo build --release

Or specify a portable baseline with SIMD:

# AVX2 (89% of x86_64 CPUs)
RUSTFLAGS="-C target-cpu=x86-64-v3" cargo build --release

# SSE2 only (100% compatible)
RUSTFLAGS="-C target-cpu=x86-64" cargo build --release

Run benchmarks:

cargo bench

Generate flamegraphs (requires cargo-flamegraph):

./scripts/profile.sh dense

Examples

01_basic_ops.rs -- The three core similarity metrics (dot product, cosine, L2 distance) and their mathematical relationships. Proves the identity L2^2(a,b) = 2(1 - cosine(a,b)) for normalized vectors, showing that cosine and L2 are interchangeable for ranking.

batch_demo.rs -- PDX-style columnar layout for batch retrieval. Transposes 10K vectors (128d) into column-major order, runs 100 queries, and verifies k-NN results against brute-force. Demonstrates the cache-friendly memory access pattern that enables auto-vectorization.

binary_demo.rs -- Binary (1-bit) quantization for first-stage retrieval. Quantizes 384d vectors to packed bits (32x memory reduction: 150 MB vs 4.6 GB for 1M documents), computes Hamming distance and binary dot product, and measures recall@10 against full-precision search.

fast_math_demo.rs -- Newton-Raphson rsqrt approximation for fast cosine similarity. Benchmarks the hot path in ANN search (640 distance calls per query in HNSW at 1M scale), measures 3-10x speedup over standard cosine at <1e-4 error, and shows architecture-specific SIMD dispatch.

matryoshka_search.rs -- Two-stage retrieval using Matryoshka embeddings. Uses a 128d prefix for coarse filtering (100 candidates from 10K corpus), then rescores with full 768d vectors to produce the final top-10. Measures recall and speedup vs single-stage search.

maxsim_colbert.rs -- ColBERT-style late interaction scoring. Computes MaxSim (sum of per-query-token maximum similarities across document tokens) for 32 query tokens x 128 doc tokens at 128d. Demonstrates non-commutativity and batch scoring of 1000 documents.

ternary_demo.rs -- Ternary (1.58-bit) quantization for extreme compression. Quantizes 768d vectors to {-1, 0, +1} (16-20x memory reduction: 90 MB vs 3.1 GB for 1M documents), measures recall trade-offs, and analyzes sparsity patterns.

cargo run --example 01_basic_ops
cargo run --example batch_demo
cargo run --example binary_demo
cargo run --example fast_math_demo
cargo run --example matryoshka_search
cargo run --example maxsim_colbert --features maxsim
cargo run --example ternary_demo

Tests

cargo test -p innr

License

Dual-licensed under MIT or Apache-2.0.