innr 0.2.0

SIMD-accelerated vector similarity primitives (dot, cosine, norm, maxsim, matryoshka)
Documentation

innr

crates.io Documentation CI

Vector similarity primitives with SIMD dispatch (AVX-512, AVX2+FMA, NEON). Pure Rust, zero dependencies, MSRV 1.75.

Computes dot product, cosine similarity, L2/L1 distance, binary/ternary/scalar quantized distances, ColBERT MaxSim, Matryoshka prefix similarity, and batch k-NN (L2, cosine, dot, filtered) over columnar layouts. Runtime CPU detection picks the widest available ISA -- no build-time flags required.

Quickstart

[dependencies]
innr = "0.2.0"
use innr::{dot, cosine, norm};

let a = [1.0_f32, 0.0, 0.0];
let b = [0.707, 0.707, 0.0];

let d = dot(&a, &b);      // 0.707
let c = cosine(&a, &b);   // 0.707
let n = norm(&a);         // 1.0

Operations

Core

Function Description
dot, dot_portable Inner product (SIMD / portable)
cosine, cosine_portable Cosine similarity (single-pass fused SIMD)
norm L2 norm
l2_distance Euclidean distance
l2_distance_squared Squared Euclidean distance (avoids sqrt)
l1_distance Manhattan distance (SIMD-accelerated)
angular_distance Angular distance (arccos-based)

Matryoshka embeddings

Function Description
matryoshka_dot Dot product on a prefix of the embedding
matryoshka_cosine Cosine similarity on a prefix of the embedding

Binary quantization (1-bit)

Type / Function Description
encode_binary Quantize f32 vector to packed bits
PackedBinary Packed bit-vector type
binary_dot Dot product on packed binary vectors
binary_hamming Hamming distance
binary_jaccard Jaccard similarity

Ternary quantization (1.58-bit)

Type / Function Description
ternary::encode_ternary Quantize f32 to {-1, 0, +1}
ternary::PackedTernary Packed ternary vector type
ternary::ternary_dot Inner product on packed ternary vectors
ternary::ternary_hamming Hamming distance on ternary vectors
ternary::asymmetric_dot Float query x ternary doc product
ternary::sparsity Fraction of zero entries

Scalar quantization (uint8)

Type / Function Description
scalar::QuantizationParams Per-collection scale/offset (from fit() or from_range())
scalar::QuantizedU8 Packed u8 vector (4x compression over f32)
scalar::quantize_u8 Quantize f32 vector to u8
scalar::asymmetric_dot_u8 f32 query x u8 doc dot product (SIMD-accelerated)
scalar::query_context Precompute query sum for batch scoring

Fast approximate math

Function Description
fast_cosine Approximate cosine via fast_rsqrt
fast_rsqrt Fast inverse square root (hardware rsqrt + Newton-Raphson)
fast_rsqrt_precise Two-iteration Newton-Raphson variant

Batch operations (PDX-style columnar layout)

Type / Function Description
batch::VerticalBatch Columnar (SoA) vector store
batch::batch_dot Batch dot products against a query
batch::batch_l2_squared Batch squared L2 distances
batch::batch_cosine Batch cosine similarities
batch::batch_norms Norms for all vectors in the batch
batch::batch_knn Exact k-NN (L2) over a batch
batch::batch_knn_cosine Top-k by cosine similarity
batch::batch_knn_dot Top-k by dot product (MIPS)
batch::batch_knn_filtered k-NN with predicate pushdown
batch::batch_knn_reordered Exact k-NN with variance-ordered pruning
batch::batch_knn_adaptive Approximate early-exit k-NN
batch::batch_l2_squared_pruning Batch L2 with threshold pruning
batch::batch_dimension_variance Per-dimension variance (for reordering)

Sparse vectors

Function Description
sparse_dot, sparse_dot_portable Sparse vector dot (sorted-index merge)
sparse_maxsim Sparse MaxSim scoring

ColBERT late interaction

Function Description
maxsim MaxSim dot-product scoring
maxsim_cosine MaxSim cosine scoring

SIMD Dispatch

Architecture Instructions Detection
x86_64 AVX-512F Runtime
x86_64 AVX2 + FMA Runtime
aarch64 NEON Always
Other Portable LLVM auto-vec

Vectors < 16 dimensions use portable code.

Performance

Benchmark throughput

Apple Silicon (NEON). Run cargo bench to reproduce on your hardware.

For maximum performance, build with native CPU features:

RUSTFLAGS="-C target-cpu=native" cargo build --release

Or specify a portable baseline with SIMD:

# AVX2 (89% of x86_64 CPUs)
RUSTFLAGS="-C target-cpu=x86-64-v3" cargo build --release

# SSE2 only (100% compatible)
RUSTFLAGS="-C target-cpu=x86-64" cargo build --release

Run benchmarks:

cargo bench

Generate flamegraphs (requires cargo-flamegraph):

./scripts/profile.sh dense

Examples

01_basic_ops.rs -- The three core similarity metrics (dot product, cosine, L2 distance) and their mathematical relationships. Proves the identity L2^2(a,b) = 2(1 - cosine(a,b)) for normalized vectors, showing that cosine and L2 are interchangeable for ranking.

batch_demo.rs -- PDX-style columnar layout for batch retrieval. Transposes 10K vectors (128d) into column-major order, runs 100 queries, and verifies k-NN results against brute-force. Demonstrates the cache-friendly memory access pattern that enables auto-vectorization.

binary_demo.rs -- Binary (1-bit) quantization for first-stage retrieval. Quantizes 384d vectors to packed bits (32x memory reduction: 150 MB vs 4.6 GB for 1M documents), computes Hamming distance and binary dot product, and measures recall@10 against full-precision search.

fast_math_demo.rs -- Newton-Raphson rsqrt approximation for fast cosine similarity. Benchmarks the hot path in ANN search (640 distance calls per query in HNSW at 1M scale), measures 3-10x speedup over standard cosine at <1e-4 error, and shows architecture-specific SIMD dispatch.

matryoshka_search.rs -- Two-stage retrieval using Matryoshka embeddings. Uses a 128d prefix for coarse filtering (100 candidates from 10K corpus), then rescores with full 768d vectors to produce the final top-10. Measures recall and speedup vs single-stage search.

maxsim_colbert.rs -- ColBERT-style late interaction scoring. Computes MaxSim (sum of per-query-token maximum similarities across document tokens) for 32 query tokens x 128 doc tokens at 128d. Demonstrates non-commutativity and batch scoring of 1000 documents.

ternary_demo.rs -- Ternary (1.58-bit) quantization for extreme compression. Quantizes 768d vectors to {-1, 0, +1} (16-20x memory reduction: 90 MB vs 3.1 GB for 1M documents), measures recall trade-offs, and analyzes sparsity patterns.

cargo run --example 01_basic_ops
cargo run --example batch_demo
cargo run --example binary_demo
cargo run --example fast_math_demo
cargo run --example matryoshka_search
cargo run --example maxsim_colbert
cargo run --example ternary_demo

Tests

cargo test -p innr

License

Dual-licensed under MIT or Apache-2.0.