Skip to main content

Module simjoin

Module simjoin 

Source
Expand description

simjoin — exact all-pairs weighted-cosine similarity join: given a corpus of sparse non-negative vectors, find every pair (i, j) whose cosine similarity is ≥ t.

This is the AllPairs / L2AP family (Bayardo et al. WWW’07; Anastasiu & Karypis ICDE’14): an inverted index with prefix filtering so that the vast majority of vector pairs are never compared. It is the principled, exact replacement for shingle-candidate + verify-all near-duplicate detection (e.g. Type-3 code clones = functions × IDF-weighted lines).

§Correctness gate

cosine_join (the indexed algorithm) must return the exact same pair set, with bit-identical similarities, as cosine_join_bruteforce (the naive O(n²) oracle). Both score a pair with the same [cos_full] sorted-merge dot, so the values match to the bit and a pair is never dropped or gained at the threshold. This equality is asserted on fuzzed corpora — the same “two implementations, one answer” discipline the RO path uses.

§Method (this reference)

Vectors are L2-normalised (so cos = dot) and their dimensions relabelled to a global rank by increasing max weight (common low-weight dims first, rare high-weight dims last — so only the rare tail is indexed, keeping postings short). For a probe vector we look up candidates through the index, then verify with the full dot. When indexing a vector we skip the leading prefix whose max possible contribution to any dot stays < t: Σ_{k<b} w_k · maxw[dim_k] < t ⇒ a pair matching only in that prefix can’t reach t (weights ≥ 0), so it is guaranteed to also share an indexed dim. The skipped prefix holds the few common dims (huge postings); only the rarer tail is indexed — that is the whole speed-up. The tighter Cauchy–Schwarz / L2-norm bounds (true L2AP) and accumulation-time pruning are the next optimisation layer on top of this exact base.

Structs§

Corpus
A corpus of L2-normalised sparse non-negative vectors in CSR form, with dimensions relabelled to a global rank (decreasing max weight). Built once; joinable at any threshold.
CosineJoiner
A reusable cosine-join handle that owns the corpus and — under feature = "gpu" on macOS — the Metal device, compiled batch_cosine kernel, and the corpus CSR uploaded to unified memory, all acquired once at construction. Repeated joins at different thresholds then skip the per-call kernel compile + CSR upload that cosine_join_with pays (only the t-specific inverted index is rebuilt each call, on the CPU). Always constructible; degrades to the pure-CPU join when the gpu feature is off or no Metal device is available — mirroring Rationer. This is the right entry point for sweeping thresholds or joining repeatedly.

Functions§

cosine_join
Exact all-pairs cosine join via inverted index + L2AP prefix filtering and accumulation-time pruning. Returns (j, i, cos) with j < i for every pair with cos ≥ t. Bit-identical pair set to cosine_join_bruteforce.
cosine_join_bruteforce
Naive O(n²) oracle: score every pair with [cos_full], keep cos ≥ t. The correctness reference cosine_join is validated against.
cosine_join_with
Run the cosine join under a chosen Concurrency backend. Returns (j, i, cos) pairs with j < i and cos ≥ t, scores as f64 (the Gpu mode’s f32 cosines are widened losslessly).