Expand description
simjoin — exact all-pairs weighted-cosine similarity join: given a corpus of sparse
non-negative vectors, find every pair (i, j) whose cosine similarity is ≥ t.
This is the AllPairs / L2AP family (Bayardo et al. WWW’07; Anastasiu & Karypis ICDE’14): an
inverted index with prefix filtering so that the vast majority of vector pairs are never
compared. It is the principled, exact replacement for shingle-candidate + verify-all
near-duplicate detection (e.g. Type-3 code clones = functions × IDF-weighted lines).
§Correctness gate
cosine_join (the indexed algorithm) must return the exact same pair set, with
bit-identical similarities, as cosine_join_bruteforce (the naive O(n²) oracle). Both score
a pair with the same [cos_full] sorted-merge dot, so the values match to the bit and a pair is
never dropped or gained at the threshold. This equality is asserted on fuzzed corpora — the same
“two implementations, one answer” discipline the RO path uses.
§Method (this reference)
Vectors are L2-normalised (so cos = dot) and their dimensions relabelled to a global rank by
increasing max weight (common low-weight dims first, rare high-weight dims last — so only the
rare tail is indexed, keeping postings short). For a probe vector we look up candidates through
the index, then verify with the full dot. When indexing a vector we skip the leading prefix
whose max possible contribution to any dot stays < t:
Σ_{k<b} w_k · maxw[dim_k] < t ⇒ a pair matching only in that prefix can’t reach t
(weights ≥ 0), so it is guaranteed to also share an indexed dim. The skipped prefix holds the few
common dims (huge postings); only the rarer tail is indexed — that is the whole speed-up. The
tighter Cauchy–Schwarz / L2-norm bounds (true L2AP) and accumulation-time pruning are the next
optimisation layer on top of this exact base.
Structs§
- Corpus
- A corpus of L2-normalised sparse non-negative vectors in CSR form, with dimensions relabelled to a global rank (decreasing max weight). Built once; joinable at any threshold.
- Cosine
Joiner - A reusable cosine-join handle that owns the corpus and — under
feature = "gpu"on macOS — the Metal device, compiledbatch_cosinekernel, and the corpus CSR uploaded to unified memory, all acquired once at construction. Repeatedjoins at different thresholds then skip the per-call kernel compile + CSR upload thatcosine_join_withpays (only thet-specific inverted index is rebuilt each call, on the CPU). Always constructible; degrades to the pure-CPU join when thegpufeature is off or no Metal device is available — mirroringRationer. This is the right entry point for sweeping thresholds or joining repeatedly.
Functions§
- cosine_
join - Exact all-pairs cosine join via inverted index + L2AP prefix filtering and accumulation-time
pruning. Returns
(j, i, cos)withj < ifor every pair withcos ≥ t. Bit-identical pair set tocosine_join_bruteforce. - cosine_
join_ bruteforce - Naive
O(n²)oracle: score every pair with [cos_full], keepcos ≥ t. The correctness referencecosine_joinis validated against. - cosine_
join_ with - Run the cosine join under a chosen
Concurrencybackend. Returns(j, i, cos)pairs withj < iandcos ≥ t, scores asf64(theGpumode’s f32 cosines are widened losslessly).