pub fn cosine_join_with(
c: &Corpus,
t: f64,
mode: Concurrency,
) -> Vec<(usize, usize, f64)>Expand description
Run the cosine join under a chosen Concurrency backend. Returns (j, i, cos) pairs with
j < i and cos ≥ t, scores as f64 (the Gpu mode’s f32 cosines are widened losslessly).
Concurrency::Cpu—cosine_join: exactf64, all-CPU, every platform.Concurrency::GpuPlusCpu— exactf64hybrid: CPU generates survivor pairs, the GPU f32 cosine filters the clear rejects, the CPU recomputes the exactf64score on what passes. Byte-identical toCpu; both engines fully used. ~1.7–2× on bandwidth-bound real data.Concurrency::Gpu— GPU-dominantf32: CPU generates survivor pairs, the GPU scores them and the result is emitted directly (no f64 re-verify). Fastest (~2×); differs from the exact answer only on pairs whose true cosine is within ~1e-6oft(measured: ≤1 pair in millions).
When the gpu feature is off, the target isn’t macOS, or no Metal device can be acquired, the GPU
modes transparently fall back to cosine_join (same as Rationer). This convenience entry
compiles + uploads the GPU corpus on every call — fine for a one-shot join, but for repeated
joins on one corpus build a CosineJoiner once and call CosineJoiner::join, which holds the
device + kernel + uploaded CSR across calls (and avoids the driver instability of compiling a
Metal library hundreds of times in a tight loop).