[][src]Module probminhash::probminhasher

Implementation of ProbMinHash2, ProbMinHash3 and ProbMinHash3a as described in O. Ertl
https://arxiv.org/abs/1911.00675

  • ProbminHash3a is the fastest but at the cost of some internal storage.
  • Probminhash3 is the same algorithm without the time optimization requiring more storage.
    It can be used in streaming
  • Probminhash2 is statistically equivalent to P-Minhash as described in : Moulton Jiang "Maximally consistent sampling and the Jaccard index of probability distributions" https://ieeexplore.ieee.org/document/8637426 or https://arxiv.org/abs/1809.04052.
    It is given as a fallback in case ProbminHash3* algorithms do not perform well, or for comparison.

The generic type D must satisfy D:Copy+Eq+Hash+Debug
The algorithms requires random generators to be initialized by the objects so we need to map (at least approximately injectively) objects into a u64 so the of objcts must satisfy Hash. If D is of type u64 it is possible to use a NoHasher (cd module nohasher)

Structs

ExpRestricted01

Structure for defining exponential sampling of parameter lambda with support restricted to unit interval [0,1).

FYshuffle

Fisher Yates random permutation generation (sampling without replacement), with lazy generation of an array of size n

ProbMinHash2

implementation of the algorithm ProbMinHash2 as described in Ertl paper.

ProbMinHash3

implementation of the algorithm ProbMinHash3a as described in Etrl.
It needs less memory than Probminhash3 but can be a little slower.
Probminhash3 needs at least 2 hash values to run.

ProbMinHash3a

implementation of the algorithm ProbMinHash3a as described in Etrl.
This version of ProbMinHash3 is faster but needs some more memory as it stores some states between 2 passes on data.

Traits

WeightedSet

A Trait to define association of a weight to an object. Typically we could implement trait WeightedSet for any collection of Object if we have a function giving a weight to each object Then hash_wset function can be used.

Functions

compute_probminhash_jaccard

computes the weighted jaccard index of 2 signatures. The 2 signatures must come from two equivalent instances of the same ProbMinHash algorithm with the same number of hash signatures