[−][src]Module probminhash::probminhasher
Implementation of ProbMinHash2, ProbMinHash3 and ProbMinHash3a as described in O. Ertl
https://arxiv.org/abs/1911.00675
- ProbminHash3a is the fastest but at the cost of some internal storage.
- Probminhash3 is the same algorithm without the time optimization requiring more storage.
It can be used in streaming - Probminhash2 is statistically equivalent to P-Minhash as described in :
Moulton Jiang "Maximally consistent sampling and the Jaccard index of probability distributions"
https://ieeexplore.ieee.org/document/8637426 or https://arxiv.org/abs/1809.04052.
It is given as a fallback in case ProbminHash3* algorithms do not perform well, or for comparison.
The generic type D must satisfy D:Copy+Eq+Hash+Debug
The algorithms requires random generators to be initialized by the objects so we need to map (at least approximately
injectively) objects into a u64 so the of objcts must satisfy Hash.
If D is of type u64 it is possible to use a NoHasher (cd module nohasher)
Structs
ExpRestricted01 | Structure for defining exponential sampling of parameter lambda with support restricted to unit interval [0,1). |
FYshuffle | Fisher Yates random permutation generation (sampling without replacement), with lazy generation of an array of size n |
ProbMinHash2 | implementation of the algorithm ProbMinHash2 as described in Ertl paper. |
ProbMinHash3 | implementation of the algorithm ProbMinHash3a as described in Etrl. |
ProbMinHash3a | implementation of the algorithm ProbMinHash3a as described in Etrl. |
Traits
WeightedSet | A Trait to define association of a weight to an object. Typically we could implement trait WeightedSet for any collection of Object if we have a function giving a weight to each object Then hash_wset function can be used. |
Functions
compute_probminhash_jaccard | computes the weighted jaccard index of 2 signatures. The 2 signatures must come from two equivalent instances of the same ProbMinHash algorithm with the same number of hash signatures |