# Some Minhash related algorithms
This crate provides implementation of some recent algorithms deriving from the original Minhash. They have better performance and are more general.
It implements:
* ProbMinHash2, ProbMinHash3 and ProbMinHash3a as described in O. Ertl paper:
**ProbMinHash. A Class of of Locality-Sensitive Hash Algorithms for the Probability Jaccard Similarity (2020)**
[probminhash Ertl](https://arxiv.org/abs/1911.00675).
These algorithms compute an estimation of the Jaccard weighted index via sensitive hashing.
It is an extension of the Jaccard index to the case where objects have a weight, or a multiplicity associated.
This Jaccard weighted index provides a metric on discrete probability distributions as explained in :
**Moulton Jiang. Maximally consistent sampling and the Jaccard index of probability distributions (2018)**
[Moulton-Jiang-ieee](https://ieeexplore.ieee.org/document/8637426) or [Moulton-Jiang-arxiv](https://arxiv.org/abs/1809.04052)
Noting *Jp* the Jaccard weighted index, then *1. - Jp* defines a metric on finite discrete probabilities.
This module is the core of the crate which has two other modules.
* Superminhash
An implementation of Superminhash :
**A new minwise Hashing Algorithm for Jaccard Similarity Estimation**
Otmar Ertl 2017-2018 Cf [superminhash Ertl](https://arxiv.org/abs/1706.05698)
This algorithm runs on unweighted objects.
The hash values are computed by the *sketch* method or can be computed before entering SuperMinHash methods.
In this case (pre-hashed values) the structure just computes permutation according to the paper.
It runs in one pass on data so it can be used in streaming.
* Invhash
It is just a module providing invertible hash from u32 to u32 or u64 to u64 and can be used to run a prehash on indexes.
(See reference to Thomas Wang's invertible integer hash functions in invhash.rs)
## Some examples
Some example of usage (more in the tests in each module) consisting to estimate intersection of contents of 2 vectors:
* Probminhash
An example of Prominhash3a with an IndexMap
(see test probminhash::tests::test_probminhash3a_count_intersection_unequal_weights)
```rust
type FnvIndexMap<K, V> = IndexMap<K, V, FnvBuildHasher>;
...
let mut wa : FnvIndexMap::<usize,f64> = FnvIndexMap::with_capacity_and_hasher(70, FnvBuildHasher::default());
// initialize wa ...
let mut wb : FnvIndexMap::<usize,f64> = FnvIndexMap::with_capacity_and_hasher(70, FnvBuildHasher::default());
// initialize ...
let mut waprobhash = ProbMinHash3a::<usize, FnvHasher>::new(nbhash, 0);
waprobhash.hash_weigthed_idxmap(&wa);
//
let mut wbprobhash = ProbMinHash3a::<usize, FnvHasher>::new(nbhash, 0);
wbprobhash.hash_weigthed_idxmap(&wb);
//
let siga = waprobhash.get_signature();
let sigb = wbprobhash.get_signature();
let jp_approx = compute_probminhash_jaccard(siga, sigb);
```
An example of Probminhash3 with items sent one by one:
```rust
let set_size = 100;
let mut wa = Vec::<f64>::with_capacity(set_size);
let mut wb = Vec::<f64>::with_capacity(set_size);
// initialize wa, wb
....
// probminhash
let mut waprobhash = ProbMinHash3::<usize, FnvHasher>::new(nbhash, 0);
for i in 0..set_size {
if wa[i] > 0. {
waprobhash.hash_item(i, wa[i]);
}
}
//
let mut wbprobhash = ProbMinHash3::<usize, FnvHasher>::new(nbhash, 0);
for i in 0..set_size {
if wb[i] > 0. {
wbprobhash.hash_item(i, wb[i]);
}
}
let siga = waprobhash.get_signature();
let sigb = wbprobhash.get_signature();
let jp_approx = compute_probminhash_jaccard(siga, sigb);
```
* Superminhash
```rust
let va : Vec<usize> = (0..1000).collect();
let vb : Vec<usize> = (900..2000).collect();
let bh = BuildHasherDefault::<FnvHasher>::default();
let mut sminhash : SuperMinHash<usize, FnvHasher>= SuperMinHash::new(70, &bh);
// now compute sketches
let resa = sminhash.sketch_slice(&va);
// we decide to reuse sminhash instead of allocating another SuperMinHash structure
let ska = sminhash.get_hsketch().clone();
sminhash.reinit();
let resb = sminhash.sketch_slice(&vb);
let skb = sminhash.get_hsketch();
//
let jac = get_jaccard_index_estimate(&ska, &skb).unwrap();
...
```
## License
Licensed under either of
* Apache License, Version 2.0, [LICENSE-APACHE](LICENSE-APACHE) or <http://www.apache.org/licenses/LICENSE-2.0>
* MIT license [LICENSE-MIT](LICENSE-MIT) or <http://opensource.org/licenses/MIT>
at your option.
This software was written on my own while working at [CEA](http://www.cea.fr/)