Locality-sensitive hashing schemes for measuring similarities between sets.
MinHash | MinHash is a locality sensitive hashing scheme that can estimate the Jaccard Similarity
measure between two sets s1 and s2 . It uses multiple hash functions and for each hash
function h , finds the minimum hash value obtained from the hashing an item in s1 using h
and hashing an item in s2 using h . Our estimate for the Jaccard Similarity is the number of
minimum hash values that are equal divided by the number of total hash functions used.
|
ShingleIterator | A w-shingle iterator for an list of items.
|
SimHash | SimHash is a locality sensitive hashing scheme. If two sets s1 and s2 are similar,
SimHash will generate hashes for s1 and s2 that has a small Hamming Distance between
them.
|
get_jaccard_similarity | Computes the Jaccard Similarity between two iterators. The Jaccard Similarity is the quotient
between the intersection and the union.
|