Module probabilistic_collections::similarity[][src]

Module for measuring similarities between sets.

Structs

MinHash

MinHash is a locality sensitive hashing scheme that can estimate the Jaccard Similarity measure between two sets s1 and s2. It uses multiple hash functions and for each hash function h, finds the minimum hash value obtained from the hashing an item in s1 using h and hashing an item in s2 using h. Our estimate for the Jaccard Similarity is the number of minimum hash values that are equal divided by the number of total hash functions used.

ShingleIterator

A w-shingle iterator for an list of items.

SimHash

SimHash is a locality sensitive hashing scheme. If two sets s1 and s2 are similar, SimHash will generate hashes for s1 and s2 that has a small Hamming Distance between them.

Functions

get_jaccard_similarity

Computes the Jaccard Similarity between two iterators. The Jaccard Similarity is the quotient between the intersection and the union.