pub struct MinHash<T: ?Sized, S = DefaultHashBuilder> { /* private fields */ }alloc only.Expand description
A fixed-size signature that estimates the Jaccard similarity of the set it summarises against another sketch.
MinHash represents a set by the minimum hash value each of k hash functions
produces over the set’s members. The fraction of signature slots two sketches
agree on is an unbiased estimate of the
Jaccard similarity
|A ∩ B| / |A ∪ B| of the two sets — computed in O(k) time and O(k)
space, independent of how large the sets are. More hash functions tighten the
estimate (standard error ≈ 1 / sqrt(k)).
The sketch is generic over the item type T and a
BuildHasher S, defaulting to the deterministic
DefaultHashBuilder. Two sketches are only
comparable when built with the same number of hash functions and the same
hasher.
§Examples
use bloom_lib::MinHash;
let mut a = MinHash::new(256).unwrap();
let mut b = MinHash::new(256).unwrap();
for w in ["the", "quick", "brown", "fox"] {
a.insert(w);
}
for w in ["the", "quick", "red", "fox"] {
b.insert(w);
}
// True Jaccard similarity is 3/5 = 0.6 (shared: the, quick, fox).
let similarity = a.similarity(&b).unwrap();
assert!((0.45..=0.75).contains(&similarity));Implementations§
Source§impl<T: ?Sized> MinHash<T, DefaultHashBuilder>
impl<T: ?Sized> MinHash<T, DefaultHashBuilder>
Sourcepub fn new(num_hashes: usize) -> Result<Self, Error>
pub fn new(num_hashes: usize) -> Result<Self, Error>
Creates a sketch with num_hashes hash functions, using the default
hasher.
More hash functions give a more precise similarity estimate at
proportional memory and update cost. A few hundred is typical; the
standard error of the estimate is about 1 / sqrt(num_hashes).
§Parameters
num_hashes: the signature length. Must be non-zero.
§Errors
Returns Error::InvalidParameter if num_hashes is zero.
§Examples
use bloom_lib::MinHash;
let sketch = MinHash::<&str>::new(128).unwrap();
assert_eq!(sketch.num_hashes(), 128);
assert!(sketch.is_empty());Source§impl<T: ?Sized, S: BuildHasher> MinHash<T, S>
impl<T: ?Sized, S: BuildHasher> MinHash<T, S>
Sourcepub fn with_hasher(num_hashes: usize, hasher: S) -> Result<Self, Error>
pub fn with_hasher(num_hashes: usize, hasher: S) -> Result<Self, Error>
Creates a sketch with num_hashes hash functions and a caller-supplied
hasher.
§Errors
Returns Error::InvalidParameter if num_hashes is zero.
§Examples
use std::collections::hash_map::RandomState;
use bloom_lib::MinHash;
let sketch: MinHash<&str, RandomState> =
MinHash::with_hasher(128, RandomState::new()).unwrap();Sourcepub fn insert(&mut self, item: &T)where
T: Hash,
pub fn insert(&mut self, item: &T)where
T: Hash,
Adds item to the set this sketch summarises.
Each of the k hash functions is evaluated on item and the
corresponding signature slot keeps the running minimum. Re-inserting an
item already present has no effect, so the sketch depends only on the
distinct members of the set.
§Examples
use bloom_lib::MinHash;
let mut sketch = MinHash::new(64).unwrap();
sketch.insert("member");
assert!(!sketch.is_empty());Sourcepub fn similarity(&self, other: &Self) -> Result<f64, Error>
pub fn similarity(&self, other: &Self) -> Result<f64, Error>
Estimates the Jaccard similarity between this sketch and other.
The result is the fraction of signature slots whose minimum hash agrees,
which is an unbiased estimator of |A ∩ B| / |A ∪ B|. Identical sets
estimate 1.0; disjoint sets estimate near 0.0. Two empty sketches
agree on every slot and so report 1.0.
§Errors
Returns Error::IncompatibleParameters if the two sketches have
different signature lengths.
§Examples
use bloom_lib::MinHash;
let mut a = MinHash::new(256).unwrap();
let mut b = MinHash::new(256).unwrap();
for i in 0..1_000u32 {
a.insert(&i);
}
for i in 0..1_000u32 {
b.insert(&i);
}
// Identical sets.
assert_eq!(a.similarity(&b).unwrap(), 1.0);Sourcepub fn merge(&mut self, other: &Self) -> Result<(), Error>
pub fn merge(&mut self, other: &Self) -> Result<(), Error>
Merges other into self, producing the sketch of the union of the
two sets.
Each signature slot becomes the minimum of the two sketches’ slots, which is exactly the slot the union would have produced. Both sketches must have the same signature length.
§Errors
Returns Error::IncompatibleParameters if the signature lengths differ.
§Examples
use bloom_lib::MinHash;
let mut a = MinHash::new(128).unwrap();
let mut b = MinHash::new(128).unwrap();
a.insert("x");
b.insert("y");
a.merge(&b).unwrap();
// `a` now summarises {x, y}.Sourcepub fn num_hashes(&self) -> usize
pub fn num_hashes(&self) -> usize
The number of hash functions in the signature.