Skip to main content

MinHash

Struct MinHash 

Source
pub struct MinHash<T: ?Sized, S = DefaultHashBuilder> { /* private fields */ }
Available on crate feature alloc only.
Expand description

A fixed-size signature that estimates the Jaccard similarity of the set it summarises against another sketch.

MinHash represents a set by the minimum hash value each of k hash functions produces over the set’s members. The fraction of signature slots two sketches agree on is an unbiased estimate of the Jaccard similarity |A ∩ B| / |A ∪ B| of the two sets — computed in O(k) time and O(k) space, independent of how large the sets are. More hash functions tighten the estimate (standard error ≈ 1 / sqrt(k)).

The sketch is generic over the item type T and a BuildHasher S, defaulting to the deterministic DefaultHashBuilder. Two sketches are only comparable when built with the same number of hash functions and the same hasher.

§Examples

use bloom_lib::MinHash;

let mut a = MinHash::new(256).unwrap();
let mut b = MinHash::new(256).unwrap();

for w in ["the", "quick", "brown", "fox"] {
    a.insert(w);
}
for w in ["the", "quick", "red", "fox"] {
    b.insert(w);
}

// True Jaccard similarity is 3/5 = 0.6 (shared: the, quick, fox).
let similarity = a.similarity(&b).unwrap();
assert!((0.45..=0.75).contains(&similarity));

Implementations§

Source§

impl<T: ?Sized> MinHash<T, DefaultHashBuilder>

Source

pub fn new(num_hashes: usize) -> Result<Self, Error>

Creates a sketch with num_hashes hash functions, using the default hasher.

More hash functions give a more precise similarity estimate at proportional memory and update cost. A few hundred is typical; the standard error of the estimate is about 1 / sqrt(num_hashes).

§Parameters
  • num_hashes: the signature length. Must be non-zero.
§Errors

Returns Error::InvalidParameter if num_hashes is zero.

§Examples
use bloom_lib::MinHash;

let sketch = MinHash::<&str>::new(128).unwrap();
assert_eq!(sketch.num_hashes(), 128);
assert!(sketch.is_empty());
Source§

impl<T: ?Sized, S: BuildHasher> MinHash<T, S>

Source

pub fn with_hasher(num_hashes: usize, hasher: S) -> Result<Self, Error>

Creates a sketch with num_hashes hash functions and a caller-supplied hasher.

§Errors

Returns Error::InvalidParameter if num_hashes is zero.

§Examples
use std::collections::hash_map::RandomState;
use bloom_lib::MinHash;

let sketch: MinHash<&str, RandomState> =
    MinHash::with_hasher(128, RandomState::new()).unwrap();
Source

pub fn insert(&mut self, item: &T)
where T: Hash,

Adds item to the set this sketch summarises.

Each of the k hash functions is evaluated on item and the corresponding signature slot keeps the running minimum. Re-inserting an item already present has no effect, so the sketch depends only on the distinct members of the set.

§Examples
use bloom_lib::MinHash;

let mut sketch = MinHash::new(64).unwrap();
sketch.insert("member");
assert!(!sketch.is_empty());
Source

pub fn similarity(&self, other: &Self) -> Result<f64, Error>

Estimates the Jaccard similarity between this sketch and other.

The result is the fraction of signature slots whose minimum hash agrees, which is an unbiased estimator of |A ∩ B| / |A ∪ B|. Identical sets estimate 1.0; disjoint sets estimate near 0.0. Two empty sketches agree on every slot and so report 1.0.

§Errors

Returns Error::IncompatibleParameters if the two sketches have different signature lengths.

§Examples
use bloom_lib::MinHash;

let mut a = MinHash::new(256).unwrap();
let mut b = MinHash::new(256).unwrap();
for i in 0..1_000u32 {
    a.insert(&i);
}
for i in 0..1_000u32 {
    b.insert(&i);
}
// Identical sets.
assert_eq!(a.similarity(&b).unwrap(), 1.0);
Source

pub fn merge(&mut self, other: &Self) -> Result<(), Error>

Merges other into self, producing the sketch of the union of the two sets.

Each signature slot becomes the minimum of the two sketches’ slots, which is exactly the slot the union would have produced. Both sketches must have the same signature length.

§Errors

Returns Error::IncompatibleParameters if the signature lengths differ.

§Examples
use bloom_lib::MinHash;

let mut a = MinHash::new(128).unwrap();
let mut b = MinHash::new(128).unwrap();
a.insert("x");
b.insert("y");

a.merge(&b).unwrap();
// `a` now summarises {x, y}.
Source

pub fn num_hashes(&self) -> usize

The number of hash functions in the signature.

Source

pub fn is_empty(&self) -> bool

Returns true if no items have been inserted.

Source

pub fn clear(&mut self)

Resets the sketch to empty, retaining the allocation.

§Examples
use bloom_lib::MinHash;

let mut sketch = MinHash::new(64).unwrap();
sketch.insert("x");
sketch.clear();
assert!(sketch.is_empty());

Trait Implementations§

Source§

impl<T: Clone + ?Sized, S: Clone> Clone for MinHash<T, S>

Source§

fn clone(&self) -> MinHash<T, S>

Returns a duplicate of the value. Read more
1.0.0 (const: unstable) · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl<T: Debug + ?Sized, S: Debug> Debug for MinHash<T, S>

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl<'de, T: ?Sized, S> Deserialize<'de> for MinHash<T, S>
where S: Default,

Source§

fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>
where __D: Deserializer<'de>,

Deserialize this value from the given Serde deserializer. Read more
Source§

impl<T: ?Sized, S> Serialize for MinHash<T, S>

Source§

fn serialize<__S>(&self, __serializer: __S) -> Result<__S::Ok, __S::Error>
where __S: Serializer,

Serialize this value into the given Serde serializer. Read more

Auto Trait Implementations§

§

impl<T, S> Freeze for MinHash<T, S>
where S: Freeze, T: ?Sized,

§

impl<T, S> RefUnwindSafe for MinHash<T, S>
where S: RefUnwindSafe, T: ?Sized,

§

impl<T, S> Send for MinHash<T, S>
where S: Send, T: ?Sized,

§

impl<T, S> Sync for MinHash<T, S>
where S: Sync, T: ?Sized,

§

impl<T, S> Unpin for MinHash<T, S>
where S: Unpin, T: ?Sized,

§

impl<T, S> UnsafeUnpin for MinHash<T, S>
where S: UnsafeUnpin, T: ?Sized,

§

impl<T, S> UnwindSafe for MinHash<T, S>
where S: UnwindSafe, T: ?Sized,

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<T> DeserializeOwned for T
where T: for<'de> Deserialize<'de>,