Crate hyperminhash

Crate hyperminhash 

Source
Expand description

Very fast, constant memory-footprint cardinality approximation, including intersection and union operation. A straight port of Hyperminhash.

As with other cardinality estimators, Hyperminhash has two advantages when counting very large sets or streams of elements:

  • It uses a single data structure that never grows while counting elements. The structure consumes 32kb of memory, allocated on the stack.
  • The amount of work done for counting a marginal element stays approximately constant.

For sets smaller than roughly 10^4 to 10^5 unique elements, a std::collections::HashSet is usually faster, albeit using much more memory. When counting streams of millions of elements, Hyperminhash is much faster and uses much less memory.

use hyperminhash::Sketch;

// A `Sketch` can approximate the unique count of elements it has seen over it's lifetime.
let mut sk = Sketch::default();

// After initialization, a `Sketch` will never allocate.

// Elements added to Sketch need to implement `std::hash::Hash`
sk.add("foobar");
sk.add(1);
sk.add(2);
sk.add(vec![3, 4, 5]);
for i in 0..5000 {
    sk.add(i);
}
// There will be some error in the count
assert!(sk.cardinality() > 4_900.0 && sk.cardinality() < 5_100.0);


// Using `std::iter::FromIterator`
let sketch1 = (0..10_000).collect::<Sketch>();
let sketch2 = (5_000..15_000).collect::<Sketch>();
assert!(sketch1.cardinality() > 9_800.0 && sketch1.cardinality() < 10_200.0);
assert!(sketch2.cardinality() > 9_800.0 && sketch2.cardinality() < 10_200.0);

// The intersection of two sketches yields the approximate number of unique
// elements that are in both sets.
let i = sketch1.intersection(&sketch2);
assert!(i > 4_800.0 && i < 5_200.0);

// Comparing both sets, the number of elements that are in both sets is roughly 33%
let j = sketch1.similarity(&sketch2);
assert!(j > 0.33 && j < 0.34);

// Merge sketch1 with sketch2
let mut sketch1 = sketch1;
sketch1.union(&sketch2);
assert!(sketch1.cardinality() > 14_800.0 && sketch1.cardinality() < 15_200.0);


// Sketches can be serialized/deserialized from any `io::Read`/`io::Write`.
let mut buffer = Vec::new();
sketch1.save(&mut buffer).expect("Failed to write");

let sketch2 = Sketch::load(&buffer[..]).expect("Failed to read");
assert_eq!(sketch1.cardinality(), sketch2.cardinality());


// Use custom seed-values to distinguish elements that ordinarily
// hash equally.
const WHITE_SEED: u64 = 0xD1CE_5EED_1234_5678;
const BLUE_SEED: u64 = 0xDEAD_BEEF_1234_5678;
let mut sk_seeded = Sketch::default();
sk_seeded.add_with_seed("foo", WHITE_SEED);
sk_seeded.add_with_seed("foo", BLUE_SEED);
assert!(sk_seeded.cardinality() > 1.0);

Structs§

Sketch
Records the approximate number of unique elements it has seen over it’s lifetime.