Skip to main content

Crate hyperloglockless

Crate hyperloglockless 

Source
Expand description

§hyperloglockless

Github Crates.io docs.rs Downloads

High-performance HyperLogLogs with bias correction and full concurrency support. Used for accurate and space-efficient cardinality estimation.

§Overview

HyperLogLogs are space efficient data structures for the “count-distinct problem”, approximating the number of distinct elements in a multiset. Paper.

hyperloglockless includes a suite of cardinality estimator implementations based on HyperLogLog++ and Log Log Beta. Notable features are:

  • O(1) Count Calls without sacrificing insert throughput. Internal counts are cheaply updated with each insert, hyperloglockless particularly useful for streaming use-cases.
  • Concurrency Support: AtomicHyperLogLog is a drop-in replacement for RwLock<OtherHyperLogLog>: all methods take &self, so you can wrap it in Arc and update it concurrently without &mut.
  • Sparse Representation: HyperLogLogPlus uses a tweaked version of Google’s sparse representation. It’s 5x faster, 100x more accurate, and uses less memory than other crates implementing sparse representations.
  • Accurate: Empirically verified accuracy for trillions of elements; other implementations break down after millions.
  • Tested: Rigorously tested with loom and benchmarked for speed, memory, and accuracy.

§Usage

[dependencies]
hyperloglockless = "0.5.0"

A HyperLogLog with precision p uses 2^p bytes of memory and has an error % of roughly 104 / sqrt(2^p).

use hyperloglockless::HyperLogLog;

let precision = hyperloglockless::precision_for_error(0.01); // 1% error
assert_eq!(precision, 14);

let mut hll = HyperLogLog::new(precision);
hll.insert(&'🦀');
hll.insert_all('a'..='z');

let count = hll.count(); // ~27
assert_eq!(hll.len(), 1 << precision); // 16384 bytes

Use any hasher:

use hyperloglockless::HyperLogLog;
use foldhash::fast::RandomState;

let hll = HyperLogLog::with_hasher(14, RandomState::default());

Full concurrency support: AtomicHyperLogLog is a drop-in replacement for RwLock<OtherHyperLogLog>: all methods take &self.

use hyperloglockless::AtomicHyperLogLog;

let hll = AtomicHyperLogLog::new(14);
hll.insert(&'🦀');
hll.insert_all('a'..='z');

A more compact and accurate “sparse” representation that switches to classic HLL automatically when memory reaches the same as classic HLL:

use hyperloglockless::HyperLogLogPlus;

let mut hll = HyperLogLogPlus::new(14);
hll.insert(&'🦀');
hll.insert_all('a'..='z');

§Performance

An overall benchmark where N items are inserted and a single count call is made afterwards. hyperloglockless has O(1) cardinality queries without sacrificing insert throughput. It excels when there are many cardinality queries and/or when the inserts are <65K. For larger inserts, it keeps up well since internal book-keeping is quick.

fp-micro

§Accuracy

hyperloglockless stays consistently accurate while other implementations break down after millions of items. hyperloglockless’s sparse HLL is ~100x more accurate than other sparse implementations. It achieves high accuracy by cramming more information into each hash encoding and using more accurate error correction models.

fp-micro

§Count Performance

count calls for hyperloglockless’s cardinality estimators are O(1):

fp-micro

§Sparse Representation Memory (HyperLogLogPlus)

Below measures and compares the amortized insert performance of hyperloglockless::HyperLogLogPlus, which first uses a sparse representation then automatically switches to classic “dense” HLL representation after a certain number of inserts. hyperloglockless::HyperLogLogPlus is 5x faster than other sparse implementations while using less memory. It achieves this by eliminating unnecessary hashing, using faster hash encoding, devirtualization avoidance, and smarter memory management.

fp-micro

§Multi-Threaded Performance

hyperloglockless::AtomicHyperLogLog does not require any locking and therefore avoids thread contention.

fp-micro

§Available Features

  • rand - Enabled by default, this has the DefaultHasher source its random state using thread_rng() instead of hardware sources. Getting entropy from a user-space source is considerably faster, but requires additional dependencies to achieve this. Disabling this feature by using default-features = false makes DefaultHasher source its entropy using foldhash, which will have a much simpler code footprint at the expense of speed.
  • serde - HyperLogLogs implement Serialize and Deserialize when possible.
  • loom - AtomicHyperLogLogs use loom atomics, making it compatible with loom testing.

§License

Licensed under either of

  • Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
  • MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

§Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Structs§

AtomicHyperLogLog
HyperLogLog is a data structure for the “count-distinct problem”, approximating the number of distinct elements in a multiset. AtomicHyperLogLog is the thread-safe counterpart of HyperLogLog.
HyperLogLog
HyperLogLog is a data structure for the “count-distinct problem”, approximating the number of distinct elements in a multiset.
HyperLogLogPlus
An implementation of the the HyperLogLog++ data structure.

Enums§

Error

Functions§

error_for_precision
Returns the approximate error of count and raw_count given the precision of a HyperLogLog or AtomicHyperLogLog.
precision_for_error
Returns the HyperLogLog precision that will have the error for calls to count and raw_count.

Type Aliases§

DefaultHasher
The default hasher for crate::HyperLogLog and crate::AtomicHyperLogLog.