rust-hll
A Rust implementation of HyperLogLog that is storage-compatible with the Aggregate Knowledge HLL Storage Spec. It is heavily based on the work done in go-hll.
Overview
HyperLogLog (HLL) is a fixed-size, set-like structure used for distinct value counting with tunable precision. For example, in 1280 bytes HLL can estimate the count of tens of billions of distinct values with only a few percent error.
In addition to the algorithm proposed in the original paper, this implementation is augmented to improve its accuracy and memory use without sacrificing much speed.
Motivation
While there are a handful of existing HLL implementations in Rust, none of them (that I have found) implement the AK Storage Spec.
The unified storage format is useful for reading and writing HLLs in a multi-lingual environment. This implementation
allows for seamless integration with other HLL implementations that follow the AK Storage Spec, such as the PostgreSQL HLL
extension.
Hashing
A good hashing algorithm is crucial to achieving the pseudorandomness that HLL requires in order to perform its calculations. The 64-bit variant of MurmurHash3 is recommended. If using a seed, it must be constant for all inputs to a given HLL. Further, if that HLL is to be unioned, then the same seed must be used for all inputs to the other HLL.
Usage
use ;
Configuration Options
The HLL implementation can be configured through the Settings struct:
log_2m: Determines the number of registers in the HLL (2^log_2m). Must be between 4 and 31.reg_width: Number of bits dedicated to each register value. Must be between 1 and 8.explicit_threshold: Cardinality at which the HLL transitions from explicit to probabilistic storage. Use -1 for auto-calculation.sparse_enabled: Whether to use sparse representation. When true, conversion thresholds are automatically calculated.
Storage Types
The implementation uses three storage types that automatically transition based on the data:
- Explicit: Used for small cardinalities, stores actual values
- Sparse: Used for medium cardinalities, stores only non-zero registers
- Dense: Used for large cardinalities, stores all registers
Additional Resources
- HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm
- Understanding the HyperLogLog
License
Released under the MIT license.