Expand description
cardinality-estimator
is a Rust crate designed to estimate the number of distinct elements in a stream or dataset in an efficient manner.
This library uses HyperLogLog++ with an optimized low memory footprint and high accuracy approach, suitable for large-scale data analysis tasks.
Cardinality estimator allows to estimate number of distinct elements in the stream or dataset and is defined with const P
and W
parameters:
P
: precision parameter in [4..18] range, which defines number of bits to use for HyperLogLog register indices.W
: width parameter in [4..6] range, which defines number of bits to use for HyperLogLog register width.
§Data-structure design rationale
§Low memory footprint
For parameters P = 12, W = 6:
- Cardinality in [0..2] range - 8 bytes (small representation)
- Cardinality in [3..4] range - 24 bytes (array representation)
- Cardinality in [5..8] range - 40 bytes (array representation)
- Cardinality in [9..16] range - 72 bytes (array representation)
- …
- Cardinality in [65..128] range - 520 bytes (array representation)
- Cardinality in [129..] range - 3092 bytes (hyperloglog representation)
§Low latency
- Auto-vectorization for slice operations via compiler hints
to use SIMD instructions when using
chunks_exact
. - Number of zero registers and registers’ harmonic sum are
stored and updated dynamically as more data being inserted,
allowing to have truly constant
estimate
operations. - Efficient polynomial computation using Horner’s method.
§High accuracy
- For small cardinality range (<= 128 for P = 12, W = 6) cardinality counted very accurately (within hash collisions chance)
- For large cardinality range HyperLogLog++ is used with LogLog-Beta bias correction.
- Expected error (1.04 / sqrt(2^P)):
- P = 10, W = 5: 0.0325
- P = 12, W = 6: 0.0162
- P = 14, W = 6: 0.0081
- P = 18, W = 6: 0.0020
- Expected error (1.04 / sqrt(2^P)):
§Data storage format
Cardinality estimator stores data in one of the three representations:
Small
representation - seesmall
module for more details.Array
representation - seearray
module for more details.HyperLogLog
representation - seehyperloglog
module for more details
§Data Storage Format
The cardinality estimator stores data in one of three formats: Small
, Array
, and HyperLogLog
.
See corresponding modules (small
, array
, hyperloglog
) for more details.
Re-exports§
pub use estimator::*;