Expand description
SIMD-accelerated implementations of various streaming algorithms.
📦 Crates.io │ 📑 GitHub │ 💬 Chat
This library is a work in progress. PRs are very welcome! Currently implemented algorithms include:
- Count–min sketch
- Top k (Count–min sketch plus a doubly linked hashmap to track heavy hitters / top k keys when ordered by aggregated value)
- HyperLogLog
- Reservoir sampling
A goal of this library is to enable composition of these algorithms; for example Top k + HyperLogLog to enable an approximate version of something akin to SELECT key FROM table GROUP BY key ORDER BY COUNT(DISTINCT value) DESC LIMIT k
.
Run your application with RUSTFLAGS="-C target-cpu=native"
and the nightly
feature to benefit from the SIMD-acceleration like so:
RUSTFLAGS="-C target-cpu=native" cargo run --features "streaming_algorithms/nightly" --release
See this gist for a good list of further algorithms to be implemented. Other resources are Probabilistic data structures – Wikipedia, DataSketches – A similar Java library originating at Yahoo, and Algebird – A similar Java library originating at Twitter.
As these implementations are often in hot code paths, unsafe is used, albeit only when necessary to a) achieve the asymptotically optimal algorithm or b) mitigate an observed bottleneck.
Structs§
- Count
MinSketch - An implementation of a count-min sketch data structure with conservative updating for increased accuracy.
- Hyper
LogLog - An implementation of the HyperLogLog data structure with bias correction.
- Hyper
LogLog Magnitude - Like
HyperLogLog
but implementsOrd
andEq
by using the estimate of the cardinality. - Sample
Total - Given population and sample sizes, returns true if this element is in the sample. Without replacement.
- Sample
Unstable - Reservoir sampling. Without replacement, and the returned order is unstable.
- Top
- This probabilistic data structure tracks the
n
top keys given a stream of(key,value)
tuples, ordered by the sum of the values for each key (the “aggregated value”). It uses onlyO(n)
space. - TopIter
- An iterator over the entries and counts in a
Top
datastructure.
Traits§
- Intersect
- Intersect zero or more
&Self
to createOption<Self>
. - Intersect
Plus Union IsPlus - An optimisation for cases like putting a HyperLogLog inside a Count–min sketch, where intersecting, adding a val, and then unioning that with counters is the same as simply adding the val to the counters.
- New
- New instances are instantiable given a specified input of
<Self as New>::Config
. - Union
Assign - Union
Self
withRhs
in place.