sketches
Probabilistic data structures for scalable approximate analytics in Rust.
This crate gives you memory-efficient sketches for:
- frequency estimation,
- distinct counting,
- set membership,
- set similarity (Jaccard),
- heavy hitter detection,
- quantiles,
- and stream sampling.
All sketches are designed for streaming workloads where exact data structures
(HashMap, full sorted buffers, exact sets) are too expensive in memory or
throughput.
Add To A Project
This repository is currently consumed as a local crate:
[]
= { = "../sketches" }
What Is Included
| Sketch | Module | Use it when | Notes |
|---|---|---|---|
| Bloom Filter | bloom_filter |
You need very fast membership checks and can tolerate false positives | No deletions |
| Cuckoo Filter | cuckoo_filter |
You need membership checks and deletions | Can fail inserts at high load |
| HyperLogLog | hyperloglog |
You need approximate distinct counts (COUNT(DISTINCT ...)) |
Mergeable, tiny memory footprint |
| MinMax Sketch | minmax_sketch |
You need approximate non-negative frequency counts | Conservative updates reduce overestimation |
| Count Sketch | count_sketch |
You need approximate signed frequency updates | Good for turnstile streams (+/- updates) |
| Space-Saving | space_saving |
You need top-k / heavy hitters from a stream | Tracks only bounded number of candidates |
| KLL Sketch | kll |
You need general quantiles (median, p90, p99) | Good default quantile sketch |
| t-digest | tdigest |
You care most about tail quantiles (p95/p99/p999) | Typically stronger tail behavior |
| MinHash | minhash |
You need Jaccard similarity between sets | Best default for similarity tasks |
| Reservoir Sampling | reservoir_sampling |
You need a uniform sample from an unbounded stream | Fixed-size unbiased sample |
| Jaccard trait/helpers | jacard |
You want a shared Jaccard API across sketches | Provides JacardIndex trait |
Which Sketch Should I Use?
If your primary goal is:
- Distinct counting: use
HyperLogLog. - Jaccard similarity: use
MinHashfirst. - Jaccard from existing cardinality pipelines: use
HyperLogLog+jacardhelpers. - Membership without delete: use
BloomFilter. - Membership with delete: use
CuckooFilter. - Approximate frequency (non-negative): use
MinMaxSketch. - Approximate frequency (signed +/- updates): use
CountSketch. - Heavy hitters / top-k: use
SpaceSaving. - General quantiles: use
KllSketch. - Tail-sensitive quantiles: use
TDigest. - Keep a representative stream sample: use
ReservoirSampling.
Quick Examples
Approximate distinct counting:
use HyperLogLog;
let mut hll = new?;
for i in 0_u64..100_000
println!;
# Ok::
Approximate Jaccard similarity (recommended via MinHash):
use JacardIndex;
use MinHash;
let mut left = new?;
let mut right = new?;
for v in 0_u64..10_000
for v in 5_000_u64..15_000
println!;
# Ok::
Run Examples
Validate