sketches 0.1.0

Probabilistic data structures for scalable approximate analytics
Documentation
  • Coverage
  • 92.86%
    143 out of 154 items documented14 out of 129 items with examples
  • Size
  • Source code size: 139.14 kB This is the summed size of all the files inside the crates.io package for this release.
  • Documentation size: 8 MB This is the summed size of all files generated by rustdoc for all configured targets
  • Ø build duration
  • this release: 23s Average build duration of successful builds.
  • all releases: 25s Average build duration of successful builds in releases after 2024-10-23.
  • Links
  • crates.io
  • Dependencies
  • Versions
  • Owners
  • eloraiby

sketches

Probabilistic data structures for scalable approximate analytics in Rust.

This crate gives you memory-efficient sketches for:

  • frequency estimation,
  • distinct counting,
  • set membership,
  • set similarity (Jaccard),
  • heavy hitter detection,
  • quantiles,
  • and stream sampling.

All sketches are designed for streaming workloads where exact data structures (HashMap, full sorted buffers, exact sets) are too expensive in memory or throughput.

Add To A Project

This repository is currently consumed as a local crate:

[dependencies]
sketches = { path = "../sketches" }

What Is Included

Sketch Module Use it when Notes
Bloom Filter bloom_filter You need very fast membership checks and can tolerate false positives No deletions
Cuckoo Filter cuckoo_filter You need membership checks and deletions Can fail inserts at high load
HyperLogLog hyperloglog You need approximate distinct counts (COUNT(DISTINCT ...)) Mergeable, tiny memory footprint
MinMax Sketch minmax_sketch You need approximate non-negative frequency counts Conservative updates reduce overestimation
Count Sketch count_sketch You need approximate signed frequency updates Good for turnstile streams (+/- updates)
Space-Saving space_saving You need top-k / heavy hitters from a stream Tracks only bounded number of candidates
KLL Sketch kll You need general quantiles (median, p90, p99) Good default quantile sketch
t-digest tdigest You care most about tail quantiles (p95/p99/p999) Typically stronger tail behavior
MinHash minhash You need Jaccard similarity between sets Best default for similarity tasks
Reservoir Sampling reservoir_sampling You need a uniform sample from an unbounded stream Fixed-size unbiased sample
Jaccard trait/helpers jacard You want a shared Jaccard API across sketches Provides JacardIndex trait

Which Sketch Should I Use?

If your primary goal is:

  • Distinct counting: use HyperLogLog.
  • Jaccard similarity: use MinHash first.
  • Jaccard from existing cardinality pipelines: use HyperLogLog + jacard helpers.
  • Membership without delete: use BloomFilter.
  • Membership with delete: use CuckooFilter.
  • Approximate frequency (non-negative): use MinMaxSketch.
  • Approximate frequency (signed +/- updates): use CountSketch.
  • Heavy hitters / top-k: use SpaceSaving.
  • General quantiles: use KllSketch.
  • Tail-sensitive quantiles: use TDigest.
  • Keep a representative stream sample: use ReservoirSampling.

Quick Examples

Approximate distinct counting:

use sketches::hyperloglog::HyperLogLog;

let mut hll = HyperLogLog::new(12)?;
for i in 0_u64..100_000 {
    hll.add(&i);
}
println!("distinct ~ {}", hll.count());
# Ok::<(), Box<dyn std::error::Error>>(())

Approximate Jaccard similarity (recommended via MinHash):

use sketches::jacard::JacardIndex;
use sketches::minhash::MinHash;

let mut left = MinHash::new(256)?;
let mut right = MinHash::new(256)?;

for v in 0_u64..10_000 {
    left.add(&v);
}
for v in 5_000_u64..15_000 {
    right.add(&v);
}

println!("jaccard ~ {:.4}", left.jaccard_index(&right)?);
# Ok::<(), Box<dyn std::error::Error>>(())

Run Examples

cargo run --example bloom_filter
cargo run --example cuckoo_filter
cargo run --example hyperloglog
cargo run --example jacard
cargo run --example minhash
cargo run --example minmax_sketch
cargo run --example count_sketch
cargo run --example space_saving
cargo run --example kll
cargo run --example tdigest
cargo run --example reservoir_sampling

Validate

cargo test
cargo check --examples