roughly
Probabilistic data structures for Rust, because sometimes close enough is good enough.
roughly provides a unified, ergonomic API for the three most widely-used probabilistic data structures. Each answers a different question about your data stream:
| Structure | Question | Memory | Error |
|---|---|---|---|
BloomFilter |
Is this item in the set? | O(n) bits | Tunable false-positive rate |
HyperLogLog |
How many unique items? | O(log log n) | ~2% std error |
CountMinSketch |
How often does this item appear? | O(1/ε · log 1/δ) | Tunable ε, δ |
Why roughly?
The Rust ecosystem has several fragmented crates for individual probabilistic structures, but no single crate that:
- Provides all three structures with a consistent, ergonomic API
- Uses builder pattern with human-readable parameters (set false-positive rate, not number of bits)
- Abstracts hashing behind a generic
BuildHashertrait - Exposes common traits (
MembershipSketch,CardinalitySketch,FrequencySketch) for trait-object use - Supports
serdeserialization andno_stdenvironments
Installation
[]
= "0.1"
Quick Start
use *;
// "Have I seen this URL before?"
let mut bloom = builder
.expected_items
.false_positive_rate // 1% false positive rate
.build;
bloom.insert;
assert!; // always true
bloom.contains => probably false
// "How many unique users visited today?"
let mut hll = builder
.std_error // ±1% standard error
.build;
for user_id in 0..1_000_000u64
println!; // ~1_000_000
// "What's the frequency of this search term?"
let mut cms = builder
.error_rate // error ≤ 0.1% of total events
.confidence // holds with 99% probability
.build;
for term in search_log
println!;
Design
Traits
All structures implement one of three traits, enabling polymorphic use:
Custom Hashers
All structures are generic over BuildHasher. The default is AHash:
use RandomState;
let bloom = builder_with_hasher
.expected_items
.false_positive_rate
.build;
Serialization
Enable the serde feature to serialize/deserialize any structure:
= { = "0.1", = ["serde"] }
let serialized = to_string?;
let restored: BloomFilter = from_str?;
no_std
Disable the std feature (requires alloc):
= { = "0.1", = false }
Algorithms
BloomFilter
Uses the Kirsch–Mitzenmacher double-hashing trick to simulate k independent hash functions from two base hashes, avoiding the cost of k full hash evaluations. Optimal m (bits) and k (hash functions) are computed from your target false-positive rate and expected item count.
HyperLogLog
64-bit HyperLogLog with:
- Bias-corrected harmonic mean estimator (Flajolet et al.)
- Small-range linear counting correction
- Large-range logarithmic correction
merge()support for distributed cardinality estimation
CountMinSketch
Standard Count-Min Sketch (Cormode & Muthukrishnan). Width and depth computed from error_rate and confidence:
width = ceil(e / error_rate)
depth = ceil(ln(1/(1-confidence)) / ln(2))
License
Licensed under either of MIT