roughly 0.1.0

Probabilistic data structures for Rust — because sometimes close enough is good enough.
Documentation
  • Coverage
  • 100%
    75 out of 75 items documented4 out of 68 items with examples
  • Size
  • Source code size: 43.05 kB This is the summed size of all the files inside the crates.io package for this release.
  • Documentation size: 4.52 MB This is the summed size of all files generated by rustdoc for all configured targets
  • Ø build duration
  • this release: 2m 23s Average build duration of successful builds.
  • all releases: 2m 23s Average build duration of successful builds in releases after 2024-10-23.
  • Links
  • Homepage
  • suryanox/roughly
    1 0 0
  • crates.io
  • Dependencies
  • Versions
  • Owners
  • suryanox

roughly

Crates.io Docs.rs CI

Probabilistic data structures for Rust, because sometimes close enough is good enough.

roughly provides a unified, ergonomic API for the three most widely-used probabilistic data structures. Each answers a different question about your data stream:

Structure Question Memory Error
BloomFilter Is this item in the set? O(n) bits Tunable false-positive rate
HyperLogLog How many unique items? O(log log n) ~2% std error
CountMinSketch How often does this item appear? O(1/ε · log 1/δ) Tunable ε, δ

Why roughly?

The Rust ecosystem has several fragmented crates for individual probabilistic structures, but no single crate that:

  • Provides all three structures with a consistent, ergonomic API
  • Uses builder pattern with human-readable parameters (set false-positive rate, not number of bits)
  • Abstracts hashing behind a generic BuildHasher trait
  • Exposes common traits (MembershipSketch, CardinalitySketch, FrequencySketch) for trait-object use
  • Supports serde serialization and no_std environments

Installation

[dependencies]
roughly = "0.1"

Quick Start

use roughly::prelude::*;

// "Have I seen this URL before?"

let mut bloom = BloomFilter::builder()
    .expected_items(1_000_000)
    .false_positive_rate(0.01)   // 1% false positive rate
    .build();

bloom.insert(&"https://example.com");
assert!(bloom.contains(&"https://example.com"));  // always true
bloom.contains(&"https://other.com") => probably false
// "How many unique users visited today?"

let mut hll = HyperLogLog::builder()
    .std_error(0.01)   // ±1% standard error
    .build();

for user_id in 0..1_000_000u64 {
    hll.insert(&user_id);
}
println!("~{} unique users", hll.count()); // ~1_000_000
// "What's the frequency of this search term?"

let mut cms = CountMinSketch::builder()
    .error_rate(0.001)   // error ≤ 0.1% of total events
    .confidence(0.99)    // holds with 99% probability
    .build();

for term in search_log {
    cms.insert(&term);
}
println!("'rust' appears ~{} times", cms.estimate(&"rust"));

Design

Traits

All structures implement one of three traits, enabling polymorphic use:

pub trait MembershipSketch<T: Hash> {
    fn insert(&mut self, item: &T);
    fn contains(&self, item: &T) -> bool;
    fn false_positive_rate(&self) -> f64;
    fn len(&self) -> usize;
    fn clear(&mut self);
}

pub trait CardinalitySketch<T: Hash> {
    fn insert(&mut self, item: &T);
    fn count(&self) -> u64;
    fn std_error(&self) -> f64;
    fn merge(&mut self, other: &Self);
    fn clear(&mut self);
}

pub trait FrequencySketch<T: Hash> {
    fn insert(&mut self, item: &T);
    fn insert_many(&mut self, item: &T, count: u64);
    fn estimate(&self, item: &T) -> u64;
    fn error_rate(&self) -> f64;
    fn confidence(&self) -> f64;
    fn total(&self) -> u64;
    fn clear(&mut self);
}

Custom Hashers

All structures are generic over BuildHasher. The default is AHash:

use std::collections::hash_map::RandomState;

let bloom = BloomFilter::builder_with_hasher(RandomState::new())
    .expected_items(10_000)
    .false_positive_rate(0.01)
    .build();

Serialization

Enable the serde feature to serialize/deserialize any structure:

roughly = { version = "0.1", features = ["serde"] }
let serialized = serde_json::to_string(&bloom)?;
let restored: BloomFilter = serde_json::from_str(&serialized)?;

no_std

Disable the std feature (requires alloc):

roughly = { version = "0.1", default-features = false }

Algorithms

BloomFilter

Uses the Kirsch–Mitzenmacher double-hashing trick to simulate k independent hash functions from two base hashes, avoiding the cost of k full hash evaluations. Optimal m (bits) and k (hash functions) are computed from your target false-positive rate and expected item count.

HyperLogLog

64-bit HyperLogLog with:

  • Bias-corrected harmonic mean estimator (Flajolet et al.)
  • Small-range linear counting correction
  • Large-range logarithmic correction
  • merge() support for distributed cardinality estimation

CountMinSketch

Standard Count-Min Sketch (Cormode & Muthukrishnan). Width and depth computed from error_rate and confidence:

width = ceil(e / error_rate)
depth = ceil(ln(1/(1-confidence)) / ln(2))

License

Licensed under either of MIT