bloom-lib 1.0.0

Probabilistic data structure library: Bloom filters, Cuckoo filters, Count-Min Sketch, HyperLogLog, MinHash, and Top-K. Tunable false-positive rates, serializable state, merge support, and streaming-safe updates.
Documentation

What it does

bloom-lib is a collection of probabilistic data structures: compact summaries that answer questions about a data set — membership, cardinality, frequency, similarity — with bounded, tunable error in a fraction of the memory an exact structure would need. They are designed for streaming workloads: insertions are allocation-free, state is serializable, and compatible structures can be merged.

The crate is no_std-friendly (it needs only alloc), carries no required runtime dependency, and is generic over the hash function so you can trade the fast deterministic default for an adversary-resistant one when inputs are untrusted.


Available structures

Structure Answers Deletion Mergeable
BloomFilter Is this item in the set?
CuckooFilter Is this item in the set?
CountMinSketch How often have I seen this item?
HyperLogLog How many distinct items are there?
MinHash How similar are two sets?
TopK What are the most frequent items?

Every structure is generic over the item type and the hash function, supports clear, and (with the serde feature) serializes its state.


Installation

[dependencies]
bloom-lib = "1.0"

# With serialization support:
bloom-lib = { version = "1.0", features = ["serde"] }

# On a heap-capable no_std target:
bloom-lib = { version = "1.0", default-features = false, features = ["alloc"] }

Quick start

use bloom_lib::BloomFilter;

// Size for 100,000 distinct items at a 0.1% false-positive rate.
let mut filter = BloomFilter::new(100_000, 0.001).unwrap();

filter.insert("session-token");

assert!(filter.contains("session-token")); // never a false negative
assert!(!filter.contains("never-seen"));    // very likely absent

insert doubles as a deduplicating primitive: it returns true the first time it sees an item and false when the item was probably already present.

use bloom_lib::BloomFilter;

let mut seen = BloomFilter::new(10_000, 0.01).unwrap();
assert!(seen.insert("first-visit"));   // newly added
assert!(!seen.insert("first-visit"));  // already present

API overview

For the complete reference with parameters and worked examples, see docs/API.md.

  • BloomFilter — set membership with a tunable false-positive rate, merge support, and cardinality estimation.
  • CuckooFilter — set membership that also supports deletion.
  • CountMinSketch — per-item frequency estimation.
  • HyperLogLog — distinct-count estimation in tiny, fixed memory.
  • MinHash — Jaccard similarity estimation between sets.
  • TopK — the most frequent items (heavy hitters) in a stream.
  • Error — the crate-wide error type returned by fallible operations.
  • HashingDefaultHasher, DefaultHashBuilder, and the rationale for deterministic hashing.
  • Prelude — convenient glob re-exports.

Hashing

Every structure is generic over core::hash::BuildHasher and defaults to the deterministic DefaultHashBuilder. A fixed seed makes hashing reproducible: filters built with the default hasher are always mergeable, and serialized state round-trips byte-for-byte across runs and platforms. The trade-off is that a fixed seed offers no defence against an adversary crafting colliding inputs. When inputs are untrusted, supply a randomly-seeded hasher such as std::collections::hash_map::RandomState:

use std::collections::hash_map::RandomState;
use bloom_lib::BloomFilter;

let filter: BloomFilter<&str, RandomState> =
    BloomFilter::with_hasher(1_000, 0.01, RandomState::new()).unwrap();

The default hasher is a fast, non-cryptographic 64-bit hash with strong avalanche behaviour. From a single hashing pass the structures synthesise the many index positions they need using the Kirsch–Mitzenmacher double-hashing scheme, and map hashes into range with a division-free multiply-shift reduction.


Performance

The hot paths are allocation-free and use only integer arithmetic: one hashing pass per item, then a handful of multiply-shift index computations and array probes. Floating-point math runs only at construction, via libm, so a structure's derived geometry is bit-identical on every platform.

Latest local Criterion means (cargo bench --bench probabilistic, Windows x86_64, Rust stable, steady-state on a pre-populated structure):

Operation ns/op
BloomFilter::insert ~6.9
BloomFilter::contains (hit) ~5.5
CuckooFilter::contains (hit) ~6.0
CountMinSketch::increment ~5.8
CountMinSketch::estimate ~6.3
HyperLogLog::insert ~0.9
MinHash::insert (128 hashes) ~24
TopK::insert (k = 100) ~72

MinHash and TopK scale with their configured size (the signature length and k, respectively); the others are effectively constant-time per operation. Run cargo bench to reproduce these on your hardware.


Feature flags

Feature Default Description
std Enables every structure and the std::error::Error impl for Error. Implies alloc.
alloc Enables every structure without std, for heap-capable no_std targets.
serde Derives Serialize/Deserialize for every structure. Implies alloc.

With no features enabled the crate still compiles and exposes VERSION and Error, so it can sit in a no_std dependency tree without pulling in a heap.


Examples

Runnable examples live in examples/, one per structure:

Example Shows
bloom_dedup Deduplicating a stream with a BloomFilter.
membership_with_deletion Granting and revoking membership with a CuckooFilter.
frequency Per-item frequency estimation with a CountMinSketch.
cardinality Counting unique items with HyperLogLog.
similarity Document similarity with MinHash.
top_words Most frequent words with TopK.
cargo run --example bloom_dedup --release

Testing

# Default features
cargo test

# Every feature, including serde round-trips and property tests
cargo test --all-features

# Heap-only no_std build
cargo test --no-default-features --features alloc

# Microbenchmarks
cargo bench --bench probabilistic

Cross-platform support

Tier-1 targets — Linux, macOS, and Windows (x86_64 and aarch64) — compile and pass the full suite on both stable and the MSRV (Rust 1.75). CI runs formatting, clippy with -D warnings, the test suite, and a documentation build with RUSTDOCFLAGS=-D warnings across all three operating systems.


Standards

  • REPS governs every decision. See REPS.md.
  • MSRV: Rust 1.75.
  • Edition: 2021.
  • Cross-platform: Linux, macOS, Windows.

License

Dual-licensed under either of:

at your option.