Bloom Filter
A high-performance, memory-efficient Bloom Filter implementation in Rust.
Get directly from crates.io.
See bloomsrv at github.com and crates.io for bloomlib wrapped into a REST API, providing a language-agnostic service allowing users to create and query multiple bloom filters
What is a Bloom Filter?

A Bloom Filter is a space-efficient probabilistic data structure used to test whether an item is a member of a set. It is designed to rapidly and memory-efficiently check whether an item is definitely not in the set or possibly in the set. A Bloom filter cannot check whether an item certainly is in the set.
The Problem it Solves
When checking for membership in massive datasets (e.g., is a username is taken, is a URL is malicious, or checking a cache), keeping every item in a standard hash table (dictionary) or database index consumes a massive amount of storage space (RAM, disk).
A Bloom Filter solves this by sacrificing 100% accuracy for massive space savings. It uses a bit array and hash functions to represent the set.
Note that:
-
False Positives (FPs) are possible: the filter may say that an item possibly is in the set while the item isn't there.
-
False Negatives (FNs) are impossible: the filter may say that an item is not in the set only if the item indeed is not there.
Implementation Details
This implementation provides the Bloom Filter as a rust library, bloomlib.
It focuses on two main optimizations:
-
Memory efficiency: Instead of
Vec<bool>(a vector of booleans, which takes 1 byte per bit) orVec<usize>, it uses aVec<u64>(a vector of 64-bit unsigned integers). This ensures exactly 1 bit is used per flag, maximizing cache locality and minimizing heap usage. -
Computational efficiency: The implementation uses the Kirsch-Mitzenmacher Optimization (double hashing). A standard Bloom Filter requires $k$ distinct hash functions, which is computationally expensive. This implementation uses double hashing to simulate $k$ hash functions performing only two actual 64-bit hash computations ($h_1$ and $h_2$). The $i$-th hash value is then calculated as:
$$ g_i(x) = (h_1(x) + i \cdot h_2(x)) \mod m $$
where $m$ is the number of flags. This provides similar collision properties to independent hash functions but is significantly faster to compute.
Usage
Import the struct BloomFilter from the bloomlib library:
use BloomFilter;
Below is an example of using the Bloom filter.
Note that BloomFilter may be initialized either with the number of hash functions (with the double hashing
optimization, see above), provided as an integer value, or with th desired FP rate, provided as a floating point
value.
Configuration
The BloomFilter::new constructor is flexible and accepts either:
- False positive rate,
f64: The library calculates the optimal number of bits ($m$) and hashes ($k$) to match this rate. - Hash count,
u32: The library calculates the optimal number of bits ($m$) to satisfy the standard 50% fill-rate assumption for the given $k$, where the theoretical false positive rate is $\approx 2^{-k}$.
Limitations
-
Memory addressing and system architecture: The maximum size of the filter is constrained by the architecture's pointer size (
usize).-
64-bit Systems: The filter can theoretically address up to $2^{64}$ bits (though limited physically by available RAM).
-
32-bit Systems: The filter is limited to $2^{32}$ bits (approx 512MB to 4GB depending on OS memory limits).
-
-
Maximum Items: The
expected_itemsinput is ausize. You cannot create a filter for more items thanusize::MAX. -
Hash Function Limit: The number of hash functions ($k$) is stored as a
u32. While this allows for ~4 billion hashes, practically, a very high $k$ (caused by requesting an extremely low false positive rate like $10^{-20}$) will severely impact performance due to CPU overhead during insertions and lookups.
Testing
The library includes unit tests for initialization, insertion, persistence, and false positive rates.
Benchmarking
The package includes benchmarking code to measure memory footprint, insertion speed, and lookup speed.
Run the benchmark with the release flag for accurate timing:
Example output:
)
|
)
|
)
|
)
|
The performance depends on CPU speed and cache availability.
Note: Lookup of unseen items is faster than insert and lookup of seen items because the filter can return false as
soon as it encounters the first 0 bit, whereas insert and lookup of seen items must set or check, respectively, all
$k$ bits.
Contributions
Please feel free to open an issue or submit a pull request.