fastbloom

The fastest bloom filter in Rust. No accuracy compromises. Compatible with any hasher.

Usage

# Cargo.toml
[dependencies]
fastbloom = "0.3.0"

Basic usage:

use fastbloom::BloomFilter;

let num_bits = 1024;

let mut filter = BloomFilter::builder(num_bits).expected_items(2);
filter.insert("42");
filter.insert("🦀");

Instantiate from a collection of items:

use fastbloom::BloomFilter;

let num_bits = 1024;

let filter = BloomFilter::builder(num_bits).items(["42", "🦀"]);
assert!(filter.contains("42"));
assert!(filter.contains("🦀"));

Use any hasher:

use fastbloom::BloomFilter;
use ahash::RandomState;

let num_bits = 1024;

let filter = BloomFilter::builder(num_bits)
    .hasher(RandomState::default())
    .items(["42", "🦀"]);

Background

Bloom filters are space efficient approximate membership set data structures. False positives from a membership check are possible, but false negatives are not. See more.

Blocked bloom filters are supported by an underlying bit vector, chunked into 512, 256, 128, or 64 bit "blocks", to track item membership. To insert, a number of bits are set at positions based on the item's hash in one of the underlying bit vector's blocks. To check membership, a number of bits are checked at positions based on the item's hash in one of the underlying bit vector's blocks.

hash(4) ──────┬─────┬───────────────┐
              ↓     ↓               ↓
0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0
  ↑           ↑           ↑
  └───────────┴───────────┴──── hash(3) (not in the set)

See more on blocked bloom filters.

Once constructed, neither the bloom filter's underlying memory usage nor number of bits per item change.

Implementation

fastbloom is several times faster than existing bloom filters and scales very well with the number of hashes per item. In all cases, fastbloom maintains competitive false positive rates. fastbloom is blazingly fast because it uses L1 cache friendly blocks, efficiently derives many index bits from only one real hash per item, and leverages other research findings on bloom filters.

fastbloom's default hasher is SipHash-1-3 using randomized keys but can be seeded or configured to use any hasher.

Runtime Performance

SipHash

Runtime comparison to other bloom filter crates (all using SipHash). Note:

The number hashes for all bloom filters is derived to optimize accuracy, meaning fewer items in the bloom filters result in more hashes per item and generally slower performance.
As number of items (input) increases, the accuracy of the bloom filter decreases. 1000 random strings were used to test membership.

Any Hash Goes

The fastbloom-rs crate (similarily named) uses xxhash, which is faster than SipHash, so it is not fair to compare above. However, we can configure fastbloom to use a similarly fast hash, ahash, and compare. 1000 random strings were used to test membership. member-fb non-member-fb

False Positive Performance

fastbloom does not compromise accuracy. Below is a comparison of false positive rates with other bloom filter crates: bloom-fp

The bloom filters and a control hash set were populated with a varying number of random 64 bit integers ("Number of Items"). Then 100,000 random 64 bit integers were checked: false positives are numbers that do NOT exist in the control hash set but do report as existing in the bloom filter.

Comparing Block Sizes: 64, 128, 256, and 512 Bits

fastbloom offers 4 different block sizes. 512 bits is the default. Larger block sizes generally have slower performance but are more accurate.

Runtime Performance

Times are for 1000 random strings. The bloom filters used ahash.

member-fastbloom-blocks non-member-fastbloom-blocks

Accuracy

fastbloom-block-fp

How it Works

fastbloom attributes its performance to two insights:

Only one real hash per item is needed, subsequent hashes can be cheaply derived from the real hash
Many bit positions can be derived from subsequent hashes

One Real Hash Per Item

The item is hashed once, producing the "real" hash. Two original hashes are derived:

The first is the real hash
The second is the last 32 bits for the first hash multiplied by constant number

Subsequent hashes are derived by a common technique explained in depth in this paper. The ith subsequent hash is derived by mixing the first two:

*hash1 = hash1.wrapping_add(hash2).rotate_left(5);
return *hash1

Many Bit Positions Derived from Subsequent Hashes

Instead of deriving a single bit position per hash, a hash with ~N 1 bits set can be formed by chaining bitwise AND and OR operations of the subsequent hashes.

Example

For a bloom filter with a bit vector of size 64 and desired hashes 24, 24 (potentially overlapping) positions in the bit vector are set or checked for each item on insertion or membership check respectively.

Other bloom filters derive 24 positions based on 24 hashes of the item:

hash0(item) % 64
hash1(item) % 64
...
hash23(item) % 64

Instead, fastbloom derives a hash of the item with ~20 bits set and then adds it to the bit vector with a bitwise OR:

hash0(item) & hash1(item) | hash2(item) & hash3(item)

That's 4 hashes versus 24!

Note:

Given 64 bits, and 24 hashes, a bit has probability (63/64)^24 to NOT be set, i.e. 0, after 24 hashes. The expected number of bits to be set for an item is 64 - (64 * (63/64)^24) ~= 20.
A 64 bit hash0(item) provides us with roughly 32 set bits with a binomial distribution. hash0(item) & hash1(item) gives us ~16 set bits, hash0(item) | hash1(item) gives us ~48 set bits, etc.

In reality, the bloom filter may have more than 64 bits of storage. In that case, many underlying u64s in the block are operated on, and the number of hashes is adjusted to be the number of hashes per u64 in the block. Additionally, some bits may be set in the usual way to account for any rounding errors.

References

License

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

fastbloom 0.3.0

fastbloom

Usage

Background

Implementation

Runtime Performance

SipHash

Any Hash Goes

False Positive Performance

Comparing Block Sizes: 64, 128, 256, and 512 Bits

Runtime Performance

Accuracy

How it Works

One Real Hash Per Item

Many Bit Positions Derived from Subsequent Hashes

Example

References

License

Contribution