1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
//! bloom_filter_simple is a library that offers different implementations of a data
//! structure for filtering elements. The data structure is based on the ideas presented by Burton
//! Howard Bloom and is therefore known as bloom filter:
//! > Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun.
//! ACM 13, 7 (July 1970), 422–426. DOI: [https://doi.org/10.1145/362686.362692](https://doi.org/10.1145/362686.362692)
//!
//! # Overview
//! Basic description taken from [Wikipedia](https://en.wikipedia.org/wiki/Bloom_filter):
//!
//! > "A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard
//! Bloom in 1970, that is used to test whether an element is a member of a set. False positive
//! matches are possible, but false negatives are not – in other words, a query returns either
//! "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed
//! (though this can be addressed with the counting Bloom filter variant); the more items added, the
//! larger the probability of false positives." ("Bloom filter". Definition, para. 1. In Wikipedia.
//! Retrieved December 02, 2020, from https://en.wikipedia.org/wiki/Bloom_filter)
//!
//! # Bloom Filter Implementations
//! The library offers two basic types of bloom filter implementations.
//!
//! ## Kirsch-Mitzenmacher Bloom Filter (KMBloomFilter)
//! This type of bloom filter uses two hashers to simulate an arbitrary number of additional hash functions.
//!
//! The implementation is based on the work of [Kirsch and Mitzenmacher](https://doi.org/10.1007/11841036_42) \[1\].
//! In their work, they demonstrated that it is possible to apply simulated hash functions in a bloom
//! filter effectively, i.e., without loss in the asymptotic false positive probability.
//! Given two hash functions *h_1(x)* and *h_2(x)*, an *i*-th additional hash function *g_i(x)* can be
//! simulated as *g_i(x) = h_1(x) + i* \* *h_2(x)*.
//!
//!  > \[1\] Kirsch A., Mitzenmacher M. (2006) Less Hashing, Same Performance: Building a Better Bloom Filter.
//! In: Azar Y., Erlebach T. (eds) Algorithms – ESA 2006. ESA 2006. Lecture Notes in Computer Science, vol 4168.
//! Springer, Berlin, Heidelberg. https://doi.org/10.1007/11841036_42
//!
//! ## Seeded Bloom Filter (SeededBloomFilter)
//! A bloom filter that uses a single Hasher that can be seeded to simulate an arbitrary number of hash functions.
//! Internally, the implementation uses [ahash::AHasher](https://crates.io/crates/ahash).
//!
//! # Examples
//! In the following, you can find simple examples of how to initialize and use the different bloom filter types.
//!
//! ## Default Bloom Filter
//! The crate offers a default type for a KMBloomFilter that uses *ahash::AHasher* and Rust's
//! *std::collections::hash_map::DefaultHasher* to simulate more hash functions. We compared
//! different hash functions for use by KMBloomFilter, and this combination yielded the best results
//! with respect to the filter's false positive probability.
//!
//! We recommend using DefaultBloomFilter for quickly getting started.
//! ```
//! use bloom_filter_simple::{BloomFilter,DefaultBloomFilter};
//!
//! fn main() {
//!     // We plan on storing at most 10,000 elements
//!     let desired_capacity = 10_000;
//!     // The chance of a false positive increases with each inserted element.
//!     // This parameter specifies that the chance should be less than 0.01% (0.0001)
//!     // when the desired capacity has been reached. In other words, the chance
//!     // that the bloom filter returns true when checking whether a novel element
//!     // has been inserted before is less than 0.01% (0.0001).
//!     let desired_fp_probability = 0.0001;
//!
//!     let mut filter = DefaultBloomFilter::new(desired_capacity, desired_fp_probability);
//!
//!     // You can insert any type implementing the Hash trait. The bloom filter does
//!     // not store the inserted elements but only their hashes. Hence, there is no
//!     // transfer of ownership required.
//!     filter.insert(&5i32);
//!     filter.insert(&"Some text");
//!     filter.insert(&10_000usize);
//!
//!     // You can check whether a value has been inserted into the filter before.
//!     assert_eq!(false, filter.contains(&3));
//!     assert_eq!(true, filter.contains(&5));
//!     assert_eq!(true, filter.contains(&"Some text"));
//! }
//! ```
//!
//! ## KMBloomFilter
//! Initialization and application of a KMBloomFilter.
//! ```
//! use bloom_filter_simple::{BloomFilter,KMBloomFilter};
//! use ahash::AHasher;
//! use std::collections::hash_map::DefaultHasher;
//!
//! fn main() {
//!     // We plan on storing at most 10,000 elements
//!     let desired_capacity = 10_000;
//!     // We want to assure that the chance of a false positive is less than 0.01% (0.0001)
//!     // for up to desired_capacity elements.
//!     let desired_fp_probability = 0.0001;
//!
//!     // We initialize a new KMBloomFilter by specifying the desired Hashers as type
//!     // parameters. It is possible to use any type that implements Hasher + Default.
//!     // Default is required to receive a new instance of a hasher after a value was
//!     // hashed, because the Hasher trait does not provide an interface for resetting
//!     // a hasher implementing it. This is required to receive the same hash value
//!     // when inserting or checking the same element multiple times.
//!     let mut filter: KMBloomFilter<AHasher, DefaultHasher> = KMBloomFilter::new(
//!         desired_capacity,
//!         desired_fp_probability
//!     );
//!
//!     // You can insert any type implementing the Hash trait. The bloom filter does not
//!     // store the inserted elements but only their hashes. Hence, there is no transfer
//!     // of ownership required.
//!     filter.insert(&5i32);
//!     filter.insert(&"Some text");
//!     filter.insert(&10_000usize);
//!
//!     // You can check whether a value has been inserted into the filter before.
//!     assert_eq!(false, filter.contains(&3));
//!     assert_eq!(true, filter.contains(&5));
//!     assert_eq!(true, filter.contains(&"Some text"));
//! }
//! ```
//!
//! ## SeededBloomFilter
//! Initialization and application of a SeededBloomFilter.
//! ```
//! use bloom_filter_simple::{BloomFilter,SeededBloomFilter};
//!
//! fn main() {
//!     // We plan on storing at most 10,000 elements
//!     let desired_capacity = 10_000;
//!     // We want to assure that the chance of a false positive is less than 0.0001
//!     // for up to desired_capacity elements.
//!     let desired_fp_probability = 0.0001;
//!
//!     // A SeededBloomFilter uses a single seeded ahash::AHasher internally.
//!     let mut filter = SeededBloomFilter::new(desired_capacity, desired_fp_probability);
//!
//!     // You can insert any type implementing the Hash trait. The bloom filter does
//!     // not store the inserted elements but only their hashes. Hence, there is no
//!     // transfer of ownership required.
//!     filter.insert(&5i32);
//!     filter.insert(&"Some text");
//!     filter.insert(&10_000usize);
//!
//!     // You can check whether a value has been inserted into the filter before.
//!     assert_eq!(false, filter.contains(&3));
//!     assert_eq!(true, filter.contains(&5));
//!     assert_eq!(true, filter.contains(&"Some text"));
//! }
//! ```

use std::{collections::hash_map::DefaultHasher, hash::Hash};

mod bitset;
mod km_bloom_filter;
mod seeded_bloom_filter;

pub use km_bloom_filter::KMBloomFilter;
pub use seeded_bloom_filter::SeededBloomFilter;

/**
 A default implementation of KMBloomFilter using ahash::AHasher and collections::hash_map::DefaultHasher.

 DefaultBloomFilter is implemented as a type definition `type DefaultBloomFilter = KMBloomFilter<ahash::AHasher, DefaultHasher>;`
 # Examples
 ```
 use bloom_filter_simple::{DefaultBloomFilter,BloomFilter};

 fn simple_bloom_filter_test() {
     let desired_capacity = 1_000_000;
     let false_positive_probability = 0.0001;
     let mut bloom_filter = DefaultBloomFilter::new(desired_capacity, false_positive_probability);

     bloom_filter.insert(&"Hello!");
     bloom_filter.insert(&34);

     assert!(bloom_filter.contains(&"Hello!"));
     assert!(bloom_filter.contains(&34));
     assert_eq!(false, bloom_filter.contains(&"Not in filter"));
 }
 ```
*/
pub type DefaultBloomFilter = KMBloomFilter<ahash::AHasher, DefaultHasher>;

/// This trait defines the basic functionality supported by the bloom filters in this library.
///
pub trait BloomFilter {
    /// Insert data into the filter.
    ///
    /// # Intended Behavior
    /// A type implementing BloomFilter should implement *insert* with respect to the following points:
    /// * It should be possible to insert the same element multiple times.
    /// * It should be possible to insert any type implementing Hash.
    ///
    /// # Examples
    /// How *insert* of a type implementing BloomFilter might be used:
    /// ```
    /// use bloom_filter_simple::{BloomFilter, DefaultBloomFilter};
    ///
    /// fn bloom_filter_insert() {
    ///     let mut bloom_filter = DefaultBloomFilter::new(5, 0.001);
    ///     bloom_filter.insert(&"Hello!");
    ///     bloom_filter.insert(&5);
    ///     bloom_filter.insert(&"Hello!");
    ///
    ///     assert_eq!(true, bloom_filter.contains(&"Hello!"));
    /// }
    /// ```
    fn insert<T: Hash>(&mut self, data: &T);

    /// Check whether data is contained in the bloom filter.
    ///
    /// # Intended Behavior
    /// Checking whether data is contained in a bloom filter must never result in a false negative,
    /// i.e., if an element 'x' has been inserted into the filter, contains(&x) will *always* return true.
    ///
    /// In contrast, contains can result in false positive, i.e., contains(&x) can return true, even if
    /// x has not been inserted yet. The chance of this happending depends on the number of elements
    /// in the bloom filter, and the number of hash functions that are used. When initializing one
    /// of the filters provided in this crate, you can specify the desired false positive probability.
    ///
    /// A type implementing BloomFilter should implement *contains* with respect to the following points:
    /// * *contains(&x)* **must** return *true* if *x* has been inserted into the filter
    /// * *contains(&x)* **can** return *true* even if *x* has **not** been inserted into the filter
    /// * It should be possible to check any type implementing Hash.
    ///
    /// # Examples
    /// How contains of a type implementing BloomFilter might be used:
    /// ```
    /// use bloom_filter_simple::{BloomFilter, DefaultBloomFilter};
    /// fn bloom_filter_insert() {
    ///     let mut bloom_filter = DefaultBloomFilter::new(5, 0.001);
    ///     bloom_filter.insert(&"Hello!");
    ///     // This assert will never fail
    ///     assert_eq!(true, bloom_filter.contains(&"Hello!"));
    ///     // This assert can fail with a probability of p(fp) < 0.001
    ///     assert_eq!(false, bloom_filter.contains(&"Goodbye!"));
    /// }
    /// ```
    fn contains<T: Hash>(&self, data: &T) -> bool;
}

/// Calculate the optimal bit count to satisfy the desired constraints.
/// Formula taken from Sagi Kedmi:
/// > S. Kedmi, ["Bloom Filters for the Perplexed"](https://sagi.io/bloom-filters-for-the-perplexed/), July 2017 [Accessed: 02.12.2020]
fn optimal_bit_count(desired_capacity: usize, desired_false_positive_probability: f64) -> usize {
    (-(desired_capacity as f64 * desired_false_positive_probability.ln()) / (2.0f64.ln().powi(2)))
        .ceil() as usize
}

/// Calculate the optimal number of hashers to satisfy the desired constraints.
/// Formula taken from Sagi Kedmi:
/// > S. Kedmi, ["Bloom Filters for the Perplexed"](https://sagi.io/bloom-filters-for-the-perplexed/), July 2017 [Accessed: 02.12.2020]
fn optimal_number_of_hashers(desired_capacity: usize, bit_count: usize) -> usize {
    ((bit_count as f64 / desired_capacity as f64) * 2.0f64.ln()).round() as usize
}

/// Approximate number of elements stored.
/// Formula taken from Wikipedia:
/// > Wikipedia, ["Bloom filter"](https://en.wikipedia.org/wiki/Bloom_filter#Approximating_the_number_of_items_in_a_Bloom_filter) [Accessed: 02.12.2020]
fn approximate_element_count(
    number_of_hashers: usize,
    bits_per_hasher: usize,
    number_of_ones: usize,
) -> f64 {
    -(bits_per_hasher as f64)
        * (1.0 - (number_of_ones as f64) / ((number_of_hashers * bits_per_hasher) as f64)).ln()
}

/// Return the current approximate false positive probability which depends on the current
/// number of elements in the filter.
/// Formula taken from Sagi Kedmi:
/// > S. Kedmi, ["Bloom Filters for the Perplexed"](https://sagi.io/bloom-filters-for-the-perplexed/), July 2017 [Accessed: 02.12.2020]
fn approximate_false_positive_probability(
    number_of_hashers: usize,
    bits_per_hasher: usize,
    element_count: f64,
) -> f64 {
    (1.0 - std::f64::consts::E.powf(-element_count / bits_per_hasher as f64))
        .powf(number_of_hashers as f64)
}