1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274
//! bloom_filter_simple is a library that offers different implementations of a data //! structure for filtering elements. The data structure is based on the ideas presented by Burton //! Howard Bloom and is therefore known as bloom filter: //! > Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. //! ACM 13, 7 (July 1970), 422–426. DOI: [https://doi.org/10.1145/362686.362692](https://doi.org/10.1145/362686.362692) //! //! # Overview //! Basic description taken from [Wikipedia](https://en.wikipedia.org/wiki/Bloom_filter): //! //! > "A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard //! Bloom in 1970, that is used to test whether an element is a member of a set. False positive //! matches are possible, but false negatives are not – in other words, a query returns either //! "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed //! (though this can be addressed with the counting Bloom filter variant); the more items added, the //! larger the probability of false positives." ("Bloom filter". Definition, para. 1. In Wikipedia. //! Retrieved December 02, 2020, from https://en.wikipedia.org/wiki/Bloom_filter) //! //! # Bloom Filter Implementations //! The library offers two basic types of bloom filter implementations. //! //! ## Kirsch-Mitzenmacher Bloom Filter (KMBloomFilter) //! This type of bloom filter uses two hashers to simulate an arbitrary number of additional hash functions. //! //! The implementation is based on the work of [Kirsch and Mitzenmacher](https://doi.org/10.1007/11841036_42) \[1\]. //! In their work, they demonstrated that it is possible to apply simulated hash functions in a bloom //! filter effectively, i.e., without loss in the asymptotic false positive probability. //! Given two hash functions *h_1(x)* and *h_2(x)*, an *i*-th additional hash function *g_i(x)* can be //! simulated as *g_i(x) = h_1(x) + i* \* *h_2(x)*. //! //! > \[1\] Kirsch A., Mitzenmacher M. (2006) Less Hashing, Same Performance: Building a Better Bloom Filter. //! In: Azar Y., Erlebach T. (eds) Algorithms – ESA 2006. ESA 2006. Lecture Notes in Computer Science, vol 4168. //! Springer, Berlin, Heidelberg. https://doi.org/10.1007/11841036_42 //! //! ## Seeded Bloom Filter (SeededBloomFilter) //! A bloom filter that uses a single Hasher that can be seeded to simulate an arbitrary number of hash functions. //! Internally, the implementation uses [ahash::AHasher](https://crates.io/crates/ahash). //! //! # Examples //! In the following, you can find simple examples of how to initialize and use the different bloom filter types. //! //! ## Default Bloom Filter //! The crate offers a default type for a KMBloomFilter that uses *ahash::AHasher* and Rust's //! *std::collections::hash_map::DefaultHasher* to simulate more hash functions. We compared //! different hash functions for use by KMBloomFilter, and this combination yielded the best results //! with respect to the filter's false positive probability. //! //! We recommend using DefaultBloomFilter for quickly getting started. //! ``` //! use bloom_filter_simple::{BloomFilter,DefaultBloomFilter}; //! //! fn main() { //! // We plan on storing at most 10,000 elements //! let desired_capacity = 10_000; //! // The chance of a false positive increases with each inserted element. //! // This parameter specifies that the chance should be less than 0.01% (0.0001) //! // when the desired capacity has been reached. In other words, the chance //! // that the bloom filter returns true when checking whether a novel element //! // has been inserted before is less than 0.01% (0.0001). //! let desired_fp_probability = 0.0001; //! //! let mut filter = DefaultBloomFilter::new(desired_capacity, desired_fp_probability); //! //! // You can insert any type implementing the Hash trait. The bloom filter does //! // not store the inserted elements but only their hashes. Hence, there is no //! // transfer of ownership required. //! filter.insert(&5i32); //! filter.insert(&"Some text"); //! filter.insert(&10_000usize); //! //! // You can check whether a value has been inserted into the filter before. //! assert_eq!(false, filter.contains(&3)); //! assert_eq!(true, filter.contains(&5)); //! assert_eq!(true, filter.contains(&"Some text")); //! } //! ``` //! //! ## KMBloomFilter //! Initialization and application of a KMBloomFilter. //! ``` //! use bloom_filter_simple::{BloomFilter,KMBloomFilter}; //! use ahash::AHasher; //! use std::collections::hash_map::DefaultHasher; //! //! fn main() { //! // We plan on storing at most 10,000 elements //! let desired_capacity = 10_000; //! // We want to assure that the chance of a false positive is less than 0.01% (0.0001) //! // for up to desired_capacity elements. //! let desired_fp_probability = 0.0001; //! //! // We initialize a new KMBloomFilter by specifying the desired Hashers as type //! // parameters. It is possible to use any type that implements Hasher + Default. //! // Default is required to receive a new instance of a hasher after a value was //! // hashed, because the Hasher trait does not provide an interface for resetting //! // a hasher implementing it. This is required to receive the same hash value //! // when inserting or checking the same element multiple times. //! let mut filter: KMBloomFilter<AHasher, DefaultHasher> = KMBloomFilter::new( //! desired_capacity, //! desired_fp_probability //! ); //! //! // You can insert any type implementing the Hash trait. The bloom filter does not //! // store the inserted elements but only their hashes. Hence, there is no transfer //! // of ownership required. //! filter.insert(&5i32); //! filter.insert(&"Some text"); //! filter.insert(&10_000usize); //! //! // You can check whether a value has been inserted into the filter before. //! assert_eq!(false, filter.contains(&3)); //! assert_eq!(true, filter.contains(&5)); //! assert_eq!(true, filter.contains(&"Some text")); //! } //! ``` //! //! ## SeededBloomFilter //! Initialization and application of a SeededBloomFilter. //! ``` //! use bloom_filter_simple::{BloomFilter,SeededBloomFilter}; //! //! fn main() { //! // We plan on storing at most 10,000 elements //! let desired_capacity = 10_000; //! // We want to assure that the chance of a false positive is less than 0.0001 //! // for up to desired_capacity elements. //! let desired_fp_probability = 0.0001; //! //! // A SeededBloomFilter uses a single seeded ahash::AHasher internally. //! let mut filter = SeededBloomFilter::new(desired_capacity, desired_fp_probability); //! //! // You can insert any type implementing the Hash trait. The bloom filter does //! // not store the inserted elements but only their hashes. Hence, there is no //! // transfer of ownership required. //! filter.insert(&5i32); //! filter.insert(&"Some text"); //! filter.insert(&10_000usize); //! //! // You can check whether a value has been inserted into the filter before. //! assert_eq!(false, filter.contains(&3)); //! assert_eq!(true, filter.contains(&5)); //! assert_eq!(true, filter.contains(&"Some text")); //! } //! ``` use std::{collections::hash_map::DefaultHasher, hash::Hash}; mod bitset; mod km_bloom_filter; mod seeded_bloom_filter; pub use km_bloom_filter::KMBloomFilter; pub use seeded_bloom_filter::SeededBloomFilter; /** A default implementation of KMBloomFilter using ahash::AHasher and collections::hash_map::DefaultHasher. DefaultBloomFilter is implemented as a type definition `type DefaultBloomFilter = KMBloomFilter<ahash::AHasher, DefaultHasher>;` # Examples ``` use bloom_filter_simple::{DefaultBloomFilter,BloomFilter}; fn simple_bloom_filter_test() { let desired_capacity = 1_000_000; let false_positive_probability = 0.0001; let mut bloom_filter = DefaultBloomFilter::new(desired_capacity, false_positive_probability); bloom_filter.insert(&"Hello!"); bloom_filter.insert(&34); assert!(bloom_filter.contains(&"Hello!")); assert!(bloom_filter.contains(&34)); assert_eq!(false, bloom_filter.contains(&"Not in filter")); } ``` */ pub type DefaultBloomFilter = KMBloomFilter<ahash::AHasher, DefaultHasher>; /// This trait defines the basic functionality supported by the bloom filters in this library. /// pub trait BloomFilter { /// Insert data into the filter. /// /// # Intended Behavior /// A type implementing BloomFilter should implement *insert* with respect to the following points: /// * It should be possible to insert the same element multiple times. /// * It should be possible to insert any type implementing Hash. /// /// # Examples /// How *insert* of a type implementing BloomFilter might be used: /// ``` /// use bloom_filter_simple::{BloomFilter, DefaultBloomFilter}; /// /// fn bloom_filter_insert() { /// let mut bloom_filter = DefaultBloomFilter::new(5, 0.001); /// bloom_filter.insert(&"Hello!"); /// bloom_filter.insert(&5); /// bloom_filter.insert(&"Hello!"); /// /// assert_eq!(true, bloom_filter.contains(&"Hello!")); /// } /// ``` fn insert<T: Hash>(&mut self, data: &T); /// Check whether data is contained in the bloom filter. /// /// # Intended Behavior /// Checking whether data is contained in a bloom filter must never result in a false negative, /// i.e., if an element 'x' has been inserted into the filter, contains(&x) will *always* return true. /// /// In contrast, contains can result in false positive, i.e., contains(&x) can return true, even if /// x has not been inserted yet. The chance of this happending depends on the number of elements /// in the bloom filter, and the number of hash functions that are used. When initializing one /// of the filters provided in this crate, you can specify the desired false positive probability. /// /// A type implementing BloomFilter should implement *contains* with respect to the following points: /// * *contains(&x)* **must** return *true* if *x* has been inserted into the filter /// * *contains(&x)* **can** return *true* even if *x* has **not** been inserted into the filter /// * It should be possible to check any type implementing Hash. /// /// # Examples /// How contains of a type implementing BloomFilter might be used: /// ``` /// use bloom_filter_simple::{BloomFilter, DefaultBloomFilter}; /// fn bloom_filter_insert() { /// let mut bloom_filter = DefaultBloomFilter::new(5, 0.001); /// bloom_filter.insert(&"Hello!"); /// // This assert will never fail /// assert_eq!(true, bloom_filter.contains(&"Hello!")); /// // This assert can fail with a probability of p(fp) < 0.001 /// assert_eq!(false, bloom_filter.contains(&"Goodbye!")); /// } /// ``` fn contains<T: Hash>(&self, data: &T) -> bool; } /// Calculate the optimal bit count to satisfy the desired constraints. /// Formula taken from Sagi Kedmi: /// > S. Kedmi, ["Bloom Filters for the Perplexed"](https://sagi.io/bloom-filters-for-the-perplexed/), July 2017 [Accessed: 02.12.2020] fn optimal_bit_count(desired_capacity: usize, desired_false_positive_probability: f64) -> usize { (-(desired_capacity as f64 * desired_false_positive_probability.ln()) / (2.0f64.ln().powi(2))) .ceil() as usize } /// Calculate the optimal number of hashers to satisfy the desired constraints. /// Formula taken from Sagi Kedmi: /// > S. Kedmi, ["Bloom Filters for the Perplexed"](https://sagi.io/bloom-filters-for-the-perplexed/), July 2017 [Accessed: 02.12.2020] fn optimal_number_of_hashers(desired_capacity: usize, bit_count: usize) -> usize { ((bit_count as f64 / desired_capacity as f64) * 2.0f64.ln()).round() as usize } /// Approximate number of elements stored. /// Formula taken from Wikipedia: /// > Wikipedia, ["Bloom filter"](https://en.wikipedia.org/wiki/Bloom_filter#Approximating_the_number_of_items_in_a_Bloom_filter) [Accessed: 02.12.2020] fn approximate_element_count( number_of_hashers: usize, bits_per_hasher: usize, number_of_ones: usize, ) -> f64 { -(bits_per_hasher as f64) * (1.0 - (number_of_ones as f64) / ((number_of_hashers * bits_per_hasher) as f64)).ln() } /// Return the current approximate false positive probability which depends on the current /// number of elements in the filter. /// Formula taken from Sagi Kedmi: /// > S. Kedmi, ["Bloom Filters for the Perplexed"](https://sagi.io/bloom-filters-for-the-perplexed/), July 2017 [Accessed: 02.12.2020] fn approximate_false_positive_probability( number_of_hashers: usize, bits_per_hasher: usize, element_count: f64, ) -> f64 { (1.0 - std::f64::consts::E.powf(-element_count / bits_per_hasher as f64)) .powf(number_of_hashers as f64) }