probabilistic_collections/
lib.rs

1//! # probabilistic-collections-rs
2//!
3//! [![probabilistic-collections](http://meritbadge.herokuapp.com/probabilistic-collections)](https://crates.io/crates/probabilistic-collections)
4//! [![Documentation](https://docs.rs/probabilistic-collections/badge.svg)](https://docs.rs/probabilistic-collections)
5//! [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
6//! [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
7//! [![Pipeline Status](https://gitlab.com/jeffrey-xiao/probabilistic-collections-rs/badges/master/pipeline.svg)](https://gitlab.com/jeffrey-xiao/probabilistic-collections-rs/-/commits/master)
8//! [![Coverage Report](https://gitlab.com/jeffrey-xiao/probabilistic-collections-rs/badges/master/coverage.svg)](https://gitlab.com/jeffrey-xiao/probabilistic-collections-rs/-/commits/master)
9//!
10//! `probabilistic-collections` contains various implementations of collections that use
11//! approximations to improve on running time or memory, but introduce a certain amount of error.
12//! The error can be controlled under a certain threshold which makes these data structures
13//! extremely useful for big data and streaming applications.
14//!
15//! The following types of collections are implemented:
16//!
17//! - Approximate Membership in Set: `BloomFilter`, `PartitionedBloomFilter`, `CuckooFilter`,
18//!   `QuotientFilter`
19//! - Scalable Approximate Membership in Set: `ScalableBloomFilter`, `ScalableCuckooFilter`
20//! - Approximate Membership in Stream: `BSBloomFilter`, `BSSDBloomFilter`, `RLBSBloomFilter`
21//! - Approximate Item Count: `CountMinSketch`
22//! - Approximate Distinct Item Count: `HyperLogLog`
23//! - Set similarity: `MinHash`, `SimHash`
24//!
25//! ## Usage
26//!
27//! Add this to your `Cargo.toml`:
28//!
29//! ```toml
30//! [dependencies]
31//! probabilistic-collections = "*"
32//! ```
33//!
34//! For [`serde`](https://github.com/serde-rs/serde) support, include the `serde` feature:
35//!
36//! ```toml
37//! [dependencies]
38//! probabilistic-collections = { version = "*", features = ["serde"] }
39//! ```
40//!
41//! Add this to your crate root if you are using Rust 2015:
42//!
43//! ```rust
44//! extern crate probabilistic_collections;
45//! ```
46//!
47//! ## Changelog
48//!
49//! See [CHANGELOG](CHANGELOG.md) for more details.
50//!
51//! ## References
52//!
53//! - [Advanced Bloom Filter Based Algorithms for Efficient Approximate Data De-Duplication in Streams](https://arxiv.org/abs/1212.3964)
54//!   > Bera, Suman K., Sourav Dutta, Ankur Narang, and Souvik Bhattacherjee. 2012. "Advanced Bloom Filter Based Algorithms for Efficient Approximate Data de-Duplication in Streams." _CoRR_ abs/1212.3964. <http://arxiv.org/abs/1212.3964>.
55//! - [An improved data stream summary: the count-min sketch and its applications](https://dl.acm.org/citation.cfm?id=1073718)
56//!   > Cormode, Graham, and S. Muthukrishnan. 2005. "An Improved Data Stream Summary: The Count-Min Sketch and Its Applications." _J. Algorithms_ 55 (1). Duluth, MN, USA: Academic Press, Inc.: 58--75. <https://doi.org/10.1016/j.jalgor.2003.12.001>.
57//! - [Cuckoo Filter: Practically Better Than Bloom](https://dl.acm.org/citation.cfm?id=2674994)
58//!   > Fan, Bin, Dave G. Andersen, Michael Kaminsky, and Michael D. Mitzenmacher. 2014. "Cuckoo Filter: Practically Better Than Bloom." In _Proceedings of the 10th Acm International on Conference on Emerging Networking Experiments and Technologies_, 75--88. CoNEXT '14. New York, NY, USA: ACM. <https://doi.org/10.1145/2674005.2674994>.
59//! - [Don't thrash: how to cache your hash on flash](https://dl.acm.org/citation.cfm?id=2350275)
60//!   > Bender, Michael A., Martin Farach-Colton, Rob Johnson, Russell Kraner, Bradley C. Kuszmaul, Dzejla Medjedovic, Pablo Montes, Pradeep Shetty, Richard P. Spillane, and Erez Zadok. 2012. "Don'T Thrash: How to Cache Your Hash on Flash." _Proc. VLDB Endow._ 5 (11). VLDB Endowment: 1627--37. <https://doi.org/10.14778/2350229.2350275>.
61//! - [HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm](https://dl.acm.org/citation.cfm?id=2452456)
62//!   > Heule, Stefan, Marc Nunkesser, and Alexander Hall. 2013. "HyperLogLog in Practice: Algorithmic Engineering of a State of the Art Cardinality Estimation Algorithm." In _Proceedings of the 16th International Conference on Extending Database Technology_, 683--92. EDBT '13. New York, NY, USA: ACM. <https://doi.org/10.1145/2452376.2452456>.
63//! - [HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf)
64//!   > Flajolet, Philippe, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2007. "Hyperloglog: The Analysis of a Near-Optimal Cardinality Estimation Algorithm." In _IN Aofa '07: PROCEEDINGS of the 2007 International Conference on Analysis of Algorithms_.
65//! - [Less hashing, same performance: Building a better Bloom filter](https://dl.acm.org/citation.cfm?id=1400125)
66//!   > Kirsch, Adam, and Michael Mitzenmacher. 2008. "Less Hashing, Same Performance: Building a Better Bloom Filter." _Random Struct. Algorithms_ 33 (2). New York, NY, USA: John Wiley & Sons, Inc.: 187--218. <https://doi.org/10.1002/rsa.v33:2>.
67//! - [Min-wise independent permutations (extended abstract)](https://dl.acm.org/citation.cfm?id=276781)
68//!   > Broder, Andrei Z., Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. 1998. "Min-Wise Independent Permutations (Extended Abstract)." In _Proceedings of the Thirtieth Annual Acm Symposium on Theory of Computing_, 327--36. STOC '98. New York, NY, USA: ACM. <https://doi.org/10.1145/276698.276781>.
69//! - [Probabilistic near-duplicate detection using simhash](https://dl.acm.org/citation.cfm?id=2063737)
70//!   > Sood, Sadhan, and Dmitri Loguinov. 2011. "Probabilistic Near-Duplicate Detection Using Simhash." In _Proceedings of the 20th Acm International Conference on Information and Knowledge Management_, 1117--26. CIKM '11. New York, NY, USA: ACM. <https://doi.org/10.1145/2063576.2063737>.
71//! - [Scalable Bloom Filters](https://dl.acm.org/citation.cfm?id=1224501)
72//!   > Almeida, Paulo Sérgio, Carlos Baquero, Nuno Preguiça, and David Hutchison. 2007. "Scalable Bloom Filters." _Inf. Process. Lett._ 101 (6). Amsterdam, The Netherlands, The Netherlands: Elsevier North-Holland, Inc.: 255--61. <https://doi.org/10.1016/j.ipl.2006.10.007>.
73//!
74//! ## License
75//!
76//! `probabilistic-collections-rs` is dual-licensed under the terms of either the MIT License or the
77//! Apache License (Version 2.0).
78//!
79//! See [LICENSE-APACHE](LICENSE-APACHE) and [LICENSE-MIT](LICENSE-MIT) for more details.
80
81#![warn(missing_docs)]
82
83pub mod bit_array_vec;
84pub mod bit_vec;
85pub mod bloom;
86pub mod count_min_sketch;
87pub mod cuckoo;
88pub mod hyperloglog;
89pub mod quotient;
90pub mod similarity;
91mod util;
92
93pub use self::util::SipHasherBuilder;
94use self::util::{DoubleHasher, HashIter};