ofilter 0.2.0

OFilter is an opinionated Bloom filter.
Documentation
# OFilter

[OFilter](https://gitlab.com/liberecofr/ofilter) is an opinionated Bloom filter implemented in [Rust](https://www.rust-lang.org/).

It implements:

- [a basic Bloom filter]https://docs.rs/ofilter/latest/ofilter/struct.Bloom.html
- [a thread-safe Bloom filter]https://docs.rs/ofilter/latest/ofilter/struct.Stream.html
- [a streaming Bloom filter]https://docs.rs/ofilter/latest/ofilter/struct.Stream.html
- [a thread-safe streaming Bloom filter]https://docs.rs/ofilter/latest/ofilter/struct.SyncStream.html

The basic Bloom filter is inspired from the existing and stable
[bloomfilter crate](https://github.com/jedisct1/rust-bloom-filter) but does not directly depend on it.
The API is slightly different, and makes a few opinionated changes.

- it has a fancy [parameters support]https://docs.rs/ofilter/latest/ofilter/struct.Params.html so
  that you can define a filter by its number of items, false positive ratio, size in bits, number
  of hash, or any combination of them. It figures out the magic.
- it uses the [bitlen crate]https://crates.io/crates/bitlen internally and exports its internal
  bit vector in that format.
- it uses the [siphasher crate]https://crates.io/crates/siphasher as the internal
  hashing func. It is not possible to override this.
- it introduces that notion of "streaming filter" which can be useful when you do not work
  in batches and never know when items are going to stop coming in. As some point it is
  a naive implementation of an aging Bloom filter, though I prefer the term streaming.
- performance-wise, the default settings tend to optimize for less CPU usage but more
  memory footprint. This can be controlled in options but by default the filter can
  more memory than strictly required.

In practice, I am using this filter to implement a self-expiring
[KV store](https://gitlab.com/liberecofr/menhirkv/).

![OFilter icon](https://gitlab.com/liberecofr/ofilter/raw/main/ofilter.png)

# Status

For now this is a toy project, clearly *NOT* suitable for production use.

[![Build Status](https://gitlab.com/liberecofr/ofilter/badges/main/pipeline.svg)](https://gitlab.com/liberecofr/ofilter/pipelines)
[![Crates.io](https://img.shields.io/crates/v/ofilter.svg)](https://crates.io/crates/ofilter)
[![Gitlab](https://img.shields.io/gitlab/last-commit/liberecofr/ofilter)](https://gitlab.com/liberecofr/ofilter/tree/main)
[![License](https://img.shields.io/gitlab/license/liberecofr/ofilter)](https://gitlab.com/liberecofr/ofilter/blob/main/LICENSE)

https://en.wikipedia.org/wiki/Bloom_filter

# Usage

```rust
use ofilter::Bloom;

let mut filter: Bloom<usize> = Filter::new(100);
assert!(!filter.check(&42));
filter.set(&42);
assert!(filter.check(&42));
```

# A bit of theory

This [interactive Bloom filter params calulator](https://hur.st/bloomfilter/?n=300000&p=0.01&m=&k=)
proved very useful while developping this.

Also it recaps the usual formulas:

```
With:
n: number of items in the filter
p: probability of false positives, fraction between 0 and 1
m: number of bits in the filter
k: number of hash functions

We have:
n = ceil(m / (-k / log(1 - exp(log(p) / k))))
p = pow(1 - exp(-k / (m / n)), k)
m = ceil((n * log(p)) / log(1 / pow(2, log(2))));
k = round((m / n) * log(2));
```

A command line equivalent would be this practical
[bloom-filter-calculator.rb gist](https://gist.github.com/brandt/8f9ab3ceae37562a2841)

```ruby
# Optimal bloom filter size and number of hashes

# Tips:
# 1. One byte per item in the input set gives about a 2% false positive rate.
# 2. The optimal number of hash functions is ~0.7x the number of bits per item.
# 3. The number of hashes dominates performance.

# Expected number of items in the collection
# n = (m * ln(2))/k;
n = 300_000

# Acceptable false-positive rate (0.01 = 1%)
# p = e^(-(m/n) * (ln(2)^2));
fpr = 0.01

# Optimal size (number of elements in the bit array)
# m = -((n*ln(p))/(ln(2)^2));
m = (n * Math.log(fpr).abs) / (Math.log(2) ** 2)

# Optimal number of hash functions
# k = (m/n) * ln(2);
k = (m / n) * Math.log(2)

puts "Optimal bloom filter size: #{m.ceil} bits"
puts "Optimal number of hash functions: #{k.ceil}"
```

# Links

* [crate]https://crates.io/crates/ofilter on crates.io
* [doc]https://docs.rs/ofilter/ on docs.rs
* [source]https://gitlab.com/liberecofr/ofilter/tree/main on gitlab.com
* [MenhirKV]https://gitlab.com/liberecofr/menhirkv, project using this filter
* More about Bloom filters:
  * https://en.wikipedia.org/wiki/Bloom_filter
  * https://www.vldb.org/archives/workshop/2010/proceedings/files/vldb_2010_workshop/ADMS_2010/adms10-canim.pdf
  * https://arxiv.org/pdf/2001.03147.pdf
  * https://arxiv.org/pdf/2106.04364.pdf
  * http://www.vldb.org/pvldb/vol13/p2355-liu.pdf
  * https://www.eecs.harvard.edu/~michaelm/postscripts/tr-02-05.pdf
  * https://freecontent.manning.com/all-about-bloom-filters/

# License

OFilter is licensed under the [MIT](https://gitlab.com/liberecofr/ofilter/blob/main/LICENSE) license.