A QuotientFilter is a set-like data structure, that keeps track of elements it has seen without
the need to store them. Looking up values has a certain false positive rate, but a false
negative rate of 0%.
use pdatastructs::filters::Filter;
use pdatastructs::filters::quotientfilter::QuotientFilter;
let bits_quotient = 16;
let bits_remainder = 5;
let mut filter = QuotientFilter::with_params(bits_quotient, bits_remainder);
filter.insert(&"my super long string").unwrap();
assert!(filter.query(&"my super long string"));
assert!(!filter.query(&"another super long string"));
- when a lot of data should be added to the set and a moderate false positive rate is
acceptable, was used for spell checking
- as a pre-filter for more expensive lookups, e.g. in combination with a real set, map or
database, so the final false positive rate is 0%
There are 2^bits_quotient slots, initial empty. For every slot, we store bits_remainder as
fingerprint information, a is_continuation bit, a is_occupied bit and a is_shifted bit.
All bits are initially set to false.
bits_quotient = 3
bits_remainder = 4
+-----------------++-----+-----+-----+-----+-----+-----+-----+-----+
| position || 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+-----------------++-----+-----+-----+-----+-----+-----+-----+-----+
| is_occupied || | | | | | | | |
| is_continuation || | | | | | | | |
| is_shifted || | | | | | | | |
| remainder || 0x0 | 0x0 | 0x0 | 0x0 | 0x0 | 0x0 | 0x0 | 0x0 |
+-----------------++-----+-----+-----+-----+-----+-----+-----+-----+
On insertion, elements are hashed to 64 bits. From these, bits_quotient are used as a
quotient and bits_remainder are used as remainder, the remaining bits are dropped.
The quotient represents the canonical position in which the remainder should be inserted. If is
is free, we use that position, set the is_occupied bit and are done.
x = "foo"
h(x) = 0x0123456789abcda5
h(x) & 0x7f = 0x25
remainder = 0x5
quotient = 0x2
+-----------------++-----+-----+-----+-----+-----+-----+-----+-----+
| position || 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+-----------------++-----+-----+-----+-----+-----+-----+-----+-----+
| is_occupied || | | X | | | | | |
| is_continuation || | | | | | | | |
| is_shifted || | | | | | | | |
| remainder || 0x0 | 0x0 | 0x2 | 0x0 | 0x0 | 0x0 | 0x0 | 0x0 |
+-----------------++-----+-----+-----+-----+-----+-----+-----+-----+
If not, linear probing is applied. If an element with the same quotient is already in the
filter, the so called "run" of it will be extended. For extensions, the is_continuation bit
is set as well as the is_shifted bit because the stored remainder is not in its canonical
position:
x = "bar"
h(x) = 0xad8caa00248af32e
h(x) & 0x7f = 0x2e
remainder = 0xe
quotient = 0x2
+-----------------++-----+-----+-----+-----+-----+-----+-----+-----+
| position || 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+-----------------++-----+-----+-----+-----+-----+-----+-----+-----+
| is_occupied || | | X | | | | | |
| is_continuation || | | | X | | | | |
| is_shifted || | | | X | | | | |
| remainder || 0x0 | 0x0 | 0x2 | 0xe | 0x0 | 0x0 | 0x0 | 0x0 |
+-----------------++-----+-----+-----+-----+-----+-----+-----+-----+
| run || [=========] |
+-----------------++-----------------------------------------------|
While doing so, the order of remainders within the run is preserved:
x = "elephant"
h(x) = 0x34235511eeadbc26
h(x) & 0x7f = 0x26
remainder = 0x6
quotient = 0x2
+-----------------++-----+-----+-----+-----+-----+-----+-----+-----+
| position || 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+-----------------++-----+-----+-----+-----+-----+-----+-----+-----+
| is_occupied || | | X | | | | | |
| is_continuation || | | | X | X | | | |
| is_shifted || | | | X | X | | | |
| remainder || 0x0 | 0x0 | 0x2 | 0x6 | 0xe | 0x0 | 0x0 | 0x0 |
+-----------------++-----+-----+-----+-----+-----+-----+-----+-----+
| run || [===============] |
+-----------------++-----------------------------------------------|
If a new quotient is inserted but the corresponding run cannot start at the canonical position,
the entire run will be shifted. A sequence of runs is also called "cluster". Even though the
run is shifted, the original position will still be marked as occupied:
x = "banana"
h(x) = 0xdfdfdfdfdfdfdf31
h(x) & 0x7f = 0x31
remainder = 0x1
quotient = 0x3
+-----------------++-----+-----+-----+-----+-----+-----+-----+-----+
| position || 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+-----------------++-----+-----+-----+-----+-----+-----+-----+-----+
| is_occupied || | | X | X | | | | |
| is_continuation || | | | X | X | | | |
| is_shifted || | | | X | X | X | | |
| remainder || 0x0 | 0x0 | 0x2 | 0x6 | 0xe | 0x1 | 0x0 | 0x0 |
+-----------------++-----+-----+-----+-----+-----+-----+-----+-----+
| run || [===============] [===] |
| cluster || [=====================] |
+-----------------++-----------------------------------------------|
Remainders may duplicate over multiple runs:
x = "apple"
h(x) = 0x0000000000000072
h(x) & 0x7f = 0x72
remainder = 0x2
quotient = 0x7
+-----------------++-----+-----+-----+-----+-----+-----+-----+-----+
| position || 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+-----------------++-----+-----+-----+-----+-----+-----+-----+-----+
| is_occupied || | | X | X | | | | X |
| is_continuation || | | | X | X | | | |
| is_shifted || | | | X | X | X | | |
| remainder || 0x0 | 0x0 | 0x2 | 0x6 | 0xe | 0x1 | 0x0 | 0x2 |
+-----------------++-----+-----+-----+-----+-----+-----+-----+-----+
| run || [===============] [===] [===]|
| cluster || [=====================] [===]|
+-----------------++-----------------------------------------------|
The entire array works like a ring-buffer and operations can over- and underflow:
x = "last"
h(x) = 0x11355343431323f3
h(x) & 0x7f = 0x73
remainder = 0x3
quotient = 0x7
+-----------------++-----+-----+-----+-----+-----+-----+-----+-----+
| position || 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+-----------------++-----+-----+-----+-----+-----+-----+-----+-----+
| is_occupied || | | X | X | | | | X |
| is_continuation || X | | | X | X | | | |
| is_shifted || X | | | X | X | X | | |
| remainder || 0x3 | 0x0 | 0x2 | 0x6 | 0xe | 0x1 | 0x0 | 0x2 |
+-----------------++-----+-----+-----+-----+-----+-----+-----+-----+
| run ||====] [===============] [===] [====|
| cluster ||====] [=====================] [====|
+-----------------++-----------------------------------------------|
The lookup basically follows the insertion procedure.
std::collections::HashSet: has a false positive rate of 0%, but also needs to store all
elements