Trait fuzzy_rocks::TableConfig
source · pub trait TableConfig {
type KeyCharT: 'static + Copy + Eq + Hash + Serialize + DeserializeOwned;
type DistanceT: 'static + Copy + Zero + PartialOrd + PartialEq + From<u8>;
type ValueT: 'static + Serialize + DeserializeOwned;
type CoderT: 'static + Coder + Send + Sync;
const UTF8_KEYS: bool = true;
const MAX_DELETES: usize = 2usize;
const MEANINGFUL_KEY_LEN: usize = 12usize;
const GROUP_VARIANT_OVERLAP_THRESHOLD: usize = 5usize;
const DISTANCE_FUNCTION: DistanceFunction<Self::KeyCharT, Self::DistanceT> = Self::levenstein_distance;
// Provided method
fn levenstein_distance(
key_a: &[Self::KeyCharT],
key_b: &[Self::KeyCharT]
) -> Self::DistanceT { ... }
}
Expand description
The TableConfig structure specifies all of the parameters for configuring a FuzzyRocks Table
An example creating a Table using a custom TableConfig
use fuzzy_rocks::{*};
struct Config();
impl TableConfig for Config {
type KeyCharT = char;
type DistanceT = u8;
type ValueT = String;
type CoderT = DefaultCoder;
const UTF8_KEYS : bool = true;
const MAX_DELETES : usize = 2;
const MEANINGFUL_KEY_LEN : usize = 12;
const GROUP_VARIANT_OVERLAP_THRESHOLD : usize = 5;
const DISTANCE_FUNCTION : DistanceFunction<Self::KeyCharT, Self::DistanceT> = Self::levenstein_distance;
}
let mut table = Table::<Config, true>::new("config_example.rocks", Config()).unwrap();
Required Associated Types§
sourcetype KeyCharT: 'static + Copy + Eq + Hash + Serialize + DeserializeOwned
type KeyCharT: 'static + Copy + Eq + Hash + Serialize + DeserializeOwned
A generic type that specifies the unit of deletion for the SymSpell algorithm.
Valid Key types for a Table must be losslessly convertible to and from Vec<KeyCharT>
.
In a typical implementation, KeyCharT
is a char for unicode keys or a u8 for simple ASCII keys,
although it could be a data type of another size. KeyCharT
must implement the Copy trait,
so that keys will be contiguous in memory and may not include any references.
sourcetype DistanceT: 'static + Copy + Zero + PartialOrd + PartialEq + From<u8>
type DistanceT: 'static + Copy + Zero + PartialOrd + PartialEq + From<u8>
A generic type that represents a scalar distance in the Metric Space that contains
all keys in the Table. A DistanceT
is the return type of the DISTANCE_FUNCTION.
DistanceT
may be any scalar type and is not necessarily required to be an integer. It is often desireable to
use fractional types to express more precision than integers, however, the use of floating point types for
DistanceT
is discouraged on account of rounding issues causing problems for equality comparison. Therefore
a fixed-point alternative type is superior.
Confusingly, the MAX_DELETES const configuration parameter is an integer. Conceptually, the SymSpell delete-distance pruning is a filter that reduces the number of candidate keys that must be tested using the DISTANCE_FUNCTION. Records that can’t be reached by deleting MAX_DELETES characters from both the record key and the lookup key will never be evaluated by the distance function and are conceptually “too far away” from the lookup key.
sourcetype ValueT: 'static + Serialize + DeserializeOwned
type ValueT: 'static + Serialize + DeserializeOwned
A generic type that represents a payload value associated with a record. ValueT
must
be able to be serialized and deserialized from the database but otherwise is not constrained.
sourcetype CoderT: 'static + Coder + Send + Sync
type CoderT: 'static + Coder + Send + Sync
The [Coder] interface to use to encode and decode the objects in the database. Currently this should be either BincodeCoder or MsgPackCoder. Alternatively, use DefaultCoder for the coder selected by the crate features
Provided Associated Constants§
sourceconst UTF8_KEYS: bool = true
const UTF8_KEYS: bool = true
A const bool
that specifies whether the keys are UTF-8 encoded Unicode strings or not.
If UTF8_KEYS = true
, common key characters will be encoded to consume only 8 bits, (as opposed to 24 bit
unicode chars, which are often padded to 32 bits), thus improving database size and
performance, but there is additional runtime overhead to encode and decode this format. UTF8_KEYS = true
is only allowed if KeyCharT = char
.
If UTF8_KEYS = false
, all keys are stored as vectors of KeyCharT, meaning there is no cost to encoding
and decoding them, but the database performance may suffer.
In general, if your keys are unicode strings, UTF8_KEYS = true
is advised, and if they aren’t, then
UTF8_KEYS = false
is probably required.
sourceconst MAX_DELETES: usize = 2usize
const MAX_DELETES: usize = 2usize
The number of deletes to store in the database for variants created
by the SymSpell optimization. If MAX_DELETES
is too small, the variant will not be found
and therefore the DISTANCE_FUNCTION will not have an opportunity to evaluate the match. However,
if MAX_DELETES
is too large, it will hurt performance by finding and evaluating too many
candidate keys.
Empirically, values near 2 seem to be good in most situations I have found. I.e. 1 and 3 might be appropriate sometimes. 4 ends up exploding in most cases I’ve seen so the SymSpell logic may not be a good fit if you need to find keys 4 edits away. 0 edits is an exact match.
sourceconst MEANINGFUL_KEY_LEN: usize = 12usize
const MEANINGFUL_KEY_LEN: usize = 12usize
MEANINGFUL_KEY_LEN
controls an optimization where only a subset of the key is used for creating
variants. For example, if MEANINGFUL_KEY_LEN = 10
then only the first 10 characters of the key will be used
to generate and search for variants.
This optimization is predicated on the idea that long key strings will not be very similar to each other, and a certain number of characters is sufficient to substantially narrow down the search.
For example the key incomprehensibilities will cause variants to be generated for incomprehe
with a MEANINGFUL_KEY_LEN
of 10, meaning that a search for incomprehension would find incomprehensibilities
and evauate it with the DISTANCE_FUNCTION even though it is further than MAX_DELETES.
In a dataset where many keys share a common prefix, or where keys are organized into a namespace by concatenating strings, this optimization will cause problems and you should either pass a high number to effectively disable it, or rework the code to use different logic to select a substring
The distance function will always be invoked with the entire key, regardless of the value of
MEANINGFUL_KEY_LEN
.
If this value is set too high, the number of variants in the database will increase. If this value is set too low, the SymSpell filtering will be less effective and the distance function will be invoked unnecessarily, hurting performance. However, the value of this field will not affect the correctness of the results.
sourceconst GROUP_VARIANT_OVERLAP_THRESHOLD: usize = 5usize
const GROUP_VARIANT_OVERLAP_THRESHOLD: usize = 5usize
The number of variants a given key must share with the other keys in an existing key group, in order for the key to be added to the key group rather than being placed into a new separate key group.
Setting this number to 0 puts every key for a record together in a group, while a large number results in each key being assigned to its own group.
Creating excess groups when for keys that will likely be fetched together anyway leads to unnecessay
database fetches and bloats the variant entries with additional KeyGroupID
s. On the other hand,
combining distant keys into together in a key_group leads to unnecessary invocations of the
DistanceFunction.
NOTE: We arrived at the default value (5) empirically by testing a number of different values and
observing the effects on lookup speed, DB construction speed, and DB size. The observed data
points are checked in, in the file: misc/perf_data.txt
sourceconst DISTANCE_FUNCTION: DistanceFunction<Self::KeyCharT, Self::DistanceT> = Self::levenstein_distance
const DISTANCE_FUNCTION: DistanceFunction<Self::KeyCharT, Self::DistanceT> = Self::levenstein_distance
The DISTANCE_FUNCTION
is a DistanceFunction associated with a Table and defines
the Metric Space that contains all Keys in the Table.
Provided Methods§
sourcefn levenstein_distance(
key_a: &[Self::KeyCharT],
key_b: &[Self::KeyCharT]
) -> Self::DistanceT
fn levenstein_distance( key_a: &[Self::KeyCharT], key_b: &[Self::KeyCharT] ) -> Self::DistanceT
An implementation of the basic Levenstein Distance function, which is used by the DefaultTableConfig, and may be used anywhere a distance function is required.
This implementation uses the Wagner-Fischer Algorithm, as it’s described here