pub trait TableConfig {
    type KeyCharT: 'static + Copy + Eq + Hash + Serialize + DeserializeOwned;
    type DistanceT: 'static + Copy + Zero + PartialOrd + PartialEq + From<u8>;
    type ValueT: 'static + Serialize + DeserializeOwned;
    type CoderT: 'static + Coder + Send + Sync;

    const UTF8_KEYS: bool = true;
    const MAX_DELETES: usize = 2usize;
    const MEANINGFUL_KEY_LEN: usize = 12usize;
    const GROUP_VARIANT_OVERLAP_THRESHOLD: usize = 5usize;
    const DISTANCE_FUNCTION: DistanceFunction<Self::KeyCharT, Self::DistanceT> = Self::levenstein_distance;

    // Provided method
    fn levenstein_distance(
        key_a: &[Self::KeyCharT],
        key_b: &[Self::KeyCharT]
    ) -> Self::DistanceT { ... }
}
Expand description

The TableConfig structure specifies all of the parameters for configuring a FuzzyRocks Table

An example creating a Table using a custom TableConfig

use fuzzy_rocks::{*};

struct Config();
impl TableConfig for Config {
    type KeyCharT = char;
    type DistanceT = u8;
    type ValueT = String;
    type CoderT = DefaultCoder;
    const UTF8_KEYS : bool = true;
    const MAX_DELETES : usize = 2;
    const MEANINGFUL_KEY_LEN : usize = 12;
    const GROUP_VARIANT_OVERLAP_THRESHOLD : usize = 5;
    const DISTANCE_FUNCTION : DistanceFunction<Self::KeyCharT, Self::DistanceT> = Self::levenstein_distance;
}
let mut table = Table::<Config, true>::new("config_example.rocks", Config()).unwrap();

Required Associated Types§

source

type KeyCharT: 'static + Copy + Eq + Hash + Serialize + DeserializeOwned

A generic type that specifies the unit of deletion for the SymSpell algorithm.
Valid Key types for a Table must be losslessly convertible to and from Vec<KeyCharT>.

In a typical implementation, KeyCharT is a char for unicode keys or a u8 for simple ASCII keys, although it could be a data type of another size. KeyCharT must implement the Copy trait, so that keys will be contiguous in memory and may not include any references.

source

type DistanceT: 'static + Copy + Zero + PartialOrd + PartialEq + From<u8>

A generic type that represents a scalar distance in the Metric Space that contains all keys in the Table. A DistanceT is the return type of the DISTANCE_FUNCTION.

DistanceT may be any scalar type and is not necessarily required to be an integer. It is often desireable to use fractional types to express more precision than integers, however, the use of floating point types for DistanceT is discouraged on account of rounding issues causing problems for equality comparison. Therefore a fixed-point alternative type is superior.

Confusingly, the MAX_DELETES const configuration parameter is an integer. Conceptually, the SymSpell delete-distance pruning is a filter that reduces the number of candidate keys that must be tested using the DISTANCE_FUNCTION. Records that can’t be reached by deleting MAX_DELETES characters from both the record key and the lookup key will never be evaluated by the distance function and are conceptually “too far away” from the lookup key.

source

type ValueT: 'static + Serialize + DeserializeOwned

A generic type that represents a payload value associated with a record. ValueT must be able to be serialized and deserialized from the database but otherwise is not constrained.

source

type CoderT: 'static + Coder + Send + Sync

The [Coder] interface to use to encode and decode the objects in the database. Currently this should be either BincodeCoder or MsgPackCoder. Alternatively, use DefaultCoder for the coder selected by the crate features

Provided Associated Constants§

source

const UTF8_KEYS: bool = true

A const bool that specifies whether the keys are UTF-8 encoded Unicode strings or not.

If UTF8_KEYS = true, common key characters will be encoded to consume only 8 bits, (as opposed to 24 bit unicode chars, which are often padded to 32 bits), thus improving database size and performance, but there is additional runtime overhead to encode and decode this format. UTF8_KEYS = true is only allowed if KeyCharT = char.

If UTF8_KEYS = false, all keys are stored as vectors of KeyCharT, meaning there is no cost to encoding and decoding them, but the database performance may suffer.

In general, if your keys are unicode strings, UTF8_KEYS = true is advised, and if they aren’t, then UTF8_KEYS = false is probably required.

source

const MAX_DELETES: usize = 2usize

The number of deletes to store in the database for variants created by the SymSpell optimization. If MAX_DELETES is too small, the variant will not be found and therefore the DISTANCE_FUNCTION will not have an opportunity to evaluate the match. However, if MAX_DELETES is too large, it will hurt performance by finding and evaluating too many candidate keys.

Empirically, values near 2 seem to be good in most situations I have found. I.e. 1 and 3 might be appropriate sometimes. 4 ends up exploding in most cases I’ve seen so the SymSpell logic may not be a good fit if you need to find keys 4 edits away. 0 edits is an exact match.

source

const MEANINGFUL_KEY_LEN: usize = 12usize

MEANINGFUL_KEY_LEN controls an optimization where only a subset of the key is used for creating variants. For example, if MEANINGFUL_KEY_LEN = 10 then only the first 10 characters of the key will be used to generate and search for variants.

This optimization is predicated on the idea that long key strings will not be very similar to each other, and a certain number of characters is sufficient to substantially narrow down the search.

For example the key incomprehensibilities will cause variants to be generated for incomprehe with a MEANINGFUL_KEY_LEN of 10, meaning that a search for incomprehension would find incomprehensibilities and evauate it with the DISTANCE_FUNCTION even though it is further than MAX_DELETES.

In a dataset where many keys share a common prefix, or where keys are organized into a namespace by concatenating strings, this optimization will cause problems and you should either pass a high number to effectively disable it, or rework the code to use different logic to select a substring

The distance function will always be invoked with the entire key, regardless of the value of MEANINGFUL_KEY_LEN.

If this value is set too high, the number of variants in the database will increase. If this value is set too low, the SymSpell filtering will be less effective and the distance function will be invoked unnecessarily, hurting performance. However, the value of this field will not affect the correctness of the results.

source

const GROUP_VARIANT_OVERLAP_THRESHOLD: usize = 5usize

The number of variants a given key must share with the other keys in an existing key group, in order for the key to be added to the key group rather than being placed into a new separate key group.

Setting this number to 0 puts every key for a record together in a group, while a large number results in each key being assigned to its own group.

Creating excess groups when for keys that will likely be fetched together anyway leads to unnecessay database fetches and bloats the variant entries with additional KeyGroupIDs. On the other hand, combining distant keys into together in a key_group leads to unnecessary invocations of the DistanceFunction.

NOTE: We arrived at the default value (5) empirically by testing a number of different values and observing the effects on lookup speed, DB construction speed, and DB size. The observed data points are checked in, in the file: misc/perf_data.txt

source

const DISTANCE_FUNCTION: DistanceFunction<Self::KeyCharT, Self::DistanceT> = Self::levenstein_distance

The DISTANCE_FUNCTION is a DistanceFunction associated with a Table and defines the Metric Space that contains all Keys in the Table.

Provided Methods§

source

fn levenstein_distance( key_a: &[Self::KeyCharT], key_b: &[Self::KeyCharT] ) -> Self::DistanceT

An implementation of the basic Levenstein Distance function, which is used by the DefaultTableConfig, and may be used anywhere a distance function is required.

This implementation uses the Wagner-Fischer Algorithm, as it’s described here

Object Safety§

This trait is not object safe.

Implementors§

source§

impl TableConfig for DefaultTableConfig

§

type KeyCharT = char

§

type DistanceT = u8

§

type ValueT = String

§

type CoderT = BincodeCoder

source§

const UTF8_KEYS: bool = true

source§

const MAX_DELETES: usize = 2usize

source§

const MEANINGFUL_KEY_LEN: usize = 12usize

source§

const GROUP_VARIANT_OVERLAP_THRESHOLD: usize = 5usize

source§

const DISTANCE_FUNCTION: DistanceFunction<Self::KeyCharT, Self::DistanceT> = {<table_config::DefaultTableConfig as table_config::TableConfig>::levenstein_distance as for<'a, 'b> fn(&'a [<table_config::DefaultTableConfig as table_config::TableConfig>::KeyCharT], &'b [<table_config::DefaultTableConfig as table_config::TableConfig>::KeyCharT]) -> <table_config::DefaultTableConfig as table_config::TableConfig>::DistanceT}