Type Alias fuzzy_rocks::DistanceFunction

source ·
pub type DistanceFunction<KeyCharT, DistanceT> = fn(key_a: &[KeyCharT], key_b: &[KeyCharT]) -> DistanceT;
Expand description

A type for a function to compute the distance between two keys. Used in a TableConfig

A DistanceFunction can be any function that returns a scalar distance when given two keys. The smaller the distance, the closer the match. Two identical keys must have a distance of zero. The fuzzy methods in this crate, such as lookup_fuzzy, invoke the distance function to determine if two keys adequately match.

This crate includes a simple Levenstein Distance function called levenstein_distance. However, you may often want to use a different function.

One reason to use a custom distance function is to account for expected error patterns. For example: a distance function that considers likely OCR errors might consider ‘lo’ to be very close to ‘b’, ‘0’ to be extremely close to ‘O’, and ‘A’ to be somewhat near to ‘^’, while ‘!’ would be much further from ‘@’ even though the Levenstein distances tell a different story with ‘lo’ being two edits away from ‘b’ and ‘!’ being only one edit away from ‘@’.

In another situation, you may want to use a custom distance function that is aware of key positions on a QWERTY keyboard, and thus able to identify likely typos. In such a distance function, ‘!’ and ‘@’ are now very close because they are adjacent keys.

In another example, a distance function may be used to identify words that are similar in pronunciation, like the Soundex algorithm, or you may have any number of other application-specific requirements.

Another reason for a custom distance function is if your keys are not human-readable strings, in which case you may need a different interpretation of variances between keys. For example DNA snippets could be used as keys to search for genetic mutations.

Distance functions must return a DistanceT.

Any distance function you choose must be compatible with SymSpell’s delete-distance optimization. In other words, you must be able to delete no more than MAX_DELETES characters from both a given record’s key and the lookup key and arrive at identical key-variants. If your distance function is incompatible with this property then the SymSpell optimization won’t work for you and you should use a different fuzzy lookup technique and a different crate.

Here is more information on the SymSpell algorithm.

Once the distance function has been evaluated, its return value is considered the authoritative distance between the two keys, and the delete distance is irrelevant from that point onwards.