Matcher
A high-performance, multi-functional word matcher implemented in Rust.
Features
- Supports Multiple Matching Methods:
- Simple word matching
- Regex-based matching
- Similarity-based matching
- Text Normalization Options:
- Fanjian (Simplify traditional Chinese characters to simplified ones)
- Delete (Remove whitespaces, punctuation, and non-alphanumeric characters)
- Normalize (Normalize special characters to identifiable characters)
- PinYin (Convert Chinese characters to Pinyin for fuzzy matching)
- PinYinChar (Convert Chinese characters to Pinyin)
- Combination and Repeated Word Matching:
- Handles combination and repetition of words with specified constraints.
Usage
Adding to Your Project
To use matcher_rs in your Rust project, add the following to your Cargo.toml file:
[]
= "*"
Explaination of the configuration
Matcher's configuration is defined by theMatchTableMap = HashMap<u64, Vec<MatchTable>>type, the key ofMatchTableMapis calledmatch_id, for eachmatch_id, thetable_idinside should but isn't required to be unique.SimpleMatcher's configuration is defined by theSimpleMatchTableMap = HashMap<SimpleMatchType, HashMap<u64, &'a str>>type, the valueHashMap<u64, &'a str>'s key is calledword_id,word_idis required to be globally unique.
MatchTable
table_id: The unique ID of the match table.match_table_type: The type of the match table.simple_match_type: The type of the simple match (only relevant ifmatch_table_typeis "simple").word_list: The word list of the match table.exemption_simple_match_type: The type of the exemption simple match.exemption_word_list: The exemption word list of the match table.
For each match table, word matching is performed over the word_list, and exemption word matching is performed over the exemption_word_list. If the exemption word matching result is True, the word matching result will be False.
MatchTableType
Simple: Supports simple multiple patterns matching with text normalization defined bysimple_match_type.- We offer transformation methods for text normalization, including
Fanjian,Normalize,PinYin···. - It can handle combination patterns and repeated times sensitive matching, delimited by
,, such ashello,world,hellowill matchhellohelloworldandworldhellohello, but nothelloworlddue to the repeated times ofhello.
- We offer transformation methods for text normalization, including
SimilarChar: Supports similar character matching using regex.["hello,hallo,hollo,hi", "word,world,wrd,🌍", "!,?,~"]will matchhelloworld,hollowrd,hi🌍··· any combinations of the words split by,in the list.
Acrostic: Supports acrostic matching using regex (currently only supports Chinese and simple English sentences).["h,e,l,l,o", "你,好"]will matchhope, endures, love, lasts, onward.and你的笑容温暖, 好心情常伴。.
SimilarTextLevenshtein: Supports similar text matching based on Levenshtein distance (threshold is 0.8).["helloworld"]will matchhelloworld,hellowrld,helloworld!··· any similar text to the words in the list.
Regex: Supports regex matching.["h[aeiou]llo", "w[aeiou]rd"]will matchhello,world,hillo,wurld··· any text that matches the regex in the list.
SimpleMatchType
None: No transformation.Fanjian: Traditional Chinese to simplified Chinese transformation.妳好->你好現⾝->现身
Delete: Delete all non-alphanumeric and non-unicode Chinese characters.hello, world!->helloworld《你∷好》->你好
Normalize: Normalize all English character variations and number variations to basic characters.ℋЀ⒈㈠ϕ->he11o⒈Ƨ㊂->123
PinYin: Convert all unicode Chinese characters to pinyin with boundaries.你好->␀ni␀␀hao␀西安->␀xi␀␀an␀
PinYinChar: Convert all unicode Chinese characters to pinyin without boundaries你好->nihao西安->xian
You can combine these transformations as needed. Pre-defined combinations like DeleteNormalize and FanjianDeleteNormalize are provided for convenience.
Avoid combining PinYin and PinYinChar due to that PinYin is a more limited version of PinYinChar, in some cases like xian, can be treat as two words xi and an, or only one word xian.
Limitations
- Simple Match can handle words with a maximum of 32 combined words (more than 32 then effective combined words are not guaranteed) and 8 repeated words (more than 8 repeated words will be limited to 8).
Basic Example
Here’s a basic example of how to use the Matcher struct for text matching:
use HashMap;
use ;
let match_table_map: MatchTableMap = from_iter;
let matcher = new;
let text = "This is an example text.";
let results = matcher.word_match;
use HashMap;
use ;
let mut simple_match_type_word_map = default;
let mut simple_word_map = default;
simple_word_map.insert;
simple_word_map.insert;
simple_match_type_word_map.insert;
let matcher = new;
let text = "你好,世界!";
let results = matcher.process;
For more detailed usage examples, please refer to the test.rs file.
Benchmarks
The matcher_rs library includes benchmarks to measure the performance of the matcher. You can find the benchmarks in the bench.rs file. To run the benchmarks, use the following command:
cargo bench
Contributing
Contributions to matcher_rs are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository. If you would like to contribute code, please fork the repository and submit a pull request.
License
matcher_rs is licensed under the MIT OR Apache-2.0 license.
More Information
For more details, visit the GitHub repository.