Matcher
A high-performance, multi-functional word matcher implemented in Rust.
Designed to solve AND OR NOT and TEXT VARIATIONS problems in word/word_list matching. For detailed implementation, see the Design Document.
Features
- Multiple Matching Methods:
- Simple Word Matching
- Regex-Based Matching
- Similarity-Based Matching
- Text Normalization:
- Fanjian: Simplify traditional Chinese characters to simplified ones.
Example:
蟲艸->虫艹 - Delete: Remove specific characters.
Example:
*Fu&*iii&^%%*&kkkk->Fuiiikkkk - Normalize: Normalize special characters to identifiable characters.
Example:
𝜢𝕰𝕃𝙻Ϙ 𝙒ⓞƦℒ𝒟!->hello world - PinYin: Convert Chinese characters to Pinyin for fuzzy matching.
Example:
西安->/xi//an/, matches洗按->/xi//an/, but not先->/xian/ - PinYinChar: Convert Chinese characters to Pinyin.
Example:
西安->xian, matches洗按and先->xian
- Fanjian: Simplify traditional Chinese characters to simplified ones.
Example:
- AND OR NOT Word Matching:
- Takes into account the number of repetitions of words.
- Example:
hello&worldmatcheshello worldandworld,hello - Example:
无&法&无&天matches无无法天(because无is repeated twice), but not无法天 - Example:
hello~helloo~hhellomatcheshellobut nothellooandhhello
- Customizable Exemption Lists: Exclude specific words from matching.
- Efficient Handling of Large Word Lists: Optimized for performance.
Usage
Adding to Your Project
To use matcher_rs in your Rust project, run the following command:
cargo add matcher_rs
Explanation of the configuration
Matcher's configuration is defined by theMatchTableMap = HashMap<u32, Vec<MatchTable>>type, the key ofMatchTableMapis calledmatch_id, for eachmatch_id, thetable_idinside should but isn't required to be unique.SimpleMatcher's configuration is defined by theSimpleMatchTableMap = HashMap<SimpleMatchType, HashMap<u32, &'a str>>type, the valueHashMap<u32, &'a str>'s key is calledword_id,word_idis required to be globally unique.
MatchTable
table_id: The unique ID of the match table.match_table_type: The type of the match table.word_list: The word list of the match table.exemption_simple_match_type: The type of the exemption simple match.exemption_word_list: The exemption word list of the match table.
For each match table, word matching is performed over the word_list, and exemption word matching is performed over the exemption_word_list. If the exemption word matching result is True, the word matching result will be False.
MatchTableType
Simple: Supports simple multiple patterns matching with text normalization defined bysimple_match_type.- We offer transformation methods for text normalization, including
Fanjian,Normalize,PinYin···. - It can handle combination patterns and repeated times sensitive matching, delimited by
&and~, such ashello&world&hellowill matchhellohelloworldandworldhellohello, but nothelloworlddue to the repeated times ofhello.
- We offer transformation methods for text normalization, including
Regex: Supports regex patterns matching.SimilarChar: Supports similar character matching using regex.["hello,hallo,hollo,hi", "word,world,wrd,🌍", "!,?,~"]will matchhelloworld,hollowrd,hi🌍··· any combinations of the words split by,in the list.
Acrostic: Supports acrostic matching using regex (currently only supports Chinese and simple English sentences).["h,e,l,l,o", "你,好"]will matchhope, endures, love, lasts, onward.and你的笑容温暖, 好心情常伴。.
Regex: Supports regex matching.["h[aeiou]llo", "w[aeiou]rd"]will matchhello,world,hillo,wurld··· any text that matches the regex in the list.
Similar: Supports similar text matching based on distance and threshold.Levenshtein: Supports similar text matching based on Levenshtein distance.DamerauLevenshtein: Supports similar text matching based on Damerau-Levenshtein distance.Indel: Supports similar text matching based on Indel distance.Jaro: Supports similar text matching based on Jaro distance.JaroWinkler: Supports similar text matching based on Jaro-Winkler distance.
SimpleMatchType
None: No transformation.Fanjian: Traditional Chinese to simplified Chinese transformation. Based on FANJIAN and UNICODE.妳好->你好現⾝->现身
Delete: Delete all punctuation, special characters and white spaces.hello, world!->helloworld《你∷好》->你好
Normalize: Normalize all English character variations and number variations to basic characters. Based on UPPER_LOWER, EN_VARIATION and NUM_NORM.ℋЀ⒈㈠ϕ->he11o⒈Ƨ㊂->123
PinYin: Convert all unicode Chinese characters to pinyin with boundaries. Based on PINYIN.你好->␀ni␀␀hao␀西安->␀xi␀␀an␀
PinYinChar: Convert all unicode Chinese characters to pinyin without boundaries. Based on PINYIN_CHAR.你好->nihao西安->xian
You can combine these transformations as needed. Pre-defined combinations like DeleteNormalize and FanjianDeleteNormalize are provided for convenience.
Avoid combining PinYin and PinYinChar due to that PinYin is a more limited version of PinYinChar, in some cases like xian, can be treat as two words xi and an, or only one word xian.
Delete is technologically a combination of TextDelete and WordDelete, we implement different delete methods for text and word. 'Cause we believe CN_SPECIAL and EN_SPECIAL are parts of the word, but not for text. For text_process and reduce_text_process functions, users should use TextDelete instead of WordDelete.
WordDelete: Delete all patterns in PUNCTUATION_SPECIAL.TextDelete: Delete all patterns in PUNCTUATION_SPECIAL, CN_SPECIAL, EN_SPECIAL.
Basic Example
Here’s a basic example of how to use the Matcher struct for text matching:
use ;
let result = text_process;
let result = reduce_text_process;
use HashMap;
use ;
let match_table_map: MatchTableMap = from_iter;
let matcher = new;
let text = "This is an example text.";
let results = matcher.word_match;
use HashMap;
use ;
let mut simple_match_type_word_map = default;
let mut simple_word_map = default;
simple_word_map.insert;
simple_word_map.insert;
simple_match_type_word_map.insert;
let matcher = new;
let text = "你好,世界!";
let results = matcher.process;
For more detailed usage examples, please refer to the test.rs file.
Benchmarks
The matcher_rs library includes benchmarks to measure the performance of the matcher. You can find the benchmarks in the bench.rs file. To run the benchmarks, use the following command:
cargo bench
Contributing
Contributions to matcher_rs are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository. If you would like to contribute code, please fork the repository and submit a pull request.
License
matcher_rs is licensed under the MIT OR Apache-2.0 license.
More Information
For more details, visit the GitHub repository.