Matcher
A high-performance, multi-functional word matcher implemented in Rust.
Features
- Supports Multiple Matching Methods:
- Simple word matching
- Regex-based matching
- Similarity-based matching
- Text Normalization Options:
- Fanjian (Simplify traditional Chinese characters to simplified ones)
- Delete (Remove whitespaces, punctuation, and non-alphanumeric characters)
- Normalize (Normalize special characters to identifiable characters)
- PinYin (Convert Chinese characters to Pinyin for fuzzy matching)
- PinYinChar (Convert Chinese characters to Pinyin)
- Combination and Repeated Word Matching:
- Handles combination and repetition of words with specified constraints.
Usage
Adding to Your Project
To use matcher_rs in your Rust project, add the following to your Cargo.toml file:
[]
= "*"
Explaination of the configuration
Matcher's configuration is defined by theMatchTableMap = HashMap<u64, Vec<MatchTable>>type, the key ofMatchTableMapis calledmatch_id, for eachmatch_id, thetable_idinside should but isn't required to be unique.SimpleMatcher's configuration is defined by theSimpleMatchTableMap = HashMap<SimpleMatchType, HashMap<u64, &'a str>>type, the valueHashMap<u64, &'a str>'s key is calledword_id,word_idis required to be globally unique.
MatchTable
table_id: The unique ID of the match table.match_table_type: The type of the match table.simple_match_type: The type of the simple match (only relevant ifmatch_table_typeis "simple").word_list: The word list of the match table.exemption_simple_match_type: The type of the exemption simple match.exemption_word_list: The exemption word list of the match table.
For each match table, word matching is performed over the word_list, and exemption word matching is performed over the exemption_word_list. If the exemption word matching result is True, the word matching result will be False.
MatchTableType
Simple: Supports simple multiple patterns matching with text normalization defined bysimple_match_type.- We offer transformation methods for text normalization, including
Fanjian,Normalize,PinYin···. - It can handle combination patterns and repeated times sensitive matching, delimited by
,, such ashello,world,hellowill matchhellohelloworldandworldhellohello, but nothelloworlddue to the repeated times ofhello.
- We offer transformation methods for text normalization, including
SimilarChar: Supports similar character matching using regex.["hello,hallo,hollo,hi", "word,world,wrd,🌍", "!,?,~"]will matchhelloworld,hollowrd,hi🌍··· any combinations of the words split by,in the list.
Acrostic: Supports acrostic matching using regex (currently only supports Chinese and simple English sentences).["h,e,l,l,o", "你,好"]will matchhope, endures, love, lasts, onward.and你的笑容温暖, 好心情常伴。.
SimilarTextLevenshtein: Supports similar text matching based on Levenshtein distance (threshold is 0.8).["helloworld"]will matchhelloworld,hellowrld,helloworld!··· any similar text to the words in the list.
Regex: Supports regex matching.["h[aeiou]llo", "w[aeiou]rd"]will matchhello,world,hillo,wurld··· any text that matches the regex in the list.
SimpleMatchType
None: No transformation.Fanjian: Traditional Chinese to simplified Chinese transformation.妳好->你好現⾝->现身
Delete: Delete all non-alphanumeric and non-unicode Chinese characters.hello, world!->helloworld《你∷好》->你好
Normalize: Normalize all English character variations and number variations to basic characters.ℋЀ⒈㈠ϕ->he11o⒈Ƨ㊂->123
PinYin: Convert all unicode Chinese characters to pinyin with boundaries.你好->␀ni␀␀hao␀西安->␀xi␀␀an␀
PinYinChar: Convert all unicode Chinese characters to pinyin without boundaries你好->nihao西安->xian
You can combine these transformations as needed. Pre-defined combinations like DeleteNormalize and FanjianDeleteNormalize are provided for convenience.
Avoid combining PinYin and PinYinChar due to that PinYin is a more limited version of PinYinChar, in some cases like xian, can be treat as two words xi and an, or only one word xian.
Limitations
Simple Match can handle words with a maximum of 32 combined words (more than 32 then effective combined words are not guaranteed) and 8 repeated words (more than 8 repeated words will be limited to 8).
Basic Example
Here’s a basic example of how to use the Matcher struct for text matching:
use HashMap;
use ;
let match_table_map: MatchTableMap = from_iter;
let matcher = new;
let text = "This is an example text.";
let results = matcher.word_match;
use HashMap;
use ;
let mut simple_match_type_word_map = default;
let mut simple_word_map = default;
simple_word_map.insert;
simple_word_map.insert;
simple_match_type_word_map.insert;
let matcher = new;
let text = "你好,世界!";
let results = matcher.process;
For more detailed usage examples, please refer to the test.rs file.
Benchmarks
The matcher_rs library includes benchmarks to measure the performance of the matcher. You can find the benchmarks in the bench.rs file. To run the benchmarks, use the following command:
cargo bench
Contributing
Contributions to matcher_rs are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository. If you would like to contribute code, please fork the repository and submit a pull request.
License
matcher_rs is licensed under the MIT OR Apache-2.0 license.
More Information
For more details, visit the GitHub repository.