Matcher Rust Implementation with PyO3 Binding
A high-performance matcher designed to solve LOGICAL and TEXT VARIATIONS problems in word matching, implemented in Rust.
For detailed implementation, see the Design Document.
Features
- Multiple Matching Methods:
- Simple Word Matching
- Regex-Based Matching
- Similarity-Based Matching
- Text Normalization:
- Fanjian: Simplify traditional Chinese characters to simplified ones.
Example:
蟲艸->虫艹 - Delete: Remove specific characters.
Example:
*Fu&*iii&^%%*&kkkk->Fuiiikkkk - Normalize: Normalize special characters to identifiable characters.
Example:
𝜢𝕰𝕃𝙻𝝧 𝙒ⓞᵣℒ𝒟!->hello world! - PinYin: Convert Chinese characters to Pinyin for fuzzy matching.
Example:
西安->xi an, matches洗按->xi an, but not先->xian - PinYinChar: Convert Chinese characters to Pinyin.
Example:
西安->xian, matches洗按and先->xian
- Fanjian: Simplify traditional Chinese characters to simplified ones.
Example:
- AND OR NOT Word Matching:
- Takes into account the number of repetitions of words.
- Example:
hello&worldmatcheshello worldandworld,hello - Example:
无&法&无&天matches无无法天(because无is repeated twice), but not无法天 - Example:
hello~helloo~hhellomatcheshellobut nothellooandhhello
- Customizable Exemption Lists: Exclude specific words from matching.
- Efficient Handling of Large Word Lists: Optimized for performance.
Installation
Use pip
pip install matcher_py
Install pre-built binary
Visit the release page to download the pre-built binary.
Usage
All relevant types are defined in extension_types.py.
Explanation of the configuration
Matcher's configuration is defined by theMatchTableMap = Dict[int, List[MatchTable]]type, the key ofMatchTableMapis calledmatch_id, for eachmatch_id, thetable_idinside is required to be unique.SimpleMatcher's configuration is defined by theSimpleTable = Dict[ProcessType, Dict[int, str]]type, the valueDict[int, str]'s key is calledword_id,word_idis required to be globally unique.
MatchTable
table_id: The unique ID of the match table.match_table_type: The type of the match table.word_list: The word list of the match table.exemption_process_type: The type of the exemption simple match.exemption_word_list: The exemption word list of the match table.
For each match table, word matching is performed over the word_list, and exemption word matching is performed over the exemption_word_list. If the exemption word matching result is True, the word matching result will be False.
MatchTableType
Simple: Supports simple multiple patterns matching with text normalization defined byprocess_type.- It can handle combination patterns and repeated times sensitive matching, delimited by
&and~, such ashello&world&hellowill matchhellohelloworldandworldhellohello, but nothelloworlddue to the repeated times ofhello.
- It can handle combination patterns and repeated times sensitive matching, delimited by
Regex: Supports regex patterns matching.SimilarChar: Supports similar character matching using regex.["hello,hallo,hollo,hi", "word,world,wrd,🌍", "!,?,~"]will matchhelloworld!,hollowrd?,hi🌍~··· any combinations of the words split by,in the list.
Acrostic: Supports acrostic matching using regex (currently only supports Chinese and simple English sentences).["h,e,l,l,o", "你,好"]will matchhope, endures, love, lasts, onward.and你的笑容温暖, 好心情常伴。.
Regex: Supports regex matching.["h[aeiou]llo", "w[aeiou]rd"]will matchhello,world,hillo,wurld··· any text that matches the regex in the list.
Similar: Supports similar text matching based on distance and threshold.Levenshtein: Supports similar text matching based on Levenshtein distance.
ProcessType
None: No transformation.Fanjian: Traditional Chinese to simplified Chinese transformation. Based on FANJIAN.妳好->你好現⾝->现身
Delete: Delete all punctuation, special characters and white spaces. Based on TEXT_DELETE andWHITE_SPACE.hello, world!->helloworld《你∷好》->你好
Normalize: Normalize all English character variations and number variations to basic characters. Based on NORM and NUM_NORM.ℋЀ⒈㈠Õ->he11o⒈Ƨ㊂->123
PinYin: Convert all unicode Chinese characters to pinyin with boundaries. Based on PINYIN.你好->ni hao西安->xi an
PinYinChar: Convert all unicode Chinese characters to pinyin without boundaries. Based on PINYIN.你好->nihao西安->xian
You can combine these transformations as needed. Pre-defined combinations like DeleteNormalize and FanjianDeleteNormalize are provided for convenience.
Avoid combining PinYin and PinYinChar due to that PinYin is a more limited version of PinYinChar, in some cases like xian, can be treat as two words xi and an, or only one word xian.
Text Process Usage
Here’s an example of how to use the reduce_text_process and text_process functions:
Matcher Basic Usage
Here’s an example of how to use the Matcher:
=
# Check if a text matches
assert
assert not
# Perform process as a list
=
assert ==
# Perform word matching as a dict
assert ==
# Perform word matching as a string
=
assert ==
Simple Matcher Basic Usage
Here’s an example of how to use the SimpleMatcher:
=
# Check if a text matches
assert
# Perform simple processing
=
assert ==
Contributing
Contributions to matcher_py are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository. If you would like to contribute code, please fork the repository and submit a pull request.
License
matcher_py is licensed under the MIT OR Apache-2.0 license.
More Information
For more details, visit the GitHub repository.