Matcher Rust Implementation with PyO3 Binding
A high-performance matcher designed to solve LOGICAL and TEXT VARIATIONS problems in word matching, implemented in Rust with PyO3 bindings.
For detailed implementation, see the Design Document.
Features
- Multiple Matching Methods:
- Simple Word Matching
- Regex-Based Matching
- Similarity-Based Matching
- Text Normalization:
- Fanjian: Simplify traditional Chinese characters to simplified ones.
Example:
蟲艸->虫艹 - Delete: Remove specific characters.
Example:
*Fu&*iii&^%%*&kkkk->Fuiiikkkk - Normalize: Normalize special characters to identifiable characters.
Example:
𝜢𝕰𝕃𝙻𝝧 𝙒ⓞᵣℒ𝒟!->hello world! - PinYin: Convert Chinese characters to Pinyin for fuzzy matching.
Example:
西安->xi an, matches洗按->xi an, but not先->xian - PinYinChar: Convert Chinese characters to Pinyin.
Example:
西安->xian, matches洗按and先->xian
- Fanjian: Simplify traditional Chinese characters to simplified ones.
Example:
- AND OR NOT Word Matching:
- Takes into account the number of repetitions of words.
- Example:
hello&worldmatcheshello worldandworld,hello - Example:
无&法&无&天matches无无法天(because无is repeated twice), but not无法天 - Example:
hello~helloo~hhellomatcheshellobut nothellooandhhello
- Customizable Exemption Lists: Exclude specific words from matching.
- Efficient Handling of Large Word Lists: Optimized for performance.
Installation
Use pip
pip install matcher_py
Install pre-built binary
Visit the release page to download the pre-built binary.
Build from source
You need to have rust and maturin installed.
# Clone the repository
git clone https://github.com/Lips7/Matcher.git
cd Matcher/matcher_py
# Install maturin
pip install maturin
# Build and install the package
maturin develop --release
Usage
All relevant types are defined in extension_types.py.
Text Process Usage
Here’s an example of how to use the reduce_text_process and text_process functions:
# Combine and reduce multiple transformations
# Perform a single transformation
Matcher Basic Usage
Here’s an example of how to use the Matcher:
=
# Check if a text matches
assert
assert not
# Perform process as a list
=
# Perform word matching as a dict
=
# Perform word matching as a string
=
Simple Matcher Basic Usage
Here’s an example of how to use the SimpleMatcher:
=
# Check if a text matches
assert
# Perform simple processing
=
Explanation of the configuration
Matcher's configuration is defined by theMatchTableMap = Dict[int, List[MatchTable]]type, the key ofMatchTableMapis calledmatch_id, for eachmatch_id, thetable_idinside is required to be unique.SimpleMatcher's configuration is defined by theSimpleTable = Dict[ProcessType, Dict[int, str]]type, the valueDict[int, str]'s key is calledword_id,word_idis required to be globally unique.
MatchTable
table_id: The unique ID of the match table.match_table_type: The type of the match table.word_list: The word list of the match table.exemption_process_type: The type of the exemption simple match.exemption_word_list: The exemption word list of the match table.
For each match table, word matching is performed over the word_list, and exemption word matching is performed over the exemption_word_list. If the exemption word matching result is True, the word matching result will be False.
MatchTableType
Simple: Supports simple multiple patterns matching with text normalization defined byprocess_type.- It can handle combination patterns and repeated times sensitive matching, delimited by
&and~, such ashello&world&hellowill matchhellohelloworldandworldhellohello, but nothelloworlddue to the repeated times ofhello.
- It can handle combination patterns and repeated times sensitive matching, delimited by
Regex: Supports regex patterns matching.SimilarChar: Supports similar character matching using regex.["hello,hallo,hollo,hi", "word,world,wrd,🌍", "!,?,~"]will matchhelloworld!,hollowrd?,hi🌍~··· any combinations of the words split by,in the list.
Acrostic: Supports acrostic matching using regex (currently only supports Chinese and simple English sentences).["h,e,l,l,o", "你,好"]will matchhope, endures, love, lasts, onward.and你的笑容温暖, 好心情常伴。.
Regex: Supports regex matching.["h[aeiou]llo", "w[aeiou]rd"]will matchhello,world,hillo,wurld··· any text that matches the regex in the list.
Similar: Supports similar text matching based on distance and threshold.Levenshtein: Supports similar text matching based on Levenshtein distance.
ProcessType
None: No transformation.Fanjian: Traditional Chinese to simplified Chinese transformation. Based on FANJIAN.妳好->你好現⾝->现身
Delete: Delete all punctuation, special characters and white spaces. Based on TEXT_DELETE andWHITE_SPACE.hello, world!->helloworld《你∷好》->你好
Normalize: Normalize all English character variations and number variations to basic characters. Based on NORM and NUM_NORM.ℋЀ⒈㈠Õ->he11o⒈Ƨ㊂->123
PinYin: Convert all unicode Chinese characters to pinyin with boundaries. Based on PINYIN.你好->ni hao西安->xi an
PinYinChar: Convert all unicode Chinese characters to pinyin without boundaries. Based on PINYIN.你好->nihao西安->xian
You can combine these transformations as needed. Pre-defined combinations like DeleteNormalize and FanjianDeleteNormalize are provided for convenience.
Avoid combining PinYin and PinYinChar due to that PinYin is a more limited version of PinYinChar, in some cases like xian, can be treat as two words xi and an, or only one word xian.
Contributing
Contributions to matcher_py are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository. If you would like to contribute code, please fork the repository and submit a pull request.
License
matcher_py is licensed under the MIT OR Apache-2.0 license.
More Information
For more details, visit the GitHub repository.