Matcher Rust Implementation with PyO3 Binding
Installation
Use pip
pip install matcher_py
Install pre-built binary
Visit the release page to download the pre-built binary.
Usage
The msgspec
library is recommended for serializing the matcher configuration due to its performance benefits. You can also use other msgpack serialization libraries like ormsgpack
. All relevant types are defined in extension_types.py.
Explaination of the configuration
Matcher
's configuration is defined by theMatchTableMap = Dict[int, List[MatchTable]]
type, the key ofMatchTableMap
is calledmatch_id
, for eachmatch_id
, thetable_id
inside should but isn't required to be unique.SimpleMatcher
's configuration is defined by theSimpleMatchTableMap = Dict[SimpleMatchType, Dict[int, str]]
type, the valueDict[int, str]
's key is calledword_id
,word_id
is required to be globally unique.
MatchTable
table_id
: The unique ID of the match table.match_table_type
: The type of the match table.simple_match_type
: The type of the simple match (only relevant ifmatch_table_type
is "simple").word_list
: The word list of the match table.exemption_simple_match_type
: The type of the exemption simple match.exemption_word_list
: The exemption word list of the match table.
For each match table, word matching is performed over the word_list
, and exemption word matching is performed over the exemption_word_list
. If the exemption word matching result is True, the word matching result will be False.
MatchTableType
Simple
: Supports simple multiple patterns matching with text normalization defined bysimple_match_type
.- We offer transformation methods for text normalization, including
MatchFanjian
,MatchNormalize
,MatchPinYin
···. - It can handle combination patterns and repeated times sensitive matching, delimited by
,
, such ashello,world,hello
will matchhellohelloworld
andworldhellohello
, but nothelloworld
due to the repeated times ofhello
.
- We offer transformation methods for text normalization, including
SimilarChar
: Supports similar character matching using regex.["hello,hallo,hollo,hi", "word,world,wrd,🌍", "!,?,~"]
will matchhelloworld
,hollowrd
,hi🌍
··· any combinations of the words split by,
in the list.
Acrostic
: Supports acrostic matching using regex (currently only supports Chinese and simple English sentences).["h,e,l,l,o", "你,好"]
will matchhope, endures, love, lasts, onward.
and你的笑容温暖, 好心情常伴。
.
SimilarTextLevenshtein
: Supports similar text matching based on Levenshtein distance (threshold is 0.8).["helloworld"]
will matchhelloworld
,hellowrld
,helloworld!
··· any similar text to the words in the list.
Regex
: Supports regex matching.["h[aeiou]llo", "w[aeiou]rd"]
will matchhello
,world
,hillo
,wurld
··· any text that matches the regex in the list.
SimpleMatchType
MatchNone
: No transformation.MatchFanjian
: Traditional Chinese to simplified Chinese transformation.妳好
->你好
現⾝
->现身
MatchDelete
: Delete all non-alphanumeric and non-unicode Chinese characters.hello, world!
->helloworld
《你∷好》
->你好
MatchNormalize
: Normalize all English character variations and number variations to basic characters.ℋЀ⒈㈠ϕ
->he11o
⒈Ƨ㊂
->123
MatchPinYin
: Convert all unicode Chinese characters to pinyin with boundaries.你好
->␀ni␀␀hao␀
西安
->␀xi␀␀an␀
MatchPinYinChar
: Convert all unicode Chinese characters to pinyin without boundaries.你好
->nihao
西安
->xian
You can combine these transformations as needed. Pre-defined combinations like MatchDeleteNormalize
and MatchFanjianDeleteNormalize
are provided for convenience.
Avoid combining MatchPinYin
and MatchPinYinChar
due to that MatchPinYin
is a more limited version of MatchPinYinChar
, in some cases like xian
, can be treat as two words xi
and an
, or only one word xian
.
Limitations
- Simple Match can handle words with a maximum of 32 combined words (more than 32 then effective combined words are not guaranteed) and 8 repeated words (more than 8 repeated words will be limited to 8).
Matcher Basic Usage
Here’s an example of how to use the Matcher
:
=
=
# Check if a text matches
assert
assert not
# Perform word matching as a dict
assert
# Perform word matching as a string
=
assert ==
# Perform batch processing as a dict using a list
=
=
# Perform batch processing as a string using a list
=
=
# Perform batch processing as a dict using a numpy array
=
=
# Perform batch processing as a string using a numpy array
=
=
Simple Matcher Basic Usage
Here’s an example of how to use the SimpleMatcher
:
=
=
# Check if a text matches
assert
# Perform simple processing
=
# Perform batch processing using a list
=
=
# Perform batch processing using a NumPy array
=
=
Contributing
Contributions to matcher_py
are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository. If you would like to contribute code, please fork the repository and submit a pull request.
License
matcher_py
is licensed under the MIT OR Apache-2.0 license.
For more details, visit the GitHub repository.