matcher_py 0.7.2

A high-performance matcher designed to solve LOGICAL and TEXT VARIATIONS problems in word matching, implemented in Rust.
Documentation
matcher_py-0.7.2 has been yanked.

Matcher Rust Implementation with PyO3 Binding

PyPI - Version PyPI - Python Version PyPI - License

A high-performance matcher designed to solve LOGICAL and TEXT VARIATIONS problems in word matching, implemented in Rust with PyO3 bindings.

For detailed implementation, see the Design Document.

Features

  • Multiple Matching Methods:
    • Simple Word Matching
    • Regex-Based Matching
    • Similarity-Based Matching
  • Text Normalization:
    • Fanjian: Simplify traditional Chinese characters to simplified ones. Example: 蟲艸 -> 虫艹
    • Delete: Remove specific characters. Example: *Fu&*iii&^%%*&kkkk -> Fuiiikkkk
    • Normalize: Normalize special characters to identifiable characters. Example: 𝜢𝕰𝕃𝙻𝝧 𝙒ⓞᵣℒ𝒟! -> hello world!
    • PinYin: Convert Chinese characters to Pinyin for fuzzy matching. Example: 西安 -> xi an, matches 洗按 -> xi an, but not -> xian
    • PinYinChar: Convert Chinese characters to Pinyin. Example: 西安 -> xian, matches 洗按 and -> xian
  • AND OR NOT Word Matching:
    • Takes into account the number of repetitions of words.
    • Example: hello&world matches hello world and world,hello
    • Example: 无&法&无&天 matches 无无法天 (because is repeated twice), but not 无法天
    • Example: hello~helloo~hhello matches hello but not helloo and hhello
  • Customizable Exemption Lists: Exclude specific words from matching.
  • Efficient Handling of Large Word Lists: Optimized for performance.

Installation

Use pip

pip install matcher_py

Install pre-built binary

Visit the release page to download the pre-built binary.

Build from source

You need to have rust and maturin installed.

# Clone the repository
git clone https://github.com/Lips7/Matcher.git
cd Matcher/matcher_py

# Install maturin
pip install maturin

# Build and install the package
maturin develop --release

Usage

All relevant types are defined in extension_types.py.

Text Process Usage

Here’s an example of how to use the reduce_text_process and text_process functions:

from matcher_py import reduce_text_process, text_process
from matcher_py.extension_types import ProcessType

# Combine and reduce multiple transformations
print(reduce_text_process(ProcessType.MatchDeleteNormalize, "hello, world!"))
# Perform a single transformation
print(text_process(ProcessType.MatchDelete, "hello, world!"))

Matcher Basic Usage

Here’s an example of how to use the Matcher:

import json

from matcher_py import Matcher
from matcher_py.extension_types import MatchTable, MatchTableType, ProcessType, RegexMatchType, SimMatchType

matcher = Matcher(
    json.dumps({
        1: [
            MatchTable(
                table_id=1,
                match_table_type=MatchTableType.Simple(process_type=ProcessType.MatchFanjianDeleteNormalize),
                word_list=["hello", "world"],
                exemption_process_type=ProcessType.MatchNone,
                exemption_word_list=["word"],
            ),
            MatchTable(
                table_id=2,
                match_table_type=MatchTableType.Regex(
                    process_type=ProcessType.MatchFanjianDeleteNormalize,
                    regex_match_type=RegexMatchType.MatchRegex
                ),
                word_list=["h[aeiou]llo"],
                exemption_process_type=ProcessType.MatchNone,
                exemption_word_list=[],
            )
        ],
        2: [
            MatchTable(
                table_id=3,
                match_table_type=MatchTableType.Similar(
                    process_type=ProcessType.MatchFanjianDeleteNormalize,
                    sim_match_type=SimMatchType.MatchLevenshtein,
                    threshold=0.5
                ),
                word_list=["halxo"],
                exemption_process_type=ProcessType.MatchNone,
                exemption_word_list=[],
            )
        ]
    }).encode()
)
# Check if a text matches
assert matcher.is_match("hello")
assert not matcher.is_match("word")

# Perform process as a list
result = matcher.process("hello")
print(result)

# Perform word matching as a dict
match_result = matcher.word_match(r"hello, world")
print(match_result)

# Perform word matching as a string
result_str = matcher.word_match_as_string("hello")
print(result_str)

Simple Matcher Basic Usage

Here’s an example of how to use the SimpleMatcher:

import json

from matcher_py import SimpleMatcher
from matcher_py.extension_types import ProcessType

simple_matcher = SimpleMatcher(
    json.dumps(
        {
            ProcessType.MatchNone: {
                1: "hello&world",
                2: "word&word~hello"
            },
            ProcessType.MatchDelete: {
                3: "hallo"
            }
        }
    ).encode()
)
# Check if a text matches
assert simple_matcher.is_match("hello^&!#*#&!^#*()world")
# Perform simple processing
result = simple_matcher.process("hello,world,word,word,hallo")
print(result)

Explanation of the configuration

  • Matcher's configuration is defined by the MatchTableMap = Dict[int, List[MatchTable]] type, the key of MatchTableMap is called match_id, for each match_id, the table_id inside is required to be unique.
  • SimpleMatcher's configuration is defined by the SimpleTable = Dict[ProcessType, Dict[int, str]] type, the value Dict[int, str]'s key is called word_id, word_id is required to be globally unique.

MatchTable

  • table_id: The unique ID of the match table.
  • match_table_type: The type of the match table.
  • word_list: The word list of the match table.
  • exemption_process_type: The type of the exemption simple match.
  • exemption_word_list: The exemption word list of the match table.

For each match table, word matching is performed over the word_list, and exemption word matching is performed over the exemption_word_list. If the exemption word matching result is True, the word matching result will be False.

MatchTableType

  • Simple: Supports simple multiple patterns matching with text normalization defined by process_type.
    • It can handle combination patterns and repeated times sensitive matching, delimited by & and ~, such as hello&world&hello will match hellohelloworld and worldhellohello, but not helloworld due to the repeated times of hello.
  • Regex: Supports regex patterns matching.
    • SimilarChar: Supports similar character matching using regex.
      • ["hello,hallo,hollo,hi", "word,world,wrd,🌍", "!,?,~"] will match helloworld!, hollowrd?, hi🌍~ ··· any combinations of the words split by , in the list.
    • Acrostic: Supports acrostic matching using regex (currently only supports Chinese and simple English sentences).
      • ["h,e,l,l,o", "你,好"] will match hope, endures, love, lasts, onward. and 你的笑容温暖, 好心情常伴。.
    • Regex: Supports regex matching.
      • ["h[aeiou]llo", "w[aeiou]rd"] will match hello, world, hillo, wurld ··· any text that matches the regex in the list.
  • Similar: Supports similar text matching based on distance and threshold.
    • Levenshtein: Supports similar text matching based on Levenshtein distance.

ProcessType

  • None: No transformation.
  • Fanjian: Traditional Chinese to simplified Chinese transformation. Based on FANJIAN.
    • 妳好 -> 你好
    • 現⾝ -> 现身
  • Delete: Delete all punctuation, special characters and white spaces. Based on TEXT_DELETE and WHITE_SPACE.
    • hello, world! -> helloworld
    • 《你∷好》 -> 你好
  • Normalize: Normalize all English character variations and number variations to basic characters. Based on NORM and NUM_NORM.
    • ℋЀ⒈㈠Õ -> he11o
    • ⒈Ƨ㊂ -> 123
  • PinYin: Convert all unicode Chinese characters to pinyin with boundaries. Based on PINYIN.
    • 你好 -> ni hao
    • 西安 -> xi an
  • PinYinChar: Convert all unicode Chinese characters to pinyin without boundaries. Based on PINYIN.
    • 你好 -> nihao
    • 西安 -> xian

You can combine these transformations as needed. Pre-defined combinations like DeleteNormalize and FanjianDeleteNormalize are provided for convenience.

Avoid combining PinYin and PinYinChar due to that PinYin is a more limited version of PinYinChar, in some cases like xian, can be treat as two words xi and an, or only one word xian.

Contributing

Contributions to matcher_py are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository. If you would like to contribute code, please fork the repository and submit a pull request.

License

matcher_py is licensed under the MIT OR Apache-2.0 license.

More Information

For more details, visit the GitHub repository.