matcher_rs 0.2.6

A high performance multiple functional word matcher
Documentation

Matcher

A high-performance, multi-functional word matcher implemented in Rust.

Features

  • Supports Multiple Matching Methods:
    • Simple word matching
    • Regex-based matching
    • Similarity-based matching
  • Text Normalization Options:
    • Fanjian (Simplify traditional Chinese characters to simplified ones)
    • Delete (Remove whitespaces, punctuation, and non-alphanumeric characters)
    • Normalize (Normalize special characters to identifiable characters)
    • PinYin (Convert Chinese characters to Pinyin for fuzzy matching)
    • PinYinChar (Convert Chinese characters to Pinyin)
  • Combination and Repeated Word Matching:
    • Handles combination and repetition of words with specified constraints.

Usage

Adding to Your Project

To use matcher_rs in your Rust project, add the following to your Cargo.toml file:

[dependencies]
matcher_rs = "*"

Explaination of the configuration

  1. Matcher's configuration is defined by the MatchTableMap = HashMap<u64, Vec<MatchTable>> type, the key of MatchTableMap is called match_id, for each match_id, the table_id inside should but isn't required to be unique.
  2. SimpleMatcher's configuration is defined by the SimpleMatchTableMap = HashMap<SimpleMatchType, HashMap<u64, &'a str>> type, the value HashMap<u64, &'a str>'s key is called word_id, word_id is required to be globally unique.

MatchTable

  • table_id: The unique ID of the match table.
  • match_table_type: The type of the match table.
  • simple_match_type: The type of the simple match (only relevant if match_table_type is "simple").
  • word_list: The word list of the match table.
  • exemption_simple_match_type: The type of the exemption simple match.
  • exemption_word_list: The exemption word list of the match table.

For each match table, word matching is performed over the word_list, and exemption word matching is performed over the exemption_word_list. If the exemption word matching result is True, the word matching result will be False.

MatchTableType

  • Simple: Supports simple multiple patterns matching with text normalization defined by simple_match_type.
    • We offer transformation methods for text normalization, including Fanjian, Normalize, PinYin ···.
    • It can handle combination patterns and repeated times sensitive matching, delimited by ,, such as hello,world,hello will match hellohelloworld and worldhellohello, but not helloworld due to the repeated times of hello.
  • SimilarChar: Supports similar character matching using regex.
    • ["hello,hallo,hollo,hi", "word,world,wrd,🌍", "!,?,~"] will match helloworld, hollowrd, hi🌍 ··· any combinations of the words split by , in the list.
  • Acrostic: Supports acrostic matching using regex (currently only supports Chinese and simple English sentences).
    • ["h,e,l,l,o", "你,好"] will match hope, endures, love, lasts, onward. and 你的笑容温暖, 好心情常伴。.
  • SimilarTextLevenshtein: Supports similar text matching based on Levenshtein distance (threshold is 0.8).
    • ["helloworld"] will match helloworld, hellowrld, helloworld! ··· any similar text to the words in the list.
  • Regex: Supports regex matching.
    • ["h[aeiou]llo", "w[aeiou]rd"] will match hello, world, hillo, wurld ··· any text that matches the regex in the list.

SimpleMatchType

  • None: No transformation.
  • Fanjian: Traditional Chinese to simplified Chinese transformation.
    • 妳好 -> 你好
    • 現⾝ -> 现身
  • Delete: Delete all non-alphanumeric and non-unicode Chinese characters.
    • hello, world! -> helloworld
    • 《你∷好》 -> 你好
  • Normalize: Normalize all English character variations and number variations to basic characters.
    • ℋЀ⒈㈠ϕ -> he11o
    • ⒈Ƨ㊂ -> 123
  • PinYin: Convert all unicode Chinese characters to pinyin with boundaries.
    • 你好 -> ␀ni␀␀hao␀
    • 西安 -> ␀xi␀␀an␀
  • PinYinChar: Convert all unicode Chinese characters to pinyin without boundaries
    • 你好 -> nihao
    • 西安 -> xian

You can combine these transformations as needed. Pre-defined combinations like DeleteNormalize and FanjianDeleteNormalize are provided for convenience.

Avoid combining PinYin and PinYinChar due to that PinYin is a more limited version of PinYinChar, in some cases like xian, can be treat as two words xi and an, or only one word xian.

Limitations

  • Simple Match can handle words with a maximum of 32 combined words (more than 32 then effective combined words are not guaranteed) and 8 repeated words (more than 8 repeated words will be limited to 8).

Basic Example

Here’s a basic example of how to use the Matcher struct for text matching:

use std::collections::HashMap;
use matcher_rs::{Matcher, MatchTableMap, MatchTable, MatchTableType, SimpleMatchType};

let match_table_map: MatchTableMap = HashMap::from_iter(vec![
    (1, vec![MatchTable {
        table_id: 1,
        match_table_type: MatchTableType::Simple,
        simple_match_type: SimpleMatchType::FanjianDeleteNormalize,
        word_list: vec!["example", "test"],
        exemption_simple_match_type: SimpleMatchType::FanjianDeleteNormalize,
        exemption_word_list: vec![],
    }]),
]);
let matcher = Matcher::new(match_table_map);
let text = "This is an example text.";
let results = matcher.word_match(text);
use std::collections::HashMap;
use matcher_rs::{SimpleMatchType, SimpleMatcher};

let mut simple_match_type_word_map = HashMap::default();
let mut simple_word_map = HashMap::default();

simple_word_map.insert(1, "你好");
simple_word_map.insert(2, "世界");

simple_match_type_word_map.insert(SimpleMatchType::Fanjian, simple_word_map);

let matcher = SimpleMatcher::new(simple_match_type_word_map);
let text = "你好,世界!";
let results = matcher.process(text);

For more detailed usage examples, please refer to the test.rs file.

Benchmarks

The matcher_rs library includes benchmarks to measure the performance of the matcher. You can find the benchmarks in the bench.rs file. To run the benchmarks, use the following command:

cargo bench

Contributing

Contributions to matcher_rs are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository. If you would like to contribute code, please fork the repository and submit a pull request.

License

matcher_rs is licensed under the MIT OR Apache-2.0 license.

More Information

For more details, visit the GitHub repository.