# Matcher
A high-performance, multi-functional word matcher implemented in Rust.
## Features
- **Supports Multiple Matching Methods**:
- Simple word matching
- Regex-based matching
- Similarity-based matching
- **Text Normalization Options**:
- Fanjian (Simplify traditional Chinese characters to simplified ones)
- Delete (Remove whitespaces, punctuation, and non-alphanumeric characters)
- Normalize (Normalize special characters to identifiable characters)
- PinYin (Convert Chinese characters to Pinyin for fuzzy matching)
- PinYinChar (Convert Chinese characters to Pinyin)
- **Combination and Repeated Word Matching**:
- Handles combination and repetition of words with specified constraints.
## Usage
### Adding to Your Project
To use `matcher_rs` in your Rust project, add the following to your `Cargo.toml` file:
```toml
[dependencies]
matcher_rs = "*"
```
### Explaination of the configuration
* `Matcher`'s configuration is defined by the `MatchTableMap = HashMap<u64, Vec<MatchTable>>` type, the key of `MatchTableMap` is called `match_id`, for each `match_id`, the `table_id` inside **should but isn't required to be unique**.
* `SimpleMatcher`'s configuration is defined by the `SimpleMatchTableMap = HashMap<SimpleMatchType, HashMap<u64, &'a str>>` type, the value `HashMap<u64, &'a str>`'s key is called `word_id`, **`word_id` is required to be globally unique**.
#### MatchTable
* `table_id`: The unique ID of the match table.
* `match_table_type`: The type of the match table.
* `simple_match_type`: The type of the simple match **(only relevant if `match_table_type` is "simple")**.
* `word_list`: The word list of the match table.
* `exemption_simple_match_type`: The type of the exemption simple match.
* `exemption_word_list`: The exemption word list of the match table.
For each match table, word matching is performed over the `word_list`, and exemption word matching is performed over the `exemption_word_list`. If the exemption word matching result is True, the word matching result will be False.
#### MatchTableType
* `Simple`: Supports simple multiple patterns matching with text normalization defined by `simple_match_type`.
* We offer transformation methods for text normalization, including `Fanjian`, `Normalize`, `PinYin` ···.
* It can handle combination patterns and repeated times sensitive matching, delimited by `,`, such as `hello,world,hello` will match `hellohelloworld` and `worldhellohello`, but not `helloworld` due to the repeated times of `hello`.
* `SimilarChar`: Supports similar character matching using regex.
* `["hello,hallo,hollo,hi", "word,world,wrd,🌍", "!,?,~"]` will match `helloworld`, `hollowrd`, `hi🌍` ··· any combinations of the words split by `,` in the list.
* `Acrostic`: Supports acrostic matching using regex **(currently only supports Chinese and simple English sentences)**.
* `["h,e,l,l,o", "你,好"]` will match `hope, endures, love, lasts, onward.` and `你的笑容温暖, 好心情常伴。`.
* `SimilarTextLevenshtein`: Supports similar text matching based on Levenshtein distance **(threshold is 0.8)**.
* `["helloworld"]` will match `helloworld`, `hellowrld`, `helloworld!` ··· any similar text to the words in the list.
* `Regex`: Supports regex matching.
* `["h[aeiou]llo", "w[aeiou]rd"]` will match `hello`, `world`, `hillo`, `wurld` ··· any text that matches the regex in the list.
#### SimpleMatchType
* `None`: No transformation.
* `Fanjian`: Traditional Chinese to simplified Chinese transformation.
* `妳好` -> `你好`
* `現⾝` -> `现身`
* `Delete`: Delete all non-alphanumeric and non-unicode Chinese characters.
* `hello, world!` -> `helloworld`
* `《你∷好》` -> `你好`
* `Normalize`: Normalize all English character variations and number variations to basic characters.
* `ℋЀ⒈㈠ϕ` -> `he11o`
* `⒈Ƨ㊂` -> `123`
* `PinYin`: Convert all unicode Chinese characters to pinyin with boundaries.
* `你好` -> `␀ni␀␀hao␀`
* `西安` -> `␀xi␀␀an␀`
* `PinYinChar`: Convert all unicode Chinese characters to pinyin without boundaries
* `你好` -> `nihao`
* `西安` -> `xian`
You can combine these transformations as needed. Pre-defined combinations like `DeleteNormalize` and `FanjianDeleteNormalize` are provided for convenience.
Avoid combining `PinYin` and `PinYinChar` due to that `PinYin` is a more limited version of `PinYinChar`, in some cases like `xian`, can be treat as two words `xi` and `an`, or only one word `xian`.
### Limitations
Simple Match can handle words with a maximum of **32** combined words (more than 32 then effective combined words are not guaranteed) and **8** repeated words (more than 8 repeated words will be limited to 8).
### Basic Example
Here’s a basic example of how to use the `Matcher` struct for text matching:
```rust
use std::collections::HashMap;
use matcher_rs::{Matcher, MatchTableMap, MatchTable, MatchTableType, SimpleMatchType};
let match_table_map: MatchTableMap = HashMap::from_iter(vec![
(1, vec![MatchTable {
table_id: 1,
match_table_type: MatchTableType::Simple,
simple_match_type: SimpleMatchType::FanjianDeleteNormalize,
word_list: vec!["example", "test"],
exemption_simple_match_type: SimpleMatchType::FanjianDeleteNormalize,
exemption_word_list: vec![],
}]),
]);
let matcher = Matcher::new(&match_table_map);
let text = "This is an example text.";
let results = matcher.word_match(text);
```
```rust
use std::collections::HashMap;
use matcher_rs::{SimpleMatchType, SimpleMatcher};
let mut simple_match_type_word_map = HashMap::default();
let mut simple_word_map = HashMap::default();
simple_word_map.insert(1, "你好");
simple_word_map.insert(2, "世界");
simple_match_type_word_map.insert(SimpleMatchType::Fanjian, simple_word_map);
let matcher = SimpleMatcher::new(&simple_match_type_word_map);
let text = "你好,世界!";
let results = matcher.process(text);
```
For more detailed usage examples, please refer to the [test.rs](./tests/test.rs) file.
## Benchmarks
The `matcher_rs` library includes benchmarks to measure the performance of the matcher. You can find the benchmarks in the [bench.rs](./benches/bench.rs) file. To run the benchmarks, use the following command:
```shell
cargo bench
```
## Contributing
Contributions to `matcher_rs` are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository. If you would like to contribute code, please fork the repository and submit a pull request.
## License
`matcher_rs` is licensed under the MIT OR Apache-2.0 license.
## More Information
For more details, visit the [GitHub repository](https://github.com/Lips7/Matcher).