# Matcher
A high-performance matcher designed to solve **LOGICAL** and **TEXT VARIATIONS** problems in word matching, implemented in Rust.
For detailed implementation, see the [Design Document](../DESIGN.md).
## Features
- **Text Transformation**:
- **Fanjian**: Simplify traditional Chinese characters to simplified ones.
Example: `蟲艸` -> `虫艹`
- **Delete**: Remove specific characters.
Example: `*Fu&*iii&^%%*&kkkk` -> `Fuiiikkkk`
- **Normalize**: Normalize special characters to identifiable characters.
Example: `𝜢𝕰𝕃𝙻𝝧 𝙒ⓞᵣℒ𝒟!` -> `hello world!`
- **PinYin**: Convert Chinese characters to Pinyin for fuzzy matching.
Example: `西安` -> ` xi an `, matches `洗按` -> ` xi an `, but not `先` -> ` xian `
- **PinYinChar**: Convert Chinese characters to Pinyin.
Example: `西安` -> `xian`, matches `洗按` and `先` -> `xian`
- **AND OR NOT Word Matching**:
- Takes into account the number of repetitions of words.
- Example: `hello&world` matches `hello world` and `world,hello`
- Example: `无&法&无&天` matches `无无法天` (because `无` is repeated twice), but not `无法天`
- Example: `hello~helloo~hhello` matches `hello` but not `helloo` and `hhello`
- **Efficient Handling of Large Word Lists**: Optimized for performance.
## Usage
### Adding to Your Project
To use `matcher_rs` in your Rust project, run the following command:
```shell
cargo add matcher_rs
```
### Explanation of the configuration
#### ProcessType
* `None`: No transformation.
* `Fanjian`: Traditional Chinese to simplified Chinese transformation. Based on [FANJIAN](./process_map/FANJIAN.txt).
* `妳好` -> `你好`
* `現⾝` -> `现身`
* `Delete`: Delete all punctuation, special characters and white spaces. Based on [TEXT_DELETE](./process_map/TEXT-DELETE.txt) and `WHITE_SPACE`.
* `hello, world!` -> `helloworld`
* `《你∷好》` -> `你好`
* `Normalize`: Normalize all English character variations and number variations to basic characters. Based on [NORM](./process_map/NORM.txt) and [NUM_NORM](./process_map/NUM-NORM.txt).
* `ℋЀ⒈㈠Õ` -> `he11o`
* `⒈Ƨ㊂` -> `123`
* `PinYin`: Convert all unicode Chinese characters to pinyin with boundaries. Based on [PINYIN](./process_map/PINYIN.txt).
* `你好` -> ` ni hao `
* `西安` -> ` xi an `
* `PinYinChar`: Convert all unicode Chinese characters to pinyin without boundaries. Based on [PINYIN](./process_map/PINYIN.txt).
* `你好` -> `nihao`
* `西安` -> `xian`
You can combine these transformations as needed. Pre-defined combinations like `DeleteNormalize` and `FanjianDeleteNormalize` are provided for convenience.
Avoid combining `PinYin` and `PinYinChar` due to that `PinYin` is a more limited version of `PinYinChar`, in some cases like `xian`, can be treat as two words `xi` and `an`, or only one word `xian`.
### Basic Example
Here’s a basic example of how to use the `SimpleMatcher` for text matching:
```rust
use matcher_rs::{text_process, reduce_text_process, ProcessType};
let result = text_process(ProcessType::Delete, "你好,世界!");
let results = reduce_text_process(ProcessType::FanjianDeleteNormalize, "你好,世界!");
```
```rust
use matcher_rs::{ProcessType, SimpleMatcherBuilder};
let matcher = SimpleMatcherBuilder::new()
.add_word(ProcessType::Fanjian, 1, "你好")
.add_word(ProcessType::Fanjian, 2, "世界")
.build();
let text = "你好,世界!";
let results = matcher.process(text);
```
For more detailed usage examples, please refer to the [test_simple_matcher.rs](./tests/test_simple_matcher.rs) file.
## Feature Flags
* `runtime_build`: Enable building the process matcher at runtime (increases build time).
* `dfa`: Use a Deterministic Finite Automaton (DFA) for matching. Offers better search speed but significantly higher memory consumption.
* `vectorscan`: Use Intel's Vectorscan (a fork of Hyperscan) for SIMD-accelerated matching. Offers the best performance but requires the Vectorscan library to be installed on the system.
### Feature Comparison & Recommendation
| **Default** | Aho-Corasick (NFA) | Good | **Lowest** | None | General purpose, memory-constrained environments. |
| `dfa` | Aho-Corasick (DFA) | **Fast** | Highest | None | Speed-critical apps where external dependencies are a no-go. |
| `vectorscan` | Vectorscan (SIMD) | **Fastest** | Moderate | **Required** | High-throughput production systems requiring max performance. |
## Benchmarks
Benchmarked on **MacBook Air M4 (24GB RAM)**.
Test data: [CN_WORD_LIST_100000](../data/word_list/cn/cn_words_100000.txt) against [CN_HAYSTACK](../data/text/cn/西游记.txt) and [EN_WORD_LIST_100000](../data/word_list/en/en_words_100000.txt) against [EN_HAYSTACK](../data/text/en/sherlock.txt).
Full records are stored in [bench_records/](./bench_records/), named by commit hash. Latest: [6181849.txt](./bench_records/6181849.txt).
## Contributing
Contributions to `matcher_rs` are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository. If you would like to contribute code, please fork the repository and submit a pull request.
## License
`matcher_rs` is licensed under the MIT OR Apache-2.0 license.
## More Information
For more details, visit the [GitHub repository](https://github.com/Lips7/Matcher).