Matcher
A high-performance matcher designed to solve LOGICAL and TEXT VARIATIONS problems in word matching, implemented in Rust.
For detailed implementation, see the Design Document.
Features
- Text Transformation:
- Fanjian: Simplify traditional Chinese characters to simplified ones.
Example:
蟲艸->虫艹 - Delete: Remove specific characters.
Example:
*Fu&*iii&^%%*&kkkk->Fuiiikkkk - Normalize: Normalize special characters to identifiable characters.
Example:
𝜢𝕰𝕃𝙻𝝧 𝙒ⓞᵣℒ𝒟!->hello world! - PinYin: Convert Chinese characters to Pinyin for fuzzy matching.
Example:
西安->xi an, matches洗按->xi an, but not先->xian - PinYinChar: Convert Chinese characters to Pinyin.
Example:
西安->xian, matches洗按and先->xian
- Fanjian: Simplify traditional Chinese characters to simplified ones.
Example:
- AND OR NOT Word Matching:
- Takes into account the number of repetitions of words.
- Example:
hello&worldmatcheshello worldandworld,hello - Example:
无&法&无&天matches无无法天(because无is repeated twice), but not无法天 - Example:
hello~helloo~hhellomatcheshellobut nothellooandhhello
- Efficient Handling of Large Word Lists: Optimized for performance.
Usage
Adding to Your Project
To use matcher_rs in your Rust project, run the following command:
cargo add matcher_rs
Explanation of the configuration
ProcessType
None: No transformation.Fanjian: Traditional Chinese to simplified Chinese transformation. Based on FANJIAN.妳好->你好現⾝->现身
Delete: Delete all punctuation, special characters and white spaces. Based on TEXT_DELETE andWHITE_SPACE.hello, world!->helloworld《你∷好》->你好
Normalize: Normalize all English character variations and number variations to basic characters. Based on NORM and NUM_NORM.ℋЀ⒈㈠Õ->he11o⒈Ƨ㊂->123
PinYin: Convert all unicode Chinese characters to pinyin with boundaries. Based on PINYIN.你好->ni hao西安->xi an
PinYinChar: Convert all unicode Chinese characters to pinyin without boundaries. Based on PINYIN.你好->nihao西安->xian
You can combine these transformations as needed. Pre-defined combinations like DeleteNormalize and FanjianDeleteNormalize are provided for convenience.
Including None in a composite ProcessType keeps the raw-text path alongside transformed
variants. For example, ProcessType::None | ProcessType::PinYin allows one part of a rule to
match the original text while another part matches the Pinyin-transformed text.
Avoid combining PinYin and PinYinChar due to that PinYin is a more limited version of PinYinChar, in some cases like xian, can be treat as two words xi and an, or only one word xian.
Basic Example
Here’s a basic example of how to use the SimpleMatcher for text matching:
use ;
let result = text_process;
let results = reduce_text_process;
text_process returns only the final transformed text. reduce_text_process returns each
distinct intermediate variant along the pipeline. For shared-prefix multi-variant traversal,
SimpleMatcher uses the internal DAG helpers instead of recomputing each path independently.
use ;
let matcher = new
.add_word
.add_word
.build;
let text = "你好,世界!";
let results = matcher.process;
For more detailed usage examples, please refer to the test_simple_matcher.rs file.
Feature Flags
runtime_build: Build transformation tables from the source text maps at startup instead of embedding precompiled binaries.dfa: Use DFA-backed Aho-Corasick automata where applicable. This is enabled by default and improves speed at the cost of higher memory consumption.simd_runtime_dispatch: Enabled by default. Selects the best available transform kernel at runtime (AVX2on x86-64,NEONon ARM64, portable fallback elsewhere).
Feature Comparison & Recommendation
| Feature | Engine | Search Speed | Memory Usage | External Dependency | Best For |
|---|---|---|---|---|---|
| Default | Aho-Corasick (DFA) | Fast | High | None | General purpose use when extra memory is acceptable. |
simd_runtime_dispatch |
Runtime-selected transform kernels | Fastest preprocess | Neutral | None | Portable builds that should exploit the host CPU automatically. |
--no-default-features |
Aho-Corasick (Contiguous NFA) | Good | Lowest | None | Memory-constrained environments. |
dfa |
Aho-Corasick (DFA) | Fast | High | None | Explicitly enabling the default engine in custom feature sets. |
Benchmarks
Benchmarked on MacBook Air M4 (24GB RAM). Test data: CN_WORD_LIST_100000 against CN_HAYSTACK and EN_WORD_LIST_100000 against EN_HAYSTACK.
Full records are stored in bench_records/. Latest: latest.txt.
To compare two benchmark records:
python matcher_rs/scripts/compare_benchmarks.py \
"matcher_rs/bench_records/2026-03-10 12:22:24.txt" \
"matcher_rs/bench_records/2026-03-11 23:16:38.txt"
The script treats the first file as the baseline and prints two sections: Regression and Improvement, using median latency by default.
Contributing
Contributions to matcher_rs are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository. If you would like to contribute code, please fork the repository and submit a pull request.
License
matcher_rs is licensed under the MIT OR Apache-2.0 license.
More Information
For more details, visit the GitHub repository.