Matcher
A high-performance matcher designed to solve LOGICAL and TEXT VARIATIONS problems in word matching, implemented in Rust.
For detailed implementation, see the Design Document.
Features
- Multiple Matching Methods:
- Simple Word Matching
- Regex-Based Matching
- Similarity-Based Matching
- Text Transformation:
- Fanjian: Simplify traditional Chinese characters to simplified ones.
Example:
蟲艸->虫艹 - Delete: Remove specific characters.
Example:
*Fu&*iii&^%%*&kkkk->Fuiiikkkk - Normalize: Normalize special characters to identifiable characters.
Example:
𝜢𝕰𝕃𝙻𝝧 𝙒ⓞᵣℒ𝒟!->hello world! - PinYin: Convert Chinese characters to Pinyin for fuzzy matching.
Example:
西安->xi an, matches洗按->xi an, but not先->xian - PinYinChar: Convert Chinese characters to Pinyin.
Example:
西安->xian, matches洗按and先->xian
- Fanjian: Simplify traditional Chinese characters to simplified ones.
Example:
- AND OR NOT Word Matching:
- Takes into account the number of repetitions of words.
- Example:
hello&worldmatcheshello worldandworld,hello - Example:
无&法&无&天matches无无法天(because无is repeated twice), but not无法天 - Example:
hello~helloo~hhellomatcheshellobut nothellooandhhello
- Customizable Exemption Lists: Exclude specific words from matching.
- Efficient Handling of Large Word Lists: Optimized for performance.
Usage
Adding to Your Project
To use matcher_rs in your Rust project, run the following command:
cargo add matcher_rs
Explanation of the configuration
Matcher's configuration is built usingMatcherBuilderandMatchTableBuilder.SimpleMatcher's configuration is built usingSimpleMatcherBuilder. For eachSimpleMatcher, the addedword_idis required to be globally unique.
MatchTable
table_id: The unique ID of the match table.match_table_type: The type of the match table.word_list: The word list of the match table.exemption_process_type: The type of the exemption simple match.exemption_word_list: The exemption word list of the match table.
For each match table, word matching is performed over the word_list, and exemption word matching is performed over the exemption_word_list. If the exemption word matching result is True, the word matching result will be False.
MatchTableType
Simple: Supports simple multiple patterns matching with text normalization defined byprocess_type.- It can handle combination patterns and repeated times sensitive matching, delimited by
&and~, such ashello&world&hellowill matchhellohelloworldandworldhellohello, but nothelloworlddue to the repeated times ofhello.
- It can handle combination patterns and repeated times sensitive matching, delimited by
Regex: Supports regex patterns matching.SimilarChar: Supports similar character matching using regex.["hello,hallo,hollo,hi", "word,world,wrd,🌍", "!,?,~"]will matchhelloworld!,hollowrd?,hi🌍~··· any combinations of the words split by,in the list.
Acrostic: Supports acrostic matching using regex (currently only supports Chinese and simple English sentences).["h,e,l,l,o", "你,好"]will matchhope, endures, love, lasts, onward.and你的笑容温暖, 好心情常伴。.
Regex: Supports regex matching.["h[aeiou]llo", "w[aeiou]rd"]will matchhello,world,hillo,wurld··· any text that matches the regex in the list.
Similar: Supports similar text matching based on distance and threshold.Levenshtein: Supports similar text matching based on Levenshtein distance.
ProcessType
None: No transformation.Fanjian: Traditional Chinese to simplified Chinese transformation. Based on FANJIAN.妳好->你好現⾝->现身
Delete: Delete all punctuation, special characters and white spaces. Based on TEXT_DELETE andWHITE_SPACE.hello, world!->helloworld《你∷好》->你好
Normalize: Normalize all English character variations and number variations to basic characters. Based on NORM and NUM_NORM.ℋЀ⒈㈠Õ->he11o⒈Ƨ㊂->123
PinYin: Convert all unicode Chinese characters to pinyin with boundaries. Based on PINYIN.你好->ni hao西安->xi an
PinYinChar: Convert all unicode Chinese characters to pinyin without boundaries. Based on PINYIN.你好->nihao西安->xian
You can combine these transformations as needed. Pre-defined combinations like DeleteNormalize and FanjianDeleteNormalize are provided for convenience.
Avoid combining PinYin and PinYinChar due to that PinYin is a more limited version of PinYinChar, in some cases like xian, can be treat as two words xi and an, or only one word xian.
Basic Example
Here’s a basic example of how to use the Matcher struct for text matching:
use ;
let result = text_process;
let result = reduce_text_process;
use ;
let table = new
.add_words
.build;
let matcher = new
.add_table
.build;
let text = "This is an example text.";
let results = matcher.word_match;
use ;
let matcher = new
.add_word
.add_word
.build;
let text = "你好,世界!";
let results = matcher.process;
For more detailed usage examples, please refer to the test.rs file.
Feature Flags
runtime_build: By enable runtime_build feature, we could build process matcher at runtime, but with build time increasing.dfa: By enable dfa feature, we could use dfa to perform simple matching, but with significantly increasing memory consumption.
Default feature is dfa.
Benchmarks
Bench against pairs (CN_WORD_LIST_100000, CN_HAYSTACK) and (EN_WORD_LIST_100000, EN_HAYSTACK). Word selection is totally random.
The matcher_rs library includes benchmarks to measure the performance of the matcher. You can find the benchmarks in the bench.rs file. To run the benchmarks, use the following command:
cargo bench
Current default simple match type: ProcessType(None)
Current default simple word map size: 10000
Current default combined times: 3
Timer precision: 41 ns
bench fastest │ slowest │ median │ mean │ samples │ iters
├─ build │ │ │ │ │
│ ├─ cn_by_combinations │ │ │ │ │
│ │ ├─ 1 7.761 ms │ 11.14 ms │ 8.053 ms │ 8.153 ms │ 100 │ 100
│ │ ├─ 3 25.6 ms │ 59.3 ms │ 28.03 ms │ 29.63 ms │ 100 │ 100
│ │ ╰─ 5 44.68 ms │ 74.26 ms │ 47.95 ms │ 49.66 ms │ 100 │ 100
│ ├─ cn_by_process_type │ │ │ │ │
│ │ ├─ "delete" 25.37 ms │ 45.72 ms │ 26.11 ms │ 26.57 ms │ 100 │ 100
│ │ ├─ "fanjian" 25.69 ms │ 55.01 ms │ 27.2 ms │ 27.64 ms │ 100 │ 100
│ │ ├─ "fanjian_delete_normalize" 25.96 ms │ 48.89 ms │ 27.3 ms │ 27.88 ms │ 100 │ 100
│ │ ╰─ "none" 25.94 ms │ 62.33 ms │ 28.24 ms │ 29.9 ms │ 100 │ 100
│ ├─ cn_by_size │ │ │ │ │
│ │ ├─ 1000 2.261 ms │ 3.293 ms │ 2.311 ms │ 2.36 ms │ 100 │ 100
│ │ ├─ 10000 25.48 ms │ 28.64 ms │ 25.91 ms │ 25.96 ms │ 100 │ 100
│ │ ╰─ 50000 105.3 ms │ 152.1 ms │ 109.2 ms │ 111.9 ms │ 45 │ 45
│ ├─ en_by_combinations │ │ │ │ │
│ │ ├─ 1 9.651 ms │ 10.92 ms │ 9.956 ms │ 9.973 ms │ 100 │ 100
│ │ ├─ 3 25.42 ms │ 40.48 ms │ 26.35 ms │ 26.62 ms │ 100 │ 100
│ │ ╰─ 5 43.95 ms │ 73.28 ms │ 46.61 ms │ 48.27 ms │ 100 │ 100
│ ├─ en_by_process_type │ │ │ │ │
│ │ ├─ "delete" 24.87 ms │ 31.21 ms │ 25.66 ms │ 25.9 ms │ 100 │ 100
│ │ ├─ "delete_normalize" 25.72 ms │ 52.05 ms │ 26.59 ms │ 27.05 ms │ 100 │ 100
│ │ ╰─ "none" 24.98 ms │ 41.02 ms │ 25.74 ms │ 26.04 ms │ 100 │ 100
│ ╰─ en_by_size │ │ │ │ │
│ ├─ 1000 2.443 ms │ 3.13 ms │ 2.56 ms │ 2.575 ms │ 100 │ 100
│ ├─ 10000 25.07 ms │ 45.75 ms │ 25.94 ms │ 26.23 ms │ 100 │ 100
│ ╰─ 50000 120.6 ms │ 237.2 ms │ 126.1 ms │ 133.9 ms │ 38 │ 38
├─ search_match │ │ │ │ │
│ ├─ cn_by_combinations │ │ │ │ │
│ │ ├─ 1 1.063 ms │ 1.203 ms │ 1.08 ms │ 1.087 ms │ 100 │ 100
│ │ ├─ 3 1.059 ms │ 1.147 ms │ 1.076 ms │ 1.079 ms │ 100 │ 100
│ │ ╰─ 5 1.066 ms │ 1.152 ms │ 1.09 ms │ 1.093 ms │ 100 │ 100
│ ├─ cn_by_process_type │ │ │ │ │
│ │ ├─ "delete" 27.72 ms │ 41.91 ms │ 28.84 ms │ 29.35 ms │ 100 │ 100
│ │ ├─ "fanjian" 18.27 ms │ 32.56 ms │ 19.08 ms │ 19.51 ms │ 100 │ 100
│ │ ├─ "fanjian_delete_normalize" 46.26 ms │ 62.02 ms │ 47.74 ms │ 48.44 ms │ 100 │ 100
│ │ ╰─ "none" 16.23 ms │ 25.45 ms │ 17.66 ms │ 18.2 ms │ 100 │ 100
│ ├─ cn_by_size │ │ │ │ │
│ │ ├─ 1000 4.043 ms │ 5.396 ms │ 4.145 ms │ 4.214 ms │ 100 │ 100
│ │ ├─ 10000 16.21 ms │ 29.69 ms │ 17.21 ms │ 17.59 ms │ 100 │ 100
│ │ ╰─ 50000 73.82 ms │ 99.08 ms │ 78.3 ms │ 79.59 ms │ 63 │ 63
│ ├─ en_by_combinations │ │ │ │ │
│ │ ├─ 1 1.291 ms │ 1.461 ms │ 1.313 ms │ 1.318 ms │ 100 │ 100
│ │ ├─ 3 1.903 ms │ 2.698 ms │ 1.938 ms │ 1.971 ms │ 100 │ 100
│ │ ╰─ 5 2.461 ms │ 3.752 ms │ 2.649 ms │ 2.613 ms │ 100 │ 100
│ ├─ en_by_process_type │ │ │ │ │
│ │ ├─ "delete" 6.257 ms │ 7.893 ms │ 6.38 ms │ 6.419 ms │ 100 │ 100
│ │ ├─ "delete_normalize" 8.506 ms │ 16.98 ms │ 8.736 ms │ 8.926 ms │ 100 │ 100
│ │ ╰─ "none" 1.915 ms │ 3.54 ms │ 1.972 ms │ 2.048 ms │ 100 │ 100
│ ╰─ en_by_size │ │ │ │ │
│ ├─ 1000 1.058 ms │ 1.244 ms │ 1.081 ms │ 1.086 ms │ 100 │ 100
│ ├─ 10000 1.912 ms │ 2.699 ms │ 1.955 ms │ 1.984 ms │ 100 │ 100
│ ╰─ 50000 4.652 ms │ 7.093 ms │ 5.15 ms │ 5.155 ms │ 100 │ 100
╰─ search_no_match │ │ │ │ │
├─ cn_by_combinations │ │ │ │ │
│ ├─ 1 736.7 µs │ 865.3 µs │ 761.8 µs │ 766.9 µs │ 100 │ 100
│ ├─ 3 749.9 µs │ 818.6 µs │ 770.8 µs │ 774.1 µs │ 100 │ 100
│ ╰─ 5 737.8 µs │ 801.4 µs │ 755.4 µs │ 755.8 µs │ 100 │ 100
├─ cn_by_process_type │ │ │ │ │
│ ├─ "delete" 5.502 ms │ 5.986 ms │ 5.551 ms │ 5.57 ms │ 100 │ 100
│ ├─ "fanjian" 1.33 ms │ 1.398 ms │ 1.348 ms │ 1.352 ms │ 100 │ 100
│ ├─ "fanjian_delete_normalize" 9.571 ms │ 13.16 ms │ 9.693 ms │ 9.823 ms │ 100 │ 100
│ ╰─ "none" 311.4 µs │ 344 µs │ 318 µs │ 319.9 µs │ 100 │ 100
├─ cn_by_size │ │ │ │ │
│ ├─ 1000 307.4 µs │ 379.1 µs │ 319.1 µs │ 322.9 µs │ 100 │ 100
│ ├─ 10000 308 µs │ 350.2 µs │ 318.3 µs │ 321.2 µs │ 100 │ 100
│ ╰─ 50000 315.7 µs │ 1.691 ms │ 333.2 µs │ 481.2 µs │ 100 │ 100
├─ en_by_combinations │ │ │ │ │
│ ├─ 1 725 µs │ 810.2 µs │ 741.7 µs │ 744.1 µs │ 100 │ 100
│ ├─ 3 738.3 µs │ 828.4 µs │ 758.2 µs │ 764.3 µs │ 100 │ 100
│ ╰─ 5 727 µs │ 787.1 µs │ 739.6 µs │ 742.7 µs │ 100 │ 100
├─ en_by_process_type │ │ │ │ │
│ ├─ "delete" 3.816 ms │ 4.719 ms │ 3.869 ms │ 3.89 ms │ 100 │ 100
│ ├─ "delete_normalize" 5.224 ms │ 5.965 ms │ 5.38 ms │ 5.416 ms │ 100 │ 100
│ ╰─ "none" 728.2 µs │ 776.1 µs │ 745.9 µs │ 746.8 µs │ 100 │ 100
╰─ en_by_size │ │ │ │ │
├─ 1000 729.3 µs │ 818.7 µs │ 743.9 µs │ 748.7 µs │ 100 │ 100
├─ 10000 743.1 µs │ 828.3 µs │ 763.1 µs │ 769.6 µs │ 100 │ 100
╰─ 50000 731.1 µs │ 783.2 µs │ 743.4 µs │ 747.2 µs │ 100 │ 100
Contributing
Contributions to matcher_rs are welcome! If you find a bug or have a feature request, please open an issue on the GitHub repository. If you would like to contribute code, please fork the repository and submit a pull request.
License
matcher_rs is licensed under the MIT OR Apache-2.0 license.
More Information
For more details, visit the GitHub repository.