zhconv-rs 中文简繁及地區詞轉換
zhconv-rs converts Chinese text among traditional/simplified scripts or regional variants (e.g. zh-TW <-> zh-CN <-> zh-HK <-> zh-Hans <-> zh-Hant), backed by rulesets from MediaWiki/Wikipedia and OpenCC.
It leverages the Aho-Corasick algorithm for linear time complexity with respect to the length of input text and conversion rules (O(n+m)), processing dozens of MiBs text per second.
🔗 Web app: https://zhconv.pages.dev (powered by WASM)
⚙️ Cli: cargo install zhconv-cli or check releases.
🦀 Rust crate: cargo add zhconv (check docs for examples)
🐍 Python package w/ wheels via PyO3: pip install zhconv-rs or pip install zhconv-rs-opencc (with rulesets from OpenCC)
# > pip install zhconv_rs
# Convert with builtin rulesets:
assert ==
assert ==
assert ==
assert ==
# Convert with custom rules:
assert ==
= # or path to rule file
assert ==
JS (Webpack): npm install zhconv or yarn add zhconv (WASM, instructions)
JS in browser: https://cdn.jsdelivr.net/npm/zhconv-web@latest (WASM)
Supported variants
| Target | Tag | Script | Description |
|---|---|---|---|
| Simplified Chinese / 简体中文 | zh-Hans |
SC / 简 | W/O substituing region-specific phrases. |
| Traditional Chinese / 繁體中文 | zh-Hant |
TC / 繁 | W/O substituing region-specific phrases. |
| Chinese (Taiwan) / 臺灣正體 | zh-TW |
TC / 繁 | With Taiwan-specific phrases adapted. |
| Chinese (Hong Kong) / 香港繁體 | zh-HK |
TC / 繁 | With Hong Kong-specific phrases adapted. |
| Chinese (Macau) / 澳门繁體 | zh-MO |
TC / 繁 | Same as zh-HK for now. |
| Chinese (Mainland China) / 大陆简体 | zh-CN |
SC / 简 | With mainland China-specific phrases adapted. |
| Chinese (Singapore) / 新加坡简体 | zh-SG |
SC / 简 | Same as zh-CN for now. |
| Chinese (Malaysia) / 大马简体 | zh-MY |
SC / 简 | Same as zh-CN for now. |
Note: zh-TW and zh-HK are based on zh-Hant. zh-CN are based on zh-Hans. Currently, zh-MO shares the same rulesets with zh-HK unless additional rules are manually configured; zh-MY and zh-SG shares the same rulesets with zh-CN unless additional rules are manually configured.
Performance
cargo bench on AMD EPYC 7B13 (GitPod) by v0.3:
load/zh2Hant time: [4.6368 ms 4.6862 ms 4.7595 ms]
load/zh2Hans time: [2.2670 ms 2.2891 ms 2.3138 ms]
load/zh2TW time: [4.7115 ms 4.7543 ms 4.8001 ms]
load/zh2HK time: [5.4438 ms 5.5474 ms 5.6573 ms]
load/zh2MO time: [4.9503 ms 4.9673 ms 4.9850 ms]
load/zh2CN time: [3.0809 ms 3.1046 ms 3.1323 ms]
load/zh2SG time: [3.0543 ms 3.0637 ms 3.0737 ms]
load/zh2MY time: [3.0514 ms 3.0640 ms 3.0787 ms]
zh2CN wikitext basic time: [385.95 µs 388.53 µs 391.39 µs]
zh2TW wikitext basic time: [393.70 µs 395.16 µs 396.89 µs]
zh2TW wikitext extended time: [1.5105 ms 1.5186 ms 1.5271 ms]
zh2CN 天乾物燥 time: [46.970 ns 47.312 ns 47.721 ns]
zh2TW data54k time: [200.72 µs 201.54 µs 202.41 µs]
zh2CN data54k time: [231.55 µs 232.86 µs 234.30 µs]
zh2Hant data689k time: [2.0330 ms 2.0513 ms 2.0745 ms]
zh2TW data689k time: [1.9710 ms 1.9790 ms 1.9881 ms]
zh2Hant data3185k time: [15.199 ms 15.260 ms 15.332 ms]
zh2TW data3185k time: [15.346 ms 15.464 ms 15.629 ms]
zh2TW data55m time: [329.54 ms 330.53 ms 331.58 ms]
is_hans data55k time: [404.73 µs 407.11 µs 409.59 µs]
infer_variant data55k time: [1.0468 ms 1.0515 ms 1.0570 ms]
is_hans data3185k time: [22.442 ms 22.589 ms 22.757 ms]
infer_variant data3185k time: [60.205 ms 60.412 ms 60.627 ms]
load/zh2Hant time: [22.074 ms 22.338 ms 22.624 ms]
load/zh2Hans time: [2.7913 ms 2.8126 ms 2.8355 ms]
load/zh2TW time: [23.068 ms 23.286 ms 23.520 ms]
load/zh2HK time: [23.358 ms 23.630 ms 23.929 ms]
load/zh2MO time: [23.363 ms 23.627 ms 23.913 ms]
load/zh2CN time: [3.6778 ms 3.7222 ms 3.7722 ms]
load/zh2SG time: [3.6522 ms 3.6848 ms 3.7202 ms]
load/zh2MY time: [3.6642 ms 3.7079 ms 3.7545 ms]
zh2CN wikitext basic time: [396.17 µs 402.51 µs 409.36 µs]
zh2TW wikitext basic time: [442.16 µs 447.53 µs 453.27 µs]
zh2TW wikitext extended time: [1.5795 ms 1.6007 ms 1.6233 ms]
zh2CN 天乾物燥 time: [47.884 ns 48.878 ns 49.953 ns]
zh2TW data54k time: [255.25 µs 259.01 µs 262.92 µs]
zh2CN data54k time: [233.74 µs 236.99 µs 240.67 µs]
zh2Hant data689k time: [3.9696 ms 4.0005 ms 4.0327 ms]
zh2TW data689k time: [3.4593 ms 3.4896 ms 3.5203 ms]
zh2Hant data3185k time: [27.710 ms 27.955 ms 28.206 ms]
zh2TW data3185k time: [30.298 ms 30.858 ms 31.428 ms]
zh2TW data55m time: [500.95 ms 515.80 ms 531.34 ms]
is_hans data55k time: [461.22 µs 470.99 µs 481.20 µs]
infer_variant data55k time: [1.1669 ms 1.1759 ms 1.1852 ms]
is_hans data3185k time: [26.609 ms 26.964 ms 27.385 ms]
infer_variant data3185k time: [74.878 ms 76.262 ms 77.818 ms]
By default, only rulesets from MediaWiki are used. opencc feature can be enabled with zhconv = { version = "...", features = [ "opencc" ] }.
But be noted that, other than performance decrease, it accounts for at least several MiBs in build output.
Limitations
Accuracy
A rule-based converter cannot capture every possible linguistic nuance, resulting in limited accuracy. Besides, the converter employs a leftmost-longest matching strategy, prioritizing to the earliest and longest matches in the text. For instance, if a ruleset includes both 干 -> 幹 and 天干物燥 -> 天乾物燥, the converter would prioritize 天乾物燥 because 天干物燥 gets matched earlier compared to 干 at a later position. This approach generally produces accurate results but may occasionally lead to incorrect conversions.
Wikitext support
While the implementation supports most MediaWiki conversion rules, it is not fully compliant with the original MediaWiki implementation.
For wikitext inputs containing global conversion rules (e.g., -{H|zh-hans:鹿|zh-hant:马}- in MediaWiki syntax), the implementation's time complexity may degrade to O(n*m) in the worst case, where n is the input text length and m is the maximum length of source words in the ruleset. This is equivalent to a brute-force approach.
Credits
Rulesets/Dictionaries: MediaWiki and OpenCC.
References:
- https://github.com/gumblex/zhconv : Python implementation of
zhConver{ter,sion}.php. - https://github.com/BYVoid/OpenCC/ : Widely adopted Chinese converter.
- https://zh.wikipedia.org/wiki/Wikipedia:字詞轉換處理
- https://zh.wikipedia.org/wiki/Help:高级字词转换语法
- https://github.com/wikimedia/mediawiki/blob/master/includes/language/LanguageConverter.php