zhconv 0.1.0-beta

Convert Traditional/Simplified Chinese and regional words of Taiwan/Hong Kong/mainland China/Singapore based on Wikipedia conversion tables 轉換中文簡體、繁體及兩岸、新馬地區詞,基於中文維基轉換表
docs.rs failed to build zhconv-0.1.0-beta
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Visit the last successful build: zhconv-0.4.1

Crates.io CI status

zhconv-rs 中文简繁及地區詞轉換

zhconv-rs converts Chinese text among several scripts or regional variants (e.g. zh-TW <-> zh-CN <-> zh-HK <-> zh-Hans <-> zh-Hant), built on the top of zhConversion.php conversion tables from Mediawiki, which is the one also used on Chinese Wikipedia.

Web App: https://zhconv.pages.dev/ (powered by WASM)

Supported variants

Target Tag Script Description
Simplified Chinese / 简体中文 zh-Hans SC / 简 W/O substituing region-specific phrases.
Traditional Chinese / 繁體中文 zh-Hant TC / 繁 W/O substituing region-specific phrases.
Chinese (Taiwan) / 臺灣正體 zh-TW TC / 繁 With Taiwan-specific phrases adapted.
Chinese (Hong Kong) / 香港繁體 zh-HK TC / 繁 With Hong Kong-specific phrases adapted.
Chinese (Macau) / 澳门繁體 zh-MO TC / 繁 Same as zh-HK for now.
Chinese (Mainland China) / 大陆简体 zh-CN SC / 简 With mainland China-specific phrases adapted.
Chinese (Singapore) / 新加坡简体 zh-SG SC / 简 Same as zh-CN for now.
Chinese (Malaysia) / 大马简体 zh-MY SC / 简 Same as zh-CN for now.

Note: zh-TW and zh-HK are based on zh-Hant. zh-CN are based on zh-Hans. Currently, zh-MO shares the same conversion table with zh-HK unless additonal rules / CGroups are applied; zh-MY and zh-SG shares the same conversion table withzh-CN unless additional rules / CGroups are applied.

Performance

cargo bench on Intel(R) Xeon(R) CPU @ 2.80GHz (GitPod), without parsing inline conversion rules:

load zh2Hant            time:   [45.442 ms 45.946 ms 46.459 ms]
load zh2Hans            time:   [8.1378 ms 8.3787 ms 8.6414 ms]
load zh2TW              time:   [60.209 ms 61.261 ms 62.407 ms]
load zh2HK              time:   [89.457 ms 90.847 ms 92.297 ms]
load zh2MO              time:   [96.670 ms 98.063 ms 99.586 ms]
load zh2CN              time:   [27.850 ms 28.520 ms 29.240 ms]
load zh2SG              time:   [28.175 ms 28.963 ms 29.796 ms]
load zh2MY              time:   [27.142 ms 27.635 ms 28.143 ms]
zh2TW data54k           time:   [546.10 us 553.14 us 561.24 us]
zh2CN data54k           time:   [504.34 us 511.22 us 518.59 us]
zh2Hant data689k        time:   [3.4375 ms 3.5182 ms 3.6013 ms]
zh2TW data689k          time:   [3.6062 ms 3.6784 ms 3.7545 ms]
zh2Hant data3185k       time:   [62.457 ms 64.257 ms 66.099 ms]
zh2TW data3185k         time:   [60.217 ms 61.348 ms 62.556 ms]
zh2TW data55m           time:   [1.0773 s 1.0872 s 1.0976 s]

Differences between other tools

  • ZhConver{sion,ter}.php of MediaWiki: zhconv-rs are just based on conversion tables listed in ZhConversion.php. MediaWiki relies the PHP built-in function strtr, which is inefficient. zhconv-rs ports some of the implementation of MediaWiki to supports the same conversion rule syntax with much more efficiency.
  • OpenCC: OpenCC has self-maintained conversion tables that are different from MediaWiki. Thanks to the efficient Aho-Corasick algorithm, zhconv-rs is much faster in general.

All of these implementation shares the same leftmost-longest matching strategy. So conversion results should generally be the same given the same conversion tables.

TODO

  • Support Module:CGroup
  • Propogate error properly with Anyhow and thiserror