Crate zhconv

source ·
Expand description

This crate provides a ZhConverter that converts Chinese variants among each other. The implementation is based on the Aho-Corasick algorithm with the leftmost-longest matching strategy and linear time complexity with respect to the length of input text and conversion rules. It ships with a bunch of conversion tables, extracted from zhConversion.php (maintained by MediaWiki and Chinese Wikipedia) and OpenCC.

While built-in rulesets work well for general case, the converter is never meant to be 100% accurate, especially for professional text. On Chinese Wikipedia, it is pretty common for editors to apply additional conversion groups and manual conversion rules on an article base. The converter optionally supports the conversion rule syntax used in MediaWiki in the form -{FOO BAR}- and loading external rules defined line by line, which are typically extracted and pre-processed from a CGroup on a specific topic. For simplicity, it is certainly also possible to add custom conversions by (FROM, TO) pairs.

§Usage

This crate is on crates.io.

[dependencies]
zhconv = "?"

§Example

Basic conversion:

use zhconv::{zhconv, Variant};
assert_eq!(zhconv("天干物燥 小心火烛", "zh-Hant".parse().unwrap()), "天乾物燥 小心火燭");
assert_eq!(zhconv("鼠曲草", Variant::ZhHant), "鼠麴草");
assert_eq!(zhconv("阿拉伯联合酋长国", Variant::ZhHant), "阿拉伯聯合酋長國");
assert_eq!(zhconv("阿拉伯联合酋长国", Variant::ZhTW), "阿拉伯聯合大公國");

With MediaWiki conversion rules:

use zhconv::{zhconv_mw, Variant};
assert_eq!(zhconv_mw("天-{干}-物燥 小心火烛", "zh-Hant".parse::<Variant>().unwrap()), "天干物燥 小心火燭");
assert_eq!(zhconv_mw("-{zh-tw:鼠麴草;zh-cn:香茅}-是菊科草本植物。", Variant::ZhCN), "香茅是菊科草本植物。");
assert_eq!(zhconv_mw("菊科草本植物包括-{zh-tw:鼠麴草;zh-cn:香茅;}-等。", Variant::ZhTW), "菊科草本植物包括鼠麴草等。");
assert_eq!(zhconv_mw("-{H|zh:馬;zh-cn:鹿;}-馬克思主義", Variant::ZhCN), "鹿克思主义"); // global rule

To load or add additional conversion rules such as CGroups or (FROM, TO) pairs, see ZhConverterBuilder.

Other useful function:

use zhconv::{is_hans, is_hans_confidence, infer_variant, infer_variant_confidence};
assert!(!is_hans("秋冬濁而春夏清,晞於朝而生於夕"));
assert!(is_hans_confidence("滴瀝明花苑,葳蕤泫竹叢") < 0.5);
println!("{}", infer_variant("錦字緘愁過薊水,寒衣將淚到遼城"));
println!("{:?}", infer_variant_confidence("zhconv-rs 中文简繁及地區詞轉換"));

Re-exports§

Modules§

Structs§

Traits§

  • A helper trait that truncates a str around a specified index in constant time (O(1)), intended to be used with is_hans and etc.

Functions§

  • Determine the Chinese variant of the input text.
  • Determine the Chinese variant of the input text with confidence.
  • Determine whether the given text looks like Simplified Chinese over Traditional Chinese.
  • Determine whether the given text looks like Simplified Chinese over Traditional Chinese.
  • Helper function for general conversion using built-in converters.
  • Helper function for general conversion, activating conversion rules in MediaWiki syntax.