Crate zhconv

Crate zhconv 

Source
Expand description

zhconv-rs converts Chinese between Traditional, Simplified and regional variants, using rulesets sourced from zhConversion.php by MediaWiki and Chinese Wikipedia and OpenCC, which are merged, flattened and then precompiled into Aho-Corasick automata by daachorse for single-pass, linear-time conversions.

The non-default feature opencc enables additional OpenCC dictionaries. Unlike other implementations, dictionaries cannot be chosen (enabled or disabled partly) at runtime since they are merged and precompiled into separate automata for each target variant.

As with MediaWiki and OpenCC, the accuracy is generally acceptable while limited. The converter optionally supports additional conversion rules in MediaWiki syntax (refer to conversion groups and manual conversion rules on Chinese Wikipedia), external rules defined line by line, and custom conversions defined by (FROM, TO) pairs. Prebuilding converter with custom rules or dictionaries is not yet supported.

§Usage

The crate is on crates.io.

[dependencies]
zhconv = { version = "?", features = ["opencc"] } # enable additional OpenCC dictionaries

§Example

Basic conversion:

use zhconv::{zhconv, Variant};
assert_eq!(zhconv("天干物燥 小心火烛", "zh-Hant".parse().unwrap()), "天乾物燥 小心火燭");
assert_eq!(zhconv("鼠曲草", Variant::ZhHant), "鼠麴草");
assert_eq!(zhconv("阿拉伯联合酋长国", Variant::ZhHant), "阿拉伯聯合酋長國");
assert_eq!(zhconv("阿拉伯联合酋长国", Variant::ZhTW), "阿拉伯聯合大公國");

With MediaWiki conversion syntax:

use zhconv::{zhconv_mw, Variant};
assert_eq!(zhconv_mw("天-{干}-物燥 小心火烛", "zh-Hant".parse::<Variant>().unwrap()), "天干物燥 小心火燭");
assert_eq!(zhconv_mw("-{zh-tw:鼠麴草;zh-cn:香茅}-是菊科草本植物。", Variant::ZhCN), "香茅是菊科草本植物。");
assert_eq!(zhconv_mw("菊科草本植物包括-{zh-tw:鼠麴草;zh-cn:香茅;}-等。", Variant::ZhTW), "菊科草本植物包括鼠麴草等。");

Set global rules inline (note that such rules always apply globally regardless of their location, unlike in MediaWiki where they affect only the text that follows):

use zhconv::{zhconv_mw, Variant};
assert_eq!(zhconv_mw("-{H|zh:馬;zh-cn:鹿;}-馬克思主義", Variant::ZhCN), "鹿克思主义"); // add
assert_eq!(zhconv_mw("&二極體\n-{-|zh-hans:二极管; zh-hant:二極體}-\n", Variant::ZhCN), "&二极体\n\n"); // remove

To load or add additional conversion rules such as CGroups or (FROM, TO) pairs, see ZhConverterBuilder.

Other useful function:

use zhconv::{is_hans, is_hans_confidence, infer_variant, infer_variant_confidence};
assert!(!is_hans("秋冬濁而春夏清,晞於朝而生於夕"));
assert!(is_hans_confidence("滴瀝明花苑,葳蕤泫竹叢") < 0.5);
println!("{}", infer_variant("錦字緘愁過薊水,寒衣將淚到遼城"));
println!("{:?}", infer_variant_confidence("zhconv-rs 中文简繁及地區詞轉換"));

Re-exports§

pub use self::converters::get_builtin_converter;
pub use self::tables::get_builtin_tables;
pub use self::variant::Variant;

Modules§

converters
Built-in converters built from tables.
pagerules
Struct to extract global rules from wikitext.
rule
Structs and functions for processing conversion rule, as is defined in ConverterRule.php.
tables
Built-in conversion tables sourced from zhConversion.php (maintained by MediaWiki and Chinese Wikipedia) and OpenCC.
variant
Structs for handling variants and mapping of variants.

Structs§

ZhConverter
A ZhConverter, built by ZhConverterBuilder.
ZhConverterBuilder
A builder that helps build a ZhConverter.

Traits§

TruncatedAround
A helper trait that truncates a str around a specified index in constant time (O(1)), intended to be used with is_hans and etc.

Functions§

infer_variant
Determine the Chinese variant of the input text.
infer_variant_confidence
Determine the Chinese variant of the input text with confidence.
is_hans
Determine whether the given text looks like Simplified Chinese over Traditional Chinese.
is_hans_confidence
Determine whether the given text looks like Simplified Chinese over Traditional Chinese.
zhconv
Helper function for general conversion using built-in converters.
zhconv_mw
Helper function for general conversion, activating wikitext support.