Crate zhconv

Source
Expand description

This crate provides a ZhConverter that converts Chinese variants among each other. The implementation is based on the Aho-Corasick algorithm with the leftmost-longest matching strategy and linear time complexity with respect to the length of input text and conversion rules. It ships with a bunch of conversion tables, extracted from zhConversion.php (maintained by MediaWiki and Chinese Wikipedia) and OpenCC.

While built-in rulesets work well for general case, the converter is never meant to be 100% accurate, especially for professional text. On Chinese Wikipedia, it is pretty common for editors to apply additional conversion groups and manual conversion rules on an article base. The converter optionally supports the conversion rule syntax used in MediaWiki in the form -{FOO BAR}- and loading external rules defined line by line, which are typically extracted and pre-processed from a CGroup on a specific topic. For simplicity, it is certainly also possible to add custom conversions by (FROM, TO) pairs.

§Usage

This crate is on crates.io.

[dependencies]
zhconv = "?"

§Example

Basic conversion:

use zhconv::{zhconv, Variant};
assert_eq!(zhconv("天干物燥 小心火烛", "zh-Hant".parse().unwrap()), "天乾物燥 小心火燭");
assert_eq!(zhconv("鼠曲草", Variant::ZhHant), "鼠麴草");
assert_eq!(zhconv("阿拉伯联合酋长国", Variant::ZhHant), "阿拉伯聯合酋長國");
assert_eq!(zhconv("阿拉伯联合酋长国", Variant::ZhTW), "阿拉伯聯合大公國");

With MediaWiki conversion syntax:

use zhconv::{zhconv_mw, Variant};
assert_eq!(zhconv_mw("天-{干}-物燥 小心火烛", "zh-Hant".parse::<Variant>().unwrap()), "天干物燥 小心火燭");
assert_eq!(zhconv_mw("-{zh-tw:鼠麴草;zh-cn:香茅}-是菊科草本植物。", Variant::ZhCN), "香茅是菊科草本植物。");
assert_eq!(zhconv_mw("菊科草本植物包括-{zh-tw:鼠麴草;zh-cn:香茅;}-等。", Variant::ZhTW), "菊科草本植物包括鼠麴草等。");

Set global rules inline (note that such rules always apply globally regardless of their location, unlike in MediaWiki where they affect only the text that follows):

use zhconv::{zhconv_mw, Variant};
assert_eq!(zhconv_mw("-{H|zh:馬;zh-cn:鹿;}-馬克思主義", Variant::ZhCN), "鹿克思主义"); // add
assert_eq!(zhconv_mw("&二極體\n-{-|zh-hans:二极管; zh-hant:二極體}-\n", Variant::ZhCN), "&二极体\n\n"); // remove

To load or add additional conversion rules such as CGroups or (FROM, TO) pairs, see ZhConverterBuilder.

Other useful function:

use zhconv::{is_hans, is_hans_confidence, infer_variant, infer_variant_confidence};
assert!(!is_hans("秋冬濁而春夏清,晞於朝而生於夕"));
assert!(is_hans_confidence("滴瀝明花苑,葳蕤泫竹叢") < 0.5);
println!("{}", infer_variant("錦字緘愁過薊水,寒衣將淚到遼城"));
println!("{:?}", infer_variant_confidence("zhconv-rs 中文简繁及地區詞轉換"));

Re-exports§

pub use self::converters::get_builtin_converter;
pub use self::tables::get_builtin_tables;
pub use self::variant::Variant;

Modules§

converters
Converters built from built-in tables.
pagerules
Struct to extract global rules from wikitext.
rule
Structs and functions for processing conversion rule, as is defined in ConverterRule.php.
tables
Built-in conversion tables extracted from zhConversion.php (maintained by MediaWiki and Chinese Wikipedia) and OpenCC.
variant
Structs for handling variants and mapping of variants.

Structs§

ZhConverter
A ZhConverter, built by ZhConverterBuilder.
ZhConverterBuilder
A builder that helps build a ZhConverter.

Traits§

TruncatedAround
A helper trait that truncates a str around a specified index in constant time (O(1)), intended to be used with is_hans and etc.

Functions§

infer_variant
Determine the Chinese variant of the input text.
infer_variant_confidence
Determine the Chinese variant of the input text with confidence.
is_hans
Determine whether the given text looks like Simplified Chinese over Traditional Chinese.
is_hans_confidence
Determine whether the given text looks like Simplified Chinese over Traditional Chinese.
zhconv
Helper function for general conversion using built-in converters.
zhconv_mw
Helper function for general conversion, activating conversion rules in MediaWiki syntax.