Expand description
This crate provides a ZhConverter that converts Chinese variants among each other. The implementation is based on the Aho-Corasick algorithm with the leftmost-longest matching strategy and linear time complexity with respect to the length of input text and conversion rules. It ships with a bunch of conversion tables, extracted from zhConversion.php (maintained by MediaWiki and Chinese Wikipedia) and OpenCC.
While built-in rulesets work well for general case, the converter is never meant to be 100%
accurate, especially for professional text. On Chinese Wikipedia, it is pretty common for
editors to apply additional conversion groups and
manual conversion rules
on an article base. The converter optionally supports the conversion rule syntax used in
MediaWiki in the form -{FOO BAR}-
and loading external rules defined line by line, which are
typically extracted and pre-processed from a CGroup
on a specific topic.
For simplicity, it is certainly also possible to add custom conversions by (FROM, TO)
pairs.
§Usage
This crate is on crates.io.
[dependencies]
zhconv = "?"
§Example
Basic conversion:
use zhconv::{zhconv, Variant};
assert_eq!(zhconv("天干物燥 小心火烛", "zh-Hant".parse().unwrap()), "天乾物燥 小心火燭");
assert_eq!(zhconv("鼠曲草", Variant::ZhHant), "鼠麴草");
assert_eq!(zhconv("阿拉伯联合酋长国", Variant::ZhHant), "阿拉伯聯合酋長國");
assert_eq!(zhconv("阿拉伯联合酋长国", Variant::ZhTW), "阿拉伯聯合大公國");
With MediaWiki conversion rules:
use zhconv::{zhconv_mw, Variant};
assert_eq!(zhconv_mw("天-{干}-物燥 小心火烛", "zh-Hant".parse::<Variant>().unwrap()), "天干物燥 小心火燭");
assert_eq!(zhconv_mw("-{zh-tw:鼠麴草;zh-cn:香茅}-是菊科草本植物。", Variant::ZhCN), "香茅是菊科草本植物。");
assert_eq!(zhconv_mw("菊科草本植物包括-{zh-tw:鼠麴草;zh-cn:香茅;}-等。", Variant::ZhTW), "菊科草本植物包括鼠麴草等。");
assert_eq!(zhconv_mw("-{H|zh:馬;zh-cn:鹿;}-馬克思主義", Variant::ZhCN), "鹿克思主义"); // global rule
To load or add additional conversion rules such as CGroups or (FROM, TO)
pairs,
see ZhConverterBuilder
.
Other useful function:
use zhconv::{is_hans, is_hans_confidence, infer_variant, infer_variant_confidence};
assert!(!is_hans("秋冬濁而春夏清,晞於朝而生於夕"));
assert!(is_hans_confidence("滴瀝明花苑,葳蕤泫竹叢") < 0.5);
println!("{}", infer_variant("錦字緘愁過薊水,寒衣將淚到遼城"));
println!("{:?}", infer_variant_confidence("zhconv-rs 中文简繁及地區詞轉換"));
Re-exports§
pub use self::converters::get_builtin_converter;
pub use self::tables::get_builtin_tables;
pub use self::variant::Variant;
Modules§
- Converters lazily built from built-in
tables
. - Struct to extract global rules from wikitext.
- Structs and functions for processing conversion rule, as is defined in ConverterRule.php.
- Built-in conversion tables extracted from zhConversion.php (maintained by MediaWiki and Chinese Wikipedia) and OpenCC.
- Structs for handling variants and mapping of variants.
Structs§
- A ZhConverter, built by
ZhConverterBuilder
. - A builder that helps build a
ZhConverter
.
Traits§
- A helper trait that truncates a str around a specified index in constant time (
O(1)
), intended to be used withis_hans
and etc.
Functions§
- Determine the Chinese variant of the input text.
- Determine the Chinese variant of the input text with confidence.
- Determine whether the given text looks like Simplified Chinese over Traditional Chinese.
- Determine whether the given text looks like Simplified Chinese over Traditional Chinese.
- Helper function for general conversion using built-in converters.
- Helper function for general conversion, activating conversion rules in MediaWiki syntax.