Skip to main content

Crate opencc_jieba_rs

Crate opencc_jieba_rs 

Source
Expand description

§opencc-jieba-rs

opencc-jieba-rs is a high-performance Rust library for Chinese text conversion, segmentation, and keyword extraction. It integrates Jieba for word segmentation and a multi-stage OpenCC-style dictionary system for converting between different Chinese variants.

§Features

  • Simplified ↔ Traditional Chinese conversion (including Taiwan, Hong Kong, Japanese variants)
  • Multi-pass dictionary-based phrase replacement
  • Fast and accurate word segmentation using Jieba
  • Keyword extraction using TF-IDF or TextRank
  • Optional punctuation conversion (e.g., 「」 ↔ “”)

§Example

use opencc_jieba_rs::OpenCC;

let opencc = OpenCC::new();
let s = opencc.s2t("“春眠不觉晓,处处闻啼鸟。”", true);
println!("{}", s); // -> "「春眠不覺曉,處處聞啼鳥。」"

§Use Cases

  • Text normalization for NLP and search engines
  • Cross-regional Chinese content adaptation
  • Automatic subtitle or document localization

§Crate Status

  • 🚀 Fast and parallelized
  • 🧪 Battle-tested on multi-million character corpora
  • 📦 Ready for crates.io and docs.rs publication

§Conversion Overview (OpenCC + Jieba)

opencc_jieba_rs::OpenCC provides a set of high-level helpers that mirror common OpenCC configurations, built on top of:

  • OpenCC dictionaries (character / phrase mappings)
  • Jieba segmentation for phrase-level matching
  • Optional punctuation conversion

All methods take &self and &str input and return a newly allocated String.

§Quick Start

let opencc = opencc_jieba_rs::OpenCC::new();

let s = "这里进行着“汉字转换”测试。";
let t = opencc.s2t(s, false);       // Simplified → Traditional (phrase-level)
let tw = opencc.t2tw(&t);    // Traditional → Taiwan Traditional

§Phrase-Level vs Character-Level

There are two main categories of conversion:

  1. Phrase-level conversions Use Jieba segmentation and multiple dictionaries to correctly handle idioms, multi-character words, and regional preferences.

  2. Character-level conversions Use only character variant dictionaries (no segmentation), ideal for high-speed normalization where phrase context is unimportant.

§Core Simplified ↔ Traditional

DirectionMethodLevelNotes
S → TOpenCC::s2tPhraseStandard Simplified → Traditional.
T → SOpenCC::t2sPhraseStandard Traditional → Simplified.
S → TstCharacterFast char-only S→T (no segmentation).
T → StsCharacterFast char-only T→S (no segmentation).

§s2t / t2s

  • Use phrase dictionaries + Jieba segmentation.
  • Preserve idioms and phrase-level semantics where possible.
  • Recommended for user-facing text conversion.

§st / ts

  • Use only st_characters / ts_characters dictionaries.
  • Do not segment or match phrases.
  • Ideal for:
    • bulk normalization
    • preprocessing before heavier conversions

§Taiwan Traditional (Tw)

DirectionMethodDescription
T → TwOpenCC::t2twStandard Traditional → Taiwan variants.
T → Tw (phr.)OpenCC::t2twpT→Tw with an extra phrase refinement round.
Tw → TOpenCC::tw2tTaiwan variants → Standard Traditional.
Tw → T (phr.)OpenCC::tw2tpTw→T with additional reverse phrase normalization.
  • t2tw uses tw_variants for Taiwan-specific character/word forms.
  • t2twp performs two rounds: phrases first (tw_phrases), then variants (tw_variants).
  • tw2t and tw2tp are reverse directions, using *_rev dictionaries to normalize back to standard Traditional.

§Hong Kong Traditional (HK)

DirectionMethodDescription
T → HKOpenCC::t2hkStandard Traditional → Hong Kong Traditional.
HK → TOpenCC::hk2tHong Kong Traditional → Standard Traditional.
  • t2hk applies hk_variants (HK-specific variants and preferences).
  • hk2t uses hk_variants_rev_phrases + hk_variants_rev to normalize back to standard Traditional.

§Japanese Kanji (Shinjitai / Kyūjitai)

DirectionMethodDescription
T → JPOpenCC::t2jpTraditional → Japanese Shinjitai-like variants (Kanji).
JP → TOpenCC::jp2tJapanese Shinjitai → Traditional (Kyūjitai-style) mapping.
  • t2jp uses jp_variants to map Traditional forms to standard Japanese Shinjitai (e.g. 體 → 体, 圖 → 図 where applicable).
  • jp2t combines jps_phrases, jps_characters, and jp_variants_rev to reverse these mappings back to Traditional Chinese.

§Punctuation and Symbols

Most high-level methods enable punctuation conversion by default, using OpenCC’s punctuation dictionaries to normalize:

  • Chinese-style quotes / brackets
  • Full-width / half-width punctuation

Lower-level helpers inside this crate may expose more granular control if you need to:

  • disable punctuation conversion
  • run custom dictionary pipelines
  • integrate with your own segmentation logic

§When to Use What?

  • Use s2t / t2s for general purpose Simplified/Traditional conversion.
  • Use t2tw / t2twp / tw2t / tw2tp when targeting Taiwan content or normalizing it.
  • Use t2hk / hk2t for Hong Kong–specific localized text.
  • Use t2jp / jp2t for interoperability with Japanese Kanji forms, when only character-shape conversion is desired (not full translation).
  • Use st / ts when you need fast, character-only normalization with minimal overhead.

For segmentation-only or keyword extraction APIs, see:

These utilities can be used independently of Chinese variant conversion, or combined with OpenCC::convert results for downstream NLP tasks such as indexing, text analysis, and keyword extraction.

Modules§

dictionary_lib

Structs§

Keyword
Keyword with weight
OpenCC
The main struct for performing Chinese text conversion and segmentation.

Enums§

KeywordMethod
Keyword extraction algorithm.
OpenccConfig
OpenCC conversion configuration (strongly-typed).

Constants§

POS_KEYWORDS
Recommended part-of-speech (POS) tags for keyword extraction.

Functions§

find_max_utf8_length
Returns the maximum valid UTF-8 byte length for a string slice, ensuring no partial characters.
is_delimiter
Tests whether a character is a delimiter.