Expand description
§opencc-jieba-rs
opencc-jieba-rs is a high-performance Rust library for Chinese text conversion,
segmentation, and keyword extraction. It integrates Jieba for word segmentation
and a multi-stage OpenCC-style dictionary system for converting between different Chinese variants.
§Features
- Simplified ↔ Traditional Chinese conversion (including Taiwan, Hong Kong, Japanese variants)
- Multi-pass dictionary-based phrase replacement
- Fast and accurate word segmentation using Jieba
- Keyword extraction using TF-IDF or TextRank
- Optional punctuation conversion (e.g., 「」 ↔ “”)
§Example
use opencc_jieba_rs::OpenCC;
let opencc = OpenCC::new();
let s = opencc.s2t("“春眠不觉晓,处处闻啼鸟。”", true);
println!("{}", s); // -> "「春眠不覺曉,處處聞啼鳥。」"§Use Cases
- Text normalization for NLP and search engines
- Cross-regional Chinese content adaptation
- Automatic subtitle or document localization
§Crate Status
- 🚀 Fast and parallelized
- 🧪 Battle-tested on multi-million character corpora
- 📦 Ready for crates.io and docs.rs publication
§Conversion Overview (OpenCC + Jieba)
opencc_jieba_rs::OpenCC provides a set of high-level helpers that mirror
common OpenCC configurations, built on top of:
- OpenCC dictionaries (character / phrase mappings)
- Jieba segmentation for phrase-level matching
- Optional punctuation conversion
All methods take &self and &str input and return a newly allocated
String.
§Quick Start
let opencc = opencc_jieba_rs::OpenCC::new();
let s = "这里进行着“汉字转换”测试。";
let t = opencc.s2t(s, false); // Simplified → Traditional (phrase-level)
let tw = opencc.t2tw(&t); // Traditional → Taiwan Traditional§Phrase-Level vs Character-Level
There are two main categories of conversion:
-
Phrase-level conversions Use Jieba segmentation and multiple dictionaries to correctly handle idioms, multi-character words, and regional preferences.
-
Character-level conversions Use only character variant dictionaries (no segmentation), ideal for high-speed normalization where phrase context is unimportant.
§Core Simplified ↔ Traditional
| Direction | Method | Level | Notes |
|---|---|---|---|
| S → T | OpenCC::s2t | Phrase | Standard Simplified → Traditional. |
| T → S | OpenCC::t2s | Phrase | Standard Traditional → Simplified. |
| S → T | st | Character | Fast char-only S→T (no segmentation). |
| T → S | ts | Character | Fast char-only T→S (no segmentation). |
§s2t / t2s
- Use phrase dictionaries + Jieba segmentation.
- Preserve idioms and phrase-level semantics where possible.
- Recommended for user-facing text conversion.
§st / ts
- Use only
st_characters/ts_charactersdictionaries. - Do not segment or match phrases.
- Ideal for:
- bulk normalization
- preprocessing before heavier conversions
§Taiwan Traditional (Tw)
| Direction | Method | Description |
|---|---|---|
| T → Tw | OpenCC::t2tw | Standard Traditional → Taiwan variants. |
| T → Tw (phr.) | OpenCC::t2twp | T→Tw with an extra phrase refinement round. |
| Tw → T | OpenCC::tw2t | Taiwan variants → Standard Traditional. |
| Tw → T (phr.) | OpenCC::tw2tp | Tw→T with additional reverse phrase normalization. |
t2twusestw_variantsfor Taiwan-specific character/word forms.t2twpperforms two rounds: phrases first (tw_phrases), then variants (tw_variants).tw2tandtw2tpare reverse directions, using*_revdictionaries to normalize back to standard Traditional.
§Hong Kong Traditional (HK)
| Direction | Method | Description |
|---|---|---|
| T → HK | OpenCC::t2hk | Standard Traditional → Hong Kong Traditional. |
| HK → T | OpenCC::hk2t | Hong Kong Traditional → Standard Traditional. |
t2hkapplieshk_variants(HK-specific variants and preferences).hk2tuseshk_variants_rev_phrases+hk_variants_revto normalize back to standard Traditional.
§Japanese Kanji (Shinjitai / Kyūjitai)
| Direction | Method | Description |
|---|---|---|
| T → JP | OpenCC::t2jp | Traditional → Japanese Shinjitai-like variants (Kanji). |
| JP → T | OpenCC::jp2t | Japanese Shinjitai → Traditional (Kyūjitai-style) mapping. |
t2jpusesjp_variantsto map Traditional forms to standard Japanese Shinjitai (e.g. 體 → 体, 圖 → 図 where applicable).jp2tcombinesjps_phrases,jps_characters, andjp_variants_revto reverse these mappings back to Traditional Chinese.
§Punctuation and Symbols
Most high-level methods enable punctuation conversion by default, using OpenCC’s punctuation dictionaries to normalize:
- Chinese-style quotes / brackets
- Full-width / half-width punctuation
Lower-level helpers inside this crate may expose more granular control if you need to:
- disable punctuation conversion
- run custom dictionary pipelines
- integrate with your own segmentation logic
§When to Use What?
- Use
s2t/t2sfor general purpose Simplified/Traditional conversion. - Use
t2tw/t2twp/tw2t/tw2tpwhen targeting Taiwan content or normalizing it. - Use
t2hk/hk2tfor Hong Kong–specific localized text. - Use
t2jp/jp2tfor interoperability with Japanese Kanji forms, when only character-shape conversion is desired (not full translation). - Use
st/tswhen you need fast, character-only normalization with minimal overhead.
For segmentation-only or keyword extraction APIs, see:
OpenCC::jieba_cut— Jieba segmentation (accurate mode)OpenCC::jieba_cut_for_search— Jieba segmentation optimized for search indexingOpenCC::jieba_cut_all— Jieba full segmentation modeOpenCC::keyword_extract_textrank— keyword extraction using TextRankOpenCC::keyword_extract_tfidf— keyword extraction using TF-IDF
These utilities can be used independently of Chinese variant conversion,
or combined with OpenCC::convert results for downstream NLP tasks such
as indexing, text analysis, and keyword extraction.
Modules§
Structs§
- Keyword
- Keyword with weight
- OpenCC
- The main struct for performing Chinese text conversion and segmentation.
Enums§
- Keyword
Method - Keyword extraction algorithm.
- Opencc
Config - OpenCC conversion configuration (strongly-typed).
Constants§
- POS_
KEYWORDS - Recommended part-of-speech (POS) tags for keyword extraction.
Functions§
- find_
max_ utf8_ length - Returns the maximum valid UTF-8 byte length for a string slice, ensuring no partial characters.
- is_
delimiter - Tests whether a character is a delimiter.