Crate opencc_jieba_rs

Crate opencc_jieba_rs 

Source
Expand description

§opencc-jieba-rs

opencc-jieba-rs is a high-performance Rust library for Chinese text conversion, segmentation, and keyword extraction. It integrates Jieba for word segmentation and a multi-stage OpenCC-style dictionary system for converting between different Chinese variants.

§Features

  • Simplified ↔ Traditional Chinese conversion (including Taiwan, Hong Kong, Japanese variants)
  • Multi-pass dictionary-based phrase replacement
  • Fast and accurate word segmentation using Jieba
  • Keyword extraction using TF-IDF or TextRank
  • Optional punctuation conversion (e.g., 「」 ↔ “”)

§Example

use opencc_jieba_rs::OpenCC;

let opencc = OpenCC::new();
let s = opencc.s2t("“春眠不觉晓,处处闻啼鸟。”", true);
println!("{}", s); // -> "「春眠不覺曉,處處聞啼鳥。」"

§Use Cases

  • Text normalization for NLP and search engines
  • Cross-regional Chinese content adaptation
  • Automatic subtitle or document localization

§Crate Status

  • 🚀 Fast and parallelized
  • 🧪 Battle-tested on multi-million character corpora
  • 📦 Ready for crates.io and docs.rs publication

Modules§

dictionary_lib

Structs§

OpenCC
The main struct for performing Chinese text conversion and segmentation.

Functions§

find_max_utf8_length
Returns the maximum valid UTF-8 byte length for a string slice, ensuring no partial characters.