wetext-rs

A Rust implementation of WeText for text normalization in TTS (Text-to-Speech) applications.

wetext-rs

Background

This project is a Rust port of the Python wetext library, which provides a lightweight runtime for WeTextProcessing without depending on Pynini. The primary motivation for creating this Rust implementation is to:

Integrate with the Candle ecosystem - Enable seamless integration with Rust-based ML frameworks like Candle, eliminating Python dependencies in production deployments
Improve performance - Leverage Rust's memory safety and zero-cost abstractions for faster text processing
Enable standalone deployment - Create a single binary that can be deployed without Python runtime

The original Python implementation uses kaldifst for FST operations. This Rust version uses rustfst, a pure Rust implementation of OpenFST, to achieve the same functionality.

Features

Text Normalization (TN): Convert numbers, dates, currency to spoken form
- "2024年1月15日" → "二零二四年一月十五日"
- "$100" → "one hundred dollars"
Inverse Text Normalization (ITN): Convert spoken form back to written form
- "一百二十三" → "123"
Multi-language support: Chinese (zh), English (en), Japanese (ja)
English contractions expansion: "don't" → "do not"
Various text preprocessing options:
- Traditional to Simplified Chinese conversion
- Full-width to half-width character conversion
- Interjection removal
- Punctuation removal
- Erhua (儿化音) removal

Installation

Add to your Cargo.toml:

[dependencies]
wetext-rs = "0.1"

FST Weight Files

This library requires FST (Finite State Transducer) weight files for text normalization. The weight files can be downloaded from:

ModelScope: pengzhendong/wetext

Download the weight files and organize them in the following structure:

fsts/
├── traditional_to_simple.fst
├── full_to_half.fst
├── remove_interjections.fst
├── remove_puncts.fst
├── tag_oov.fst
├── en/
│   └── tn/
│       ├── tagger.fst
│       └── verbalizer.fst
├── zh/
│   ├── tn/
│   │   ├── tagger.fst
│   │   ├── verbalizer.fst
│   │   └── verbalizer_remove_erhua.fst
│   └── itn/
│       ├── tagger.fst
│       ├── tagger_enable_0_to_9.fst
│       └── verbalizer.fst
└── ja/
    ├── tn/
    │   ├── tagger.fst
    │   └── verbalizer.fst
    └── itn/
        ├── tagger.fst
        ├── tagger_enable_0_to_9.fst
        └── verbalizer.fst

Download Options

Option 1: ModelScope CLI

pip install modelscope
modelscope download --model pengzhendong/wetext --local_dir ./fsts

Option 2: Git LFS

git lfs install
git clone https://www.modelscope.cn/pengzhendong/wetext.git fsts

Usage

Basic Usage

use wetext_rs::{Normalizer, NormalizerConfig, Language, Operator};

// Create normalizer with default settings (Chinese TN, auto language detection)
let mut normalizer = Normalizer::with_defaults("path/to/fsts");

// Normalize text
let result = normalizer.normalize("2024年1月15日").unwrap();
println!("{}", result);  // 二零二四年一月十五日

With Configuration

use wetext_rs::{Normalizer, NormalizerConfig, Language, Operator};

// Configure for specific language and operation
let config = NormalizerConfig::new()
    .with_lang(Language::Zh)
    .with_operator(Operator::Tn)
    .with_fix_contractions(true)
    .with_traditional_to_simple(true);

let mut normalizer = Normalizer::new("path/to/fsts", config);
let result = normalizer.normalize("100元").unwrap();
println!("{}", result);  // 一百元

Inverse Text Normalization (ITN)

use wetext_rs::{Normalizer, NormalizerConfig, Language, Operator};

let config = NormalizerConfig::new()
    .with_lang(Language::Zh)
    .with_operator(Operator::Itn);

let mut normalizer = Normalizer::new("path/to/fsts", config);
let result = normalizer.normalize("一百二十三").unwrap();
println!("{}", result);  // 123

Convenience Function

use wetext_rs::normalize;

let result = normalize("path/to/fsts", "123").unwrap();
println!("{}", result);  // 幺二三

Configuration Options

Option	Default	Description
`lang`	`Auto`	Language: `Auto`, `En`, `Zh`, `Ja`
`operator`	`Tn`	Operation: `Tn` (text normalization), `Itn` (inverse)
`fix_contractions`	`false`	Expand English contractions
`traditional_to_simple`	`false`	Convert Traditional to Simplified Chinese
`full_to_half`	`false`	Convert full-width to half-width characters
`remove_interjections`	`false`	Remove interjections (e.g., "嗯", "啊")
`remove_puncts`	`false`	Remove punctuation marks
`tag_oov`	`false`	Tag out-of-vocabulary words
`enable_0_to_9`	`false`	Enable 0-9 digit conversion in ITN
`remove_erhua`	`false`	Remove erhua (儿化音)

Examples

Chinese Text Normalization

Input	Output
`123`	`幺二三`
`2024年`	`二零二四年`
`2024年1月15日`	`二零二四年一月十五日`
`下午3点30分`	`下午三点三十分`
`100元`	`一百元`
`3/4`	`四分之三`
`1.5`	`一点五`

Chinese Inverse Text Normalization

Input	Output
`一百二十三`	`123`
`二零二四年`	`2024年`
`一点五`	`1.5`

English Text Normalization

Input	Output
`$100`	`one hundred dollars`
`January 15, 2024`	`january fifteenth twenty twenty four`
`3.14`	`three point one four`

Japanese Text Normalization

Input	Output
`100円`	`百円`
`2024年`	`二千二十四年`
`3月15日`	`三月十五日`

Dependencies

Crate	Purpose
rustfst	FST operations (Rust implementation of OpenFST)
thiserror	Error handling
regex	Regular expressions
once_cell	Lazy initialization
serde_json	JSON parsing

Compatibility with Python WeText

This Rust implementation is designed to be compatible with the Python wetext library. The core TN/ITN functionality produces identical results for the same inputs.

Differences from Python version:

Aspect	Python wetext	Rust wetext-rs
Language detection	Chinese/English only	Adds Japanese detection (via Hiragana/Katakana)
Contractions	Runtime loaded	Compile-time embedded
Error handling	Python exceptions	`Result<T, WeTextError>`
FST library	kaldifst	rustfst

Development

Running Tests

# Run all unit and integration tests
cargo test

# Run with verbose output
cargo test -- --nocapture

Consistency Testing with Python WeText

To verify that the Rust implementation produces identical results to the Python version:

Setup Python environment (Python 3.13 recommended):

cd tests
python3.13 -m venv venv
source venv/bin/activate
pip install wetext

Generate reference outputs from Python:

python tests/generate_reference.py

This creates tests/reference_outputs.json with expected outputs from Python wetext.

Run comparison tests:

cargo test test_compare_with_python -- --ignored --nocapture

Expected output:

✓ PASS: '123' (zh/tn) => '幺二三'
✓ PASS: '2024年1月15日' (zh/tn) => '二零二四年一月十五日'
...
Results: 20 passed, 0 failed

Code Quality

# Run clippy linter
cargo clippy -- -D warnings

# Format code
cargo fmt

# Check formatting
cargo fmt -- --check

Credits

Original Python implementation: pengzhendong/wetext
FST weight files: ModelScope - pengzhendong/wetext
WeTextProcessing grammar: wenet-e2e/WeTextProcessing

License

This project is licensed under the Apache-2.0 License.

wetext-rs 0.1.2