piper-phoneme-streaming

piper-phoneme-streaming is a high-performance Rust library for streaming Text-to-Phoneme (G2P) conversion. It is built to seamlessly integrate with modern streaming Text-to-Speech (TTS) engines like Piper and others based on espeak-ng.

What Problems Does It Solve?

Typical G2P (Grapheme-to-Phoneme) approaches wait for a full sentence or paragraph before converting text to phonemes. In real-time or streaming TTS applications, this introduces unacceptable latency.

piper-phoneme-streaming addresses this by:

Streaming natively: Processing text character-by-character and yielding phonemes as soon as there is enough context (e.g., at word boundaries).
Dynamic Language Detection: Seamlessly handling mixed-language input on the fly. It can automatically detect language boundaries (e.g., mixing English and Vietnamese) and switch phonemization strategies mid-sentence without interrupting the stream.
Accurate Text Normalization: Built-in strategies to expand abbreviations, dates, numbers, and acronyms sequentially before phonemization.
espeak-ng Parity: Employs direct execution of espeak-ng's binary phoneme table and dictionary formats to assure generated phonemes match exactly what Piper or other models expect.

How It Works

The library operates fundamentally in a push-based architecture via StreamingG2P:

Text Expansion & Normalization: Input characters are processed by TextExpand, which handles numbers, money, and typical abbreviations interactively.
Language Detection: If multiple languages are enabled, dynamic heuristics detect the language of incoming text batches on the fly.
Word Phonemizer: The WordPhonemizer matches the normalized text against the appropriate language's dictionary and runtime rules from embedded espeak-ng data.
Sentence Upgrade: StreamingSentencePhonemeUpgrade applies sentence-level syntax rules, stress assignments, and intonation corrections before finalizing the phoneme token stream.

Usage Examples

Streaming Conversion

The streaming API enables progressive consumption of text.

use piper_phoneme_streaming::{StreamingG2P, Language};

fn main() {
    // Initialize the engine with supported languages
    let g2p = StreamingG2P::with_languages(
        &[Language::English, Language::Vietnamese], 
        Language::English
    ).unwrap();

    // Create a new streaming session (maintains state across pushed chunks)
    let mut session = g2p.new_session();
    let text = "Hello world. Xin chào thế giới.";

    // Push characters individually or in chunks
    for ch in text.chars() {
        let output = g2p.push_text(&mut session, &ch.to_string()).unwrap();
        for phoneme in output {
            print!("{}", phoneme.token);
        }
    }

    // Flush any remaining buffered phonemes once the stream ends
    let tail = g2p.finish(&mut session).unwrap();
    for phoneme in tail {
        print!("{}", phoneme.token);
    }
}

Normal Conversion

If streaming is not required, you can use the full conversion API to process the entire result at once.

use piper_phoneme_streaming::{FullG2p, Language};

fn main() {
    let g2p = FullG2p::new(Language::English).unwrap();
    let out = g2p.g2p("Hello world!").unwrap();
    
    let out_str: String = out.iter().map(|t| t.token).collect();
    println!("{}", out_str);
}

Adding to Your Project

Add the dependency to your Cargo.toml:

[dependencies]
piper-phoneme-streaming = { path = "..." } # Or specify version if published