Expand description
Text-to-Speech (TTS) module (GH-133).
Provides TTS primitives for:
- Neural TTS synthesis
- Mel spectrogram generation from text
- Vocoder integration (HiFi-GAN,
WaveGlow, etc.) - Multi-speaker synthesis
§Architecture
Text → Text Processing → Acoustic Model → Mel Spectrogram → Vocoder → Audio
↓ ↑
Phoneme/Grapheme [Speaker Embedding]
Encoding [Prosody Control]§Example
use aprender::speech::tts::{TtsConfig, SpeechSynthesizer, SynthesisRequest};
let config = TtsConfig::default();
assert_eq!(config.sample_rate, 22050);
assert_eq!(config.n_mels, 80);§Supported Models
- Tacotron2-style (attention-based)
- FastSpeech2-style (non-autoregressive)
- VITS-style (end-to-end variational)
§References
- Wang, Y., et al. (2017). Tacotron: End-to-End Speech Synthesis.
- Ren, Y., et al. (2020).
FastSpeech2: Fast and High-Quality TTS. - Kim, J., et al. (2021). Conditional Variational Autoencoder with Adversarial Learning.
§PMAT Compliance
- Zero
unwrap()calls - All public APIs return
Result<T, E>where fallible
Structs§
- Alignment
Info - Alignment information between text and audio.
- Fast
Speech2 Synthesizer - FastSpeech2-style TTS synthesizer.
- Hifi
GanVocoder - HiFi-GAN vocoder.
- Synthesis
Request - A synthesis request with text and optional controls.
- Synthesis
Result - Synthesis result containing audio and metadata.
- TtsConfig
- TTS configuration.
- Vits
Synthesizer - VITS-style end-to-end TTS.
Traits§
- Speech
Synthesizer - Trait for speech synthesis.
- Vocoder
- Trait for neural vocoder (mel to audio).
Functions§
- estimate_
duration - Estimate synthesis duration from text.
- normalize_
text - Normalize text for TTS (lowercase, expand abbreviations, etc.).
- split_
sentences - Split text into sentences for chunked synthesis.