Expand description
Speech processing: VAD, ASR, TTS, diarization (spec §5, GH-133) Speech processing module for ASR, TTS, VAD, and diarization.
This module implements speech processing primitives for Whisper and other
speech models, following the spec in apr-whisper-and-cookbook-support-eoy-2025.md.
§Modules
vad: Voice Activity Detection (Silero-style or energy-based)asr: Automatic Speech Recognition primitivesdiarization: Speaker diarizationtts: Text-to-Speech primitives
§Example
use aprender::speech::vad::{Vad, VadConfig};
// Create VAD with default config
let config = VadConfig::default();
let vad = Vad::new(config).expect("default config is valid");
// Detect voice activity in audio samples
let samples: Vec<f32> = vec![0.0; 16000]; // 1 second of silence at 16kHz
let segments = vad.detect(&samples, 16000).expect("valid input");
assert!(segments.is_empty()); // No speech detected in silence§PMAT Compliance
This module enforces zero-tolerance quality rules:
- No
unwrap()- audio streams can’t panic - No
panic!()- real-time processing requirement - All public APIs return
Result<T, E> #[must_use]on all Results
§References
- Silero VAD: https://github.com/snakers4/silero-vad
- WebRTC VAD: Energy-based voice activity detection
- Whisper: https://github.com/openai/whisper
Modules§
- asr
- Automatic Speech Recognition (ASR) primitives.
- diarization
- Speaker diarization module.
- tts
- Text-to-Speech (TTS) module (GH-133).
- vad
- Voice Activity Detection (VAD) module.
Structs§
- Alignment
Info - Alignment information between text and audio.
- AsrConfig
- ASR configuration for transcription sessions
- AsrSession
- Stateful ASR session for transcription
- Cross
Attention Weights - Cross-attention weights for encoder-decoder alignment (G5)
- Diarization
Config - Configuration for speaker diarization
- Diarization
Result - Complete diarization result
- Fast
Speech2 Synthesizer - FastSpeech2-style TTS synthesizer.
- Hifi
GanVocoder - HiFi-GAN vocoder.
- Language
Detection - Language detection result (G2)
- Segment
- A single transcription segment with timing
- Speaker
- Identified speaker with embedding
- Speaker
Segment - A segment attributed to a specific speaker
- Streaming
Transcription - Iterator for streaming transcription results
- Synthesis
Request - A synthesis request with text and optional controls.
- Synthesis
Result - Synthesis result containing audio and metadata.
- Transcription
- Complete transcription result
- TtsConfig
- TTS configuration.
- Vad
- Voice Activity Detector using energy-based detection.
- VadConfig
- Configuration for Voice Activity Detection.
- Vits
Synthesizer - VITS-style end-to-end TTS.
- Voice
Segment - A detected voice segment with start and end times.
- Word
Timing - Word-level timing information
Enums§
- Speech
Error - Speech processing error type
Constants§
- SUPPORTED_
LANGUAGES - List of supported language codes (ISO 639-1)
Traits§
- AsrModel
- Trait for ASR model implementations
- Speech
Synthesizer - Trait for speech synthesis.
- Vocoder
- Trait for neural vocoder (mel to audio).
Functions§
- detect_
language - Detect language from audio features (G2)
- estimate_
duration - Estimate synthesis duration from text.
- is_
language_ supported - Check if a language code is supported
- normalize_
text - Normalize text for TTS (lowercase, expand abbreviations, etc.).
- split_
sentences - Split text into sentences for chunked synthesis.
Type Aliases§
- Speech
Result - Result type for speech operations