Module speech

Expand description

Speech processing: VAD, ASR, TTS, diarization (spec §5, GH-133) Speech processing module for ASR, TTS, VAD, and diarization.

This module implements speech processing primitives for Whisper and other speech models, following the spec in apr-whisper-and-cookbook-support-eoy-2025.md.

§Modules

vad: Voice Activity Detection (Silero-style or energy-based)
asr: Automatic Speech Recognition primitives
diarization: Speaker diarization
tts: Text-to-Speech primitives

§Example

use aprender::speech::vad::{Vad, VadConfig};

// Create VAD with default config
let config = VadConfig::default();
let vad = Vad::new(config).expect("default config is valid");

// Detect voice activity in audio samples
let samples: Vec<f32> = vec![0.0; 16000]; // 1 second of silence at 16kHz
let segments = vad.detect(&samples, 16000).expect("valid input");
assert!(segments.is_empty()); // No speech detected in silence

§PMAT Compliance

This module enforces zero-tolerance quality rules:

No unwrap() - audio streams can’t panic
No panic!() - real-time processing requirement
All public APIs return Result<T, E>
#[must_use] on all Results

§References

Silero VAD: https://github.com/snakers4/silero-vad
WebRTC VAD: Energy-based voice activity detection
Whisper: https://github.com/openai/whisper

Modules§

asr: Automatic Speech Recognition (ASR) primitives.
diarization: Speaker diarization module.
tts: Text-to-Speech (TTS) module (GH-133).
vad: Voice Activity Detection (VAD) module.

Structs§

AlignmentInfo: Alignment information between text and audio.
AsrConfig: ASR configuration for transcription sessions
AsrSession: Stateful ASR session for transcription
CrossAttentionWeights: Cross-attention weights for encoder-decoder alignment (G5)
DiarizationConfig: Configuration for speaker diarization
DiarizationResult: Complete diarization result
FastSpeech2Synthesizer: FastSpeech2-style TTS synthesizer.
HifiGanVocoder: HiFi-GAN vocoder.
LanguageDetection: Language detection result (G2)
Segment: A single transcription segment with timing
Speaker: Identified speaker with embedding
SpeakerSegment: A segment attributed to a specific speaker
StreamingTranscription: Iterator for streaming transcription results
SynthesisRequest: A synthesis request with text and optional controls.
SynthesisResult: Synthesis result containing audio and metadata.
Transcription: Complete transcription result
TtsConfig: TTS configuration.
Vad: Voice Activity Detector using energy-based detection.
VadConfig: Configuration for Voice Activity Detection.
VitsSynthesizer: VITS-style end-to-end TTS.
VoiceSegment: A detected voice segment with start and end times.
WordTiming: Word-level timing information

Enums§

SpeechError: Speech processing error type

Constants§

SUPPORTED_LANGUAGES: List of supported language codes (ISO 639-1)

Traits§

AsrModel: Trait for ASR model implementations
SpeechSynthesizer: Trait for speech synthesis.
Vocoder: Trait for neural vocoder (mel to audio).

Functions§

detect_language: Detect language from audio features (G2)
estimate_duration: Estimate synthesis duration from text.
is_language_supported: Check if a language code is supported
normalize_text: Normalize text for TTS (lowercase, expand abbreviations, etc.).
split_sentences: Split text into sentences for chunked synthesis.

Type Aliases§

SpeechResult: Result type for speech operations

Module speech

Module speech Copy item path

§Modules

§Example

§PMAT Compliance

§References

Modules§

Structs§

Enums§

Constants§

Traits§

Functions§

Type Aliases§

Module speech