Skip to main content

Module speech

Module speech 

Source
Expand description

Speech processing: VAD, ASR, TTS, diarization (spec §5, GH-133) Speech processing module for ASR, TTS, VAD, and diarization.

This module implements speech processing primitives for Whisper and other speech models, following the spec in apr-whisper-and-cookbook-support-eoy-2025.md.

§Modules

  • vad: Voice Activity Detection (Silero-style or energy-based)
  • asr: Automatic Speech Recognition primitives
  • diarization: Speaker diarization
  • tts: Text-to-Speech primitives

§Example

use aprender::speech::vad::{Vad, VadConfig};

// Create VAD with default config
let config = VadConfig::default();
let vad = Vad::new(config).expect("default config is valid");

// Detect voice activity in audio samples
let samples: Vec<f32> = vec![0.0; 16000]; // 1 second of silence at 16kHz
let segments = vad.detect(&samples, 16000).expect("valid input");
assert!(segments.is_empty()); // No speech detected in silence

§PMAT Compliance

This module enforces zero-tolerance quality rules:

  • No unwrap() - audio streams can’t panic
  • No panic!() - real-time processing requirement
  • All public APIs return Result<T, E>
  • #[must_use] on all Results

§References

Modules§

asr
Automatic Speech Recognition (ASR) primitives.
diarization
Speaker diarization module.
tts
Text-to-Speech (TTS) module (GH-133).
vad
Voice Activity Detection (VAD) module.

Structs§

AlignmentInfo
Alignment information between text and audio.
AsrConfig
ASR configuration for transcription sessions
AsrSession
Stateful ASR session for transcription
CrossAttentionWeights
Cross-attention weights for encoder-decoder alignment (G5)
DiarizationConfig
Configuration for speaker diarization
DiarizationResult
Complete diarization result
FastSpeech2Synthesizer
FastSpeech2-style TTS synthesizer.
HifiGanVocoder
HiFi-GAN vocoder.
LanguageDetection
Language detection result (G2)
Segment
A single transcription segment with timing
Speaker
Identified speaker with embedding
SpeakerSegment
A segment attributed to a specific speaker
StreamingTranscription
Iterator for streaming transcription results
SynthesisRequest
A synthesis request with text and optional controls.
SynthesisResult
Synthesis result containing audio and metadata.
Transcription
Complete transcription result
TtsConfig
TTS configuration.
Vad
Voice Activity Detector using energy-based detection.
VadConfig
Configuration for Voice Activity Detection.
VitsSynthesizer
VITS-style end-to-end TTS.
VoiceSegment
A detected voice segment with start and end times.
WordTiming
Word-level timing information

Enums§

SpeechError
Speech processing error type

Constants§

SUPPORTED_LANGUAGES
List of supported language codes (ISO 639-1)

Traits§

AsrModel
Trait for ASR model implementations
SpeechSynthesizer
Trait for speech synthesis.
Vocoder
Trait for neural vocoder (mel to audio).

Functions§

detect_language
Detect language from audio features (G2)
estimate_duration
Estimate synthesis duration from text.
is_language_supported
Check if a language code is supported
normalize_text
Normalize text for TTS (lowercase, expand abbreviations, etc.).
split_sentences
Split text into sentences for chunked synthesis.

Type Aliases§

SpeechResult
Result type for speech operations