VoiRS Recognizer
Automatic Speech Recognition (ASR) and phoneme alignment for VoiRS
VoiRS Recognizer provides comprehensive speech recognition capabilities for the VoiRS ecosystem, enabling accurate transcription, phoneme alignment, and audio analysis for speech synthesis evaluation and training.
Features
🎤 Multi-Model ASR Support
- OpenAI Whisper: State-of-the-art multilingual speech recognition
- Mozilla DeepSpeech: Privacy-focused local speech recognition
- Facebook Wav2Vec2: Self-supervised speech representation learning
- Custom Models: Plugin architecture for additional ASR backends
🔤 Phoneme Recognition & Alignment
- Forced Alignment: Precise time-aligned phoneme segmentation
- Montreal Forced Alignment (MFA): Professional-grade phoneme alignment
- Custom Phoneme Sets: Support for multiple languages and dialects
- Confidence Scoring: Reliability metrics for recognition results
📊 Audio Analysis
- Quality Assessment: SNR, THD, and spectral analysis
- Prosody Analysis: Pitch, rhythm, stress, and intonation
- Speaker Characteristics: Gender, age, emotion detection
- Artifact Detection: Clipping, distortion, and noise identification
Quick Start
Add VoiRS Recognizer to your Cargo.toml:
[]
= "0.1.0"
# Enable specific ASR models
= { = "0.1.0", = ["whisper", "forced-align"] }
Basic Speech Recognition
use *;
async
Phoneme Alignment
use *;
async
Audio Quality Analysis
use *;
async
Supported ASR Models
Whisper
- Languages: 99+ languages supported
- Model Sizes: tiny, base, small, medium, large
- Features: Multilingual, robust to noise, timestamp accuracy
- Use Case: General-purpose, multilingual applications
DeepSpeech
- Languages: English (primary), with community models for other languages
- Features: Local processing, privacy-focused, customizable
- Use Case: Privacy-sensitive applications, offline deployment
Wav2Vec2
- Languages: English, with multilingual variants available
- Features: Self-supervised learning, fine-tunable
- Use Case: Research applications, custom domain adaptation
Feature Flags
Enable specific functionality through feature flags:
[]
= {
version = "0.1.0",
= [
"whisper", # OpenAI Whisper support
"deepspeech", # Mozilla DeepSpeech support
"wav2vec2", # Facebook Wav2Vec2 support
"forced-align", # Basic forced alignment
"mfa", # Montreal Forced Alignment
"all-models", # Enable all ASR models
"gpu", # GPU acceleration support
]
}
Performance Optimization
VoiRS Recognizer is designed to meet strict performance requirements:
- Real-time factor (RTF) < 0.3 on modern CPUs
- Memory usage < 2GB for largest models
- Startup time < 5 seconds
- Streaming latency < 200ms
Performance Validation
Use the built-in performance validator to ensure your configuration meets requirements:
use *;
use Duration;
async
Model Selection for Performance
Choose the appropriate model size based on your performance requirements:
use *;
// Ultra-fast processing (RTF ~0.1, lower accuracy)
let fast_config = WhisperConfig ;
// Balanced performance and accuracy (RTF ~0.3)
let balanced_config = WhisperConfig ;
// High accuracy (RTF ~0.8, higher latency)
let accurate_config = WhisperConfig ;
GPU Acceleration
Enable GPU acceleration for significant performance gains:
use *;
// NVIDIA GPU acceleration
let cuda_config = WhisperConfig ;
// Apple Silicon GPU acceleration
let metal_config = WhisperConfig ;
// CPU with optimizations
let optimized_cpu_config = WhisperConfig ;
Memory Optimization
Reduce memory usage with these techniques:
use *;
// Memory-efficient configuration
let memory_config = WhisperConfig ;
// Enable dynamic quantization for further memory savings
let quantized_config = WhisperConfig ;
Real-time Processing Optimization
Configure for ultra-low latency streaming:
use *;
use ;
// Ultra-low latency configuration
let streaming_config = StreamingConfig ;
// Balanced latency/accuracy configuration
let balanced_streaming = StreamingConfig ;
// Create streaming ASR with optimized config
let streaming_asr = with_config.await?;
Batch Processing
Process multiple files efficiently:
use *;
// Optimal batch processing
let batch_config = BatchProcessingConfig ;
let audio_files = vec!;
// Load audio files
let audio_buffers: = audio_files
.iter
.map
.?;
// Process batch efficiently
let batch_processor = with_config.await?;
let transcripts = batch_processor.process_batch.await?;
// Results are returned in the same order as input
for in transcripts.iter.enumerate
Performance Monitoring
Monitor your application's performance in real-time:
use *;
use Instant;
// Enable performance monitoring
let monitor = new
.with_metrics_collection
.with_real_time_reporting;
// Monitor ASR performance
let start = now;
let audio = from_file?;
let transcript = asr.recognize.await?;
let processing_time = start.elapsed;
// Validate performance
let validator = new.with_verbose;
let = validator.validate_rtf;
let = validator.estimate_memory_usage?;
println!;
println!;
println!;
println!;
println!;
// Log performance metrics for analysis
monitor.log_performance_metrics;
Platform-Specific Optimizations
SIMD Acceleration (Automatic)
VoiRS automatically detects and uses SIMD instructions:
- Intel/AMD: AVX2, AVX-512 when available
- ARM: NEON instructions on ARM64
- Apple: Apple Silicon optimizations
No manual configuration required - optimizations are applied automatically.
Multi-threading
Optimize thread usage for your hardware:
use *;
// Automatic thread optimization
let thread_config = auto_detect;
// Manual thread configuration
let manual_config = ThreadingConfig ;
let asr = with_threading_config.await?;
Troubleshooting Performance Issues
Common performance problems and solutions:
High Memory Usage
// If memory usage exceeds limits, try:
let low_memory_config = WhisperConfig ;
Poor RTF Performance
// If RTF is too high, try:
let fast_config = WhisperConfig ;
High Streaming Latency
// If streaming latency is too high, try:
let low_latency_config = StreamingConfig ;
Language Support
VoiRS Recognizer supports multiple languages through its ASR backends:
| Language | Whisper | DeepSpeech | Wav2Vec2 | MFA |
|---|---|---|---|---|
| English | ✅ | ✅ | ✅ | ✅ |
| Spanish | ✅ | ❌ | ❌ | ✅ |
| French | ✅ | ❌ | ❌ | ✅ |
| German | ✅ | ❌ | ❌ | ✅ |
| Japanese | ✅ | ❌ | ❌ | ❌ |
| Chinese | ✅ | ❌ | ❌ | ❌ |
| Korean | ✅ | ❌ | ❌ | ❌ |
Configuration
Custom ASR Configuration
use *;
let config = ASRConfig ;
let asr = with_config.await?;
Phoneme Alignment Configuration
use *;
let config = PhonemeConfig ;
let recognizer = with_config.await?;
Error Handling
VoiRS Recognizer provides comprehensive error handling:
use *;
match asr.recognize.await
Examples
Check out the examples directory for more comprehensive usage examples:
basic_recognition.rs- Simple speech recognitionphoneme_alignment.rs- Detailed phoneme alignmentaudio_analysis.rs- Audio quality analysisbatch_processing.rs- Efficient batch processingstreaming_recognition.rs- Real-time recognitionmultilingual.rs- Multi-language support
Benchmarks
Performance benchmarks on common datasets:
| Model | Dataset | WER | RTF | Memory |
|---|---|---|---|---|
| Whisper-base | LibriSpeech | 5.2% | 0.3x | 1.2GB |
| DeepSpeech | CommonVoice | 8.1% | 0.8x | 800MB |
| Wav2Vec2-base | LibriSpeech | 6.4% | 0.5x | 1.0GB |
RTF = Real Time Factor (processing time / audio duration)
Community Support
🆘 Getting Help
-
GitHub Issues: For bug reports and feature requests
-
GitHub Discussions: For questions, ideas, and community chat
🌟 Connect with the Community
-
Discord Server: Real-time chat and support
- 🔗 Join VoiRS Community Discord
- Channels:
#general,#help,#showcase,#development
-
Matrix: Bridged with Discord for matrix users
📚 Learning Resources
- Documentation: docs.rs/voirs-recognizer
- Examples: GitHub Examples
- Tutorials: VoiRS Learning Hub
- Blog: Medium @voirs-dev
🚀 Professional Support
For commercial deployments and enterprise support:
- Email: support@voirs.dev
- Consulting: Available for integration assistance, performance optimization, and custom development
🎯 Roadmap & Planning
- Project Board: GitHub Projects
- Milestones: GitHub Milestones
- Changelog: CHANGELOG.md
🏆 Recognition
- Contributors: All Contributors
- Sponsors: GitHub Sponsors
- Citations: See Citation section for academic references
Contributing
We welcome contributions! Please see our Contributing Guide for details.
Development Setup
# Clone the repository
# Install dependencies
# Run tests
# Run benchmarks
License
This project is licensed under either of
- Apache License, Version 2.0, (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Citation
If you use VoiRS Recognizer in your research, please cite: