WaveKat VAD
Voice Activity Detection library for Rust with multiple backend support.
Quick Start
use VoiceActivityDetector;
use ;
let mut vad = new.unwrap;
let samples: = vec!; // 10ms at 16kHz
let probability = vad.process.unwrap;
Backends
| Backend | Feature | Sample Rates | Frame Size | Output |
|---|---|---|---|---|
| WebRTC | webrtc (default) |
8/16/32/48 kHz | 10, 20, or 30ms | Binary (0.0 or 1.0) |
| Silero | silero |
8/16 kHz | 32ms (256 or 512 samples) | Continuous (0.0–1.0) |
| TEN-VAD | ten-vad |
16 kHz only | 16ms (256 samples) | Continuous (0.0–1.0) |
[]
= "0.1" # WebRTC only (default)
= { = "0.1", = ["silero"] }
= { = "0.1", = ["ten-vad"] }
= { = "0.1", = ["webrtc", "silero", "ten-vad"] } # all backends
Benchmarks
Performance measured against the TEN-VAD testset — 30 audio files from LibriSpeech, GigaSpeech, and DNS Challenge with manual speech/non-speech annotations. Threshold: 0.5.
v0.1.7
| Backend | Precision | Recall | F1 Score | Frame Size | Avg Inference | RTF |
|---|---|---|---|---|---|---|
| WebRTC | 0.821 | 0.983 | 0.895 | 480 (30 ms) | 2.6 µs | 0.0001 |
| Silero | 0.938 | 0.938 | 0.938 | 512 (32 ms) | 120.4 µs | 0.0038 |
| TEN-VAD | 0.942 | 0.915 | 0.928 | 256 (16 ms) | 60.7 µs | 0.0038 |
Accuracy metrics are deterministic; inference times are approximate and vary by hardware. Measured with
--releaseon GitHub Actionsubuntu-latestrunners. Run locally:make accuracyormake bench
WebRTC
Google's WebRTC VAD. Fast and lightweight, returns binary speech/silence detection. Supports four aggressiveness modes.
use VoiceActivityDetector;
use ;
// Default 30ms frame duration
let mut vad = new.unwrap;
// Or specify frame duration (10, 20, or 30ms)
let mut vad = with_frame_duration.unwrap;
let samples = vec!; // 20ms at 16kHz
let result = vad.process.unwrap; // 0.0 or 1.0
Silero
Neural network (LSTM) via ONNX Runtime. Returns continuous probability, higher accuracy than WebRTC. Only supports 8kHz and 16kHz.
use VoiceActivityDetector;
use SileroVad;
let mut vad = new.unwrap;
let samples = vec!; // 32ms at 16kHz
let probability = vad.process.unwrap; // 0.0–1.0
// Or load a custom model
let vad = from_file.unwrap;
TEN-VAD
Agora's TEN-VAD with pure Rust preprocessing (no C dependency). Returns continuous probability, 16kHz only.
use VoiceActivityDetector;
use TenVad;
let mut vad = new.unwrap;
let samples = vec!; // 16ms at 16kHz
let probability = vad.process.unwrap; // 0.0–1.0
The VoiceActivityDetector Trait
All backends implement a common trait, so you can write code that is generic over backends:
use ;
FrameAdapter
Real-world audio arrives in arbitrary chunk sizes. FrameAdapter buffers incoming samples and feeds correctly-sized frames to the backend automatically.
use FrameAdapter;
use SileroVad;
let vad = new.unwrap;
let mut adapter = new;
// Feed arbitrary-sized chunks — adapter handles buffering
let chunk = vec!; // not a multiple of 512
// Get all complete frame results at once
let probabilities = adapter.process_all.unwrap;
// Or get just the latest result (convenient for real-time)
let latest = adapter.process_latest.unwrap;
// Or process one frame at a time
let result = adapter.process.unwrap; // Some(prob) or None
Preprocessing
Optional audio preprocessing to improve VAD accuracy. Available stages: high-pass filter, noise suppression, and amplitude normalization.
use ;
// Use a preset
let config = raw_mic; // 80Hz HP + normalize + denoise
// let config = PreprocessorConfig::telephony(); // 200Hz HP only
// Or configure manually
let config = PreprocessorConfig ;
let mut preprocessor = new;
let raw_audio: = vec!;
let cleaned = preprocessor.process;
// feed `cleaned` to your VAD
Feature Flags
| Feature | Default | Description |
|---|---|---|
webrtc |
Yes | WebRTC VAD backend |
silero |
No | Silero VAD backend (ONNX model downloaded at build time) |
ten-vad |
No | TEN-VAD backend (ONNX model downloaded at build time) |
denoise |
No | RNNoise-based noise suppression in the preprocessing pipeline |
serde |
No | Serialize/Deserialize for config types |
ONNX Model Downloads
Silero and TEN-VAD models are downloaded automatically at build time. For offline or CI builds, point to a local model file:
SILERO_MODEL_PATH=/path/to/silero_vad.onnx
TEN_VAD_MODEL_PATH=/path/to/ten-vad.onnx
Error Handling
All backends return Result<f32, VadError>. The error type covers:
VadError::InvalidSampleRate(u32)— unsupported sample rate for the backendVadError::InvalidFrameSize { got, expected }— wrong number of samplesVadError::BackendError(String)— backend-specific error (e.g., ONNX failure)
Use capabilities() to check a backend's requirements before processing.
vad-lab
Dev tool for live VAD experimentation. Captures audio server-side and streams results to a web UI.
Quick Start
License
Apache-2.0
TEN-VAD model notice
The TEN-VAD ONNX model (used by the ten-vad feature) is licensed under Apache-2.0 with a non-compete clause by the TEN-framework / Agora. It restricts deployment that competes with Agora's offerings and limits deployment to "solely for your benefit and the benefit of your direct End Users." This is not standard open-source despite the Apache-2.0 label. Review the TEN-VAD license before using in production.
Third-party notices
This project uses nnnoiseless (BSD-3-Clause) for noise suppression via the denoise feature.