fast-vad
Extremely fast voice activity detection in Rust with Python bindings and streaming mode support. Significantly faster than WebRTC VAD and orders of magnitude faster than Silero ONNX. See benchmark comparisons.
Supports 16 kHz and 8 kHz sample rates.
Architecture
Audio is split into non-overlapping 32 ms frames (512 samples at 16 kHz, 256 at 8 kHz), Hann-windowed, FFT'd, and collapsed into 8 log-energy bands covering roughly 94-4000 Hz.
Per frame, the detector builds 32 features: 8 raw log-energies, 8 noise-normalised values (raw minus a running noise floor), and their first and second order deltas. A logistic regression model with weights compiled into the crate scores these features and compares the result to a mode-specific threshold. The noise floor is a per-band exponential moving average that only updates on silence frames, so it adapts to background noise without being contaminated by speech.
Raw frame labels are then post-processed: short speech bursts below min_speech_ms are dropped, short silence gaps below min_silence_ms are filled, and voiced regions are extended by hangover_ms to avoid clipping word endings.
VAD processes all frames in parallel with rayon. VadStateful processes one frame at a time with reused FFT scratch buffers for low-latency streaming. Hot loops are SIMD-accelerated via the wide crate.
Install
Python
Or with uv:
Rust
Build from source
Python
Requires a Rust toolchain and maturin.
Rust
Python usage
Fast vad comes with a few modes.VAD() and VadStateful() default to fast_vad.mode.normal for offline and streaming mode respectively. To customize parameters use with_mode or with_config for even finer control.
, =
assert in
# Default (Normal mode)
=
# Explicit mode
= # choose permissive, normal or aggressive
# Custom parameters
=
# Per-sample labels
=
# Per-frame labels
=
# Speech segments as a (N, 2) uint64 numpy array of [start, end] sample indices
=
Streaming
# Default (Normal mode)
=
# Explicit mode
=
# Custom parameters
=
= # 512 at 16 kHz, 256 at 8 kHz
=
# reuse for another stream
Feature extraction
You can also use fast vad as a feature extractor.
=
# 8 log-energy band features per frame
= # shape: (num_frames, 8)
# 24-dimensional features per frame: raw bands + first- and second-order deltas
= # shape: (num_frames, 24)
Modes
| Constant | Description |
|---|---|
fast_vad.mode.permissive |
Low false-negative rate; more speech accepted |
fast_vad.mode.normal |
Balanced, general-purpose |
fast_vad.mode.aggressive |
Low false-positive rate; stricter |
The built-in modes were tuned against LibriVAD, so they work best on read speech. For other domains (phone calls, meetings, noisy environments, etc.) you'll likely get better results tuning with_config() against your own data.
Rust usage
Config is set at construction. VAD::new and VadStateful::new default to Normal
mode; use with_mode or with_config to customise.
use ;
Streaming
use ;
Benchmarking
License
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT license (LICENSE-MIT)
at your option.