fast-vad
Extremely fast voice activity detection in Rust with Python bindings and streaming mode support.
Supports 16 kHz and 8 kHz audio. Fixed frame width of 32 ms (512 samples at 16 kHz and 256 samples at 8 kHz).
If you are interested in benchmark comparisons, see docs/README.md.
Benchmarking
Python benchmarks live in bench_py/ and Rust benchmarks live in bench_rs/.
Install
Python
Or with uv:
Rust
Architecture
fast_vad is a small fixed-frame DSP pipeline with a hardcoded lightweight classifier.
audio
-> 32 ms frames
- 16 kHz: 512 samples
- 8 kHz: 256 samples
-> Hann window
-> real FFT
-> 8 log-energy bands
-> feature engineering
- raw bands (8)
- noise-normalized (8)
- first deltas (8)
- second deltas (8)
= 32 total features
-> hardcoded logistic regression
-> threshold + smoothing
-> speech / silence labels
At a glance:
VAD(offline / batch) splits audio into 32 ms frames and usesrayonto process complete frames in parallel while extracting the 8-band features.VadStateful(streaming) runs the same per-frame pipeline one frame at a time and reuses scratch buffers instead of paying thread-pool overhead.- The detector keeps a running 8-band noise floor, then derives 32 total features from each frame: raw band energies, noise-normalized energies, first-order deltas, and second-order deltas.
- Classification is a tiny hardcoded logistic-regression-style model with fixed weights and bias compiled into the crate.
- The final decision is shaped by simple temporal rules: thresholding, minimum speech length, minimum silence length, and hangover.
- Hot loops are SIMD-accelerated with the
widecrate for windowing, spectral power computation, band-energy math, and detector feature math.
frame features (8 bands)
| raw
| raw - noise_floor
| delta
| delta2
v
32 engineered features
v
linear score + bias
v
speech / silence
Build from source
Python (with uv)
Requires uv and a Rust toolchain.
The package is then importable inside the virtual environment.
Rust
Add as a dependency in another crate:
[]
= { = "/path/to/fast-vad" }
Python usage
Config is set at construction time. VAD() and VadStateful() default to Normal
mode; use with_mode or with_config to customise.
, =
assert in
# Default (Normal mode)
=
# Explicit mode
=
# Custom parameters
=
# Per-sample labels
=
# Per-frame labels
=
# Speech segments as a (N, 2) uint64 numpy array of [start, end] sample indices
=
Streaming
# Default (Normal mode)
=
# Explicit mode
=
# Custom parameters
=
= # 512 at 16 kHz, 256 at 8 kHz
=
# reuse for another stream
Feature extraction
=
= # shape: (num_frames, 8)
Modes
| Constant | Description |
|---|---|
fast_vad.mode.permissive |
Low false-negative rate; more speech accepted |
fast_vad.mode.normal |
Balanced, general-purpose |
fast_vad.mode.aggressive |
Low false-positive rate; stricter |
Rust usage
Config is set at construction. VAD::new and VadStateful::new default to Normal
mode; use with_mode or with_config to customise.
use ;
Streaming
use ;
License
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT license (LICENSE-MIT)
at your option.