fast-vad 0.2.1

Extremely fast voice activity detection in Rust with Python bindings and streaming mode support.
Documentation

fast-vad

Extremely fast voice activity detection in Rust with Python bindings and streaming mode support. Significantly faster than WebRTC VAD and orders of magnitude faster than Silero ONNX. See benchmark comparisons.

Supports 16 kHz and 8 kHz sample rates.

Architecture

Audio is split into non-overlapping 32 ms frames (512 samples at 16 kHz, 256 at 8 kHz), Hann-windowed, FFT'd, and collapsed into 8 log-energy bands covering roughly 94-4000 Hz.

Per frame, the detector builds 32 features: 8 raw log-energies, 8 noise-normalised values (raw minus a running noise floor), and their first and second order deltas. A logistic regression model with weights compiled into the crate scores these features and compares the result to a mode-specific threshold. The noise floor is a per-band exponential moving average that only updates on silence frames, so it adapts to background noise without being contaminated by speech.

Raw frame labels are then post-processed: short speech bursts below min_speech_ms are dropped, short silence gaps below min_silence_ms are filled, and voiced regions are extended by hangover_ms to avoid clipping word endings.

VAD processes all frames in parallel with rayon. VadStateful processes one frame at a time with reused FFT scratch buffers for low-latency streaming. Hot loops are SIMD-accelerated via the wide crate.

Install

Python

pip install fast-vad

Or with uv:

uv add fast-vad

Rust

cargo add fast-vad

Build from source

Python

Requires a Rust toolchain and maturin.

git clone https://github.com/AtharvBhat/fast-vad
cd fast-vad
maturin develop --release

Rust

cargo build --release

Python usage

Fast vad comes with a few modes.VAD() and VadStateful() default to fast_vad.mode.normal for offline and streaming mode respectively. To customize parameters use with_mode or with_config for even finer control.

import numpy as np
import soundfile as sf
import fast_vad

audio, sr = sf.read("audio.wav", dtype="float32")
assert sr in (8000, 16000)

# Default (Normal mode)
vad = fast_vad.VAD(sr)

# Explicit mode
vad = fast_vad.VAD.with_mode(sr, fast_vad.mode.aggressive) # choose permissive, normal or aggressive 

# Custom parameters
vad = fast_vad.VAD.with_config(
    sr,
    threshold_probability=0.7,
    min_speech_ms=100,
    min_silence_ms=300,
    hangover_ms=100,
)

# Per-sample labels
labels = vad.detect(audio)

# Per-frame labels
frame_labels = vad.detect_frames(audio)

# Speech segments as a (N, 2) uint64 numpy array of [start, end] sample indices
segments = vad.detect_segments(audio)
for start, end in segments:
    print(f"speech: {start/sr:.2f}s – {end/sr:.2f}s")

Streaming

# Default (Normal mode)
vad = fast_vad.VadStateful(sr)

# Explicit mode
vad = fast_vad.VadStateful.with_mode(sr, fast_vad.mode.normal)

# Custom parameters
vad = fast_vad.VadStateful.with_config(sr, 0.7, 100, 300, 100)

frame_size = vad.frame_size  # 512 at 16 kHz, 256 at 8 kHz

for i in range(0, len(audio) - frame_size + 1, frame_size):
    is_speech = vad.detect_frame(audio[i : i + frame_size])
    print(f"frame {i // frame_size}: {'speech' if is_speech else 'silence'}")

vad.reset_state()  # reuse for another stream

Feature extraction

You can also use fast vad as a feature extractor.

fe = fast_vad.FeatureExtractor(sr)

# 8 log-energy band features per frame
features = fe.extract_features(audio)  # shape: (num_frames, 8)

# 24-dimensional features per frame: raw bands + first- and second-order deltas
features = fe.feature_engineer(audio)  # shape: (num_frames, 24)

Modes

Constant Description
fast_vad.mode.permissive Low false-negative rate; more speech accepted
fast_vad.mode.normal Balanced, general-purpose
fast_vad.mode.aggressive Low false-positive rate; stricter

The built-in modes were tuned against LibriVAD, so they work best on read speech. For other domains (phone calls, meetings, noisy environments, etc.) you'll likely get better results tuning with_config() against your own data.

Rust usage

Config is set at construction. VAD::new and VadStateful::new default to Normal mode; use with_mode or with_config to customise.

use fast_vad::vad::detector::{VAD, VADModes, VadConfig};

fn main() -> Result<(), fast_vad::VadError> {
    let audio = vec![0.0f32; 16000]; // 1 second of silence

    // Default (Normal mode)
    let vad = VAD::new(16000)?;

    // Explicit mode
    let vad = VAD::with_mode(16000, VADModes::Aggressive)?;

    // Custom parameters
    let vad = VAD::with_config(16000, VadConfig {
        threshold_probability: 0.7,
        min_speech_ms: 100,
        min_silence_ms: 300,
        hangover_ms: 100,
    })?;

    let labels = vad.detect(&audio);           // one bool per sample
    let frame_labels = vad.detect_frames(&audio); // one bool per frame
    let segments = vad.detect_segments(&audio);   // Vec<[start, end]>

    Ok(())
}

Streaming

use fast_vad::vad::detector::{VadStateful, VADModes, VadConfig};

fn main() -> Result<(), fast_vad::VadError> {
    let audio = vec![0.0f32; 16000];

    // Default (Normal mode)
    let mut vad = VadStateful::new(16000)?;

    // Explicit mode
    let mut vad = VadStateful::with_mode(16000, VADModes::Normal)?;

    // Custom parameters
    let mut vad = VadStateful::with_config(16000, VadConfig {
        threshold_probability: 0.7,
        min_speech_ms: 100,
        min_silence_ms: 300,
        hangover_ms: 100,
    })?;

    let frame_size = vad.frame_size();
    for frame in audio.chunks_exact(frame_size) {
        let is_speech = vad.detect_frame(frame)?;
        println!("{is_speech}");
    }

    vad.reset_state(); // reuse for another stream
    Ok(())
}

Benchmarking

cargo bench --manifest-path bench_rs/Cargo.toml

License

Licensed under either of

at your option.