wavekat-vad 0.1.7

Unified voice activity detection with multiple backends
Documentation
# WaveKat VAD

[![Crates.io](https://img.shields.io/crates/v/wavekat-vad.svg)](https://crates.io/crates/wavekat-vad)
[![docs.rs](https://docs.rs/wavekat-vad/badge.svg)](https://docs.rs/wavekat-vad)
[![CI](https://github.com/wavekat/wavekat-vad/actions/workflows/ci.yml/badge.svg)](https://github.com/wavekat/wavekat-vad/actions/workflows/ci.yml)

Voice Activity Detection library for Rust with multiple backend support.

## Quick Start

```rust
use wavekat_vad::VoiceActivityDetector;
use wavekat_vad::backends::webrtc::{WebRtcVad, WebRtcVadMode};

let mut vad = WebRtcVad::new(16000, WebRtcVadMode::Quality).unwrap();
let samples: Vec<i16> = vec![0; 160]; // 10ms at 16kHz
let probability = vad.process(&samples, 16000).unwrap();
```

## Backends

| Backend | Feature | Sample Rates | Frame Size | Output |
|---------|---------|-------------|------------|--------|
| WebRTC | `webrtc` (default) | 8/16/32/48 kHz | 10, 20, or 30ms | Binary (0.0 or 1.0) |
| Silero | `silero` | 8/16 kHz | 32ms (256 or 512 samples) | Continuous (0.0–1.0) |
| TEN-VAD | `ten-vad` | 16 kHz only | 16ms (256 samples) | Continuous (0.0–1.0) |

```toml
[dependencies]
wavekat-vad = "0.1"                    # WebRTC only (default)
wavekat-vad = { version = "0.1", features = ["silero"] }
wavekat-vad = { version = "0.1", features = ["ten-vad"] }
wavekat-vad = { version = "0.1", features = ["webrtc", "silero", "ten-vad"] }  # all backends
```

### Benchmarks

Performance measured against the [TEN-VAD testset](https://github.com/TEN-framework/ten-vad/tree/main/testset) — 30 audio files from LibriSpeech, GigaSpeech, and DNS Challenge with manual speech/non-speech annotations. Threshold: 0.5.

<!-- benchmark-table-start -->
*v0.1.7*

| Backend | Precision | Recall | F1 Score | Frame Size | Avg Inference | RTF |
|---------|-----------|--------|----------|------------|---------------|-----|
| WebRTC | 0.821 | 0.983 | 0.895 | 480 (30 ms) | 2.6 µs | 0.0001 |
| Silero | 0.938 | 0.938 | 0.938 | 512 (32 ms) | 120.4 µs | 0.0038 |
| TEN-VAD | 0.942 | 0.915 | 0.928 | 256 (16 ms) | 60.7 µs | 0.0038 |
<!-- benchmark-table-end -->

> Accuracy metrics are deterministic; inference times are approximate and vary by hardware. Measured with `--release` on GitHub Actions `ubuntu-latest` runners. Run locally: `make accuracy` or `make bench`

### WebRTC

Google's WebRTC VAD. Fast and lightweight, returns binary speech/silence detection. Supports four aggressiveness modes.

```rust
use wavekat_vad::VoiceActivityDetector;
use wavekat_vad::backends::webrtc::{WebRtcVad, WebRtcVadMode};

// Default 30ms frame duration
let mut vad = WebRtcVad::new(16000, WebRtcVadMode::Quality).unwrap();

// Or specify frame duration (10, 20, or 30ms)
let mut vad = WebRtcVad::with_frame_duration(16000, WebRtcVadMode::Aggressive, 20).unwrap();

let samples = vec![0i16; 320]; // 20ms at 16kHz
let result = vad.process(&samples, 16000).unwrap(); // 0.0 or 1.0
```

### Silero

Neural network (LSTM) via ONNX Runtime. Returns continuous probability, higher accuracy than WebRTC. Only supports 8kHz and 16kHz.

```rust
use wavekat_vad::VoiceActivityDetector;
use wavekat_vad::backends::silero::SileroVad;

let mut vad = SileroVad::new(16000).unwrap();
let samples = vec![0i16; 512]; // 32ms at 16kHz
let probability = vad.process(&samples, 16000).unwrap(); // 0.0–1.0

// Or load a custom model
let vad = SileroVad::from_file("path/to/model.onnx", 16000).unwrap();
```

### TEN-VAD

Agora's TEN-VAD with pure Rust preprocessing (no C dependency). Returns continuous probability, 16kHz only.

```rust
use wavekat_vad::VoiceActivityDetector;
use wavekat_vad::backends::ten_vad::TenVad;

let mut vad = TenVad::new().unwrap();
let samples = vec![0i16; 256]; // 16ms at 16kHz
let probability = vad.process(&samples, 16000).unwrap(); // 0.0–1.0
```

## The `VoiceActivityDetector` Trait

All backends implement a common trait, so you can write code that is generic over backends:

```rust
use wavekat_vad::{VoiceActivityDetector, VadCapabilities};

fn detect_speech(vad: &mut dyn VoiceActivityDetector, audio: &[i16], sample_rate: u32) {
    let caps = vad.capabilities();
    // caps.sample_rate  — required sample rate
    // caps.frame_size   — required frame size in samples
    // caps.frame_duration_ms — frame duration

    for frame in audio.chunks_exact(caps.frame_size) {
        let probability = vad.process(frame, sample_rate).unwrap();
        if probability > 0.5 {
            println!("Speech detected!");
        }
    }
}
```

## `FrameAdapter`

Real-world audio arrives in arbitrary chunk sizes. `FrameAdapter` buffers incoming samples and feeds correctly-sized frames to the backend automatically.

```rust
use wavekat_vad::FrameAdapter;
use wavekat_vad::backends::silero::SileroVad;

let vad = SileroVad::new(16000).unwrap();
let mut adapter = FrameAdapter::new(Box::new(vad));

// Feed arbitrary-sized chunks — adapter handles buffering
let chunk = vec![0i16; 1000]; // not a multiple of 512

// Get all complete frame results at once
let probabilities = adapter.process_all(&chunk, 16000).unwrap();

// Or get just the latest result (convenient for real-time)
let latest = adapter.process_latest(&chunk, 16000).unwrap();

// Or process one frame at a time
let result = adapter.process(&chunk, 16000).unwrap(); // Some(prob) or None
```

## Preprocessing

Optional audio preprocessing to improve VAD accuracy. Available stages: high-pass filter, noise suppression, and amplitude normalization.

```rust
use wavekat_vad::preprocessing::{Preprocessor, PreprocessorConfig};

// Use a preset
let config = PreprocessorConfig::raw_mic();     // 80Hz HP + normalize + denoise
// let config = PreprocessorConfig::telephony(); // 200Hz HP only

// Or configure manually
let config = PreprocessorConfig {
    high_pass_hz: Some(80.0),       // remove low-frequency rumble
    denoise: false,                  // requires "denoise" feature
    normalize_dbfs: Some(-20.0),     // normalize amplitude
};

let mut preprocessor = Preprocessor::new(&config, 16000);
let raw_audio: Vec<i16> = vec![0; 512];
let cleaned = preprocessor.process(&raw_audio);
// feed `cleaned` to your VAD
```

## Feature Flags

| Feature | Default | Description |
|---------|---------|-------------|
| `webrtc` | Yes | WebRTC VAD backend |
| `silero` | No | Silero VAD backend (ONNX model downloaded at build time) |
| `ten-vad` | No | TEN-VAD backend (ONNX model downloaded at build time) |
| `denoise` | No | RNNoise-based noise suppression in the preprocessing pipeline |
| `serde` | No | `Serialize`/`Deserialize` for config types |

### ONNX Model Downloads

Silero and TEN-VAD models are downloaded automatically at build time. For offline or CI builds, point to a local model file:

```sh
SILERO_MODEL_PATH=/path/to/silero_vad.onnx cargo build --features silero
TEN_VAD_MODEL_PATH=/path/to/ten-vad.onnx cargo build --features ten-vad
```

## Error Handling

All backends return `Result<f32, VadError>`. The error type covers:

- `VadError::InvalidSampleRate(u32)` — unsupported sample rate for the backend
- `VadError::InvalidFrameSize { got, expected }` — wrong number of samples
- `VadError::BackendError(String)` — backend-specific error (e.g., ONNX failure)

Use `capabilities()` to check a backend's requirements before processing.

## vad-lab

Dev tool for live VAD experimentation. Captures audio server-side and streams results to a web UI.

<p align="center">
  <img src="https://raw.githubusercontent.com/wavekat/wavekat-vad/main/docs/images/vad-lab-screenshot.png" alt="vad-lab screenshot" width="700">
  <br>
  <em>vad-lab web interface</em>
</p>

### Quick Start

```sh
make setup         # Install dependencies (once)
make dev-backend   # Terminal 1
make dev-frontend  # Terminal 2
```

## License

Apache-2.0

### TEN-VAD model notice

The TEN-VAD ONNX model (used by the `ten-vad` feature) is licensed under Apache-2.0 with a non-compete clause by the TEN-framework / Agora. It restricts deployment that competes with Agora's offerings and limits deployment to "solely for your benefit and the benefit of your direct End Users." This is **not standard open-source** despite the Apache-2.0 label. Review the [TEN-VAD license](https://github.com/TEN-framework/ten-vad) before using in production.

### Third-party notices

This project uses [nnnoiseless](https://github.com/jneem/nnnoiseless) (BSD-3-Clause) for noise suppression via the `denoise` feature.