rlx-vad
Voice activity detection on RLX — pure Rust inference, no ONNX Runtime or PyTorch.
Two backends ship with embedded weights (no downloads at runtime):
| Backend | Reference | Embedded file | Size | Frame @ 16 kHz |
|---|---|---|---|---|
| Earshot | pykeio/earshot | weights/earshot_weights.bin |
~75 KiB | 256 samples (16 ms) |
| Silero | snakers4/silero-vad | weights/silero_vad_16k.safetensors |
~920 KiB | 512 samples + 64 context |
Facade re-export: rlx_models::vad (see crates/rlx-models/src/lib.rs).
Quick start
# Earshot — embedded bin (~5–6 µs/frame on Apple Silicon)
# Silero — embedded safetensors
# Optional: print segment boundaries in seconds
# Noise / latency / quality bench (assets/jfk + white-noise sweep)
# Sweep RLX device slots (cpu, metal, mlx, wgpu, …)
# or: cargo run -p rlx-vad --example jfk_bench --release --features all-backends -- --devices all
CLI flags: --backend earshot|silero, --wav PATH, --threshold (override preset), --device cpu|metal|…, --weights PATH (Silero override only), --seconds.
Bench flags: --devices all|apple-silicon|cpu,metal,… (see Benchmarks).
Benchmarks (assets/jfk)
Measured with cargo run -p rlx-vad --example jfk_bench --release on Apple Silicon (release, CPU BLAS / Accelerate). Clips: assets/jfk/jfk_rust_speech.wav (12.1 s) and jfk_voice_clone.wav (5.2 s), each wrapped in 1.5 s silence pads for labeled-region scoring.
Latency (clean audio)
| VAD | Mean / frame | p99 / frame | RTF | Notes |
|---|---|---|---|---|
| Earshot | ~5–6 µs | ~6 µs | ~0.00034 | 256-sample hop; BLAS MinGRU |
| Silero | ~365 µs | ~460 µs | ~0.011 | 512-sample hop + 64-sample context |
RTF = wall time ÷ audio duration (lower is faster). Both are well under real-time for streaming.
Quality (clean audio, algorithm-specific segment presets)
Frame metrics use each algorithm’s SegmentParams preset (see below). Segment IoU measures overlap with the labeled speech region — the primary quality metric for Silero, where LSTM state keeps frame probabilities elevated on trailing silence (frame-level silence specificity is misleading).
| VAD | Frame acc | Speech recall | Silence spec | Seg IoU | Mean speech prob |
|---|---|---|---|---|---|
| Earshot | ~70% | ~82% | ~48% | 0.96–0.97 | ~0.67 |
| Silero | ~57% | ~90% | ~0%* | 0.99–1.00 | ~0.82 |
*Silero frame silence specificity is low on padded silence because probabilities decay slowly after speech; segment IoU remains near 1.0 with official min-speech / min-silence settings.
Noise sweep (jfk_rust_speech.wav, SNR vs white noise)
| SNR | Earshot rec / spec / IoU | Silero rec / IoU |
|---|---|---|
| clean | 82% / 48% / 0.97 | 90% / 0.99 |
| 20 dB | 82% / 47% / 0.97 | 93% / 0.99 |
| 10 dB | 71% / 60% / 0.91 | 93% / 0.99 |
| 5 dB | 75% / 38% / 0.95 | 93% / 0.99 |
| 0 dB | 91% / 26% / 0.96 | 93% / 0.99 |
| −5 dB | 94% / 26% / 0.96 | 89% / 0.99 |
Earshot is faster and lighter; Silero holds higher speech recall and segment IoU under noise at the cost of ~60× higher per-frame latency.
RLX device compatibility
jfk_bench --devices all validates each RLX backend slot (cpu, metal, mlx, cuda, wgpu, …) and reports identical probabilities across slots (parity checked in tests/backend_quick_check.rs, tolerance < 1e-6).
Streaming inference runs on CPU BLAS for every device slot — 256–512 sample frames make GPU transfer dominate latency (same policy as Whisper decode on Metal/MLX). --device still validates that the requested RLX backend is available in the build.
Segment presets (quality tuning)
CLI and bench pick defaults via SegmentParams::for_algorithm():
| Preset | threshold |
neg_threshold |
min_speech |
min_silence |
|---|---|---|---|---|
SegmentParams::earshot() |
0.35 | 0.20 | 100 ms | 50 ms |
SegmentParams::silero() |
0.5 | threshold − 0.15 | 250 ms | 100 ms |
Override on the CLI with --threshold. Library callers can clone a preset and adjust fields.
Cargo features
| Feature | Default | Backend |
|---|---|---|
earshot |
yes | pykeio/earshot CNN + MinGRU (~77 KiB embedded bin) |
silero |
yes | Silero ONNX 16 kHz branch (~944 KiB embedded safetensors; pulls rlx-core) |
all-backends |
no | Forward GPU features to rlx-runtime for --device metal|cuda|… validation |
Build one backend only:
enabled_backends() / default_backend() reflect the compile-time VAD algorithm set.
RLX execution devices
--device validates the RLX backend is available (resolve_device). Streaming frame inference runs on CPU BLAS for all device slots; probabilities are identical across slots. See Benchmarks for multi-device bench commands.
Silero embedded weights
What is embedded
The crate embeds weights/silero_vad_16k.safetensors at compile time:
// crates/rlx-vad/src/silero/embedded.rs
const SAFETENSORS: & = include_bytes!;
On first use, bytes are parsed with rlx_core::embedded_safetensors::EmbeddedSafetensors and cached in a OnceLock. No filesystem access unless you call SileroWeights::load(path).
Not the same as the HF download
Hugging Face hosts a file also named silero_vad_16k.safetensors, but it matches the 8 kHz ONNX branch (STFT (258, 1, 256), conv1 in (128, 129, 3)). That graph is not interchangeable with 16 kHz streaming inference.
The embedded file is exported from the official silero_vad.onnx 16 kHz branch (then_branch when sr == 16000):
| Tensor | Shape | Notes |
|---|---|---|
stft_conv.weight |
(130, 1, 128) |
STFT as conv, stride 64, +32 reflect pad |
conv1.weight / bias |
(128, 65, 3) / (128,) |
magnitude → 128 ch |
conv2 … conv4 |
… | stride-2 middle layers |
lstm_cell.weight_ih / weight_hh |
(512, 128) |
PyTorch LSTM layout |
lstm_cell.bias_ih / bias_hh |
(512,) |
|
final_conv.weight / bias |
(1, 128, 1) / (1,) |
sigmoid speech prob |
Use the export script below — do not copy the HF artifact into weights/.
Regenerate embedded safetensors
Requires Python 3 + pip install onnx numpy safetensors.
Then rebuild rlx-vad (the new blob is picked up via include_bytes!).
Legacy RLXV blob export (same tensors, custom header) still exists in scripts/export_silero_embedded.py for experiments; safetensors is the supported embed format.
Earshot embedded weights
weights/earshot_weights.bin — custom layout from pykeio/earshot (FFT tables + CNN + MinGRU). Parsed once at startup; no external files.
Library API
use ;
// Earshot — frame-at-a-time
let mut det = default;
let prob = det.predict_f32;
// Silero — streaming session (512-sample frames, 64-sample context)
let mut session = new;
let prob = session.predict_frame?;
// Segments with tuned presets (requires matching Cargo features)
let _dev = resolve_device?;
let segs = speech_segments_earshot;
let segs = speech_segments_silero?;
Segment helpers merge frame scores into [start, end) sample ranges. Use SegmentParams::earshot() or ::silero() rather than bare defaults when quality matters.
Tests
Integration tests use assets/jfk/jfk_rust_speech.wav when present.
Implementation notes
- Shared ops —
src/ops.rs: Conv1d, LSTM cell, BLASgemv(viarlx-cpu). - Silero STFT — reflect-pad 32 samples right; conv stride 64; magnitude from 65 bins of 130 STFT channels.
- Streaming — Silero expects
context || chunk(576 samples @ 16 kHz per step); LSTM state carried inSileroSession. - Backends —
--devicevalidates RLX backend availability; streaming inference uses CPU BLAS on all slots (seestreaming_execution_device).
See also
rlx_core::embedded_safetensors— reusable in-memory safetensors loader for other small embedded models- README.md — workspace overview
- AGENTS.md — agent command cheat sheet