polyvoice
Speaker diarization for Rust — who spoke when, without Python.
Silero VAD + WeSpeaker embeddings + AHC clustering in a single
Pipeline::run()call.
Input: 14 seconds of two-speaker audio (16 kHz mono WAV)
Output: SPEAKER_00: 0.10s - 7.60s
SPEAKER_01: 8.10s - 14.10s
Quick start
1. Add the dependency
[]
= { = "0.5", = ["onnx"] }
2. Download models
# Downloads WeSpeaker ResNet34 (25 MB) and Silero VAD v5 (2.2 MB) to models/
3. Run the pipeline
use ;
use Path;
Python
=
=
CLI
How it works
WAV / PCM audio (16 kHz mono)
|
v
+-------------+ +------------------+ +---------+
| Silero VAD |---->| WeSpeaker |---->| AHC |---> Speaker turns
| (speech | | ResNet34 | | cluster |
| regions) | | (256-d embed.) | | |
+-------------+ +------------------+ +---------+
fbank + CMVN cosine similarity
lock-free pool threshold merging
VAD detects speech regions, skipping silence. WeSpeaker extracts 256-dimensional speaker embeddings from log-mel filterbank features (80-bin, CMVN-normalized). AHC clusters embeddings by cosine similarity into speaker groups. The Pipeline wires it all together.
Why not pyannote?
| polyvoice | pyannote | |
|---|---|---|
| Language | Rust | Python |
| Runtime | ONNX Runtime | PyTorch |
| GIL-free | Yes | No |
| Binary size | ~30 MB (with models) | ~2 GB (torch + models) |
| Deploy | Single binary / C FFI | Python env + pip |
| Concurrent sessions | Lock-free session pool | Thread-limited |
| Streaming | OnlineDiarizer built-in |
Third-party wrappers |
pyannote is the gold standard for accuracy. polyvoice trades some accuracy for deployment simplicity: no Python runtime, no GPU required, ~30 MB total.
Accuracy (DER benchmarks)
Evaluated on AMI test set (Mix-Headset, 16 meetings, 9 hours), 0.25s collar:
| System | DER | Miss | FA | Confusion | Speed |
|---|---|---|---|---|---|
| polyvoice (AHC, t=0.4) | 27.5% | 17.7% | 2.2% | 7.6% | 7x RT (CPU) |
| pyannote 3.0 | ~18% | — | — | — | ~1x RT (GPU) |
| Simple i-vector + AHC | ~33% | — | — | — | — |
polyvoice outperforms traditional i-vector pipelines. The gap to pyannote comes from neural end-to-end training and overlap-aware resegmentation, which polyvoice doesn't do yet.
# Reproduce benchmarks
Features
- Pipeline API —
Pipeline::run()for one-call diarization with VAD + embeddings + clustering. - Online & Offline —
OnlineDiarizerfor real-time streaming,OfflineDiarizerfor batch files. - ONNX-powered — WeSpeaker and ECAPA-TDNN extractors with 80-bin log-mel fbank + CMVN.
- Lock-free session pool —
crossbeam-queuebacked pool for concurrent ONNX inference. - Silero VAD — integrated voice activity detection with stateful LSTM context.
- Overlap detection — find regions where multiple speakers talk simultaneously.
- Word alignment — assign speaker IDs to transcript words by timestamp.
- Python bindings —
pip install polyvoice, 3-line API via PyO3/maturin. - CLI —
polyvoice diarize meeting.wavwith text/json/rttm output. - C FFI — drop-in
.so/.dylib/.dllfor Go, Node.js, C++ callers. - Safety verified — Miri (memory), Loom (concurrency), cargo-fuzz (inputs), across Linux/macOS/Windows.
Configuration
use ;
let config = DiarizationConfig ;
let vad_config = VadConfig ;
Streaming (real-time)
use ;
let config = default;
let mut diarizer = new;
let extractor = new;
// In your audio callback:
# let chunk = vec!;
let segments = diarizer.feed.unwrap;
for seg in segments
Verification
| Check | Tool |
|---|---|
| Unsafe memory safety | Miri (nightly CI) |
| Concurrency correctness | Loom model-checking |
| Input fuzzing | cargo-fuzz (4 targets) |
| API stability | cargo-semver-checks |
| Cross-platform | Ubuntu, macOS, Windows CI |
| Dependency audit | cargo-audit |
Roadmap
- WeSpeaker + ECAPA-TDNN ONNX extractors
- Silero VAD integration
- Agglomerative hierarchical clustering (AHC)
- Pipeline API (VAD + embeddings + AHC)
- C FFI bindings
- Miri / Loom / fuzz verification
- Cross-platform CI
- Python bindings (PyO3 / maturin)
- CLI tool (
polyvoice diarize/download-models) - DER benchmarks on AMI (27.5% on full test set, 0.25s collar)
- Spectral clustering backend
- PLDA scoring backend
License
MIT