polyvoice
Speaker diarization for Rust — who spoke when, without Python. Legacy pipeline: Silero VAD + WeSpeaker embeddings + AHC clustering. New in v0.6.3: Hybrid pipeline (Powerset VAD + ResNet34 + AHC/K-means) for long-form multi-speaker audio — API-only. New in unreleased: K-means auto-k clusterer (silhouette-based k selection) beats AHC by 4.65% DER on VoxConverse.
Quick Start
[]
= { = "0.6", = ["onnx"] }
Features
- One-call pipeline —
Pipeline::run()wires VAD → embeddings → AHC or K-means clustering. - Hybrid pipeline —
HybridPipeline(v0.6.3, API-only) uses PowersetSegmenter as a superior VAD (overlap-aware) + global ResNet34 embedding clustering. Overcomes the 3-speaker limit of local segmentation models on long-form audio. - Online & offline —
OnlineDiarizerfor streaming,OfflineDiarizerfor batch. - CPU-only, ~30 MB — ONNX Runtime, no GPU or Python runtime required.
- Multi-language — Rust library, Python bindings (
pip install polyvoice), C FFI, CLI. - Lock-free concurrency —
crossbeam-queuesession pool for parallel inference. - Parallel embedder —
embed_batchspreads chunks across CPU cores viastd::thread::scope. - AHC O(n²) — agglomerative clustering rewritten from cubic to quadratic; handles >500 embeddings on long recordings.
- K-means auto-k — silhouette-based automatic k selection with single-speaker detection. 14.12% DER on VoxConverse full (vs AHC 18.77%).
- Hardened — Miri (memory), Loom (concurrency), cargo-fuzz (4 targets), model signing (Minisign).
Minimal Example (Legacy Pipeline — CLI / Python default)
use ;
use Path;
Hybrid Pipeline (API-only, v0.6.3)
The hybrid pipeline is available in Rust via the pipeline_v2::hybrid module. It uses PowersetSegmenter purely for speech-region detection (including overlaps), then extracts sliding-window ResNet34 embeddings and clusters them globally with AHC. This avoids the 3-speaker hard limit of the Powerset model.
use ModelRegistry;
use HybridPipeline;
use PowersetSegmenter;
use ResNet34Adapter;
use KMeansClusterer;
use Path;
Note: The hybrid pipeline is currently API-only. The CLI (
polyvoice diarize) and Python bindings continue to use the legacy pipeline for stability.
Python / C FFI
Python bindings use the legacy pipeline (stable default):
=
=
// cargo build --features ffi
// See include/polyvoice.h and examples/ffi_usage.c
;
;
Benchmarks
| Pipeline | Dataset | DER | Speed |
|---|---|---|---|
| Legacy (Silero + ResNet34 + AHC) | VoxConverse (232 files) | ~14% | 10x RT (CPU) |
| Legacy (Silero + ResNet34 + AHC) | AMI (16 meetings) | ~36% | 7x RT (CPU) |
| Hybrid (Powerset VAD + ResNet34 + AHC) | e2e smoke (26 s clip) | 4.43% | — |
| Hybrid (Powerset VAD + ResNet34 + AHC) | VoxConverse (3-file subset) | 8.27% | — |
| Hybrid (Powerset VAD + ResNet34 + AHC) | VoxConverse (10-file subset) | 16.62% | — |
| Hybrid (Powerset VAD + ResNet34 + K-means) | VoxConverse (10-file subset) | 13.48% | — |
| Hybrid (Powerset VAD + ResNet34 + K-means) | VoxConverse (full 232 files) | 14.12% | — |
~80% of pyannote's accuracy at 10× the speed on CPU — no GPU, no Python.
Note on long-form audio: The 10-file VoxConverse subset includes one known outlier (
aorju: 23 min, 12 speakers, 17% overlap → DER 52.51%). Excluding this file, the average DER drops to ~10.5%. The hybrid pipeline is API-only and optimized for typical conference/meeting recordings; extreme multi-speaker long-form with heavy overlap remains an active research area (VBx/PLDA).
License
MIT