polyvoice
Speaker diarization for Rust — who spoke when, without Python.
Beta-quality speaker diarization that runs on CPU and fits in ~30 MB, with automatic K-means speaker count detection. See PRODUCTION-READINESS.md for deployment guidance (GO for desktop and controlled internal use; NO-GO for public multi-tenant APIs).
Speaker_0: 0.0s - 12.3s
Speaker_1: 14.1s - 28.7s
Speaker_0: 31.2s - 45.0s
At a glance
| polyvoice | pyannote 3.1 | whisperX | |
|---|---|---|---|
| VoxConverse DER¹ | 13.83% | ~12% | ~15% |
| Model size | ~30 MB | ~100 MB | ~1 GB |
| Runtime | CPU only | GPU recommended | GPU required |
| Dependencies | No Python / PyTorch² | PyTorch + ONNX | PyTorch + faster-whisper |
| Languages | Rust / Python / C / CLI | Python only | Python only |
| Streaming | Yes | No | No |
~80% of pyannote's accuracy at 10× less RAM and no GPU. Runs at ~10× realtime on CPU — 9.3× average over a VoxConverse subset (artifact).
Other Rust diarizers. sherpa-rs (now archived), pyannote-rs, and speakrs are the closest Rust options. None publishes a collar-matched VoxConverse DER, so this table compares only the established Python systems; see Why polyvoice for the maintained / pure-Rust / streaming / four-binding differentiators.
¹ Legacy pipeline, VoxConverse-test (232 files), 0.25 s collar. The 232-file no-collar
figure was not measured, but on a 10-file subset no-collar DER is 25.99% vs 17.43% at
0.25 s collar — expect the strict number several points higher. Competitor figures use their
own conventions and are not collar-matched — compare only on a matched collar. All
polyvoice DER figures are sourced from tests/der_baseline.json;
see the canonical table below.
² The C++ ONNX Runtime is downloaded at build time via the ort crate
(download-binaries); for hermetic builds use a static-linked / vendored ORT (see
PRODUCTION-READINESS.md §2). No Python/PyTorch runtime.
Install
# Rust
# Python
# CLI
Quick start — Rust
Note: the CLI and Python bindings default to the validated legacy pipeline. The builder below is the curated v2 API; for long-form meetings see PRODUCTION-READINESS.md.
use ModelRegistry;
use Pipeline;
use ;
Quick start — Python
=
=
Quick start — CLI
# Download models once
# Diarize
Benchmarks
All figures below are sourced from tests/der_baseline.json
(schema polyvoice-der-baseline-v2) and labeled with pipeline, dataset, file count, and
collar. CI-gated marks rows enforced by the release DER-regression gate.
| Pipeline | Dataset | Files | DER (0.25 s collar) | DER (no-collar) | CI-gated |
|---|---|---|---|---|---|
| Legacy (Silero + AHC) | VoxConverse-test | 232 | 13.83% | not measured | no |
| Legacy (Silero + AHC) | VoxConverse-test subset | 10 | 17.43% | 25.99% | yes |
| Legacy (Silero + AHC) | e2e smoke (26 s clip) | 1 | 6.62% | not measured | yes |
| Legacy (Silero + AHC) | AMI EN2002a (1 meeting) | 1 | 36.30% | 44.73% | yes |
| v2 (Powerset + ResNet34 + AHC) | e2e smoke (26 s clip) | 1 | 4.43% | not measured | yes |
| Hybrid (Powerset + ResNet34 + AHC) | e2e smoke (26 s clip) | 1 | 4.43% | not measured | no |
| Hybrid (Powerset + ResNet34 + AHC) | VoxConverse-test subset | 3 | 8.27% | not measured | no |
| Hybrid (Powerset + ResNet34 + AHC) | VoxConverse-test subset | 10 | 15.03% | not measured | no |
| Hybrid (Powerset + ResNet34 + AHC) | AMI EN2002a (1 meeting) | 1 | 24.95% | not measured | no |
Notes:
- No-collar DER is materially higher than the 0.25 s-collar figure (e.g. the 10-file legacy subset is 17.43% collar vs 25.99% no-collar). Compare against other systems only on a matched collar.
- The previously headlined "14.12% (232-file, Hybrid + K-means)" number had no committed artifact and was withdrawn pending a reproducible, provenance-stamped re-run.
- AMI rows are a single meeting (EN2002a, ~79% overlap), not a multi-meeting average.
Automatic speaker count uses silhouette-based k selection with a single-speaker guard (no 20-speaker predictions on 1-speaker files).
What makes it different
- Automatic speaker count — K-means auto-k detects how many speakers are in the recording, matching well-tuned AHC without any manual threshold sweep.
- Single-speaker guardrail — embeddings too similar? Returns 1 speaker instead of hallucinating clusters.
- Overlap-aware — PowersetSegmenter detects overlapping speech regions; embeddings are masked to exclude overlaps before clustering.
- Streaming & batch —
OnlineDiarizerfor real-time,OfflineDiarizerfor files. - Cross-platform — Linux, macOS, Windows; x86_64 and aarch64.
- Hardened — Miri (memory safety), Loom (concurrency), cargo-fuzz (4 targets), model signing (Minisign).
Why polyvoice
- Maintained, pure-Rust, streaming-capable. The popular
sherpa-rsbindings are now archived; polyvoice is an actively-maintained, pure-Rust diarization path (ONNX viaort, no C++ toolkit) with first-class streaming. - One library, four surfaces. Rust + Python (maturin) + C FFI + CLI from a single crate — most Rust diarizers are Rust-only.
- CPU-first, ~30 MB, MIT. No GPU, no Python runtime, no gated model access.
Honest scope: polyvoice is not the accuracy leader — like-for-like no-collar VoxConverse DER is ~mid-20s%, versus ~11% for pyannote community-1 / speakrs. It trades a few DER points for deployability and a maintained, multi-binding SDK. See Benchmarks for the labeled, collar-disclosed numbers.
Brand note (open maintainer decision): the name collides with ByteDance's "PolyVoice" speech-to-speech-translation research, so always refer to this project as "polyvoice — speaker diarization for Rust". Registering a
polyvoice-rsalias is an open decision; a crate rename would break downstreams and is out of scope.
Architecture
┌─────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Audio Bytes │ --> │ Embedding │ --> │ Speaker Cluster │ --> Turns
│ (f32 PCM) │ │ Extractor │ │ (AHC or K-means)│
└─────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
v v v
Powerset VAD WeSpeaker ResNet34 Silhouette auto-k
(10s windows, (2s windows, 256-dim) (pairwise cosine
1s hop) distance cache)
License
MIT