polyvoice
Speaker diarization for Rust — who spoke when, without Python.
Production-ready speaker diarization that runs on CPU, fits in 30 MB, and outperforms AHC clustering with automatic K-means speaker count detection.
Speaker_0: 0.0s - 12.3s
Speaker_1: 14.1s - 28.7s
Speaker_0: 31.2s - 45.0s
At a glance
| polyvoice | pyannote 3.1 | whisperX | |
|---|---|---|---|
| VoxConverse DER | 14.12% | ~12% | ~15% |
| Model size | ~30 MB | ~100 MB | ~1 GB |
| Runtime | CPU only | GPU recommended | GPU required |
| Dependencies | Zero (ONNX) | PyTorch + ONNX | PyTorch + faster-whisper |
| Languages | Rust / Python / C / CLI | Python only | Python only |
| Streaming | Yes | No | No |
~80% of pyannote's accuracy at 10× less RAM and no GPU.
Install
# Rust
# Python
# CLI
Quick start — Rust
use ModelRegistry;
use HybridPipeline;
use PowersetSegmenter;
use ResNet34Adapter;
use KMeansClusterer;
use SampleRate;
Quick start — Python
=
=
Quick start — CLI
# Download models once
# Diarize
Benchmarks
| Pipeline | Dataset | Files | DER | Notes |
|---|---|---|---|---|
| Hybrid + K-means | VoxConverse-test | 232 | 14.12% | Auto-k, no threshold tuning |
| Hybrid + AHC | VoxConverse-test | 232 | 18.77% | Manual threshold 0.40 |
| Legacy (Silero + AHC) | VoxConverse-test | 232 | ~14% | Baseline pipeline |
| Hybrid + K-means | VoxConverse-test | 10 | 13.48% | Subset |
| Hybrid + AHC | VoxConverse-test | 10 | 15.03% | Subset |
| Hybrid + K-means | e2e smoke | 1 | 4.43% | 26 s clip |
K-means auto-k uses silhouette-based k selection with single-speaker detection (no more 20-speaker predictions on 1-speaker files). It beats AHC by 4.65% DER on the full VoxConverse benchmark without any manual threshold tuning.
What makes it different
- Automatic speaker count — K-means auto-k detects how many speakers are in the recording. No more guessing thresholds.
- Single-speaker guardrail — embeddings too similar? Returns 1 speaker instead of hallucinating clusters.
- Overlap-aware — PowersetSegmenter detects overlapping speech regions; embeddings are masked to exclude overlaps before clustering.
- Streaming & batch —
OnlineDiarizerfor real-time,OfflineDiarizerfor files. - Cross-platform — Linux, macOS, Windows; x86_64 and aarch64.
- Hardened — Miri (memory safety), Loom (concurrency), cargo-fuzz (4 targets), model signing (Minisign).
Architecture
┌─────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Audio Bytes │ --> │ Embedding │ --> │ Speaker Cluster │ --> Turns
│ (f32 PCM) │ │ Extractor │ │ (AHC or K-means)│
└─────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
v v v
Powerset VAD WeSpeaker ResNet34 Silhouette auto-k
(10s windows, (2s windows, 256-dim) (pairwise cosine
1s hop) distance cache)
License
MIT