# polyvoice
[](https://github.com/ekhodzitsky/polyvoice/actions/workflows/ci.yml)
[](https://crates.io/crates/polyvoice)
[](https://pypi.org/project/polyvoice)
[](https://docs.rs/polyvoice)
[](LICENSE)
> Speaker diarization for Rust — who spoke when, without Python.
> Legacy pipeline: Silero VAD + WeSpeaker embeddings + AHC clustering.
> **New in v0.6.3**: Hybrid pipeline (Powerset VAD + ResNet34 + AHC/K-means) for long-form multi-speaker audio — API-only.
> **New in unreleased**: K-means auto-k clusterer (silhouette-based k selection) beats AHC by 4.65% DER on VoxConverse.
## Quick Start
```toml
[dependencies]
polyvoice = { version = "0.6", features = ["onnx"] }
```
```bash
cargo add polyvoice --features onnx
```
## Features
- **One-call pipeline** — `Pipeline::run()` wires VAD → embeddings → AHC or K-means clustering.
- **Hybrid pipeline** — `HybridPipeline` (v0.6.3, API-only) uses PowersetSegmenter as a superior VAD (overlap-aware) + global ResNet34 embedding clustering. Overcomes the 3-speaker limit of local segmentation models on long-form audio.
- **Online & offline** — `OnlineDiarizer` for streaming, `OfflineDiarizer` for batch.
- **CPU-only, ~30 MB** — ONNX Runtime, no GPU or Python runtime required.
- **Multi-language** — Rust library, Python bindings (`pip install polyvoice`), C FFI, CLI.
- **Lock-free concurrency** — `crossbeam-queue` session pool for parallel inference.
- **Parallel embedder** — `embed_batch` spreads chunks across CPU cores via `std::thread::scope`.
- **AHC O(n²)** — agglomerative clustering rewritten from cubic to quadratic; handles >500 embeddings on long recordings.
- **K-means auto-k** — silhouette-based automatic k selection with single-speaker detection. 14.12% DER on VoxConverse full (vs AHC 18.77%).
- **Hardened** — Miri (memory), Loom (concurrency), cargo-fuzz (4 targets), model signing (Minisign).
## Minimal Example (Legacy Pipeline — CLI / Python default)
```rust,no_run
use polyvoice::{Pipeline, DiarizationConfig, VadConfig, FbankOnnxExtractor, SileroVad};
use std::path::Path;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let ext = FbankOnnxExtractor::new(Path::new("models/wespeaker_resnet34.onnx"), 256, 4)?;
let mut vad = SileroVad::new(Path::new("models/silero_vad.onnx"), 512)?;
let (samples, _sr) = polyvoice::wav::read_wav(Path::new("meeting.wav"))?;
let result = Pipeline::new(DiarizationConfig::default(), VadConfig::default())
.run(&samples, &ext, &mut vad)?;
for turn in &result.turns {
println!("{}: {:.2}s - {:.2}s", turn.speaker, turn.time.start, turn.time.end);
}
Ok(())
}
```
## Hybrid Pipeline (API-only, v0.6.3)
The hybrid pipeline is available in Rust via the `pipeline_v2::hybrid` module. It uses `PowersetSegmenter` purely for speech-region detection (including overlaps), then extracts sliding-window ResNet34 embeddings and clusters them globally with AHC. This avoids the 3-speaker hard limit of the Powerset model.
```rust,no_run
use polyvoice::models::ModelRegistry;
use polyvoice::pipeline_v2::hybrid::HybridPipeline;
use polyvoice::segmentation::PowersetSegmenter;
use polyvoice::embedder::ResNet34Adapter;
use polyvoice::clusterer::KMeansClusterer;
use std::path::Path;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let registry = ModelRegistry::default()?;
let models = registry.ensure_for_profile(polyvoice::types::Profile::Balanced)?;
let segmenter = PowersetSegmenter::new(&models.segmenter_path)?;
let embedder = ResNet34Adapter::new(&models.embedder_path, 4)?;
let clusterer = KMeansClusterer::new(20);
let pipeline = HybridPipeline::new(
Box::new(segmenter),
Box::new(embedder),
Box::new(clusterer),
);
let (samples, _sr) = polyvoice::wav::read_wav(Path::new("meeting.wav"))?;
let sr = polyvoice::types::SampleRate::new(16000).unwrap();
let result = pipeline.run(&samples, sr)?;
for turn in &result.turns {
println!("{}: {:.2}s - {:.2}s", turn.speaker, turn.time.start, turn.time.end);
}
Ok(())
}
```
> **Note**: The hybrid pipeline is currently API-only. The CLI (`polyvoice diarize`) and Python bindings continue to use the legacy pipeline for stability.
## Python / C FFI
Python bindings use the **legacy** pipeline (stable default):
```python
import polyvoice
pipeline = polyvoice.Pipeline.balanced("models/")
result = pipeline.run(samples, sample_rate=16000)
for turn in result["turns"]:
print(f"{turn['speaker']}: {turn['start']:.1f}s - {turn['end']:.1f}s")
```
```c
// cargo build --features ffi
// See include/polyvoice.h and examples/ffi_usage.c
polyvoice_pipeline_create(BALANCED, "models/", &handle);
polyvoice_pipeline_run(handle, samples, n, 16000, &json, &len);
```
## Benchmarks
| **Legacy** (Silero + ResNet34 + AHC) | VoxConverse (232 files) | **~14%** | 10x RT (CPU) |
| **Legacy** (Silero + ResNet34 + AHC) | AMI (16 meetings) | **~36%** | 7x RT (CPU) |
| **Hybrid** (Powerset VAD + ResNet34 + AHC) | e2e smoke (26 s clip) | **4.43%** | — |
| **Hybrid** (Powerset VAD + ResNet34 + AHC) | VoxConverse (3-file subset) | **8.27%** | — |
| **Hybrid** (Powerset VAD + ResNet34 + AHC) | VoxConverse (10-file subset) | **16.62%** | — |
| **Hybrid** (Powerset VAD + ResNet34 + **K-means**) | VoxConverse (10-file subset) | **13.48%** | — |
| **Hybrid** (Powerset VAD + ResNet34 + **K-means**) | VoxConverse (full 232 files) | **14.12%** | — |
~80% of pyannote's accuracy at 10× the speed on CPU — no GPU, no Python.
> **Note on long-form audio**: The 10-file VoxConverse subset includes one known
> outlier (`aorju`: 23 min, 12 speakers, 17% overlap → DER 52.51%). Excluding this
> file, the average DER drops to ~10.5%. The hybrid pipeline is API-only and
> optimized for typical conference/meeting recordings; extreme multi-speaker
> long-form with heavy overlap remains an active research area (VBx/PLDA).
## License
MIT