polyvoice 0.7.0

Speaker diarization for Rust — who spoke when. ONNX-powered: Silero VAD, WeSpeaker embeddings, Pyannote segmentation, K-means/AHC clustering, overlap detection.
Documentation

polyvoice

CI Crates.io PyPI Docs.rs License: MIT

Speaker diarization for Rust — who spoke when, without Python.

Beta-quality speaker diarization that runs on CPU and fits in ~30 MB, with automatic K-means speaker count detection. See PRODUCTION-READINESS.md for deployment guidance (GO for desktop and controlled internal use; NO-GO for public multi-tenant APIs).

Speaker_0: 0.0s - 12.3s
Speaker_1: 14.1s - 28.7s
Speaker_0: 31.2s - 45.0s

At a glance

polyvoice pyannote 3.1 whisperX
VoxConverse DER¹ 13.83% ~12% ~15%
Model size ~30 MB ~100 MB ~1 GB
Runtime CPU only GPU recommended GPU required
Dependencies No Python / PyTorch² PyTorch + ONNX PyTorch + faster-whisper
Languages Rust / Python / C / CLI Python only Python only
Streaming Yes No No

~80% of pyannote's accuracy at 10× less RAM and no GPU. Runs at ~10× realtime on CPU — 9.3× average over a VoxConverse subset (artifact).

Other Rust diarizers. sherpa-rs (now archived), pyannote-rs, and speakrs are the closest Rust options. None publishes a collar-matched VoxConverse DER, so this table compares only the established Python systems; see Why polyvoice for the maintained / pure-Rust / streaming / four-binding differentiators.

¹ Legacy pipeline, VoxConverse-test (232 files), 0.25 s collar. The 232-file no-collar figure was not measured, but on a 10-file subset no-collar DER is 25.99% vs 17.43% at 0.25 s collar — expect the strict number several points higher. Competitor figures use their own conventions and are not collar-matched — compare only on a matched collar. All polyvoice DER figures are sourced from tests/der_baseline.json; see the canonical table below.

² The C++ ONNX Runtime is downloaded at build time via the ort crate (download-binaries); for hermetic builds use a static-linked / vendored ORT (see PRODUCTION-READINESS.md §2). No Python/PyTorch runtime.


Install

# Rust
cargo add polyvoice --features "onnx,download"

# Python
pip install polyvoice

# CLI
cargo install polyvoice --features cli

Quick start — Rust

Note: the CLI and Python bindings default to the validated legacy pipeline. The builder below is the curated v2 API; for long-form meetings see PRODUCTION-READINESS.md.

use polyvoice::models::ModelRegistry;
use polyvoice::pipeline_v2::Pipeline;
use polyvoice::types::{Profile, SampleRate};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Models auto-download on first run.
    let registry = ModelRegistry::default()?;

    let pipeline = Pipeline::builder()
        .profile(Profile::Balanced) // auto-k speaker count via the Balanced profile
        .with_models_from(registry)
        .build()?;

    let (samples, sr_hz) = polyvoice::wav::read_wav("meeting.wav")?;
    let sr = SampleRate::new(sr_hz).ok_or("invalid sample rate")?;
    let result = pipeline.run(&samples, sr)?;

    for turn in &result.turns {
        println!("{}: {:.1}s - {:.1}s", turn.speaker, turn.time.start, turn.time.end);
    }
    Ok(())
}

Quick start — Python

import polyvoice

pipeline = polyvoice.Pipeline.balanced("models/")
result = pipeline.run(samples, sample_rate=16000)

for turn in result["turns"]:
    print(f"{turn['speaker']}: {turn['start']:.1f}s - {turn['end']:.1f}s")

Quick start — CLI

# Download models once
polyvoice download-models --profile balanced

# Diarize
polyvoice diarize meeting.wav --output meeting.rttm

Benchmarks

All figures below are sourced from tests/der_baseline.json (schema polyvoice-der-baseline-v2) and labeled with pipeline, dataset, file count, and collar. CI-gated marks rows enforced by the release DER-regression gate.

Pipeline Dataset Files DER (0.25 s collar) DER (no-collar) CI-gated
Legacy (Silero + AHC) VoxConverse-test 232 13.83% not measured no
Legacy (Silero + AHC) VoxConverse-test subset 10 17.43% 25.99% yes
Legacy (Silero + AHC) e2e smoke (26 s clip) 1 6.62% not measured yes
Legacy (Silero + AHC) AMI EN2002a (1 meeting) 1 36.30% 44.73% yes
v2 (Powerset + ResNet34 + AHC) e2e smoke (26 s clip) 1 4.43% not measured yes
Hybrid (Powerset + ResNet34 + AHC) e2e smoke (26 s clip) 1 4.43% not measured no
Hybrid (Powerset + ResNet34 + AHC) VoxConverse-test subset 3 8.27% not measured no
Hybrid (Powerset + ResNet34 + AHC) VoxConverse-test subset 10 15.03% not measured no
Hybrid (Powerset + ResNet34 + AHC) AMI EN2002a (1 meeting) 1 24.95% not measured no

Notes:

  • No-collar DER is materially higher than the 0.25 s-collar figure (e.g. the 10-file legacy subset is 17.43% collar vs 25.99% no-collar). Compare against other systems only on a matched collar.
  • The previously headlined "14.12% (232-file, Hybrid + K-means)" number had no committed artifact and was withdrawn pending a reproducible, provenance-stamped re-run.
  • AMI rows are a single meeting (EN2002a, ~79% overlap), not a multi-meeting average.

Automatic speaker count uses silhouette-based k selection with a single-speaker guard (no 20-speaker predictions on 1-speaker files).


What makes it different

  • Automatic speaker count — K-means auto-k detects how many speakers are in the recording, matching well-tuned AHC without any manual threshold sweep.
  • Single-speaker guardrail — embeddings too similar? Returns 1 speaker instead of hallucinating clusters.
  • Overlap-aware — PowersetSegmenter detects overlapping speech regions; embeddings are masked to exclude overlaps before clustering.
  • Streaming & batchOnlineDiarizer for real-time, OfflineDiarizer for files.
  • Cross-platform — Linux, macOS, Windows; x86_64 and aarch64.
  • Hardened — Miri (memory safety), Loom (concurrency), cargo-fuzz (4 targets), model signing (Minisign).

Why polyvoice

  • Maintained, pure-Rust, streaming-capable. The popular sherpa-rs bindings are now archived; polyvoice is an actively-maintained, pure-Rust diarization path (ONNX via ort, no C++ toolkit) with first-class streaming.
  • One library, four surfaces. Rust + Python (maturin) + C FFI + CLI from a single crate — most Rust diarizers are Rust-only.
  • CPU-first, ~30 MB, MIT. No GPU, no Python runtime, no gated model access.

Honest scope: polyvoice is not the accuracy leader — like-for-like no-collar VoxConverse DER is ~mid-20s%, versus ~11% for pyannote community-1 / speakrs. It trades a few DER points for deployability and a maintained, multi-binding SDK. See Benchmarks for the labeled, collar-disclosed numbers.

Brand note (open maintainer decision): the name collides with ByteDance's "PolyVoice" speech-to-speech-translation research, so always refer to this project as "polyvoice — speaker diarization for Rust". Registering a polyvoice-rs alias is an open decision; a crate rename would break downstreams and is out of scope.


Architecture

┌─────────────┐     ┌─────────────────┐     ┌─────────────────┐
│ Audio Bytes │ --> │ Embedding       │ --> │ Speaker Cluster │ --> Turns
│ (f32 PCM)   │     │ Extractor       │     │ (AHC or K-means)│
└─────────────┘     └─────────────────┘     └─────────────────┘
       │                    │                       │
       v                    v                       v
  Powerset VAD      WeSpeaker ResNet34      Silhouette auto-k
  (10s windows,     (2s windows, 256-dim)   (pairwise cosine
   1s hop)                                  distance cache)

License

MIT