polyvoice

Speaker diarization for Rust — who spoke when, without Python. Legacy pipeline: Silero VAD + WeSpeaker embeddings + AHC clustering. New in v0.6.3: Hybrid pipeline (Powerset VAD + ResNet34 + AHC/K-means) for long-form multi-speaker audio — API-only. New in unreleased: K-means auto-k clusterer (silhouette-based k selection) beats AHC by 4.65% DER on VoxConverse.

Quick Start

[dependencies]
polyvoice = { version = "0.6", features = ["onnx"] }

cargo add polyvoice --features onnx

Features

One-call pipeline — Pipeline::run() wires VAD → embeddings → AHC or K-means clustering.
Hybrid pipeline — HybridPipeline (v0.6.3, API-only) uses PowersetSegmenter as a superior VAD (overlap-aware) + global ResNet34 embedding clustering. Overcomes the 3-speaker limit of local segmentation models on long-form audio.
Online & offline — OnlineDiarizer for streaming, OfflineDiarizer for batch.
CPU-only, ~30 MB — ONNX Runtime, no GPU or Python runtime required.
Multi-language — Rust library, Python bindings (pip install polyvoice), C FFI, CLI.
Lock-free concurrency — crossbeam-queue session pool for parallel inference.
Parallel embedder — embed_batch spreads chunks across CPU cores via std::thread::scope.
AHC O(n²) — agglomerative clustering rewritten from cubic to quadratic; handles >500 embeddings on long recordings.
K-means auto-k — silhouette-based automatic k selection with single-speaker detection. 14.12% DER on VoxConverse full (vs AHC 18.77%).
Hardened — Miri (memory), Loom (concurrency), cargo-fuzz (4 targets), model signing (Minisign).

Minimal Example (Legacy Pipeline — CLI / Python default)

use polyvoice::{Pipeline, DiarizationConfig, VadConfig, FbankOnnxExtractor, SileroVad};
use std::path::Path;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let ext = FbankOnnxExtractor::new(Path::new("models/wespeaker_resnet34.onnx"), 256, 4)?;
    let mut vad = SileroVad::new(Path::new("models/silero_vad.onnx"), 512)?;
    let (samples, _sr) = polyvoice::wav::read_wav(Path::new("meeting.wav"))?;
    let result = Pipeline::new(DiarizationConfig::default(), VadConfig::default())
        .run(&samples, &ext, &mut vad)?;
    for turn in &result.turns {
        println!("{}: {:.2}s - {:.2}s", turn.speaker, turn.time.start, turn.time.end);
    }
    Ok(())
}

Hybrid Pipeline (API-only, v0.6.3)

The hybrid pipeline is available in Rust via the pipeline_v2::hybrid module. It uses PowersetSegmenter purely for speech-region detection (including overlaps), then extracts sliding-window ResNet34 embeddings and clusters them globally with AHC. This avoids the 3-speaker hard limit of the Powerset model.

use polyvoice::models::ModelRegistry;
use polyvoice::pipeline_v2::hybrid::HybridPipeline;
use polyvoice::segmentation::PowersetSegmenter;
use polyvoice::embedder::ResNet34Adapter;
use polyvoice::clusterer::KMeansClusterer;
use std::path::Path;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let registry = ModelRegistry::default()?;
    let models = registry.ensure_for_profile(polyvoice::types::Profile::Balanced)?;

    let segmenter = PowersetSegmenter::new(&models.segmenter_path)?;
    let embedder = ResNet34Adapter::new(&models.embedder_path, 4)?;
    let clusterer = KMeansClusterer::new(20);

    let pipeline = HybridPipeline::new(
        Box::new(segmenter),
        Box::new(embedder),
        Box::new(clusterer),
    );

    let (samples, _sr) = polyvoice::wav::read_wav(Path::new("meeting.wav"))?;
    let sr = polyvoice::types::SampleRate::new(16000).unwrap();
    let result = pipeline.run(&samples, sr)?;
    for turn in &result.turns {
        println!("{}: {:.2}s - {:.2}s", turn.speaker, turn.time.start, turn.time.end);
    }
    Ok(())
}

Note: The hybrid pipeline is currently API-only. The CLI (polyvoice diarize) and Python bindings continue to use the legacy pipeline for stability.

Python / C FFI

Python bindings use the legacy pipeline (stable default):

import polyvoice
pipeline = polyvoice.Pipeline.balanced("models/")
result = pipeline.run(samples, sample_rate=16000)
for turn in result["turns"]:
    print(f"{turn['speaker']}: {turn['start']:.1f}s - {turn['end']:.1f}s")

// cargo build --features ffi
// See include/polyvoice.h and examples/ffi_usage.c
polyvoice_pipeline_create(BALANCED, "models/", &handle);
polyvoice_pipeline_run(handle, samples, n, 16000, &json, &len);

Benchmarks

Pipeline	Dataset	DER	Speed
Legacy (Silero + ResNet34 + AHC)	VoxConverse (232 files)	~14%	10x RT (CPU)
Legacy (Silero + ResNet34 + AHC)	AMI (16 meetings)	~36%	7x RT (CPU)
Hybrid (Powerset VAD + ResNet34 + AHC)	e2e smoke (26 s clip)	4.43%	—
Hybrid (Powerset VAD + ResNet34 + AHC)	VoxConverse (3-file subset)	8.27%	—
Hybrid (Powerset VAD + ResNet34 + AHC)	VoxConverse (10-file subset)	16.62%	—
Hybrid (Powerset VAD + ResNet34 + K-means)	VoxConverse (10-file subset)	13.48%	—
Hybrid (Powerset VAD + ResNet34 + K-means)	VoxConverse (full 232 files)	14.12%	—

~80% of pyannote's accuracy at 10× the speed on CPU — no GPU, no Python.

Note on long-form audio: The 10-file VoxConverse subset includes one known outlier (aorju: 23 min, 12 speakers, 17% overlap → DER 52.51%). Excluding this file, the average DER drops to ~10.5%. The hybrid pipeline is API-only and optimized for typical conference/meeting recordings; extreme multi-speaker long-form with heavy overlap remains an active research area (VBx/PLDA).

License

MIT

polyvoice 0.6.4