polyvoice 0.6.0-alpha.3

Speaker diarization library for Rust — online and offline, ONNX-powered, ecosystem-agnostic

polyvoice

Speaker diarization for Rust — who spoke when, without Python.

Silero VAD + WeSpeaker embeddings + AHC clustering in a single Pipeline::run() call.

CLI Demo

Input:  14 seconds of two-speaker audio (16 kHz mono WAV)
Output: SPEAKER_00: 0.10s -  7.60s
        SPEAKER_01: 8.10s - 14.10s

Quick start

1. Add the dependency

[dependencies]
polyvoice = { version = "0.5", features = ["onnx"] }

2. Download models

bash scripts/download-models.sh
# Downloads WeSpeaker ResNet34 (25 MB) and Silero VAD v5 (2.2 MB) to models/

3. Run the pipeline

use polyvoice::{
    Pipeline, DiarizationConfig, VadConfig,
    FbankOnnxExtractor, SileroVad,
};
use std::path::Path;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Load models
    let extractor = FbankOnnxExtractor::new(
        Path::new("models/wespeaker_resnet34.onnx"),
        256, // embedding dim
        4,   // ONNX session pool size
    )?;
    let mut vad = SileroVad::new(Path::new("models/silero_vad.onnx"), 512)?;

    // Configure and run
    let pipeline = Pipeline::new(
        DiarizationConfig::default(),
        VadConfig::default(),
    );
    let (samples, _sr) = polyvoice::wav::read_wav(Path::new("meeting.wav"))?;
    let result = pipeline.run(&samples, &extractor, &mut vad)?;

    for turn in &result.turns {
        println!("{}: {:.2}s - {:.2}s", turn.speaker, turn.time.start, turn.time.end);
    }
    Ok(())
}

Python

pip install polyvoice

Or build from source:

cd python
maturin develop --release

import polyvoice

pipeline = polyvoice.Pipeline("models/")
turns = pipeline("meeting.wav")

for turn in turns:
    print(f"{turn.speaker}: {turn.start:.1f}s - {turn.end:.1f}s")

CLI

cargo install polyvoice --features cli

polyvoice download-models
polyvoice diarize meeting.wav
polyvoice diarize meeting.wav --format json
polyvoice diarize meeting.wav --format rttm --max-speakers 4

How it works

WAV / PCM audio (16 kHz mono)
       |
       v
+-------------+     +------------------+     +---------+
|  Silero VAD |---->| WeSpeaker        |---->|   AHC   |---> Speaker turns
|  (speech    |     | ResNet34         |     | cluster |
|   regions)  |     | (256-d embed.)   |     |         |
+-------------+     +------------------+     +---------+
                     fbank + CMVN           cosine similarity
                     lock-free pool         threshold merging

VAD detects speech regions, skipping silence. WeSpeaker extracts 256-dimensional speaker embeddings from log-mel filterbank features (80-bin, CMVN-normalized). AHC clusters embeddings by cosine similarity into speaker groups. The Pipeline wires it all together.

Comparison with pyannote

	polyvoice	pyannote
Language	Rust	Python
Runtime	ONNX Runtime	PyTorch
GIL-free	Yes	No
Binary size	~30 MB (with models)	~2 GB (torch + models)
Deploy	Single binary / C FFI	Python env + pip
Concurrent sessions	Lock-free session pool	Thread-limited
Streaming	`OnlineDiarizer` built-in	Third-party wrappers

pyannote is the gold standard for accuracy. polyvoice trades some accuracy for deployment simplicity: no Python runtime, no GPU required, ~30 MB total.

Minimum Supported Rust Version (MSRV)

1.85 (Rust 2024 edition).

Accuracy (DER benchmarks)

Evaluated with 0.25s collar on standard diarization benchmarks:

VoxConverse (232 files, 43.5 hours — broadcast, meetings, interviews)

System	DER	Miss	FA	Confusion	Speed
polyvoice (AHC, t=0.45, me=2)	~15%	3.9%	3.2%	7.9%	10.6x RT (CPU)
pyannote 3.0	~11%	—	—	—	~1x RT (GPU)

AMI (16 meetings, 9 hours — meeting room recordings)

System	DER	Miss	FA	Confusion	Speed
polyvoice (AHC, t=0.45, me=2)	~23%	15.4%	3.5%	4.1%	7x RT (CPU)
pyannote 3.0	~18%	—	—	—	~1x RT (GPU)
Simple i-vector + AHC	~33%	—	—	—	—

polyvoice delivers ~80% of pyannote's accuracy at 10x the speed on CPU alone — no GPU, no Python, ~30 MB total. The accuracy gap comes from neural end-to-end training and overlap-aware resegmentation, which polyvoice doesn't do yet.

# Reproduce benchmarks
bash scripts/download-ami-test.sh
cargo run --release --features cli --bin polyvoice-bench -- data/ami-test

bash scripts/download-voxconverse-test.sh
cargo run --release --features cli --bin polyvoice-bench -- data/voxconverse-test --threshold 0.4

Features

Pipeline API — Pipeline::run() for one-call diarization with VAD + embeddings + clustering.
Online & Offline — OnlineDiarizer for real-time streaming, OfflineDiarizer for batch files.
ONNX-powered — WeSpeaker and ECAPA-TDNN extractors with 80-bin log-mel fbank + CMVN.
Lock-free session pool — crossbeam-queue backed pool for concurrent ONNX inference.
Silero VAD — integrated voice activity detection with stateful LSTM context.
Overlap detection — find regions where multiple speakers talk simultaneously.
Word alignment — assign speaker IDs to transcript words by timestamp.
Python bindings — pip install polyvoice, 3-line API via PyO3/maturin.
CLI — polyvoice diarize meeting.wav with text/json/rttm output.
C FFI — drop-in .so/.dylib/.dll for Go, Node.js, C++ callers.
Safety verified — Miri (memory), Loom (concurrency), cargo-fuzz (inputs), across Linux/macOS/Windows.

Configuration

use polyvoice::{DiarizationConfig, VadConfig, SampleRate};

let config = DiarizationConfig {
    threshold: 0.45,          // cosine similarity threshold
    max_speakers: 64,         // hard speaker limit
    window_secs: 1.5,         // analysis window
    hop_secs: 0.75,           // sliding step
    min_speech_secs: 0.25,    // discard shorter segments
    max_gap_secs: 0.5,        // merge same-speaker gaps under 500 ms
    min_turn_duration_secs: 1.0,  // filter turns shorter than 1s
    min_embeddings_per_speaker: 2, // merge speakers with <2 embeddings
    sample_rate: SampleRate::new(16000).unwrap(),
};

let vad_config = VadConfig {
    frame_size: 512,          // Silero VAD chunk size (32 ms at 16 kHz)
    threshold: 0.5,           // speech probability threshold
    min_silence_ms: 300.0,    // minimum silence to split segments
};

Streaming (real-time)

use polyvoice::{OnlineDiarizer, DiarizationConfig, DummyExtractor};

let config = DiarizationConfig::default();
let mut diarizer = OnlineDiarizer::new(config);
let extractor = DummyExtractor::new(256);

// In your audio callback:
# let chunk = vec![0.0f32; 4800];
let segments = diarizer.feed(&chunk, &extractor).unwrap();
for seg in segments {
    println!("Speaker {:?} at {:.2}s", seg.speaker, seg.time.start);
}

Verification

Check	Tool
Unsafe memory safety	Miri (nightly CI)
Concurrency correctness	Loom model-checking
Input fuzzing	cargo-fuzz (4 targets)
API stability	cargo-semver-checks
Cross-platform	Ubuntu, macOS, Windows CI
Dependency audit	cargo-audit

Roadmap

WeSpeaker + ECAPA-TDNN ONNX extractors
Silero VAD integration
Agglomerative hierarchical clustering (AHC)
Pipeline API (VAD + embeddings + AHC)
C FFI bindings
Miri / Loom / fuzz verification
Cross-platform CI
Python bindings (PyO3 / maturin)
CLI tool (polyvoice diarize / download-models)
DER benchmarks on AMI (~23%) and VoxConverse (~15%), 0.25s collar
Spectral clustering backend (experimental)
Merge-small-speakers post-processing
PLDA scoring backend

Contributing

See CONTRIBUTING.md.

Changelog

See CHANGELOG.md.

License

MIT