polyvoice 0.6.0-alpha.3

Speaker diarization library for Rust — online and offline, ONNX-powered, ecosystem-agnostic
Documentation

polyvoice

CI Crates.io PyPI Docs.rs License: MIT

Speaker diarization for Rust — who spoke when, without Python.

Silero VAD + WeSpeaker embeddings + AHC clustering in a single Pipeline::run() call.

CLI Demo

Input:  14 seconds of two-speaker audio (16 kHz mono WAV)
Output: SPEAKER_00: 0.10s -  7.60s
        SPEAKER_01: 8.10s - 14.10s

Quick start

1. Add the dependency

[dependencies]
polyvoice = { version = "0.5", features = ["onnx"] }

2. Download models

bash scripts/download-models.sh
# Downloads WeSpeaker ResNet34 (25 MB) and Silero VAD v5 (2.2 MB) to models/

3. Run the pipeline

use polyvoice::{
    Pipeline, DiarizationConfig, VadConfig,
    FbankOnnxExtractor, SileroVad,
};
use std::path::Path;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Load models
    let extractor = FbankOnnxExtractor::new(
        Path::new("models/wespeaker_resnet34.onnx"),
        256, // embedding dim
        4,   // ONNX session pool size
    )?;
    let mut vad = SileroVad::new(Path::new("models/silero_vad.onnx"), 512)?;

    // Configure and run
    let pipeline = Pipeline::new(
        DiarizationConfig::default(),
        VadConfig::default(),
    );
    let (samples, _sr) = polyvoice::wav::read_wav(Path::new("meeting.wav"))?;
    let result = pipeline.run(&samples, &extractor, &mut vad)?;

    for turn in &result.turns {
        println!("{}: {:.2}s - {:.2}s", turn.speaker, turn.time.start, turn.time.end);
    }
    Ok(())
}

Python

pip install polyvoice

Or build from source:

cd python
maturin develop --release
import polyvoice

pipeline = polyvoice.Pipeline("models/")
turns = pipeline("meeting.wav")

for turn in turns:
    print(f"{turn.speaker}: {turn.start:.1f}s - {turn.end:.1f}s")

CLI

cargo install polyvoice --features cli

polyvoice download-models
polyvoice diarize meeting.wav
polyvoice diarize meeting.wav --format json
polyvoice diarize meeting.wav --format rttm --max-speakers 4

How it works

WAV / PCM audio (16 kHz mono)
       |
       v
+-------------+     +------------------+     +---------+
|  Silero VAD |---->| WeSpeaker        |---->|   AHC   |---> Speaker turns
|  (speech    |     | ResNet34         |     | cluster |
|   regions)  |     | (256-d embed.)   |     |         |
+-------------+     +------------------+     +---------+
                     fbank + CMVN           cosine similarity
                     lock-free pool         threshold merging

VAD detects speech regions, skipping silence. WeSpeaker extracts 256-dimensional speaker embeddings from log-mel filterbank features (80-bin, CMVN-normalized). AHC clusters embeddings by cosine similarity into speaker groups. The Pipeline wires it all together.

Comparison with pyannote

polyvoice pyannote
Language Rust Python
Runtime ONNX Runtime PyTorch
GIL-free Yes No
Binary size ~30 MB (with models) ~2 GB (torch + models)
Deploy Single binary / C FFI Python env + pip
Concurrent sessions Lock-free session pool Thread-limited
Streaming OnlineDiarizer built-in Third-party wrappers

pyannote is the gold standard for accuracy. polyvoice trades some accuracy for deployment simplicity: no Python runtime, no GPU required, ~30 MB total.

Minimum Supported Rust Version (MSRV)

1.85 (Rust 2024 edition).

Accuracy (DER benchmarks)

Evaluated with 0.25s collar on standard diarization benchmarks:

VoxConverse (232 files, 43.5 hours — broadcast, meetings, interviews)

System DER Miss FA Confusion Speed
polyvoice (AHC, t=0.45, me=2) ~15% 3.9% 3.2% 7.9% 10.6x RT (CPU)
pyannote 3.0 ~11% ~1x RT (GPU)

AMI (16 meetings, 9 hours — meeting room recordings)

System DER Miss FA Confusion Speed
polyvoice (AHC, t=0.45, me=2) ~23% 15.4% 3.5% 4.1% 7x RT (CPU)
pyannote 3.0 ~18% ~1x RT (GPU)
Simple i-vector + AHC ~33%

polyvoice delivers ~80% of pyannote's accuracy at 10x the speed on CPU alone — no GPU, no Python, ~30 MB total. The accuracy gap comes from neural end-to-end training and overlap-aware resegmentation, which polyvoice doesn't do yet.

# Reproduce benchmarks
bash scripts/download-ami-test.sh
cargo run --release --features cli --bin polyvoice-bench -- data/ami-test

bash scripts/download-voxconverse-test.sh
cargo run --release --features cli --bin polyvoice-bench -- data/voxconverse-test --threshold 0.4

Features

  • Pipeline APIPipeline::run() for one-call diarization with VAD + embeddings + clustering.
  • Online & OfflineOnlineDiarizer for real-time streaming, OfflineDiarizer for batch files.
  • ONNX-powered — WeSpeaker and ECAPA-TDNN extractors with 80-bin log-mel fbank + CMVN.
  • Lock-free session poolcrossbeam-queue backed pool for concurrent ONNX inference.
  • Silero VAD — integrated voice activity detection with stateful LSTM context.
  • Overlap detection — find regions where multiple speakers talk simultaneously.
  • Word alignment — assign speaker IDs to transcript words by timestamp.
  • Python bindingspip install polyvoice, 3-line API via PyO3/maturin.
  • CLIpolyvoice diarize meeting.wav with text/json/rttm output.
  • C FFI — drop-in .so/.dylib/.dll for Go, Node.js, C++ callers.
  • Safety verified — Miri (memory), Loom (concurrency), cargo-fuzz (inputs), across Linux/macOS/Windows.

Configuration

use polyvoice::{DiarizationConfig, VadConfig, SampleRate};

let config = DiarizationConfig {
    threshold: 0.45,          // cosine similarity threshold
    max_speakers: 64,         // hard speaker limit
    window_secs: 1.5,         // analysis window
    hop_secs: 0.75,           // sliding step
    min_speech_secs: 0.25,    // discard shorter segments
    max_gap_secs: 0.5,        // merge same-speaker gaps under 500 ms
    min_turn_duration_secs: 1.0,  // filter turns shorter than 1s
    min_embeddings_per_speaker: 2, // merge speakers with <2 embeddings
    sample_rate: SampleRate::new(16000).unwrap(),
};

let vad_config = VadConfig {
    frame_size: 512,          // Silero VAD chunk size (32 ms at 16 kHz)
    threshold: 0.5,           // speech probability threshold
    min_silence_ms: 300.0,    // minimum silence to split segments
};

Streaming (real-time)

use polyvoice::{OnlineDiarizer, DiarizationConfig, DummyExtractor};

let config = DiarizationConfig::default();
let mut diarizer = OnlineDiarizer::new(config);
let extractor = DummyExtractor::new(256);

// In your audio callback:
# let chunk = vec![0.0f32; 4800];
let segments = diarizer.feed(&chunk, &extractor).unwrap();
for seg in segments {
    println!("Speaker {:?} at {:.2}s", seg.speaker, seg.time.start);
}

Verification

Check Tool
Unsafe memory safety Miri (nightly CI)
Concurrency correctness Loom model-checking
Input fuzzing cargo-fuzz (4 targets)
API stability cargo-semver-checks
Cross-platform Ubuntu, macOS, Windows CI
Dependency audit cargo-audit

Roadmap

  • WeSpeaker + ECAPA-TDNN ONNX extractors
  • Silero VAD integration
  • Agglomerative hierarchical clustering (AHC)
  • Pipeline API (VAD + embeddings + AHC)
  • C FFI bindings
  • Miri / Loom / fuzz verification
  • Cross-platform CI
  • Python bindings (PyO3 / maturin)
  • CLI tool (polyvoice diarize / download-models)
  • DER benchmarks on AMI (~23%) and VoxConverse (~15%), 0.25s collar
  • Spectral clustering backend (experimental)
  • Merge-small-speakers post-processing
  • PLDA scoring backend

Contributing

See CONTRIBUTING.md.

Changelog

See CHANGELOG.md.

License

MIT