polyvoice

Speaker diarization library for Rust — online (streaming) and offline (file-based), ONNX-powered, and ecosystem-agnostic.

polyvoice answers the question "who spoke when?" in audio streams or files. It is designed to be embedded into STT servers, real-time transcription pipelines, or any other Rust application that needs speaker-aware audio processing.

Features

Online (streaming) diarization — process audio chunk-by-chunk in real time.
Offline (file) diarization — process an entire audio buffer with post-processing (segment merging, gap filling).
Sliding-window embeddings — configurable window and hop sizes instead of fixed segments.
ECAPA-TDNN ONNX extractor — built-in 80-bin log-mel filterbank + ONNX inference.
Session pool for ONNX models — no Mutex contention under concurrent load.
VAD integration trait — plug in Silero VAD, Energy VAD, or your own implementation.
Overlap detection — identify regions where multiple speakers are active simultaneously.
Word-level speaker alignment — assign speaker IDs to individual words using timestamps.
Zero Python dependencies — pure Rust + ONNX Runtime.

Quick start

Add to your Cargo.toml:

[dependencies]
polyvoice = { git = "https://github.com/ekhodzitsky/polyvoice" }

Offline diarization

use polyvoice::{OfflineDiarizer, DiarizationConfig, DummyExtractor};

let config = DiarizationConfig::default();
let diarizer = OfflineDiarizer::new(config);
let extractor = DummyExtractor::new(256);

let samples: Vec<f32> = vec![0.0; 16000 * 10]; // 10s of 16 kHz mono audio
let result = diarizer.run(&samples, &extractor).unwrap();

for turn in &result.turns {
    println!("{}: {:.2}s - {:.2}s", turn.speaker, turn.time.start, turn.time.end);
}

Online diarization

use polyvoice::{OnlineDiarizer, DiarizationConfig, DummyExtractor};

let config = DiarizationConfig::default();
let mut diarizer = OnlineDiarizer::new(config);
let extractor = DummyExtractor::new(256);

// Feed audio chunks as they arrive (e.g. from a WebSocket stream)
let chunk = vec![0.0f32; 16000]; // 1 second
let segments = diarizer.feed(&chunk, &extractor).unwrap();

With ONNX embedding extractor

Enable the onnx feature and use a WeSpeaker or ECAPA-TDNN ONNX model:

[dependencies]
polyvoice = { git = "https://github.com/ekhodzitsky/polyvoice", features = ["onnx"] }

WeSpeaker (raw audio input):

use polyvoice::{OnnxEmbeddingExtractor, OfflineDiarizer, DiarizationConfig};
use std::path::Path;

let config = DiarizationConfig::default();
let extractor = OnnxEmbeddingExtractor::new(
    Path::new("wespeaker_resnet34.onnx"),
    256,              // embedding dimension
    24000,            // window samples (1.5s @ 16kHz)
    4,                // pool size
).unwrap();

let diarizer = OfflineDiarizer::new(config);
let result = diarizer.run(&samples, &extractor).unwrap();

ECAPA-TDNN (fbank input):

use polyvoice::{EcapaTdnnExtractor, OfflineDiarizer, DiarizationConfig};
use std::path::Path;

let config = DiarizationConfig::default();
let extractor = EcapaTdnnExtractor::new(
    Path::new("ecapa_tdnn.onnx"),
    192,              // embedding dimension
    4,                // pool size
).unwrap();

let diarizer = OfflineDiarizer::new(config);
let result = diarizer.run(&samples, &extractor).unwrap();

Architecture

polyvoice
├── embedding      # EmbeddingExtractor trait + ONNX pool implementation
├── cluster        # Online incremental centroid clustering
├── vad            # Voice Activity Detection trait + utilities
├── online         # StreamingDiarizer (chunk-by-chunk)
├── offline        # OfflineDiarizer (two-pass with post-processing)
├── overlap        # Overlap detection from segment lists
└── types          # Config, SpeakerId, Segment, WordAlignment, etc.

Configuration

use polyvoice::{DiarizationConfig, SampleRate};

let config = DiarizationConfig {
    threshold: 0.5,           // cosine similarity threshold
    max_speakers: 64,         // speaker limit
    window_secs: 1.5,         // analysis window
    hop_secs: 0.75,           // sliding step
    min_speech_secs: 0.25,    // discard shorter segments
    max_gap_secs: 0.5,        // merge same-speaker gaps under 500ms
    sample_rate: SampleRate::new(16000).unwrap(),
};

Benchmarks

cargo bench --all-features

Measures offline diarization latency and ECAPA fbank throughput on synthetic multi-speaker audio.

Roadmap to 1.0

ECAPA-TDNN ONNX extractor (in addition to WeSpeaker)
C FFI bindings
Agglomerative re-clustering pass for offline mode
PLDA scoring backend
no_std support for embedded targets

License

MIT

polyvoice 0.4.1