polyvoice
Speaker diarization library for Rust — online (streaming) and offline (file-based), ONNX-powered, and ecosystem-agnostic.
polyvoice answers the question "who spoke when?" in audio streams or files. It is designed to be embedded into STT servers, real-time transcription pipelines, or any other Rust application that needs speaker-aware audio processing.
Features
- Online (streaming) diarization — process audio chunk-by-chunk in real time.
- Offline (file) diarization — process an entire audio buffer with post-processing (segment merging, gap filling).
- Sliding-window embeddings — configurable window and hop sizes instead of fixed segments.
- ECAPA-TDNN ONNX extractor — built-in 80-bin log-mel filterbank + ONNX inference.
- Session pool for ONNX models — no
Mutexcontention under concurrent load. - VAD integration trait — plug in Silero VAD, Energy VAD, or your own implementation.
- Overlap detection — identify regions where multiple speakers are active simultaneously.
- Word-level speaker alignment — assign speaker IDs to individual words using timestamps.
- Zero Python dependencies — pure Rust + ONNX Runtime.
Quick start
Add to your Cargo.toml:
[]
= { = "https://github.com/ekhodzitsky/polyvoice" }
Offline diarization
use ;
let config = default;
let diarizer = new;
let extractor = new;
let samples: = vec!; // 10s of 16 kHz mono audio
let result = diarizer.run.unwrap;
for turn in &result.turns
Online diarization
use ;
let config = default;
let mut diarizer = new;
let extractor = new;
// Feed audio chunks as they arrive (e.g. from a WebSocket stream)
let chunk = vec!; // 1 second
let segments = diarizer.feed.unwrap;
With ONNX embedding extractor
Enable the onnx feature and use a WeSpeaker or ECAPA-TDNN ONNX model:
[]
= { = "https://github.com/ekhodzitsky/polyvoice", = ["onnx"] }
WeSpeaker (raw audio input):
use ;
use Path;
let config = default;
let extractor = new.unwrap;
let diarizer = new;
let result = diarizer.run.unwrap;
ECAPA-TDNN (fbank input):
use ;
use Path;
let config = default;
let extractor = new.unwrap;
let diarizer = new;
let result = diarizer.run.unwrap;
Architecture
polyvoice
├── embedding # EmbeddingExtractor trait + ONNX pool implementation
├── cluster # Online incremental centroid clustering
├── vad # Voice Activity Detection trait + utilities
├── online # StreamingDiarizer (chunk-by-chunk)
├── offline # OfflineDiarizer (two-pass with post-processing)
├── overlap # Overlap detection from segment lists
└── types # Config, SpeakerId, Segment, WordAlignment, etc.
Configuration
use ;
let config = DiarizationConfig ;
Benchmarks
Measures offline diarization latency and ECAPA fbank throughput on synthetic multi-speaker audio.
Roadmap to 1.0
- ECAPA-TDNN ONNX extractor (in addition to WeSpeaker)
- C FFI bindings
- Agglomerative re-clustering pass for offline mode
- PLDA scoring backend
-
no_stdsupport for embedded targets
License
MIT