polyvoice
Speaker diarization library for Rust — online (streaming) and offline (file-based), ONNX-powered, and ecosystem-agnostic.
polyvoice answers the question "who spoke when?" in audio streams or files. It is designed to be embedded into STT servers such as gigastt, phostt, nihostt, siamstt, or any other Rust application.
Features
- Online (streaming) diarization — process audio chunk-by-chunk in real time.
- Offline (file) diarization — process an entire audio buffer with post-processing (segment merging, gap filling).
- Sliding-window embeddings — configurable window and hop sizes instead of fixed segments.
- ECAPA-TDNN ONNX extractor — built-in 80-bin log-mel filterbank + ONNX inference.
- Session pool for ONNX models — no
Mutexcontention under concurrent load. - VAD integration trait — plug in Silero VAD, Energy VAD, or your own implementation.
- Overlap detection — identify regions where multiple speakers are active simultaneously.
- Word-level speaker alignment — assign speaker IDs to individual words using timestamps.
- Zero Python dependencies — pure Rust + ONNX Runtime.
Quick start
Add to your Cargo.toml:
[]
= { = "https://github.com/ekhodzitsky/polyvoice" }
Offline diarization
use ;
let config = default;
let diarizer = new;
let extractor = new;
let samples: = vec!; // 10s of 16 kHz mono audio
let result = diarizer.run.unwrap;
for turn in &result.turns
Online diarization
use ;
let config = default;
let mut diarizer = new;
let extractor = new;
// Feed audio chunks as they arrive (e.g. from a WebSocket stream)
let chunk = vec!; // 1 second
let segments = diarizer.feed.unwrap;
With ONNX embedding extractor
Enable the onnx feature and use a WeSpeaker or ECAPA-TDNN ONNX model:
[]
= { = "https://github.com/ekhodzitsky/polyvoice", = ["onnx"] }
WeSpeaker (raw audio input):
use ;
use Path;
let config = default;
let extractor = new.unwrap;
let diarizer = new;
let result = diarizer.run.unwrap;
ECAPA-TDNN (fbank input):
use ;
use Path;
let config = default;
let extractor = new.unwrap;
let diarizer = new;
let result = diarizer.run.unwrap;
Architecture
polyvoice
├── embedding # EmbeddingExtractor trait + ONNX pool implementation
├── cluster # Online incremental centroid clustering
├── vad # Voice Activity Detection trait + utilities
├── online # StreamingDiarizer (chunk-by-chunk)
├── offline # OfflineDiarizer (two-pass with post-processing)
├── overlap # Overlap detection from segment lists
└── types # Config, SpeakerId, Segment, WordAlignment, etc.
Configuration
use ;
let config = DiarizationConfig ;
Benchmarks
Measures offline diarization latency and ECAPA fbank throughput on synthetic multi-speaker audio.
Roadmap to 1.0
- ECAPA-TDNN ONNX extractor (in addition to WeSpeaker)
- C FFI bindings
- Agglomerative re-clustering pass for offline mode
- PLDA scoring backend
-
no_stdsupport for embedded targets
License
MIT