polyvoice
Speaker diarization for Rust — real-time, accurate, and production-hardened.
Turn any audio stream into a clear timeline of who spoke when.
What is speaker diarization?
Speech-to-text tells you what was said. Speaker diarization tells you who said it.
Input: "hello world how are you"
Output: SPEAKER_00: 0.0s - 1.2s "hello world"
SPEAKER_01: 1.5s - 2.8s "how are you"
Without diarization, transcripts are a wall of text. With it, every word is attributed to the right person — essential for meeting minutes, call analytics, podcasts, and court recordings.
Why polyvoice?
| You need... | polyvoice delivers |
|---|---|
| Real-time streaming | OnlineDiarizer processes audio chunk-by-chunk with sub-second latency |
| File-based batch | OfflineDiarizer two-pass pipeline with gap merging and overlap detection |
| No Python in production | Pure Rust + ONNX Runtime. No GIL, no virtualenv, no dependency hell |
| Concurrent inference | Lock-free ONNX session pool — scale to many connections without Mutex contention |
| Plug your own model | EmbeddingExtractor trait: WeSpeaker, ECAPA-TDNN, or your custom ONNX model |
| C FFI | Drop-in .so/.dylib/.dll for Python, Go, Node.js, or C++ callers |
| Safety guarantees | Verified with Miri (unsafe memory), Loom (concurrency model-checking), and fuzzing |
Quick start
[]
= "0.4"
Offline diarization (file / batch)
use ;
let config = default;
let diarizer = new;
let extractor = new;
let samples: = vec!; // 10 s of 16 kHz mono audio
let result = diarizer.run.unwrap;
for turn in &result.turns
Real-time streaming
use ;
let config = default;
let mut diarizer = new;
let extractor = new;
while let Some = microphone.read
With an ONNX model (WeSpeaker / ECAPA-TDNN)
[]
= { = "0.4", = ["onnx"] }
use ;
use Path;
let config = default;
let extractor = new.unwrap;
let diarizer = new;
let result = diarizer.run.unwrap;
Architecture
Input audio (f32 PCM)
│
▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ VAD │ --> │ Embedding │ --> │ Speaker │ --> Turns / Segments
│ (optional) │ │ Extractor │ │ Cluster │
└──────────────┘ └──────────────┘ └──────────────┘
ONNX pool Incremental
(lock-free) cosine-sim clustering
Key features
- 🎙️ Online & Offline — stream chunks in real time or process entire files in one shot.
- 🧠 ONNX-powered — ECAPA-TDNN and WeSpeaker extractors with built-in 80-bin log-mel filterbank.
- ⚡ Lock-free session pool —
crossbeam-queuebacked pool eliminatesMutexcontention under concurrent load. - 🔌 VAD trait — plug in Silero VAD, Energy VAD, or your own voice-activity detector.
- 🗣️ Overlap detection — find regions where multiple speakers talk simultaneously.
- 📝 Word alignment — assign speaker IDs to individual transcript words by timestamp.
- 🔒 Memory-safe FFI — C ABI with Miri-verified unsafe code and Valgrind-tested Python bindings.
- 🦀 Pure Rust — zero Python dependencies in production.
Production readiness
This crate is hardened for production use:
| Verification | Tool |
|---|---|
| Unsafe memory safety | Miri ( nightly CI ) |
| Concurrency correctness | Loom model-checking |
| Input fuzzing | cargo-fuzz (4 targets, nightly CI) |
| API stability | cargo-semver-checks |
| Cross-platform | Ubuntu, macOS, Windows CI |
| Dependency audit | cargo-audit |
Benchmarks
| Benchmark | Metric |
|---|---|
| Offline diarization (10 s) | Latency on synthetic two-speaker audio |
| ECAPA fbank (10 s) | Log-mel throughput |
| DER (10 s) | Diarization Error Rate vs. ground truth |
Configuration
use ;
let config = DiarizationConfig ;
Roadmap
- ECAPA-TDNN ONNX extractor
- C FFI bindings
- Miri / Loom / fuzz verification
- Cross-platform CI
- Agglomerative re-clustering pass for offline mode
- PLDA scoring backend
-
no_stdsupport for embedded targets
License
MIT