# polyvoice
[](https://github.com/ekhodzitsky/polyvoice/actions/workflows/ci.yml)
[](https://crates.io/crates/polyvoice)
[](https://docs.rs/polyvoice)
[](LICENSE)
> **Speaker diarization for Rust — real-time, accurate, and production-hardened.**
>
> Turn any audio stream into a clear timeline of *who spoke when*.
## What is speaker diarization?
Speech-to-text tells you **what** was said. Speaker diarization tells you **who** said it.
```
Input: "hello world how are you"
Output: SPEAKER_00: 0.0s - 1.2s "hello world"
SPEAKER_01: 1.5s - 2.8s "how are you"
```
Without diarization, transcripts are a wall of text. With it, every word is attributed to the right person — essential for meeting minutes, call analytics, podcasts, and court recordings.
## Why polyvoice?
| **Real-time streaming** | `OnlineDiarizer` processes audio chunk-by-chunk with sub-second latency |
| **File-based batch** | `OfflineDiarizer` two-pass pipeline with gap merging and overlap detection |
| **No Python in production** | Pure Rust + ONNX Runtime. No GIL, no virtualenv, no dependency hell |
| **Concurrent inference** | Lock-free ONNX session pool — scale to many connections without `Mutex` contention |
| **Plug your own model** | `EmbeddingExtractor` trait: WeSpeaker, ECAPA-TDNN, or your custom ONNX model |
| **C FFI** | Drop-in `.so`/`.dylib`/`.dll` for Python, Go, Node.js, or C++ callers |
| **Safety guarantees** | Verified with Miri (unsafe memory), Loom (concurrency model-checking), and fuzzing |
## Quick start
```toml
[dependencies]
polyvoice = "0.4"
```
### Offline diarization (file / batch)
```rust
use polyvoice::{OfflineDiarizer, DiarizationConfig, DummyExtractor};
let config = DiarizationConfig::default();
let diarizer = OfflineDiarizer::new(config);
let extractor = DummyExtractor::new(256);
let samples: Vec<f32> = vec![0.0; 16000 * 10]; // 10 s of 16 kHz mono audio
let result = diarizer.run(&samples, &extractor).unwrap();
for turn in &result.turns {
println!("{}: {:.2}s - {:.2}s", turn.speaker, turn.time.start, turn.time.end);
}
```
### Real-time streaming
```rust
use polyvoice::{OnlineDiarizer, DiarizationConfig, DummyExtractor};
let config = DiarizationConfig::default();
let mut diarizer = OnlineDiarizer::new(config);
let extractor = DummyExtractor::new(256);
while let Some(chunk) = microphone.read() {
let segments = diarizer.feed(&chunk, &extractor).unwrap();
for seg in segments {
println!("Speaker {:?} from {:.2}s", seg.speaker, seg.time.start);
}
}
```
### With an ONNX model (WeSpeaker / ECAPA-TDNN)
```toml
[dependencies]
polyvoice = { version = "0.4", features = ["onnx"] }
```
```rust
use polyvoice::{EcapaTdnnExtractor, OfflineDiarizer, DiarizationConfig};
use std::path::Path;
let config = DiarizationConfig::default();
let extractor = EcapaTdnnExtractor::new(
Path::new("ecapa_tdnn.onnx"),
192, // embedding dimension
4, // session pool size
).unwrap();
let diarizer = OfflineDiarizer::new(config);
let result = diarizer.run(&samples, &extractor).unwrap();
```
## Architecture
```
Input audio (f32 PCM)
│
▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ VAD │ --> │ Embedding │ --> │ Speaker │ --> Turns / Segments
│ (optional) │ │ Extractor │ │ Cluster │
└──────────────┘ └──────────────┘ └──────────────┘
ONNX pool Incremental
(lock-free) cosine-sim clustering
```
## Key features
- **🎙️ Online & Offline** — stream chunks in real time or process entire files in one shot.
- **🧠 ONNX-powered** — ECAPA-TDNN and WeSpeaker extractors with built-in 80-bin log-mel filterbank.
- **⚡ Lock-free session pool** — `crossbeam-queue` backed pool eliminates `Mutex` contention under concurrent load.
- **🔌 VAD trait** — plug in Silero VAD, Energy VAD, or your own voice-activity detector.
- **🗣️ Overlap detection** — find regions where multiple speakers talk simultaneously.
- **📝 Word alignment** — assign speaker IDs to individual transcript words by timestamp.
- **🔒 Memory-safe FFI** — C ABI with Miri-verified unsafe code and Valgrind-tested Python bindings.
- **🦀 Pure Rust** — zero Python dependencies in production.
## Production readiness
This crate is hardened for production use:
| Unsafe memory safety | **Miri** ( nightly CI ) |
| Concurrency correctness | **Loom** model-checking |
| Input fuzzing | **cargo-fuzz** (4 targets, nightly CI) |
| API stability | **cargo-semver-checks** |
| Cross-platform | Ubuntu, macOS, Windows CI |
| Dependency audit | **cargo-audit** |
## Benchmarks
```bash
cargo bench --all-features
```
| Offline diarization (10 s) | Latency on synthetic two-speaker audio |
| ECAPA fbank (10 s) | Log-mel throughput |
| DER (10 s) | Diarization Error Rate vs. ground truth |
## Configuration
```rust
use polyvoice::{DiarizationConfig, SampleRate};
let config = DiarizationConfig {
threshold: 0.5, // cosine similarity threshold
max_speakers: 64, // hard speaker limit
window_secs: 1.5, // analysis window
hop_secs: 0.75, // sliding step
min_speech_secs: 0.25, // discard shorter segments
max_gap_secs: 0.5, // merge same-speaker gaps under 500 ms
sample_rate: SampleRate::new(16000).unwrap(),
};
```
## Roadmap
- [x] ECAPA-TDNN ONNX extractor
- [x] C FFI bindings
- [x] Miri / Loom / fuzz verification
- [x] Cross-platform CI
- [ ] Agglomerative re-clustering pass for offline mode
- [ ] PLDA scoring backend
- [ ] `no_std` support for embedded targets
## License
MIT