polyvoice

Speaker diarization for Rust — real-time, accurate, and production-hardened.

Turn any audio stream into a clear timeline of who spoke when.

What is speaker diarization?

Speech-to-text tells you what was said. Speaker diarization tells you who said it.

Input:  "hello world how are you"
Output: SPEAKER_00: 0.0s - 1.2s  "hello world"
        SPEAKER_01: 1.5s - 2.8s  "how are you"

Without diarization, transcripts are a wall of text. With it, every word is attributed to the right person — essential for meeting minutes, call analytics, podcasts, and court recordings.

Why polyvoice?

You need...	`polyvoice` delivers
Real-time streaming	`OnlineDiarizer` processes audio chunk-by-chunk with sub-second latency
File-based batch	`OfflineDiarizer` two-pass pipeline with gap merging and overlap detection
No Python in production	Pure Rust + ONNX Runtime. No GIL, no virtualenv, no dependency hell
Concurrent inference	Lock-free ONNX session pool — scale to many connections without `Mutex` contention
Plug your own model	`EmbeddingExtractor` trait: WeSpeaker, ECAPA-TDNN, or your custom ONNX model
C FFI	Drop-in `.so`/`.dylib`/`.dll` for Python, Go, Node.js, or C++ callers
Safety guarantees	Verified with Miri (unsafe memory), Loom (concurrency model-checking), and fuzzing

Quick start

[dependencies]
polyvoice = "0.4"

Offline diarization (file / batch)

use polyvoice::{OfflineDiarizer, DiarizationConfig, DummyExtractor};

let config = DiarizationConfig::default();
let diarizer = OfflineDiarizer::new(config);
let extractor = DummyExtractor::new(256);

let samples: Vec<f32> = vec![0.0; 16000 * 10]; // 10 s of 16 kHz mono audio
let result = diarizer.run(&samples, &extractor).unwrap();

for turn in &result.turns {
    println!("{}: {:.2}s - {:.2}s", turn.speaker, turn.time.start, turn.time.end);
}

Real-time streaming

use polyvoice::{OnlineDiarizer, DiarizationConfig, DummyExtractor};

let config = DiarizationConfig::default();
let mut diarizer = OnlineDiarizer::new(config);
let extractor = DummyExtractor::new(256);

while let Some(chunk) = microphone.read() {
    let segments = diarizer.feed(&chunk, &extractor).unwrap();
    for seg in segments {
        println!("Speaker {:?} from {:.2}s", seg.speaker, seg.time.start);
    }
}

With an ONNX model (WeSpeaker / ECAPA-TDNN)

[dependencies]
polyvoice = { version = "0.4", features = ["onnx"] }

use polyvoice::{EcapaTdnnExtractor, OfflineDiarizer, DiarizationConfig};
use std::path::Path;

let config = DiarizationConfig::default();
let extractor = EcapaTdnnExtractor::new(
    Path::new("ecapa_tdnn.onnx"),
    192, // embedding dimension
    4,   // session pool size
).unwrap();

let diarizer = OfflineDiarizer::new(config);
let result = diarizer.run(&samples, &extractor).unwrap();

Architecture

Input audio (f32 PCM)
       │
       ▼
┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   VAD        │ --> │  Embedding   │ --> │   Speaker    │ --> Turns / Segments
│  (optional)  │     │  Extractor   │     │   Cluster    │
└──────────────┘     └──────────────┘     └──────────────┘
                          ONNX pool            Incremental
                          (lock-free)          cosine-sim clustering

Key features

🎙️ Online & Offline — stream chunks in real time or process entire files in one shot.
🧠 ONNX-powered — ECAPA-TDNN and WeSpeaker extractors with built-in 80-bin log-mel filterbank.
⚡ Lock-free session pool — crossbeam-queue backed pool eliminates Mutex contention under concurrent load.
🔌 VAD trait — plug in Silero VAD, Energy VAD, or your own voice-activity detector.
🗣️ Overlap detection — find regions where multiple speakers talk simultaneously.
📝 Word alignment — assign speaker IDs to individual transcript words by timestamp.
🔒 Memory-safe FFI — C ABI with Miri-verified unsafe code and Valgrind-tested Python bindings.
🦀 Pure Rust — zero Python dependencies in production.

Production readiness

This crate is hardened for production use:

Verification	Tool
Unsafe memory safety	Miri ( nightly CI )
Concurrency correctness	Loom model-checking
Input fuzzing	cargo-fuzz (4 targets, nightly CI)
API stability	cargo-semver-checks
Cross-platform	Ubuntu, macOS, Windows CI
Dependency audit	cargo-audit

Benchmarks

cargo bench --all-features

Benchmark	Metric
Offline diarization (10 s)	Latency on synthetic two-speaker audio
ECAPA fbank (10 s)	Log-mel throughput
DER (10 s)	Diarization Error Rate vs. ground truth

Configuration

use polyvoice::{DiarizationConfig, SampleRate};

let config = DiarizationConfig {
    threshold: 0.5,           // cosine similarity threshold
    max_speakers: 64,         // hard speaker limit
    window_secs: 1.5,         // analysis window
    hop_secs: 0.75,           // sliding step
    min_speech_secs: 0.25,    // discard shorter segments
    max_gap_secs: 0.5,        // merge same-speaker gaps under 500 ms
    sample_rate: SampleRate::new(16000).unwrap(),
};

Roadmap

ECAPA-TDNN ONNX extractor
C FFI bindings
Miri / Loom / fuzz verification
Cross-platform CI
Agglomerative re-clustering pass for offline mode
PLDA scoring backend
no_std support for embedded targets

License

MIT

polyvoice 0.4.3