Skip to main content

Crate polyvoice

Crate polyvoice 

Source
Expand description

§polyvoice

Speaker diarization library for Rust — online (streaming) and offline (file-based), ONNX-powered, and ecosystem-agnostic.

Designed to be embedded into any Rust application that needs to answer the question “who spoke when?”.

§Quick start

Build a diarization pipeline using Pipeline and ModelRegistry. See the pipeline module for details.

§Module organization

polyvoice carries two parallel module families from an in-progress migration to a trait-based v1.0 architecture. This is deliberate (shared math, a compile-time feature guard), not accidental duplication:

  • v1.0 trait-based (current architecture): embedder (the migration target trait), clusterer, segmentation, resegmentation, silero_vad, and pipeline_v2 (experimental — see its README).
  • Legacy: embedding, ecapa, onnx are #[deprecated] — migrate to the embedder trait. cluster and vad are the legacy clustering/VAD surfaces.
  • Pipeline status (note the inversion): the legacy pipeline is the validated default for the CLI and Python bindings; pipeline_v2 is experimental (opt-in via --v2), reverted from default after the 0.6.1 long-form DER regression.
  • Shared math, reused by both families: ahc, kmeans, spectral, features, der, utils.

Re-exports§

pub use features::FbankConfig;
pub use features::FbankExtractor;
pub use utils::merge_segments;
pub use segmentation::AggregationConfig;
pub use segmentation::Aggregator;
pub use segmentation::FrameLabel;
pub use segmentation::MIN_AUDIO_SAMPLES;
pub use segmentation::PowersetClass;
pub use segmentation::PowersetDecoder;
pub use segmentation::RawSegment;
pub use segmentation::SegmentationError;
pub use segmentation::Segmenter;
pub use segmentation::WindowOutput;
pub use segmentation::PowersetConfig;
pub use segmentation::PowersetSegmenter;
pub use embedder::Embedder;
pub use embedder::EmbedderError;
pub use embedder::EmbedderPool;
pub use embedder::apply_overlap_mask;
pub use embedder::CamPlusPlusExtractor;
pub use embedder::ResNet34Adapter;
pub use clusterer::AhcClusterer;
pub use clusterer::Clusterer;
pub use clusterer::ClustererError;
pub use clusterer::NmeScClusterer;
pub use resegmentation::OverlapRegionInput;
pub use resegmentation::OverlapResegmenter;
pub use resegmentation::ResegmentError;
pub use resegmentation::ResegmentInputs;
pub use resegmentation::Resegmenter;
pub use resegmentation::SpeakerCentroid;
pub use resegmentation::compute_centroids;
pub use resegmentation::extract_overlap_time_ranges;
pub use pipeline::Pipeline;
pub use pipeline::PipelineError;
pub use vad::EnergyVad;
pub use vad::VadConfig;
pub use vad::VadError;
pub use vad::VoiceActivityDetector;
pub use vad::segment_speech;
pub use silero_vad::SileroVad;
pub use onnx::OnnxEmbeddingExtractor;Deprecated
pub use cluster::SpeakerCluster;
pub use embedding::DummyExtractor;Deprecated
pub use embedding::EmbeddingError;Deprecated
pub use embedding::EmbeddingExtractor;Deprecated
pub use models::ModelRegistry;
pub use models::ProfileModels;
pub use models::RegistryError;
pub use overlap::OverlapRegion;
pub use overlap::detect_overlaps;
pub use types::ClusterConfig;
pub use types::Confidence;
pub use types::DiarizationConfig;
pub use types::DiarizationResult;
pub use types::Profile;
pub use types::SampleRate;
pub use types::Seconds;
pub use types::Segment;
pub use types::SpeakerId;
pub use types::SpeakerIdRemap;
pub use types::SpeakerTurn;
pub use types::TimeRange;
pub use types::WordAlignment;
pub use types::remap_segments;
pub use types::remap_turns;
pub use window::WindowBuffer;
pub use window::WindowIter;
pub use ecapa::FbankOnnxExtractor;Deprecated

Modules§

ahc
Agglomerative Hierarchical Clustering (AHC) for speaker diarization.
cluster
Speaker clustering with online incremental centroid updates.
clusterer
v1.0 Clusterer trait + concrete clusterers (NME-SC, AHC).
der
Diarization Error Rate (DER) computation.
ecapa
ONNX speaker embedding extractor (WeSpeaker, ECAPA-TDNN, etc.).
embedder
v1.0 Embedder trait + concrete extractors (CAM++, ResNet34) + pool + overlap-mask helper.
embedding
Speaker embedding extraction trait.
features
Log-mel filterbank (fbank) feature extraction for speaker embeddings.
kmeans
K-Means++ clustering with automatic k selection via silhouette score.
models
Model registry — manifest-driven downloads with SHA-256 verification.
onnx
ONNX-based speaker embedding extractor with a session pool.
overlap
Overlap detection: identify frames where multiple speakers may be active.
pipeline
High-level diarization pipeline.
pipeline_v2
M6a — additive polyvoice::pipeline_v2 module.
resegmentation
v1.0 OverlapResegmenter — overlap-aware post-clustering pass.
rttm
RTTM (Rich Transcription Time Marked) parser and writer.
segmentation
Speaker segmentation: powerset-classifier + sliding-window aggregator.
silero_vad
Silero VAD v5 ONNX integration.
spectral
Spectral clustering for speaker diarization.
streaming
Real-time streaming diarization pipeline.
types
Core types for speaker diarization.
utils
Math utilities for diarization.
vad
Voice Activity Detection (VAD) trait and utilities.
wav
WAV file I/O via the hound crate.
window
Sliding-window utilities for batch and streaming pipelines.