Expand description
§polyvoice
Speaker diarization library for Rust — online (streaming) and offline (file-based), ONNX-powered, and ecosystem-agnostic.
Designed to be embedded into any Rust application that needs to answer the question “who spoke when?”.
§Quick start
Build a diarization pipeline using Pipeline and ModelRegistry.
See the pipeline module for details.
§Module organization
polyvoice carries two parallel module families from an in-progress migration to a trait-based v1.0 architecture. This is deliberate (shared math, a compile-time feature guard), not accidental duplication:
- v1.0 trait-based (current architecture):
embedder(the migration target trait),clusterer,segmentation,resegmentation,silero_vad, andpipeline_v2(experimental — see its README). - Legacy:
embedding,ecapa,onnxare#[deprecated]— migrate to theembeddertrait.clusterandvadare the legacy clustering/VAD surfaces. - Pipeline status (note the inversion): the legacy
pipelineis the validated default for the CLI and Python bindings;pipeline_v2is experimental (opt-in via--v2), reverted from default after the 0.6.1 long-form DER regression. - Shared math, reused by both families:
ahc,kmeans,spectral,features,der,utils.
Re-exports§
pub use features::FbankConfig;pub use features::FbankExtractor;pub use utils::merge_segments;pub use segmentation::AggregationConfig;pub use segmentation::Aggregator;pub use segmentation::FrameLabel;pub use segmentation::MIN_AUDIO_SAMPLES;pub use segmentation::PowersetClass;pub use segmentation::PowersetDecoder;pub use segmentation::RawSegment;pub use segmentation::SegmentationError;pub use segmentation::Segmenter;pub use segmentation::WindowOutput;pub use segmentation::PowersetConfig;pub use segmentation::PowersetSegmenter;pub use embedder::Embedder;pub use embedder::EmbedderError;pub use embedder::EmbedderPool;pub use embedder::apply_overlap_mask;pub use embedder::CamPlusPlusExtractor;pub use embedder::ResNet34Adapter;pub use clusterer::AhcClusterer;pub use clusterer::Clusterer;pub use clusterer::ClustererError;pub use clusterer::NmeScClusterer;pub use resegmentation::OverlapRegionInput;pub use resegmentation::OverlapResegmenter;pub use resegmentation::ResegmentError;pub use resegmentation::ResegmentInputs;pub use resegmentation::Resegmenter;pub use resegmentation::SpeakerCentroid;pub use resegmentation::compute_centroids;pub use resegmentation::extract_overlap_time_ranges;pub use pipeline::Pipeline;pub use pipeline::PipelineError;pub use vad::EnergyVad;pub use vad::VadConfig;pub use vad::VadError;pub use vad::VoiceActivityDetector;pub use vad::segment_speech;pub use silero_vad::SileroVad;pub use onnx::OnnxEmbeddingExtractor;Deprecated pub use cluster::SpeakerCluster;pub use embedding::DummyExtractor;Deprecated pub use embedding::EmbeddingError;Deprecated pub use embedding::EmbeddingExtractor;Deprecated pub use models::ModelRegistry;pub use models::ProfileModels;pub use models::RegistryError;pub use overlap::OverlapRegion;pub use overlap::detect_overlaps;pub use types::ClusterConfig;pub use types::Confidence;pub use types::DiarizationConfig;pub use types::DiarizationResult;pub use types::Profile;pub use types::SampleRate;pub use types::Seconds;pub use types::Segment;pub use types::SpeakerId;pub use types::SpeakerIdRemap;pub use types::SpeakerTurn;pub use types::TimeRange;pub use types::WordAlignment;pub use types::remap_segments;pub use types::remap_turns;pub use window::WindowBuffer;pub use window::WindowIter;pub use ecapa::FbankOnnxExtractor;Deprecated
Modules§
- ahc
- Agglomerative Hierarchical Clustering (AHC) for speaker diarization.
- cluster
- Speaker clustering with online incremental centroid updates.
- clusterer
- v1.0
Clusterertrait + concrete clusterers (NME-SC, AHC). - der
- Diarization Error Rate (DER) computation.
- ecapa
- ONNX speaker embedding extractor (WeSpeaker, ECAPA-TDNN, etc.).
- embedder
- v1.0
Embeddertrait + concrete extractors (CAM++, ResNet34) + pool + overlap-mask helper. - embedding
- Speaker embedding extraction trait.
- features
- Log-mel filterbank (fbank) feature extraction for speaker embeddings.
- kmeans
- K-Means++ clustering with automatic k selection via silhouette score.
- models
- Model registry — manifest-driven downloads with SHA-256 verification.
- onnx
- ONNX-based speaker embedding extractor with a session pool.
- overlap
- Overlap detection: identify frames where multiple speakers may be active.
- pipeline
- High-level diarization pipeline.
- pipeline_
v2 - M6a — additive
polyvoice::pipeline_v2module. - resegmentation
- v1.0 OverlapResegmenter — overlap-aware post-clustering pass.
- rttm
- RTTM (Rich Transcription Time Marked) parser and writer.
- segmentation
- Speaker segmentation: powerset-classifier + sliding-window aggregator.
- silero_
vad - Silero VAD v5 ONNX integration.
- spectral
- Spectral clustering for speaker diarization.
- streaming
- Real-time streaming diarization pipeline.
- types
- Core types for speaker diarization.
- utils
- Math utilities for diarization.
- vad
- Voice Activity Detection (VAD) trait and utilities.
- wav
- WAV file I/O via the
houndcrate. - window
- Sliding-window utilities for batch and streaming pipelines.