Crate polyvoice

Expand description

§polyvoice

Speaker diarization library for Rust — online (streaming) and offline (file-based), ONNX-powered, and ecosystem-agnostic.

Designed to be embedded into any Rust application that needs to answer the question “who spoke when?”.

§Quick start

Build a diarization pipeline using Pipeline and ModelRegistry. See the pipeline module for details.

§Module organization

polyvoice carries two parallel module families from an in-progress migration to a trait-based v1.0 architecture. This is deliberate (shared math, a compile-time feature guard), not accidental duplication:

v1.0 trait-based (current architecture): embedder (the migration target trait), clusterer, segmentation, resegmentation, silero_vad, and pipeline_v2 (experimental — see its README).
Legacy: embedding, ecapa, onnx are #[deprecated] — migrate to the embedder trait. cluster and vad are the legacy clustering/VAD surfaces.
Pipeline status (note the inversion): the legacy pipeline is the validated default for the CLI and Python bindings; pipeline_v2 is experimental (opt-in via --v2), reverted from default after the 0.6.1 long-form DER regression.
Shared math, reused by both families: ahc, kmeans, spectral, features, der, utils.

Re-exports§

pub use asr::Asr;
pub use asr::AsrError;
pub use features::FbankConfig;
pub use features::FbankExtractor;
pub use utils::merge_segments;
pub use segmentation::AggregationConfig;
pub use segmentation::Aggregator;
pub use segmentation::FrameLabel;
pub use segmentation::MIN_AUDIO_SAMPLES;
pub use segmentation::PowersetClass;
pub use segmentation::PowersetDecoder;
pub use segmentation::RawSegment;
pub use segmentation::SegmentationError;
pub use segmentation::Segmenter;
pub use segmentation::WindowOutput;
pub use segmentation::PowersetConfig;
pub use segmentation::PowersetSegmenter;
pub use embedder::Embedder;
pub use embedder::EmbedderError;
pub use embedder::EmbedderPool;
pub use embedder::apply_overlap_mask;
pub use embedder::CamPlusPlusExtractor;
pub use embedder::ResNet34Adapter;
pub use clusterer::AhcClusterer;
pub use clusterer::Clusterer;
pub use clusterer::ClustererError;
pub use clusterer::MinClusterSizeClusterer;
pub use clusterer::NmeScClusterer;
pub use resegmentation::OverlapRegionInput;
pub use resegmentation::OverlapResegmenter;
pub use resegmentation::ResegmentError;
pub use resegmentation::ResegmentInputs;
pub use resegmentation::Resegmenter;
pub use resegmentation::SpeakerCentroid;
pub use resegmentation::compute_centroids;
pub use resegmentation::extract_overlap_time_ranges;
pub use pipeline::Pipeline;
pub use pipeline::PipelineError;
pub use vad::EnergyVad;
pub use vad::VadConfig;
pub use vad::VadError;
pub use vad::VoiceActivityDetector;
pub use vad::segment_speech;
pub use silero_vad::SileroVad;
pub use onnx::OnnxEmbeddingExtractor;Deprecated
pub use cluster::SpeakerCluster;
pub use embedding::DummyExtractor;Deprecated
pub use embedding::EmbeddingError;Deprecated
pub use embedding::EmbeddingExtractor;Deprecated
pub use models::ModelRegistry;
pub use models::ProfileModels;
pub use models::RegistryError;
pub use overlap::OverlapRegion;
pub use overlap::detect_overlaps;
pub use types::ClusterConfig;
pub use types::Confidence;
pub use types::DiarizationConfig;
pub use types::DiarizationResult;
pub use types::Profile;
pub use types::SampleRate;
pub use types::Seconds;
pub use types::Segment;
pub use types::SpeakerId;
pub use types::SpeakerIdRemap;
pub use types::SpeakerTurn;
pub use types::TimeRange;
pub use types::Transcript;
pub use types::Word;
pub use types::WordAlignment;
pub use types::remap_segments;
pub use types::remap_turns;
pub use window::WindowBuffer;
pub use window::WindowIter;
pub use ecapa::FbankOnnxExtractor;Deprecated

Modules§

ahc: Agglomerative Hierarchical Clustering (AHC) for speaker diarization.
asr: ASR (speech-to-text) trait — the stable interface the opt-in polyvoice-asr companion crate implements and the word→speaker join targets.
cluster: Speaker clustering with online incremental centroid updates.
clusterer: v1.0 Clusterer trait + concrete clusterers (NME-SC, AHC).
der: Diarization Error Rate (DER) computation.
ecapa: ONNX speaker embedding extractor (WeSpeaker, ECAPA-TDNN, etc.).
embedder: v1.0 Embedder trait + concrete extractors (CAM++, ResNet34) + pool + overlap-mask helper.
embedding: Speaker embedding extraction trait.
features: Log-mel filterbank (fbank) feature extraction for speaker embeddings.
format: Subtitle / plain-text projections of a diarization result (SRT, WebVTT, TXT).
kmeans: K-Means++ clustering with automatic k selection via silhouette score.
models: Model registry — manifest-driven downloads with SHA-256 verification.
onnx: ONNX-based speaker embedding extractor with a session pool.
overlap: Overlap detection: identify frames where multiple speakers may be active.
pipeline: High-level diarization pipeline.
pipeline_v2: M6a — additive polyvoice::pipeline_v2 module.
resegmentation: v1.0 OverlapResegmenter — overlap-aware post-clustering pass.
rttm: RTTM (Rich Transcription Time Marked) parser and writer.
segmentation: Speaker segmentation: powerset-classifier + sliding-window aggregator.
silero_vad: Silero VAD v5 ONNX integration.
spectral: Spectral clustering for speaker diarization.
streaming: Real-time streaming diarization pipeline.
types: Core types for speaker diarization.
utils: Math utilities for diarization.
vad: Voice Activity Detection (VAD) trait and utilities.
wav: WAV file I/O via the hound crate.
window: Sliding-window utilities for batch and streaming pipelines.

Crate polyvoice

Crate polyvoice Copy item path

§polyvoice

§Quick start

§Module organization

Re-exports§

Modules§

Crate polyvoice