Crate speakrs

Expand description

speakrs implements the full pyannote community-1 style diarization pipeline in Rust: segmentation, powerset decode, overlap-add aggregation, binarization, embedding, PLDA, and VBx clustering.

There is no Python runtime in the library path. Inference runs on ONNX Runtime or native CoreML, and the rest of the pipeline stays in Rust.

§Usage

# macOS (CoreML)
speakrs = { version = "0.5", features = ["coreml"] }

# NVIDIA GPU
speakrs = { version = "0.5", features = ["cuda"] }

# CPU only
speakrs = "0.5"

# System OpenBLAS
speakrs = { version = "0.5", default-features = false, features = ["online", "openblas-system"] }

# AMD GPU
speakrs = { version = "0.5", features = ["migraphx"] }

§Quick start

use speakrs::{ExecutionMode, OwnedDiarizationPipeline};

fn main() -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
    let mut pipeline = OwnedDiarizationPipeline::from_pretrained(ExecutionMode::CoreMl)?;

    let audio: Vec<f32> = load_your_mono_16khz_audio_here();
    let result = pipeline.run(&audio)?;

    print!("{}", result.rttm("my-audio"));
    Ok(())
}

§Speaker turns


let result = pipeline.run(&audio)?;

for segment in result.discrete_diarization.to_segments() {
    println!("{:.3} - {:.3}  {}", segment.start, segment.end, segment.speaker);
}

§Background queue

QueueSender and QueueReceiver run a background worker. Push audio from any thread and read results as they finish:

use speakrs::{ExecutionMode, OwnedDiarizationPipeline, QueuedDiarizationRequest};

let pipeline = OwnedDiarizationPipeline::from_pretrained(ExecutionMode::CoreMl)?;
let (tx, rx) = pipeline.into_queued()?;

std::thread::spawn(move || {
    for (file_id, audio) in receive_files() {
        tx.push(QueuedDiarizationRequest::new(file_id, audio)).unwrap();
    }
});

for result in rx {
    let result = result?;
    print!("{}", result.result?.rttm(&result.file_id));
}

§Local models

For offline or airgapped setups, load models from a local directory:

use std::path::Path;
use speakrs::{ExecutionMode, OwnedDiarizationPipeline};

let mut pipeline = OwnedDiarizationPipeline::from_dir(
    Path::new("/path/to/models"),
    ExecutionMode::Cpu,
)?;
let result = pipeline.run(&audio)?;

§Choosing a mode

Mode	Backend	Step	Use it for
`cpu`	ONNX Runtime CPU	1s	CPU runs and widest compatibility
`coreml`	Native CoreML	1s	macOS with CoreML acceleration
`coreml-fast`	Native CoreML	2s	macOS with CoreML acceleration and higher throughput
`cuda`	ONNX Runtime CUDA	1s	NVIDIA GPU
`cuda-fast`	ONNX Runtime CUDA	2s	NVIDIA GPU for higher throughput
`migraphx`	ONNX Runtime MIGraphX	1s	AMD GPU

The *-fast modes move the segmentation window every 2 seconds instead of every 1 second. That gives the pipeline fewer windows to score, so it can be much faster, but speaker changes may land a little farther from the exact word or pause where they happened.

Use the 1 second modes when you care about exactly when each speaker starts and stops, short clips, interviews with quick back-and-forth, or audio you plan to subtitle or edit. The 2 second modes are usually worth trying for long recordings where speed matters more than exact speaker-change times, such as meetings, lectures, podcasts, or bulk archives.

§Benchmarks

VoxConverse dev, collar=0ms:

Platform	Implementation	DER	Time	RTFx
Apple M4 Pro	`speakrs` `coreml`	7.1%	138s	529x
Apple M4 Pro	`speakrs` `coreml-fast`	7.4%	169s	434x
Apple M4 Pro	pyannote community-1 (MPS)	7.2%	2999s	24x
RTX 4090	`speakrs` `cuda`	7.0%	1236s	59x
RTX 4090	`speakrs` `cuda-fast`	7.4%	604s	121x
RTX 4090	pyannote community-1 (CUDA)	7.2%	2312s	32x

On VoxConverse test, coreml matches pyannote at 11.1% DER and runs at 631x realtime versus pyannote’s 23x. cuda matches pyannote at 11.1% DER and runs at 50x realtime versus pyannote’s 18x. See benchmarks/ for the full tables across all datasets.

CoreML and ONNX Runtime can differ slightly even in FP32 because the runtime graphs are not identical and floating-point reduction order changes rounding.

§Why not pyannote-rs?

pyannote-rs is the main Rust-only comparison point, but it targets a different tradeoff.

	`speakrs`	`pyannote-rs`
Pipeline	Full pyannote `community-1` style pipeline	Simpler window-level pipeline
Aggregation	Overlap-add plus binarization	No overlap-add or binarization
Clustering	PLDA + VBx	Cosine threshold
Goal	Stay close to pyannote behavior on CPU/CUDA	Lightweight Rust diarization

On the VoxConverse dev subset where pyannote-rs emits output, speakrs CoreML scores 11.5% DER versus 80.2% for pyannote-rs. In that same run, pyannote-rs returned no segments on most files.

§Models

With the default online feature, models download on first use from avencera/speakrs-models. Set SPEAKRS_MODELS_DIR if you want to force a local bundle instead.

§Features and build notes

Common features:

online (default): model download via ModelManager
coreml: native CoreML backend on macOS
cuda: NVIDIA CUDA backend via ONNX Runtime
migraphx: AMD GPU backend via ONNX Runtime MIGraphX
load-dynamic: load the ONNX Runtime library at startup instead of static linking

BLAS backends matter if you disable default features:

x86_64 defaults to statically linked Intel MKL
non-x86_64 defaults to statically linked OpenBLAS and needs a C toolchain
no-default builds must enable exactly one of intel-mkl, openblas-static, or openblas-system

speakrs = { version = "0.5", default-features = false, features = ["online", "intel-mkl"] }
speakrs = { version = "0.5", default-features = false, features = ["online", "openblas-system"] }

The ONNX Runtime dependency (ort 2.0.0-rc.12) is still pre-release.

§Public API

Start here:

OwnedDiarizationPipeline: pipeline entry point
QueueSender and QueueReceiver: background worker interface
DiarizationResult: frame-level activations, segments, clusters, embeddings, RTTM
PipelineConfig and RuntimeConfig: tuning knobs
ModelManager: model download when online is enabled
Segment: a single speaker turn

Re-exports§

pub use inference::CoreMlComputeUnits;
pub use inference::ExecutionMode;
pub use models::ModelBundle;
pub use models::ModelManager;online
pub use pipeline::AhcConfig;
pub use pipeline::BatchInput;
pub use pipeline::BinarizeConfig;
pub use pipeline::DiarizationPipeline;
pub use pipeline::DiarizationResult;
pub use pipeline::OwnedDiarizationPipeline;
pub use pipeline::PipelineBuilder;
pub use pipeline::PipelineConfig;
pub use pipeline::PipelineError;
pub use pipeline::QueueError;
pub use pipeline::QueueReceiver;
pub use pipeline::QueueReceiverIter;
pub use pipeline::QueueSender;
pub use pipeline::QueuedDiarizationJobId;
pub use pipeline::QueuedDiarizationRequest;
pub use pipeline::QueuedDiarizationResult;
pub use pipeline::RuntimeConfig;
pub use pipeline::VbxConfig;
pub use segment::Segment;

Modules§

inference: Segmentation and embedding model wrappers
metrics_metrics: Diarization error rate (DER) evaluation utilities
models: Model paths and HuggingFace download support
pipeline: High-level diarization pipeline and result types
segment: Speaker segments, merging, and RTTM output

Structs§

PowersetMapping_metrics: Maps between powerset class indices and multi-speaker binary activations

Crate speakrs

Crate speakrs Copy item path

§Usage

§Quick start

§Speaker turns

§Background queue

§Local models

§Choosing a mode

§Benchmarks

§Why not pyannote-rs?

§Models

§Features and build notes

§Public API

Re-exports§

Modules§

Structs§

Crate speakrs