rust-dominant-speaker 0.3.0

Pure-Rust port of the Jitsi/mediasoup dominant speaker identification algorithm (Volfin/Cohen 2012).
Documentation

rust-dominant-speaker

CI Crates.io docs.rs License: MIT OR Apache-2.0 MSRV

Pure-Rust library implementing the dominant speaker identification algorithm used by Jitsi Videobridge and mediasoup. No FFI, no WebRTC stack dependencies — just feed it RFC 6464 audio-level observations and it tells you who is talking.

Why this exists

Every Rust SFU currently reimplements this algorithm inline or skips it entirely. This crate is a faithful, library-shaped extraction so you can drop it in and move on.

Usage

Add to your Cargo.toml:

[dependencies]
rust-dominant-speaker = "0.3"
use dominant_speaker::{ActiveSpeakerDetector, SpeakerChange};

let mut detector = ActiveSpeakerDetector::new();

// Timestamps are caller-supplied u64 milliseconds (any epoch).
// In std: let t0 = Instant::now(); let now_ms = t0.elapsed().as_millis() as u64;
// In WASM: performance.now() as u64

// Register participants at t=0.
detector.add_peer(1u64, 0);
detector.add_peer(2u64, 0);

// Feed RFC 6464 audio levels: 0 = loudest, 127 = silent.
let mut t_ms: u64 = 0;
while t_ms < 2000 {
    detector.record_level(1, 5, t_ms);    // peer 1 is speaking
    detector.record_level(2, 127, t_ms);  // peer 2 is silent
    t_ms += 20;
}

// Call tick() on a 300ms timer. Returns Some(SpeakerChange) only on speaker change.
if let Some(change) = detector.tick(300) {
    println!("Dominant speaker: peer {} (confidence: {:.2})", change.peer_id, change.c2_margin);
}

// Query current speaker without advancing the clock.
assert_eq!(detector.current_dominant().copied(), Some(1));

Algorithm

Three time-scale comparison (Volfin & Cohen, 2012): audio levels are bucketed into immediate (20ms), medium (200ms), and long (2s) windows. For each window, a binomial-coefficient activity score is computed. A challenger beats the incumbent only if their log-ratio exceeds all three thresholds simultaneously, preventing brief spikes from triggering spurious speaker changes (hysteresis).

Constants (mediasoup production tuning)

Constant Value Meaning
C1 3.0 Immediate time-scale log-ratio threshold
C2 2.0 Medium time-scale log-ratio threshold
C3 0.0 Long time-scale threshold (disabled in production)
N1 13 Immediate subband count
N2 5 Medium subband count
N3 10 Long subband count
ImmediateBuffLen 50 Immediate ring-buffer slots (1s at 20ms cadence × 5 subbands)
MediumsBuffLen 10 Medium ring-buffer slots
LongsBuffLen 1 Long ring-buffer slots
LevelsBuffLen 50 Raw-level ring-buffer slots
MinActivityScore 1e-10 Score floor; prevents log(0) in ratio
LevelIdleTimeout 40ms Stale level replaced with silence
SpeakerIdleTimeout 1h Idle non-dominant peer marked paused
TICK_INTERVAL 300ms Recommended tick cadence

Prior art

  • mediasoup (C++, ISC) — direct source of constants and algorithm structure. ActiveSpeakerObserver.cpp
  • Jitsi (Java, Apache-2.0) — reference algorithm implementation. DominantSpeakerIdentification.java
  • Signal-Calling-Service (Rust, AGPL-3.0) — a simplified heuristic variant inlined in their backend; not a library. backend/src/audio.rs
  • RFC 6464 — input format: audio level in dBov carried in RTP header extensions.

Extracted from

Originally built as part of OxPulse Chat, published standalone for the broader Rust WebRTC ecosystem.

no_std / WASM

The crate is #![no_std] compatible. To use without the standard library:

[dependencies]
rust-dominant-speaker = { version = "0.3", default-features = false }

Timestamps are caller-supplied u64 milliseconds — no Instant dependency:

// All three methods accept u64 ms in both std and no_std builds.
detector.add_peer(peer_id, now_ms);
detector.record_level(peer_id, level, now_ms);
detector.tick(now_ms);

In a browser AudioWorklet, supply performance.now() as u64 as the timestamp.

Runtime dependencies added for no_std: hashbrown (hash map) and libm (f64 math).

License

Dual-licensed under MIT or Apache-2.0 at your option.