rust-dominant-speaker
Pure-Rust library implementing the dominant speaker identification algorithm used by Jitsi Videobridge and mediasoup. No FFI, no WebRTC stack dependencies — just feed it RFC 6464 audio-level observations and it tells you who is talking.
Why this exists
Every Rust SFU currently reimplements this algorithm inline or skips it entirely. This crate is a faithful, library-shaped extraction so you can drop it in and move on.
Usage
Add to your Cargo.toml:
[]
= "0.3"
use ;
let mut detector = new;
// Timestamps are caller-supplied u64 milliseconds (any epoch).
// In std: let t0 = Instant::now(); let now_ms = t0.elapsed().as_millis() as u64;
// In WASM: performance.now() as u64
// Register participants at t=0.
detector.add_peer;
detector.add_peer;
// Feed RFC 6464 audio levels: 0 = loudest, 127 = silent.
let mut t_ms: u64 = 0;
while t_ms < 2000
// Call tick() on a 300ms timer. Returns Some(SpeakerChange) only on speaker change.
if let Some = detector.tick
// Query current speaker without advancing the clock.
assert_eq!;
Algorithm
Three time-scale comparison (Volfin & Cohen, 2012): audio levels are bucketed into immediate (20ms), medium (200ms), and long (2s) windows. For each window, a binomial-coefficient activity score is computed. A challenger beats the incumbent only if their log-ratio exceeds all three thresholds simultaneously, preventing brief spikes from triggering spurious speaker changes (hysteresis).
Constants (mediasoup production tuning)
| Constant | Value | Meaning |
|---|---|---|
| C1 | 3.0 | Immediate time-scale log-ratio threshold |
| C2 | 2.0 | Medium time-scale log-ratio threshold |
| C3 | 0.0 | Long time-scale threshold (disabled in production) |
| N1 | 13 | Immediate subband count |
| N2 | 5 | Medium subband count |
| N3 | 10 | Long subband count |
| ImmediateBuffLen | 50 | Immediate ring-buffer slots (1s at 20ms cadence × 5 subbands) |
| MediumsBuffLen | 10 | Medium ring-buffer slots |
| LongsBuffLen | 1 | Long ring-buffer slots |
| LevelsBuffLen | 50 | Raw-level ring-buffer slots |
| MinActivityScore | 1e-10 | Score floor; prevents log(0) in ratio |
| LevelIdleTimeout | 40ms | Stale level replaced with silence |
| SpeakerIdleTimeout | 1h | Idle non-dominant peer marked paused |
| TICK_INTERVAL | 300ms | Recommended tick cadence |
Prior art
- mediasoup (C++, ISC) — direct source of constants and algorithm structure. ActiveSpeakerObserver.cpp
- Jitsi (Java, Apache-2.0) — reference algorithm implementation. DominantSpeakerIdentification.java
- Signal-Calling-Service (Rust, AGPL-3.0) — a simplified heuristic variant inlined in their backend; not a library. backend/src/audio.rs
- RFC 6464 — input format: audio level in dBov carried in RTP header extensions.
Extracted from
Originally built as part of OxPulse Chat, published standalone for the broader Rust WebRTC ecosystem.
no_std / WASM
The crate is #![no_std] compatible. To use without the standard library:
[]
= { = "0.3", = false }
Timestamps are caller-supplied u64 milliseconds — no Instant dependency:
// All three methods accept u64 ms in both std and no_std builds.
detector.add_peer;
detector.record_level;
detector.tick;
In a browser AudioWorklet, supply performance.now() as u64 as the timestamp.
Runtime dependencies added for no_std: hashbrown (hash map) and libm (f64 math).
License
Dual-licensed under MIT or Apache-2.0 at your option.