Skip to main content

VadConfig

Struct VadConfig 

Source
pub struct VadConfig {
Show 17 fields pub sample_rate: u32, pub frame_duration: AudioDuration, pub frame_overlap: f32, pub energy_smoothing: f32, pub flux_smoothing: f32, pub energy_floor: f32, pub flux_floor: f32, pub threshold_smoothing: f32, pub activation_margin: f32, pub release_margin: f32, pub base_threshold: f32, pub energy_weight: f32, pub flux_weight: f32, pub hangover_frames: usize, pub min_speech_frames: usize, pub stream_start_time: AudioTimestamp, pub pre_emphasis: Option<f32>,
}
Expand description

Configuration for the voice activity detector.

§Performance Characteristics

  • Latency: Typically <2ms per 20ms frame (10% overhead)
  • Memory: ~10KB per detector instance (FFT buffers + state)
  • Accuracy: >95% speech detection on clean audio

§Configuration Guidelines

§Quick Start (Use Defaults)

use speech_prep::VadConfig;

let config = VadConfig::default(); // Optimized for 16kHz mono speech

§Advanced Tuning

For Noisy Environments: Increase activation_margin to 1.3-1.5

let config = VadConfig {
    activation_margin: 1.4, // Require stronger signal
    hangover_frames: 5,     // Longer trailing silence tolerance
    ..VadConfig::default()
};

For Low-Latency Applications: Reduce frame_duration

let config = VadConfig {
    frame_duration: AudioDuration::from_millis(10), // 10ms frames
    ..VadConfig::default()
};

For Soft/Quiet Speech: Lower activation_margin

let config = VadConfig {
    activation_margin: 1.05, // More sensitive
    min_speech_frames: 2,    // Faster activation
    ..VadConfig::default()
};

Fields§

§sample_rate: u32

Expected audio sample rate in Hz.

Default: 16000 (16kHz - optimal for speech)

Valid Range: 8000-48000 Hz

Performance Impact: Higher rates increase FFT computation cost. At 48kHz, expect ~3x slower processing vs 16kHz.

Recommendation: Use 16kHz unless your audio pipeline requires otherwise.

§frame_duration: AudioDuration

Frame duration used for analysis.

Default: 20ms (320 samples at 16kHz)

Valid Range: 10-50ms

Trade-offs:

  • Shorter (10ms): Lower latency, less robust to noise
  • Longer (50ms): Higher latency, more stable detection

Performance Impact: 20ms frame = ~1.5ms processing time. Linear scaling: 10ms → ~0.75ms, 50ms → ~3.75ms.

§frame_overlap: f32

Fractional overlap between adjacent frames.

Default: 0.5 (50% overlap)

Valid Range: [0.0, 1.0)

Effect: Higher overlap increases temporal resolution but adds computation cost. 50% overlap means processing 2x frames for same audio duration.

Recommendation: 0.5 for balanced accuracy/performance, 0.75 for critical applications requiring precise boundary detection.

§energy_smoothing: f32

Smoothing factor for rolling energy baseline (exponential moving average).

Default: 0.85 (85% history, 15% new observation)

Valid Range: [0.0, 1.0)

Effect: Controls adaptation speed to background noise changes.

  • Higher (0.9-0.95): Slower adaptation, stable in constant noise
  • Lower (0.7-0.8): Faster adaptation, handles dynamic noise

Half-Life: At 0.85, baseline half-life ≈ 4.3 frames (86ms at 20ms/frame).

§flux_smoothing: f32

Smoothing factor for rolling spectral flux baseline.

Default: 0.8 (80% history, 20% new observation)

Valid Range: [0.0, 1.0)

Effect: Controls adaptation to spectral change patterns. Flux typically more variable than energy, so slightly lower smoothing.

Half-Life: At 0.8, baseline half-life ≈ 3.1 frames (62ms at 20ms/frame).

§energy_floor: f32

Minimum energy floor to prevent division by zero in normalization.

Default: 1e-4 (0.0001)

Valid Range: >0.0 (typically 1e-6 to 1e-3)

Effect: Prevents numerical instability when audio is completely silent. Value is small enough to not affect real audio.

§flux_floor: f32

Minimum spectral flux floor to prevent division by zero.

Default: 1e-4 (0.0001)

Valid Range: >0.0 (typically 1e-6 to 1e-3)

Effect: Prevents numerical instability in flux calculations.

§threshold_smoothing: f32

Smoothing factor for the dynamic decision threshold.

Default: 0.9 (90% history, 10% new)

Valid Range: [0.0, 1.0)

Effect: Controls how quickly the detector adapts its sensitivity. Higher values make the threshold more stable, preventing rapid oscillations in marginal cases.

§activation_margin: f32

Multiplier applied to dynamic threshold to activate speech detection.

Default: 1.1 (110% of baseline threshold)

Valid Range: ≥1.0

Effect: Creates hysteresis to prevent chattering at boundaries.

  • 1.05-1.1: High sensitivity (detects soft speech, more false positives)
  • 1.2-1.5: Low sensitivity (robust to noise, may miss quiet speech)

Recommendation: Start with 1.1, increase if too many false activations.

§release_margin: f32

Multiplier applied to dynamic threshold when releasing to silence.

Default: 0.9 (90% of baseline threshold)

Valid Range: >0.0, must be ≤ activation_margin

Effect: Creates hysteresis to maintain speech state during brief pauses. Difference between activation and release margins prevents rapid toggling.

Typical Gap: 0.1-0.3 between margins (e.g., activate=1.2, release=0.9).

§base_threshold: f32

Initial baseline threshold before dynamic adaptation kicks in.

Default: 0.4 (40% of normalized scale)

Valid Range: 0.0-1.0

Effect: Starting point for adaptive threshold. After 10-20 frames, adaptive algorithm takes over and this value becomes less relevant.

Recommendation: Leave at default unless you know audio characteristics.

§energy_weight: f32

Weight applied to normalized energy when combining dual metrics.

Default: 0.6 (60% energy, 40% flux)

Valid Range: 0.0-1.0 (combined with flux_weight should sum to 1.0)

Effect: Energy detects signal presence, flux detects spectral changes. Higher energy weight emphasizes volume-based detection.

Use Cases:

  • 0.7-0.8: Emphasize loudness (good for clean recordings)
  • 0.5-0.6: Balanced (default, works well generally)
  • 0.3-0.4: Emphasize spectral change (noisy environments)
§flux_weight: f32

Weight applied to normalized spectral flux when combining metrics.

Default: 0.4 (40% flux, 60% energy)

Valid Range: 0.0-1.0 (combined with energy_weight should sum to 1.0)

Effect: Flux is more robust to constant background noise but can be fooled by music or non-speech sounds with spectral variation.

§hangover_frames: usize

Number of trailing silent frames retained at the end of a speech segment.

Default: 3 frames (60ms at 20ms/frame)

Valid Range: 0-10 frames (typically)

Effect: Prevents premature cutoff of speech segments during brief pauses (e.g., between words). Too high causes long trailing silence.

Recommendation:

  • 2-3: Normal speech (default)
  • 5-8: Slow/hesitant speech
  • 0-1: Real-time applications requiring minimal latency
§min_speech_frames: usize

Minimum number of speech frames required to emit a segment.

Default: 3 frames (60ms at 20ms/frame)

Valid Range: 1-10 frames (typically)

Effect: Filters out brief noise spikes mistaken for speech. Too high causes missed short utterances (e.g., “yes”, “no”).

Recommendation:

  • 2-3: Balanced (default)
  • 1: Detect very short sounds
  • 5+: Only long speech segments
§stream_start_time: AudioTimestamp

Absolute start time for the first sample processed by this detector.

Default: AudioTimestamp::EPOCH (zero-based stream time)

Effect: Used for timestamping detected speech segments. Set this to the origin you want segment timestamps to use, or leave it as EPOCH for timestamps relative to the start of processing.

Use Cases:

  • Live streams: Set to a shared stream origin
  • Batch processing: Keep EPOCH or provide a known offset
  • Testing: Leave as EPOCH for deterministic timestamps
§pre_emphasis: Option<f32>

Optional pre-emphasis coefficient applied before analysis (high-pass filter).

Default: Some(0.97) (standard speech pre-emphasis)

Valid Range: None or Some(0.9-0.99)

Effect: Applies first-order high-pass filter: y[n] = x[n] - α*x[n-1]

  • Boosts high frequencies relative to low frequencies
  • Compensates for typical speech spectral tilt (more energy in low freqs)
  • Improves robustness to low-frequency rumble/hum

Recommendation:

  • Some(0.97): Standard for speech (default)
  • Some(0.95): More aggressive high-pass (very noisy low-freq environment)
  • None: Disable if audio already pre-emphasized or for music/non-speech

Implementations§

Source§

impl VadConfig

Source

pub fn validate(&self) -> Result<()>

Validate configuration invariants.

Source

pub fn frame_length_samples(&self) -> Result<usize>

Frame length in samples derived from the configured duration and sample rate. Returns an error if the computed frame length exceeds platform limits.

Source

pub fn hop_length_samples(&self) -> Result<usize>

Hop size in samples considering the configured frame overlap.

Source

pub fn fft_size(&self) -> Result<usize>

FFT size for spectral analysis (next power of two of the frame length).

Trait Implementations§

Source§

impl Clone for VadConfig

Source§

fn clone(&self) -> VadConfig

Returns a duplicate of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl Debug for VadConfig

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl Default for VadConfig

Source§

fn default() -> Self

Returns the “default value” for a type. Read more
Source§

impl Copy for VadConfig

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<F, T> IntoSample<T> for F
where T: FromSample<F>,

Source§

fn into_sample(self) -> T

Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more