Struct VadConfig

Source

pub struct VadConfig {Show 17 fields
    pub sample_rate: u32,
    pub frame_duration: AudioDuration,
    pub frame_overlap: f32,
    pub energy_smoothing: f32,
    pub flux_smoothing: f32,
    pub energy_floor: f32,
    pub flux_floor: f32,
    pub threshold_smoothing: f32,
    pub activation_margin: f32,
    pub release_margin: f32,
    pub base_threshold: f32,
    pub energy_weight: f32,
    pub flux_weight: f32,
    pub hangover_frames: usize,
    pub min_speech_frames: usize,
    pub stream_start_time: AudioTimestamp,
    pub pre_emphasis: Option<f32>,
}

Expand description

Configuration for the voice activity detector.

§Performance Characteristics

Latency: Typically <2ms per 20ms frame (10% overhead)
Memory: ~10KB per detector instance (FFT buffers + state)
Accuracy: >95% speech detection on clean audio

§Configuration Guidelines

§Quick Start (Use Defaults)

use speech_prep::VadConfig;

let config = VadConfig::default(); // Optimized for 16kHz mono speech

§Advanced Tuning

For Noisy Environments: Increase activation_margin to 1.3-1.5

let config = VadConfig {
    activation_margin: 1.4, // Require stronger signal
    hangover_frames: 5,     // Longer trailing silence tolerance
    ..VadConfig::default()
};

For Low-Latency Applications: Reduce frame_duration

let config = VadConfig {
    frame_duration: AudioDuration::from_millis(10), // 10ms frames
    ..VadConfig::default()
};

For Soft/Quiet Speech: Lower activation_margin

let config = VadConfig {
    activation_margin: 1.05, // More sensitive
    min_speech_frames: 2,    // Faster activation
    ..VadConfig::default()
};

Fields§

§sample_rate: u32

Expected audio sample rate in Hz.

Default: 16000 (16kHz - optimal for speech)

Valid Range: 8000-48000 Hz

Performance Impact: Higher rates increase FFT computation cost. At 48kHz, expect ~3x slower processing vs 16kHz.

Recommendation: Use 16kHz unless your audio pipeline requires otherwise.

§frame_duration: AudioDuration

Frame duration used for analysis.

Default: 20ms (320 samples at 16kHz)

Valid Range: 10-50ms

Trade-offs:

Shorter (10ms): Lower latency, less robust to noise
Longer (50ms): Higher latency, more stable detection

Performance Impact: 20ms frame = ~1.5ms processing time. Linear scaling: 10ms → ~0.75ms, 50ms → ~3.75ms.

§frame_overlap: f32

Fractional overlap between adjacent frames.

Default: 0.5 (50% overlap)

Valid Range: [0.0, 1.0)

Effect: Higher overlap increases temporal resolution but adds computation cost. 50% overlap means processing 2x frames for same audio duration.

Recommendation: 0.5 for balanced accuracy/performance, 0.75 for critical applications requiring precise boundary detection.

§energy_smoothing: f32

Smoothing factor for rolling energy baseline (exponential moving average).

Default: 0.85 (85% history, 15% new observation)

Valid Range: [0.0, 1.0)

Effect: Controls adaptation speed to background noise changes.

Higher (0.9-0.95): Slower adaptation, stable in constant noise
Lower (0.7-0.8): Faster adaptation, handles dynamic noise

Half-Life: At 0.85, baseline half-life ≈ 4.3 frames (86ms at 20ms/frame).

§flux_smoothing: f32

Smoothing factor for rolling spectral flux baseline.

Default: 0.8 (80% history, 20% new observation)

Valid Range: [0.0, 1.0)

Effect: Controls adaptation to spectral change patterns. Flux typically more variable than energy, so slightly lower smoothing.

Half-Life: At 0.8, baseline half-life ≈ 3.1 frames (62ms at 20ms/frame).

§energy_floor: f32

Minimum energy floor to prevent division by zero in normalization.

Default: 1e-4 (0.0001)

Valid Range: >0.0 (typically 1e-6 to 1e-3)

Effect: Prevents numerical instability when audio is completely silent. Value is small enough to not affect real audio.

§flux_floor: f32

Minimum spectral flux floor to prevent division by zero.

Default: 1e-4 (0.0001)

Valid Range: >0.0 (typically 1e-6 to 1e-3)

Effect: Prevents numerical instability in flux calculations.

§threshold_smoothing: f32

Smoothing factor for the dynamic decision threshold.

Default: 0.9 (90% history, 10% new)

Valid Range: [0.0, 1.0)

Effect: Controls how quickly the detector adapts its sensitivity. Higher values make the threshold more stable, preventing rapid oscillations in marginal cases.

§activation_margin: f32

Multiplier applied to dynamic threshold to activate speech detection.

Default: 1.1 (110% of baseline threshold)

Valid Range: ≥1.0

Effect: Creates hysteresis to prevent chattering at boundaries.

1.05-1.1: High sensitivity (detects soft speech, more false positives)
1.2-1.5: Low sensitivity (robust to noise, may miss quiet speech)

Recommendation: Start with 1.1, increase if too many false activations.

§release_margin: f32

Multiplier applied to dynamic threshold when releasing to silence.

Default: 0.9 (90% of baseline threshold)

Valid Range: >0.0, must be ≤ activation_margin

Effect: Creates hysteresis to maintain speech state during brief pauses. Difference between activation and release margins prevents rapid toggling.

Typical Gap: 0.1-0.3 between margins (e.g., activate=1.2, release=0.9).

§base_threshold: f32

Initial baseline threshold before dynamic adaptation kicks in.

Default: 0.4 (40% of normalized scale)

Valid Range: 0.0-1.0

Effect: Starting point for adaptive threshold. After 10-20 frames, adaptive algorithm takes over and this value becomes less relevant.

Recommendation: Leave at default unless you know audio characteristics.

§energy_weight: f32

Weight applied to normalized energy when combining dual metrics.

Default: 0.6 (60% energy, 40% flux)

Valid Range: 0.0-1.0 (combined with flux_weight should sum to 1.0)

Effect: Energy detects signal presence, flux detects spectral changes. Higher energy weight emphasizes volume-based detection.

Use Cases:

0.7-0.8: Emphasize loudness (good for clean recordings)
0.5-0.6: Balanced (default, works well generally)
0.3-0.4: Emphasize spectral change (noisy environments)

§flux_weight: f32

Weight applied to normalized spectral flux when combining metrics.

Default: 0.4 (40% flux, 60% energy)

Valid Range: 0.0-1.0 (combined with energy_weight should sum to 1.0)

Effect: Flux is more robust to constant background noise but can be fooled by music or non-speech sounds with spectral variation.

§hangover_frames: usize

Number of trailing silent frames retained at the end of a speech segment.

Default: 3 frames (60ms at 20ms/frame)

Valid Range: 0-10 frames (typically)

Effect: Prevents premature cutoff of speech segments during brief pauses (e.g., between words). Too high causes long trailing silence.

Recommendation:

2-3: Normal speech (default)
5-8: Slow/hesitant speech
0-1: Real-time applications requiring minimal latency

§min_speech_frames: usize

Minimum number of speech frames required to emit a segment.

Default: 3 frames (60ms at 20ms/frame)

Valid Range: 1-10 frames (typically)

Effect: Filters out brief noise spikes mistaken for speech. Too high causes missed short utterances (e.g., “yes”, “no”).

Recommendation:

2-3: Balanced (default)
1: Detect very short sounds
5+: Only long speech segments

§stream_start_time: AudioTimestamp

Absolute start time for the first sample processed by this detector.

Default: AudioTimestamp::EPOCH (zero-based stream time)

Effect: Used for timestamping detected speech segments. Set this to the origin you want segment timestamps to use, or leave it as EPOCH for timestamps relative to the start of processing.

Use Cases:

Live streams: Set to a shared stream origin
Batch processing: Keep EPOCH or provide a known offset
Testing: Leave as EPOCH for deterministic timestamps

§pre_emphasis: Option<f32>

Optional pre-emphasis coefficient applied before analysis (high-pass filter).

Default: Some(0.97) (standard speech pre-emphasis)

Valid Range: None or Some(0.9-0.99)

Effect: Applies first-order high-pass filter: y[n] = x[n] - α*x[n-1]

Boosts high frequencies relative to low frequencies
Compensates for typical speech spectral tilt (more energy in low freqs)
Improves robustness to low-frequency rumble/hum

Recommendation:

Some(0.97): Standard for speech (default)
Some(0.95): More aggressive high-pass (very noisy low-freq environment)
None: Disable if audio already pre-emphasized or for music/non-speech

Struct VadConfig Copy item path

§Performance Characteristics

§Configuration Guidelines

§Quick Start (Use Defaults)

§Advanced Tuning

Fields§

Implementations§

impl VadConfig

pub fn validate(&self) -> Result<()>

pub fn frame_length_samples(&self) -> Result<usize>

pub fn hop_length_samples(&self) -> Result<usize>

pub fn fft_size(&self) -> Result<usize>

Trait Implementations§

impl Clone for VadConfig

fn clone(&self) -> VadConfig

fn clone_from(&mut self, source: &Self)

impl Debug for VadConfig

fn fmt(&self, f: &mut Formatter<'_>) -> Result

impl Default for VadConfig

fn default() -> Self

impl Copy for VadConfig

Auto Trait Implementations§

impl Freeze for VadConfig

impl RefUnwindSafe for VadConfig

impl Send for VadConfig

impl Sync for VadConfig

impl Unpin for VadConfig

impl UnsafeUnpin for VadConfig

impl UnwindSafe for VadConfig

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> CloneToUninit for Twhere T: Clone,

unsafe fn clone_to_uninit(&self, dest: *mut u8)

impl<T> From<T> for T

fn from(t: T) -> T

impl<T> Instrument for T

fn instrument(self, span: Span) -> Instrumented<Self>

fn in_current_span(self) -> Instrumented<Self>

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<F, T> IntoSample<T> for Fwhere T: FromSample<F>,

fn into_sample(self) -> T

impl<T> ToOwned for Twhere T: Clone,

type Owned = T

fn to_owned(&self) -> T

fn clone_into(&self, target: &mut T)

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

impl<T> WithSubscriber for T

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>where S: Into<Dispatch>,

fn with_current_subscriber(self) -> WithDispatch<Self>

Struct VadConfig

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T> CloneToUninit for T
where T: Clone,

impl<T, U> Into<U> for T
where U: From<T>,

impl<F, T> IntoSample<T> for F
where T: FromSample<F>,

impl<T> ToOwned for T
where T: Clone,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,