loqa-voice-dsp

Shared DSP library for voice analysis, providing core digital signal processing functionality for both Loqa backend and VoiceFind mobile app.

Features

Pitch Detection: YIN and pYIN algorithms for fundamental frequency (F0) estimation
- NEW in v0.3.0: Stateful VoiceAnalyzer API for streaming analysis
- NEW in v0.3.0: pYIN (probabilistic YIN) for better noisy/breathy voice detection
- NEW in v0.3.0: Configurable algorithm selection (Auto/pYIN/YIN/Autocorr)
- NEW in v0.4.0: Custom pYIN implementation optimized for voice (no external dependencies)
Formant Extraction: Linear Predictive Coding (LPC) for formant analysis
FFT Utilities: Fast Fourier Transform for spectral analysis
Spectral Analysis: Spectral centroid, tilt, and rolloff calculations
HNR (Harmonics-to-Noise Ratio): Breathiness measurement using Boersma's autocorrelation method
H1-H2 Amplitude Difference: Vocal weight analysis (lighter vs fuller voice quality)

Installation

iOS (CocoaPods)

Add to your Podfile:

pod 'LoqaVoiceDSP', '~> 0.3.0'

Then run:

pod install

iOS (Swift Package Manager)

In Xcode:

File → Add Packages
Enter repository URL: https://github.com/loqalabs/loqa
Select version: 0.3.0 or later

Or add to Package.swift:

dependencies: [
    .package(url: "https://github.com/loqalabs/loqa", from: "0.3.0")
]

Rust (Cargo)

Add to your Cargo.toml:

[dependencies]
loqa-voice-dsp = "0.3.0"

Usage

Buffer Size Recommendations

Pitch detection algorithms analyze buffers in frames. For best results:

Recommended: 2048-4096 samples (~46-93ms @ 44100 Hz)
Minimum: 1024 samples for real-time applications
Maximum: 4096 samples per frame (accuracy degrades beyond this)

Why this matters:

Large buffers (>4096) may contain pitch variations, voiced/unvoiced transitions, or multiple syllables
The algorithm requires buffer size ≥ 2× longest period for the lowest expected frequency:
- For 80 Hz (male voice): minimum ~1103 samples
- For 400 Hz (female voice): minimum ~221 samples

Frame-based analysis for long audio (v0.3.0+):

For buffers larger than 4096 samples, use the new VoiceAnalyzer API:

use loqa_voice_dsp::{VoiceAnalyzer, AnalysisConfig};

let config = AnalysisConfig::default()
    .with_frame_size(2048)
    .with_hop_size(1024);  // 50% overlap

let mut analyzer = VoiceAnalyzer::new(config)?;
let results = analyzer.process_stream(&long_audio_buffer);

Legacy approach (v0.2.x):

fn analyze_long_buffer(buffer: &[f32], sample_rate: u32) -> Vec<PitchResult> {
    const FRAME_SIZE: usize = 2048;
    const HOP_SIZE: usize = 1024;  // 50% overlap

    let mut results = Vec::new();
    for i in (0..buffer.len().saturating_sub(FRAME_SIZE)).step_by(HOP_SIZE) {
        let frame = &buffer[i..i + FRAME_SIZE];
        if let Ok(pitch) = detect_pitch(frame, sample_rate, 80.0, 400.0) {
            results.push(pitch);
        }
    }
    results
}

Rust (Loqa Backend)

New in v0.3.0 - Stateful API:

use loqa_voice_dsp::{VoiceAnalyzer, AnalysisConfig, PitchAlgorithm};

let audio_samples: Vec<f32> = /* your audio data */;

// Create analyzer with pYIN algorithm
let config = AnalysisConfig::default()
    .with_sample_rate(16000)
    .with_frame_size(2048)
    .with_algorithm(PitchAlgorithm::PYIN);

let mut analyzer = VoiceAnalyzer::new(config)?;

// Process single frame
let pitch = analyzer.process_frame(&audio_samples)?;
println!("Frequency: {} Hz", pitch.frequency);
println!("Confidence: {}", pitch.confidence);
println!("Voiced Probability: {}", pitch.voiced_probability);

// Or process a stream
let results = analyzer.process_stream(&long_audio_buffer);
for (i, pitch) in results.iter().enumerate() {
    println!("Frame {}: {} Hz (conf: {})", i, pitch.frequency, pitch.confidence);
}

Legacy API (still supported):

use loqa_voice_dsp::{detect_pitch, extract_formants, compute_fft, calculate_hnr, calculate_h1h2};

let audio_samples: Vec<f32> = /* your audio data */;
let sample_rate = 16000;

// Pitch detection (single-shot)
let pitch = detect_pitch(&audio_samples, sample_rate, 80.0, 400.0)?;
println!("Frequency: {} Hz, Confidence: {}", pitch.frequency, pitch.confidence);

// Formant extraction
let formants = extract_formants(&audio_samples, sample_rate, 14)?;
println!("F1: {} Hz, F2: {} Hz", formants.f1, formants.f2);

// HNR (breathiness)
let hnr = calculate_hnr(&audio_samples, sample_rate, 75.0, 500.0)?;
println!("HNR: {} dB, Voiced: {}", hnr.hnr, hnr.is_voiced);

// H1-H2 (vocal weight)
let h1h2 = calculate_h1h2(&audio_samples, sample_rate, Some(pitch.frequency))?;
println!("H1-H2: {} dB", h1h2.h1h2);

// FFT
let fft_result = compute_fft(&audio_samples, sample_rate, 2048)?;

iOS (Swift via FFI)

New in v0.3.0 - Stateful Analyzer:

// Create analyzer configuration
var config = loqa_analysis_config_default()
config.algorithm = 1  // 0=Auto, 1=PYIN, 2=YIN, 3=Autocorr
config.frame_size = 2048
config.sample_rate = 16000

// Create analyzer
let analyzer = loqa_voice_analyzer_new(config)
defer { loqa_voice_analyzer_free(analyzer) }  // Always free

// Process single frame
let pitchResult = samples.withUnsafeBufferPointer { buffer in
    loqa_voice_analyzer_process_frame(
        analyzer,
        buffer.baseAddress!,
        buffer.count
    )
}
if pitchResult.success {
    print("Pitch: \(pitchResult.frequency)Hz")
    print("Confidence: \(pitchResult.confidence)")
    print("Voiced Probability: \(pitchResult.voiced_probability)")
}

// Or process stream
var results = [PitchResultFFI](repeating: PitchResultFFI(), count: 100)
let count = samples.withUnsafeBufferPointer { buffer in
    results.withUnsafeMutableBufferPointer { resultsBuffer in
        loqa_voice_analyzer_process_stream(
            analyzer,
            buffer.baseAddress!,
            buffer.count,
            resultsBuffer.baseAddress!,
            100
        )
    }
}
print("Got \(count) pitch results")

Legacy API (still supported):

// Call C-compatible FFI functions
let samples: [Float] = /* your audio data */

// Pitch detection
let pitchResult = samples.withUnsafeBufferPointer { buffer in
    loqa_detect_pitch(
        buffer.baseAddress!,
        buffer.count,
        16000,  // sample rate
        80.0,   // min freq
        400.0   // max freq
    )
}
if pitchResult.success {
    print("Pitch: \(pitchResult.frequency)Hz, Confidence: \(pitchResult.confidence)")
}

// HNR (breathiness)
let hnrResult = samples.withUnsafeBufferPointer { buffer in
    loqa_calculate_hnr(
        buffer.baseAddress!,
        buffer.count,
        16000,  // sample rate
        75.0,   // min freq
        500.0   // max freq
    )
}
if hnrResult.success {
    print("HNR: \(hnrResult.hnr) dB, Voiced: \(hnrResult.is_voiced)")
}

// H1-H2 (vocal weight) - pass 0.0 for f0 to auto-detect
let h1h2Result = samples.withUnsafeBufferPointer { buffer in
    loqa_calculate_h1h2(
        buffer.baseAddress!,
        buffer.count,
        16000,  // sample rate
        pitchResult.frequency  // use detected pitch, or 0.0 to auto-detect
    )
}
if h1h2Result.success {
    print("H1-H2: \(h1h2Result.h1h2) dB")
}

Android (Java via JNI)

// Build with --features android-jni
import com.voicefind.VoiceFindDSP;

float[] audioSamples = /* your audio data */;
VoiceFindDSP.PitchResult pitch = VoiceFindDSP.detectPitch(
    audioSamples,
    16000,  // sample rate
    80.0f,  // min freq
    400.0f  // max freq
);

System.out.println("Frequency: " + pitch.frequency + " Hz");

Note: Android JNI requires building with --features android-jni

FFI Safety & Parameter Validation

FFI Safety Requirements

Critical: All FFI structs use #[repr(C)] to ensure C-compatible memory layout. Failure to maintain this can cause alignment issues and incorrect values (see historical issues #1, #2, #3).

Memory safety:

All FFI functions validate null pointers before dereferencing
FFT results (loqa_compute_fft) allocate memory that must be freed using loqa_free_fft_result
Never free FFT results more than once
Never use FFT result pointers after calling loqa_free_fft_result

Swift/iOS example with proper cleanup:

let fftResult = loqa_compute_fft(buffer, count, sampleRate, fftSize)
defer { loqa_free_fft_result(&fftResult) }  // Always free

if fftResult.success {
    let spectral = loqa_analyze_spectrum(&fftResult)
    // Use spectral features...
}

Parameter Validation Ranges

Important: All validation happens in the Rust core. Higher-level layers (Swift/TypeScript) should trust Rust validation rather than implementing their own rules.

Pitch Detection (`loqa_detect_pitch`)

Parameter	Valid Range	Recommended	Notes
`buffer_size`	≥ 100 samples	2048-4096 samples	See "Buffer Size Recommendations" above
`sample_rate`	8000-96000 Hz	16000-44100 Hz	Higher rates support higher frequency ranges
`min_frequency`	20-4000 Hz	80 Hz (male voice)	Must be < max_frequency
`max_frequency`	40-8000 Hz	400 Hz (voice)	Must be > min_frequency

Formant Extraction (`loqa_extract_formants`)

Parameter	Valid Range	Recommended	Notes
`buffer_size`	≥ 2048 samples	2048-4096 samples	Larger buffers improve formant resolution
`sample_rate`	8000-96000 Hz	16000-44100 Hz	Higher rates capture higher formants
`lpc_order`	8-24	12-16	NOT `sample_rate / 1000` - use fixed range instead

Historical Note: Issue loqa-expo-dsp#8 - TypeScript calculated lpc_order = sample_rate / 1000 + 2 which gave 46 for 44.1kHz. Swift layer rejected this as out of range, causing all calls to fail. Solution: Use fixed range 8-24 for all sample rates.

FFT (`loqa_compute_fft`)

Parameter	Valid Range	Recommended	Notes
`buffer_size`	≥ fft_size	= fft_size	Larger buffers are truncated
`sample_rate`	8000-96000 Hz	16000-48000 Hz	Affects frequency bin resolution
`fft_size`	Power of 2: 512-8192	2048 or 4096	Non-power-of-2 may fail (impl-specific)

HNR Calculation (`loqa_calculate_hnr`)

Parameter	Valid Range	Recommended	Notes
`buffer_size`	≥ 2048 samples	2048-4096	Needs multiple pitch periods
`sample_rate`	8000-96000 Hz	16000 Hz	Standard voice analysis rate
`min_frequency`	50-300 Hz	75 Hz	Lowest expected F0
`max_frequency`	200-600 Hz	500 Hz	Highest expected F0

H1-H2 Calculation (`loqa_calculate_h1h2`)

Parameter	Valid Range	Recommended	Notes
`buffer_size`	≥ 2048 samples	4096 samples	Needs good spectral resolution
`sample_rate`	8000-96000 Hz	16000-44100 Hz	Higher rates improve harmonic resolution
`f0`	0.0 or 50-800 Hz	Detected pitch	Pass 0.0 for auto-detect, or provide known F0

Auto-detect F0: Pass 0.0 (or any negative value) for f0 parameter to automatically detect pitch before calculating H1-H2.

Common FFI Pitfalls (Lessons Learned)

1. Struct Alignment Issues (Fixed in v0.2.1)

Problem: Missing #[repr(C)] caused field misalignment
Symptom: Correct frequency in Rust, wrong value in Swift/Java
Solution: All FFI structs now have #[repr(C)] - verified by CI tests

2. Parameter Validation Mismatches (Fixed in v0.2.2)

Problem: TypeScript/Swift layers calculated different validation rules than Rust
Symptom: Valid parameters rejected, invalid parameters accepted
Solution: Single source of truth - Rust core validates, higher layers trust it

3. Buffer Size Confusion (Documented in v0.2.2)

Problem: Users passing large buffers (16384 samples) got false negatives
Symptom: Pitch detection failed despite valid audio
Solution: Documentation + optional validation warnings (see Issue #5)

4. Memory Leaks with FFT (Prevented by design)

Problem: Forgetting to free FFT results leaks ~16-32KB per call
Symptom: Gradual memory growth in long-running apps
Solution: Always use Swift defer or RAII patterns to ensure cleanup

Implementation Status

Crate structure created
Pitch detection (YIN + autocorrelation)
Formant extraction (LPC-based)
FFT utilities
Spectral analysis (centroid, tilt, rolloff)
HNR calculation (Boersma's autocorrelation method)
H1-H2 amplitude difference (vocal weight)
iOS FFI layer (C exports for all functions)
Android JNI layer (with jni feature)
Unit tests (68 passing)
FFI integration tests (9 passing)
SVD consistency tests (5 passing)
Synthetic consistency tests (4 passing)
Documentation tests (8 passing)
Benchmarks harness
Performance benchmarks (validated)

Performance Benchmarks

Validated Performance (2025-11-07) - All targets exceeded ✅

Operation	Target	Actual (mean)	Result	Speedup
Pitch detection (100ms audio)	<20ms	0.125ms	✅ PASS	160x faster
Formant extraction (500ms audio)	<50ms	0.134ms	✅ PASS	373x faster
FFT (2048 points)	<10ms	~0.020ms	✅ PASS	500x faster
Spectral analysis	<5ms	~0.003ms	✅ PASS	1667x faster
HNR calculation (100ms window)	<30ms	<1ms	✅ PASS	>30x faster
H1-H2 with F0 provided	<20ms	<1ms	✅ PASS	>20x faster

Note: Benchmarks run on Apple M-series silicon. All latency targets easily met with significant performance headroom for real-time voice processing.

Algorithm Details

Custom pYIN Implementation (v0.4.0+)

Starting in v0.4.0, we use a custom pYIN implementation optimized for voice analysis, removing the external pyin crate dependency.

What is pYIN?

pYIN (Mauch & Dixon, 2014) extends the YIN pitch detection algorithm to produce probabilistic pitch estimates, making it more robust for noisy or breathy voice signals.

Key Differences from Standard YIN:

YIN: Returns single pitch estimate per frame
pYIN: Returns multiple pitch candidates with probabilities, then uses Hidden Markov Model (HMM) to find the smoothest pitch track

Our Voice-Optimized Implementation:

Two-Stage Process:
- Stage 1: Generate multiple pitch candidates using Beta distribution β(2,18) for thresholds
- Stage 2: Use Viterbi algorithm on HMM to find optimal pitch track
Voice-Specific Optimizations:
- Narrower frequency range (80-400 Hz vs. general audio 50-2000 Hz)
- Tighter HMM transition constraints (voice pitch changes slowly)
- Voice-tuned Beta distribution concentrating probability near threshold=0.1
Benefits:
- No external dependencies - fully integrated implementation
- Better handling of breathy voice - multiple candidates provide robustness
- Smoother pitch tracks - HMM enforces temporal consistency
- Voiced probability per frame - soft voiced/unvoiced decisions (not just binary)
- Smaller binary size - only includes what we need for voice
Performance:
- ~65-67 µs per 100ms frame (16kHz sample rate)
- ~1.5-2x overhead vs. standard YIN (acceptable tradeoff for improved accuracy)
- Still meets 160-500x real-time performance target

References:

Acoustic Measures Reference

HNR (Harmonics-to-Noise Ratio)

Measures the ratio of harmonic (periodic) to noise (aperiodic) energy in voice - the primary acoustic indicator of breathiness.

HNR Range	Interpretation
18-25+ dB	Clear, less breathy voice
12-18 dB	Moderate breathiness
<10 dB	Very breathy or pathological voice

H1-H2 (First/Second Harmonic Difference)

Measures the amplitude difference between the fundamental and second harmonic - indicates vocal weight.

H1-H2 Range	Interpretation
>5 dB	Lighter, breathier vocal quality
0-5 dB	Balanced vocal weight
<0 dB	Fuller, heavier vocal quality

Test Data

Saarbrücken Voice Database

This library uses samples from the Saarbrücken Voice Database for consistency validation testing.

License: CC BY 4.0

Attribution: Pützer, M. & Barry, W.J., Former Institute of Phonetics, Saarland University. Available at Zenodo.

The SVD provides lab-quality voice recordings including:

Sustained vowels (/a:/, /i:/, /u:/) at low, normal, and high pitch
851 healthy control speakers
1002 speakers with documented voice pathologies
50 kHz sample rate, controlled recording conditions

Setting Up Test Data

# 1. Download SVD from Zenodo (CC BY 4.0 license)
#    https://zenodo.org/records/16874898

# 2. Install conversion dependencies
pip install scipy numpy

# 3. Convert SVD files to test format
python scripts/download_svd.py /path/to/extracted/svd

Test Sample Requirements

For comprehensive validation, the library needs test samples with these characteristics:

Function	Sample Requirements	Recommended Datasets
Pitch Detection	Male (80-180 Hz), Female (160-300 Hz), varied intonation	Saarbrücken Voice Database, PTDB-TUG
Formant Extraction	Sustained vowels /a/, /i/, /u/, /e/, /o/ from multiple speakers	Hillenbrand Vowel Database, VTR-TIMIT
HNR	Breathy, modal, and clear voice qualities	Saarbrücken Voice Database
H1-H2	Light to full voice qualities, different phonation types	UCLA Voice Quality Database, VoiceSauce reference recordings
Spectral	Dark to bright voice qualities	Voice quality databases with perceptual labels

Development

# Build
cargo build --release

# Test
cargo test

# Benchmark
cargo bench

# Documentation
cargo doc --open

License

MIT

loqa-voice-dsp 0.5.0