loqa-voice-dsp
Shared DSP library for voice analysis, providing core digital signal processing functionality for both Loqa backend and VoiceFind mobile app.
Features
- Pitch Detection: YIN and pYIN algorithms for fundamental frequency (F0) estimation
- NEW in v0.3.0: Stateful
VoiceAnalyzerAPI for streaming analysis - NEW in v0.3.0: pYIN (probabilistic YIN) for better noisy/breathy voice detection
- NEW in v0.3.0: Configurable algorithm selection (Auto/pYIN/YIN/Autocorr)
- NEW in v0.4.0: Custom pYIN implementation optimized for voice (no external dependencies)
- NEW in v0.3.0: Stateful
- Formant Extraction: Linear Predictive Coding (LPC) for formant analysis
- FFT Utilities: Fast Fourier Transform for spectral analysis
- Spectral Analysis: Spectral centroid, tilt, and rolloff calculations
- HNR (Harmonics-to-Noise Ratio): Breathiness measurement using Boersma's autocorrelation method
- H1-H2 Amplitude Difference: Vocal weight analysis (lighter vs fuller voice quality)
Installation
iOS (CocoaPods)
Add to your Podfile:
pod ,
Then run:
iOS (Swift Package Manager)
In Xcode:
- File → Add Packages
- Enter repository URL:
https://github.com/loqalabs/loqa - Select version:
0.3.0or later
Or add to Package.swift:
dependencies: [
.package(url: "https://github.com/loqalabs/loqa", from: "0.3.0")
]
Rust (Cargo)
Add to your Cargo.toml:
[]
= "0.3.0"
Usage
Buffer Size Recommendations
Pitch detection algorithms analyze buffers in frames. For best results:
- Recommended: 2048-4096 samples (~46-93ms @ 44100 Hz)
- Minimum: 1024 samples for real-time applications
- Maximum: 4096 samples per frame (accuracy degrades beyond this)
Why this matters:
- Large buffers (>4096) may contain pitch variations, voiced/unvoiced transitions, or multiple syllables
- The algorithm requires buffer size ≥ 2× longest period for the lowest expected frequency:
- For 80 Hz (male voice): minimum ~1103 samples
- For 400 Hz (female voice): minimum ~221 samples
Frame-based analysis for long audio (v0.3.0+):
For buffers larger than 4096 samples, use the new VoiceAnalyzer API:
use ;
let config = default
.with_frame_size
.with_hop_size; // 50% overlap
let mut analyzer = new?;
let results = analyzer.process_stream;
Legacy approach (v0.2.x):
Rust (Loqa Backend)
New in v0.3.0 - Stateful API:
use ;
let audio_samples: = /* your audio data */;
// Create analyzer with pYIN algorithm
let config = default
.with_sample_rate
.with_frame_size
.with_algorithm;
let mut analyzer = new?;
// Process single frame
let pitch = analyzer.process_frame?;
println!;
println!;
println!;
// Or process a stream
let results = analyzer.process_stream;
for in results.iter.enumerate
Legacy API (still supported):
use ;
let audio_samples: = /* your audio data */;
let sample_rate = 16000;
// Pitch detection (single-shot)
let pitch = detect_pitch?;
println!;
// Formant extraction
let formants = extract_formants?;
println!;
// HNR (breathiness)
let hnr = calculate_hnr?;
println!;
// H1-H2 (vocal weight)
let h1h2 = calculate_h1h2?;
println!;
// FFT
let fft_result = compute_fft?;
iOS (Swift via FFI)
New in v0.3.0 - Stateful Analyzer:
// Create analyzer configuration
var config = loqa_analysis_config_default()
config.algorithm = 1 // 0=Auto, 1=PYIN, 2=YIN, 3=Autocorr
config.frame_size = 2048
config.sample_rate = 16000
// Create analyzer
let analyzer = loqa_voice_analyzer_new(config)
defer { loqa_voice_analyzer_free(analyzer) } // Always free
// Process single frame
let pitchResult = samples.withUnsafeBufferPointer { buffer in
loqa_voice_analyzer_process_frame(
analyzer,
buffer.baseAddress!,
buffer.count
)
}
if pitchResult.success {
print("Pitch: \(pitchResult.frequency)Hz")
print("Confidence: \(pitchResult.confidence)")
print("Voiced Probability: \(pitchResult.voiced_probability)")
}
// Or process stream
var results = [PitchResultFFI](repeating: PitchResultFFI(), count: 100)
let count = samples.withUnsafeBufferPointer { buffer in
results.withUnsafeMutableBufferPointer { resultsBuffer in
loqa_voice_analyzer_process_stream(
analyzer,
buffer.baseAddress!,
buffer.count,
resultsBuffer.baseAddress!,
100
)
}
}
print("Got \(count) pitch results")
Legacy API (still supported):
// Call C-compatible FFI functions
let samples: [Float] = /* your audio data */
// Pitch detection
let pitchResult = samples.withUnsafeBufferPointer { buffer in
loqa_detect_pitch(
buffer.baseAddress!,
buffer.count,
16000, // sample rate
80.0, // min freq
400.0 // max freq
)
}
if pitchResult.success {
print("Pitch: \(pitchResult.frequency)Hz, Confidence: \(pitchResult.confidence)")
}
// HNR (breathiness)
let hnrResult = samples.withUnsafeBufferPointer { buffer in
loqa_calculate_hnr(
buffer.baseAddress!,
buffer.count,
16000, // sample rate
75.0, // min freq
500.0 // max freq
)
}
if hnrResult.success {
print("HNR: \(hnrResult.hnr) dB, Voiced: \(hnrResult.is_voiced)")
}
// H1-H2 (vocal weight) - pass 0.0 for f0 to auto-detect
let h1h2Result = samples.withUnsafeBufferPointer { buffer in
loqa_calculate_h1h2(
buffer.baseAddress!,
buffer.count,
16000, // sample rate
pitchResult.frequency // use detected pitch, or 0.0 to auto-detect
)
}
if h1h2Result.success {
print("H1-H2: \(h1h2Result.h1h2) dB")
}
Android (Java via JNI)
// Build with --features android-jni
;
float[] audioSamples ;
VoiceFindDSP.PitchResult pitch ;
System.out.;
Note: Android JNI requires building with --features android-jni
FFI Safety & Parameter Validation
FFI Safety Requirements
Critical: All FFI structs use #[repr(C)] to ensure C-compatible memory layout. Failure to maintain this can cause alignment issues and incorrect values (see historical issues #1, #2, #3).
Memory safety:
- All FFI functions validate null pointers before dereferencing
- FFT results (
loqa_compute_fft) allocate memory that must be freed usingloqa_free_fft_result - Never free FFT results more than once
- Never use FFT result pointers after calling
loqa_free_fft_result
Swift/iOS example with proper cleanup:
let fftResult = loqa_compute_fft(buffer, count, sampleRate, fftSize)
defer { loqa_free_fft_result(&fftResult) } // Always free
if fftResult.success {
let spectral = loqa_analyze_spectrum(&fftResult)
// Use spectral features...
}
Parameter Validation Ranges
Important: All validation happens in the Rust core. Higher-level layers (Swift/TypeScript) should trust Rust validation rather than implementing their own rules.
Pitch Detection (loqa_detect_pitch)
| Parameter | Valid Range | Recommended | Notes |
|---|---|---|---|
buffer_size |
≥ 100 samples | 2048-4096 samples | See "Buffer Size Recommendations" above |
sample_rate |
8000-96000 Hz | 16000-44100 Hz | Higher rates support higher frequency ranges |
min_frequency |
20-4000 Hz | 80 Hz (male voice) | Must be < max_frequency |
max_frequency |
40-8000 Hz | 400 Hz (voice) | Must be > min_frequency |
Formant Extraction (loqa_extract_formants)
| Parameter | Valid Range | Recommended | Notes |
|---|---|---|---|
buffer_size |
≥ 2048 samples | 2048-4096 samples | Larger buffers improve formant resolution |
sample_rate |
8000-96000 Hz | 16000-44100 Hz | Higher rates capture higher formants |
lpc_order |
8-24 | 12-16 | NOT sample_rate / 1000 - use fixed range instead |
Historical Note: Issue loqa-expo-dsp#8 - TypeScript calculated lpc_order = sample_rate / 1000 + 2 which gave 46 for 44.1kHz. Swift layer rejected this as out of range, causing all calls to fail. Solution: Use fixed range 8-24 for all sample rates.
FFT (loqa_compute_fft)
| Parameter | Valid Range | Recommended | Notes |
|---|---|---|---|
buffer_size |
≥ fft_size | = fft_size | Larger buffers are truncated |
sample_rate |
8000-96000 Hz | 16000-48000 Hz | Affects frequency bin resolution |
fft_size |
Power of 2: 512-8192 | 2048 or 4096 | Non-power-of-2 may fail (impl-specific) |
HNR Calculation (loqa_calculate_hnr)
| Parameter | Valid Range | Recommended | Notes |
|---|---|---|---|
buffer_size |
≥ 2048 samples | 2048-4096 | Needs multiple pitch periods |
sample_rate |
8000-96000 Hz | 16000 Hz | Standard voice analysis rate |
min_frequency |
50-300 Hz | 75 Hz | Lowest expected F0 |
max_frequency |
200-600 Hz | 500 Hz | Highest expected F0 |
H1-H2 Calculation (loqa_calculate_h1h2)
| Parameter | Valid Range | Recommended | Notes |
|---|---|---|---|
buffer_size |
≥ 2048 samples | 4096 samples | Needs good spectral resolution |
sample_rate |
8000-96000 Hz | 16000-44100 Hz | Higher rates improve harmonic resolution |
f0 |
0.0 or 50-800 Hz | Detected pitch | Pass 0.0 for auto-detect, or provide known F0 |
Auto-detect F0: Pass 0.0 (or any negative value) for f0 parameter to automatically detect pitch before calculating H1-H2.
Common FFI Pitfalls (Lessons Learned)
1. Struct Alignment Issues (Fixed in v0.2.1)
- Problem: Missing
#[repr(C)]caused field misalignment - Symptom: Correct frequency in Rust, wrong value in Swift/Java
- Solution: All FFI structs now have
#[repr(C)]- verified by CI tests
2. Parameter Validation Mismatches (Fixed in v0.2.2)
- Problem: TypeScript/Swift layers calculated different validation rules than Rust
- Symptom: Valid parameters rejected, invalid parameters accepted
- Solution: Single source of truth - Rust core validates, higher layers trust it
3. Buffer Size Confusion (Documented in v0.2.2)
- Problem: Users passing large buffers (16384 samples) got false negatives
- Symptom: Pitch detection failed despite valid audio
- Solution: Documentation + optional validation warnings (see Issue #5)
4. Memory Leaks with FFT (Prevented by design)
- Problem: Forgetting to free FFT results leaks ~16-32KB per call
- Symptom: Gradual memory growth in long-running apps
- Solution: Always use Swift
deferor RAII patterns to ensure cleanup
Implementation Status
- Crate structure created
- Pitch detection (YIN + autocorrelation)
- Formant extraction (LPC-based)
- FFT utilities
- Spectral analysis (centroid, tilt, rolloff)
- HNR calculation (Boersma's autocorrelation method)
- H1-H2 amplitude difference (vocal weight)
- iOS FFI layer (C exports for all functions)
- Android JNI layer (with jni feature)
- Unit tests (68 passing)
- FFI integration tests (9 passing)
- SVD consistency tests (5 passing)
- Synthetic consistency tests (4 passing)
- Documentation tests (8 passing)
- Benchmarks harness
- Performance benchmarks (validated)
Performance Benchmarks
Validated Performance (2025-11-07) - All targets exceeded ✅
| Operation | Target | Actual (mean) | Result | Speedup |
|---|---|---|---|---|
| Pitch detection (100ms audio) | <20ms | 0.125ms | ✅ PASS | 160x faster |
| Formant extraction (500ms audio) | <50ms | 0.134ms | ✅ PASS | 373x faster |
| FFT (2048 points) | <10ms | ~0.020ms | ✅ PASS | 500x faster |
| Spectral analysis | <5ms | ~0.003ms | ✅ PASS | 1667x faster |
| HNR calculation (100ms window) | <30ms | <1ms | ✅ PASS | >30x faster |
| H1-H2 with F0 provided | <20ms | <1ms | ✅ PASS | >20x faster |
Note: Benchmarks run on Apple M-series silicon. All latency targets easily met with significant performance headroom for real-time voice processing.
Algorithm Details
Custom pYIN Implementation (v0.4.0+)
Starting in v0.4.0, we use a custom pYIN implementation optimized for voice analysis, removing the external pyin crate dependency.
What is pYIN?
pYIN (Mauch & Dixon, 2014) extends the YIN pitch detection algorithm to produce probabilistic pitch estimates, making it more robust for noisy or breathy voice signals.
Key Differences from Standard YIN:
- YIN: Returns single pitch estimate per frame
- pYIN: Returns multiple pitch candidates with probabilities, then uses Hidden Markov Model (HMM) to find the smoothest pitch track
Our Voice-Optimized Implementation:
-
Two-Stage Process:
- Stage 1: Generate multiple pitch candidates using Beta distribution β(2,18) for thresholds
- Stage 2: Use Viterbi algorithm on HMM to find optimal pitch track
-
Voice-Specific Optimizations:
- Narrower frequency range (80-400 Hz vs. general audio 50-2000 Hz)
- Tighter HMM transition constraints (voice pitch changes slowly)
- Voice-tuned Beta distribution concentrating probability near threshold=0.1
-
Benefits:
- No external dependencies - fully integrated implementation
- Better handling of breathy voice - multiple candidates provide robustness
- Smoother pitch tracks - HMM enforces temporal consistency
- Voiced probability per frame - soft voiced/unvoiced decisions (not just binary)
- Smaller binary size - only includes what we need for voice
-
Performance:
- ~65-67 µs per 100ms frame (16kHz sample rate)
- ~1.5-2x overhead vs. standard YIN (acceptable tradeoff for improved accuracy)
- Still meets 160-500x real-time performance target
References:
Acoustic Measures Reference
HNR (Harmonics-to-Noise Ratio)
Measures the ratio of harmonic (periodic) to noise (aperiodic) energy in voice - the primary acoustic indicator of breathiness.
| HNR Range | Interpretation |
|---|---|
| 18-25+ dB | Clear, less breathy voice |
| 12-18 dB | Moderate breathiness |
| <10 dB | Very breathy or pathological voice |
H1-H2 (First/Second Harmonic Difference)
Measures the amplitude difference between the fundamental and second harmonic - indicates vocal weight.
| H1-H2 Range | Interpretation |
|---|---|
| >5 dB | Lighter, breathier vocal quality |
| 0-5 dB | Balanced vocal weight |
| <0 dB | Fuller, heavier vocal quality |
Test Data
Saarbrücken Voice Database
This library uses samples from the Saarbrücken Voice Database for consistency validation testing.
License: CC BY 4.0
Attribution: Pützer, M. & Barry, W.J., Former Institute of Phonetics, Saarland University. Available at Zenodo.
The SVD provides lab-quality voice recordings including:
- Sustained vowels (/a:/, /i:/, /u:/) at low, normal, and high pitch
- 851 healthy control speakers
- 1002 speakers with documented voice pathologies
- 50 kHz sample rate, controlled recording conditions
Setting Up Test Data
# 1. Download SVD from Zenodo (CC BY 4.0 license)
# https://zenodo.org/records/16874898
# 2. Install conversion dependencies
# 3. Convert SVD files to test format
Test Sample Requirements
For comprehensive validation, the library needs test samples with these characteristics:
| Function | Sample Requirements | Recommended Datasets |
|---|---|---|
| Pitch Detection | Male (80-180 Hz), Female (160-300 Hz), varied intonation | Saarbrücken Voice Database, PTDB-TUG |
| Formant Extraction | Sustained vowels /a/, /i/, /u/, /e/, /o/ from multiple speakers | Hillenbrand Vowel Database, VTR-TIMIT |
| HNR | Breathy, modal, and clear voice qualities | Saarbrücken Voice Database |
| H1-H2 | Light to full voice qualities, different phonation types | UCLA Voice Quality Database, VoiceSauce reference recordings |
| Spectral | Dark to bright voice qualities | Voice quality databases with perceptual labels |
Development
# Build
# Test
# Benchmark
# Documentation
License
MIT