# Audio Processing Theory
Audio processing is fundamental to speech recognition (ASR), text-to-speech (TTS), and voice applications. This chapter covers the signal processing theory behind Aprender's audio module.
## The Audio Processing Pipeline
Modern ASR systems like Whisper process audio through a standardized pipeline:
```
┌─────────────┐ ┌──────────┐ ┌─────────────┐ ┌───────────┐
│ Raw Audio │───▸│ Resample │───▸│ Mel │───▸│ Neural │
│ (44.1kHz) │ │ (16kHz) │ │ Spectrogram │ │ Network │
└─────────────┘ └──────────┘ └─────────────┘ └───────────┘
```
Each stage transforms the audio into a representation more suitable for machine learning.
## Mel Scale and Human Perception
The **mel scale** is a perceptual scale of pitches that models how humans perceive frequency. It's based on the observation that humans perceive equal intervals between low frequencies (e.g., 100-200 Hz) as larger than equal intervals at high frequencies (e.g., 8000-8100 Hz).
### Hz to Mel Conversion
```
mel = 2595 * log₁₀(1 + f/700)
```
And the inverse:
```
f = 700 * (10^(mel/2595) - 1)
```
| 0 | 0 |
| 500 | 607 |
| 1000 | 1000 |
| 2000 | 1548 |
| 4000 | 2146 |
| 8000 | 2840 |
Notice how 0-1000 Hz spans 1000 mels, but 4000-8000 Hz only spans ~700 mels.
## Mel Filterbank
A **mel filterbank** is a set of triangular filters that convert the linear frequency spectrum to mel scale:
```
Filterbank
▲
│ △ △ △ △ △
│ / \ / \ / \ / \ / \
│ / \ / \ / \ / \ / \
│ / \/ \ / \ / \ / \
└─────────────────────────────────────────────▸
0 500 1000 2000 4000 8000 Hz
```
Each triangular filter:
1. Is centered at a mel-spaced frequency
2. Overlaps with adjacent filters (50%)
3. Sums the power spectrum within its bandwidth
### Slaney Normalization
Aprender uses **Slaney area normalization**, which ensures each filter has unit area:
```
normalization_factor = 2 / (f_high - f_low)
```
This matches librosa's `norm='slaney'` and OpenAI Whisper's filterbank, ensuring consistent outputs across implementations.
## Mel Spectrogram Computation
The mel spectrogram computation follows these steps:
### 1. Frame the Audio
Divide audio into overlapping frames using a Hann window:
```
Frame 0: samples[0:400] ← Apply Hann window
Frame 1: samples[160:560] ← Hop by 160 samples
Frame 2: samples[320:720]
...
```
For Whisper at 16kHz:
- Frame size (n_fft): 400 samples = 25ms
- Hop length: 160 samples = 10ms
- Overlap: 60%
### 2. Apply FFT
Transform each windowed frame to frequency domain:
```
X[k] = Σₙ x[n] · e^(-j2πkn/N)
```
This produces a complex spectrum with N/2+1 frequency bins.
### 3. Compute Power Spectrum
```
### 4. Apply Mel Filterbank
Matrix multiply the power spectrum by the filterbank:
```
mel_energies = filterbank @ power_spectrum
```
This reduces 201 frequency bins (for n_fft=400) to 80 mel channels.
### 5. Log Compression
Apply logarithmic compression for dynamic range:
```
log_mel = log₁₀(max(mel_energy, 1e-10))
```
The floor value (1e-10) prevents log(0).
### 6. Normalize
Whisper-style normalization:
```
normalized = (log_mel.max(max - 8.0) + 4.0) / 4.0
```
## Sample Rate Conversion
### Why Resample?
Different audio sources have different sample rates:
- CD quality: 44,100 Hz
- Professional audio: 48,000 Hz
- Whisper requirement: 16,000 Hz
- Telephone: 8,000 Hz
### Resampling Algorithm
Aprender uses linear interpolation for basic resampling:
```
For each output sample i:
src_pos = i * (from_rate / to_rate)
src_idx = floor(src_pos)
frac = src_pos - src_idx
output[i] = samples[src_idx] * (1 - frac)
+ samples[src_idx + 1] * frac
```
For higher quality, windowed-sinc interpolation minimizes aliasing.
## Audio Validation
### Clipping Detection
Properly normalized audio samples should be in the range [-1.0, 1.0]. Clipping occurs when samples exceed this range:
```
Clipped Audio
▲
1 │──────┬─────────────────────
│ /│\ /│\
│ / │ \ / │ \
│ / │ \ / │ \
│ / │ \ / │ \
──┼─/────┼────\─/────┼────\───▸
│/ │ V │ \
-1│──────┴───────────┴───────
```
Clipping causes:
- Distortion in reconstructed audio
- Poor ASR accuracy
- Incorrect mel spectrogram values
### NaN and Infinity Detection
Invalid floating-point values can propagate through the pipeline:
- **NaN**: Often from 0/0 or sqrt(-1)
- **Infinity**: From division by very small numbers
Aprender validates audio before processing to catch these early.
## Stereo to Mono Conversion
Most ASR models expect mono audio. Stereo conversion averages the channels:
```
mono[i] = (left[i] + right[i]) / 2
```
For interleaved stereo audio [L₀, R₀, L₁, R₁, ...]:
```rust
let mono: Vec<f32> = stereo
.chunks(2)
.map(|chunk| (chunk[0] + chunk[1]) / 2.0)
.collect();
```
## Streaming and Chunking
Real-time ASR requires processing audio in chunks as it arrives:
```
┌─────────────────────────────────────────────────────┐
│ Chunk 1 (30s) │ Chunk 2 (30s) │ ... │
│ │ │ │
│ ◀──────Overlap(1s)──▶ │ │
└─────────────────────────────────────────────────────┘
```
### Overlap Handling
Chunks overlap to avoid boundary artifacts:
1. Process chunk 1, get transcription
2. Keep last 1 second of chunk 1
3. Prepend to chunk 2 for context
4. Merge transcriptions, removing duplicates
### Configuration
| Chunk size | 30 seconds | 5 seconds |
| Overlap | 1 second | 0.5 seconds |
| Latency | N/A | ~5 seconds |
## Platform-Specific Audio Capture
### Backend Architecture
```
┌─────────────────────────────────────────────────────────┐
│ AudioCapture API │
├─────────────────────────────────────────────────────────┤
│ Linux │ macOS │ Windows │ WASM │
│ (ALSA) │ (CoreAudio)│ (WASAPI) │ (WebAudio API) │
└─────────────────────────────────────────────────────────┘
```
Each backend implements the `CaptureBackend` trait:
```rust
pub trait CaptureBackend {
fn open(device: Option<&str>, config: &CaptureConfig) -> Result<Self, AudioError>;
fn read(&mut self, buffer: &mut [f32]) -> Result<usize, AudioError>;
fn close(&mut self) -> Result<(), AudioError>;
}
```
### ALSA (Linux)
ALSA provides low-latency audio on Linux:
- Requires `libasound2-dev` package
- Enable with `audio-alsa` feature
- Captures in S16_LE format, converts to f32
## Configuration Presets
### Whisper (ASR)
```rust
MelConfig {
n_mels: 80, // 80 mel channels
n_fft: 400, // 25ms window at 16kHz
hop_length: 160, // 10ms hop
sample_rate: 16000, // 16kHz required
fmin: 0.0,
fmax: 8000.0, // Nyquist frequency
}
```
### TTS (VITS-style)
```rust
MelConfig {
n_mels: 80,
n_fft: 1024, // 46ms window at 22kHz
hop_length: 256, // 11.6ms hop
sample_rate: 22050, // CD-quality
fmin: 0.0,
fmax: 11025.0,
}
```
## Mathematical Foundations
### Hann Window
The Hann window reduces spectral leakage:
```
w[n] = 0.5 * (1 - cos(2πn / N))
```
It smoothly tapers to zero at the edges, preventing discontinuities.
### Short-Time Fourier Transform (STFT)
The STFT captures both time and frequency information:
```
X[m, k] = Σₙ x[n + m·H] · w[n] · e^(-j2πkn/N)
```
Where:
- m = frame index
- k = frequency bin
- H = hop length
- w[n] = window function
## References
- Radford, A. et al. (2023). "Robust Speech Recognition via Large-Scale Weak Supervision" (Whisper paper)
- Stevens, S., Volkmann, J., & Newman, E. (1937). "A Scale for the Measurement of the Psychological Magnitude Pitch"
- Slaney, M. (1998). "Auditory Toolbox" Technical Report #1998-010