aprender 0.31.2

<!-- PCU: ml-fundamentals-speech-voice-processing | contract: contracts/apr-page-ml-fundamentals-speech-voice-processing-v1.yaml -->
<!-- Example: cargo run -p aprender-core --example none -->
<!-- Status: enforced -->

# Speech and Voice Processing Theory

Speech and voice processing enables machines to understand, generate, and manipulate human speech. This chapter covers ASR, TTS, VAD, diarization, and voice cloning.

## Speech Processing Pipeline

```
┌──────────┐    ┌─────┐    ┌─────────────┐    ┌──────────┐
│  Audio   │───▶│ VAD │───▶│ ASR/Speaker │───▶│  Output  │
│  Input   │    │     │    │ Recognition │    │  Text/ID │
└──────────┘    └─────┘    └─────────────┘    └──────────┘
```

## Voice Activity Detection (VAD)

Detect when speech is present in audio:

### Energy-Based VAD

Simple threshold on frame energy:

```
energy[t] = Σ(samples[t:t+frame]²)
is_speech[t] = energy[t] > threshold
```

**Pros:** Fast, no model needed
**Cons:** Sensitive to noise

### Neural VAD (Silero-style)

```
Audio → Mel Spectrogram → LSTM/Conv → [0.0, 1.0]
                                         Speech probability
```

**Pros:** Robust to noise
**Cons:** Requires model inference

### VAD Parameters

| Parameter | Typical Value | Effect |
|-----------|---------------|--------|
| Frame length | 20-30ms | Resolution |
| Threshold | 0.5 | Sensitivity |
| Min speech | 250ms | Filter noise |
| Min silence | 300ms | Merge segments |

## Automatic Speech Recognition (ASR)

Convert speech to text:

### Traditional Pipeline

```
Audio → MFCC → Acoustic Model → HMM → Language Model → Text
```

### End-to-End (Whisper-style)

```
Audio → Mel Spectrogram → Encoder → Decoder → Text
              │               │          │
              └──────────────────────────┘
                  Transformer Architecture
```

### Whisper Architecture

```
Audio (30s max)
      │
      ▼
Mel Spectrogram (80 mel, 3000 frames)
      │
      ▼
┌─────────────────────┐
│  Encoder            │ (Transformer)
│  - Conv stem        │
│  - Positional enc   │
│  - N layers         │
└─────────────────────┘
      │
      ▼
┌─────────────────────┐
│  Decoder            │ (Transformer)
│  - Text tokens      │
│  - Cross-attention  │
│  - Autoregressive   │
└─────────────────────┘
      │
      ▼
Text tokens → Text
```

### Word-Level Timestamps

Cross-attention alignment:

```
For each word:
  1. Find decoder step that generated word
  2. Extract cross-attention weights
  3. Find peak attention position
  4. Map to audio timestamp
```

## Speaker Diarization

"Who spoke when?"

### Pipeline

```
Audio → VAD → Embedding → Clustering → Timeline
              │               │
              ▼               ▼
        Speaker Vectors   Speakers
```

### Speaker Embeddings

**X-Vector:**
```
Audio → Frame features → Statistics pooling → DNN → 512-dim
```

**ECAPA-TDNN:**
```
Audio → SE-Res2Net → Attentive Stats → 192-dim
```

### Clustering Methods

| Method | Requires K? | Notes |
|--------|-------------|-------|
| K-Means | Yes | Simple, fast |
| Spectral | Yes | Better for non-spherical |
| Agglomerative | No | Can auto-detect speakers |
| VBx | No | Bayesian, state-of-the-art |

## Text-to-Speech (TTS)

Convert text to speech:

### Two-Stage Pipeline

```
Text → Acoustic Model → Mel Spectrogram → Vocoder → Waveform
           │                                  │
           ▼                                  ▼
    Tacotron/FastSpeech              HiFi-GAN/WaveGlow
```

### FastSpeech 2

Non-autoregressive for fast synthesis:

```
Phonemes → Encoder → Variance Adaptor → Mel Decoder → Mel
                           │
              Duration, Pitch, Energy predictors
```

**Variance Adaptor:**
- Duration: How long each phoneme
- Pitch: F0 contour
- Energy: Loudness

### Vocoders

Convert mel spectrogram to waveform:

| Vocoder | Quality | Speed |
|---------|---------|-------|
| Griffin-Lim | Low | Fast |
| WaveNet | High | Very slow |
| HiFi-GAN | High | Fast |
| WaveGlow | High | Moderate |

## Voice Cloning

Clone a voice from samples:

### Zero-Shot Cloning (YourTTS)

```
Reference Audio → Speaker Encoder → Style Vector
                                          │
                                          ▼
Text → TTS Model ─────────────────────▶ Cloned Speech
```

Only needs 3-5 seconds of reference audio.

### Fine-Tuning Based

1. Pre-train TTS on large corpus
2. Fine-tune on target speaker (15-30 min audio)
3. Generate with fine-tuned model

**Trade-off:** Better quality, more data needed

## Voice Conversion

Change voice identity while preserving content:

### PPG-Based

```
Source Audio → ASR → PPG (Content) ─────┐
                                        │
Target Speaker → Embedding ────────────▶│───▶ Converted
                                        │
Prosody extraction ────────────────────┘
```

PPG = Phonetic Posteriorgram (content representation)

### Autoencoder-Based

```
Audio → Content Encoder → Content ─────┐
                                       │
Audio → Speaker Encoder → Speaker ────▶│───▶ Decoder → Audio'
                                       │
Audio → Prosody Encoder → Prosody ────┘
```

## Voice Isolation

Separate voice from background:

### Spectral Subtraction

```
Y(f) = Speech(f) + Noise(f)
Speech(f) ≈ Y(f) - E[Noise(f)]
```

Estimate noise from silent segments.

### Neural Source Separation

```
Mixture → U-Net/Conv-TasNet → Separated Sources
               │
          Mask estimation per source
```

## Speaker Verification

"Is this the claimed speaker?"

### Pipeline

```
Enrollment:  Audio → Embedding Model → Reference Vector
                                              │
                                              ▼
Verification: Audio → Embedding Model → Query Vector
                                              │
                                              ▼
                                       Cosine Similarity
                                              │
                                              ▼
                                      Accept/Reject
```

### Metrics

| Metric | Description |
|--------|-------------|
| EER | Equal Error Rate (FAR = FRR) |
| minDCF | Detection cost function |
| TAR@FAR | True accept at fixed false accept |

## Prosody Transfer

Transfer speaking style:

```
Source Audio → Style Encoder → Style Vector
                                     │
                    ┌────────────────┘
                    ▼
Target Audio → TTS → New Audio with Source Style
```

Style includes:
- Speaking rate
- Pitch patterns
- Emphasis
- Emotion

## Quality Metrics

| Metric | Measures | Range |
|--------|----------|-------|
| WER | ASR accuracy | 0-∞ (lower=better) |
| MOS | Subjective quality | 1-5 |
| PESQ | Perceptual quality | -0.5 to 4.5 |
| STOI | Intelligibility | 0-1 |

## References

- Radford, A., et al. (2023). "Robust Speech Recognition via Large-Scale Weak Supervision." (Whisper)
- Ren, Y., et al. (2020). "FastSpeech 2." ICLR.
- Kong, J., et al. (2020). "HiFi-GAN." NeurIPS.
- Desplanques, B., et al. (2020). "ECAPA-TDNN." Interspeech.