wav2vec2-rs 0.1.0

Rust wav2vec2 CTC forced alignment with CPU, wgpu, and CUDA backends.
Documentation

[![Quality Gate Status](https://sonarcloud.io/api/project_badges/measure?project=Djoe-Denne_wav2vec2-rs&metric=alert_status)](https://sonarcloud.io/summary/new_code?id=Djoe-Denne_wav2vec2-rs) [![Coverage](https://sonarcloud.io/api/project_badges/measure?project=Djoe-Denne_wav2vec2-rs&metric=coverage)](https://sonarcloud.io/summary/new_code?id=Djoe-Denne_wav2vec2-rs) [![Security Rating](https://sonarcloud.io/api/project_badges/measure?project=Djoe-Denne_wav2vec2-rs&metric=security_rating)](https://sonarcloud.io/summary/new_code?id=Djoe-Denne_wav2vec2-rs) [![Maintainability Rating](https://sonarcloud.io/api/project_badges/measure?project=Djoe-Denne_wav2vec2-rs&metric=sqale_rating)](https://sonarcloud.io/summary/new_code?id=Djoe-Denne_wav2vec2-rs)

# wav2vec2-align

A Rust library for **CTC forced alignment** using wav2vec2 acoustic models. Maps a known transcript onto an audio signal at word level, producing millisecond-precision timing boundaries with per-word confidence scores.

Forced alignment solves the problem of synchronizing text with speech: given audio and its transcript, it determines exactly *when* each word is spoken. This is foundational for subtitle generation, speech corpus annotation, audiobook segmentation, and pronunciation analysis.

The library implements the full pipeline — from raw audio to timestamped words — with three parallel Viterbi backends (CPU, wgpu, CUDA) and a multi-policy blank expansion strategy that produces MFA-quality word boundaries.

---

## Acknowledgments

This project was heavily inspired by [wav2vec2aligner](https://github.com/EveryVoiceTTS/wav2vec2aligner) by EveryVoiceTTS. The original Python/TorchAudio implementation served both as a reference for understanding the alignment pipeline and as a comparison baseline for benchmarking. The benchmark harness on the Python side was also built on top of that project (see the patch file `wav2vec2aligner-main.perf-monitoring.patch`). Many thanks to the authors for making their work available.

---

## Features

- **CTC Viterbi forced alignment** with reachability-band pruning for O(T·S) DP instead of naive O(T·S²)
- **Three compute backends**: CPU (always available), wgpu (Vulkan/DX12/Metal), CUDA with zero-copy ORT integration
- **Two model runtimes**: Candle (pure Rust, safetensors) and ONNX Runtime (with CUDA EP support)
- **Zero-copy CUDA path**: when ORT runs on GPU, log-softmax + Viterbi execute entirely on device — only the T-length state path (T×4 bytes) is copied back to host
- **Multi-policy blank expansion** with acoustic-evidence candidate selection for robust word boundaries
- **Deterministic composite confidence scoring**: blends geometric mean emission probability, top-2 margin, p10 log-prob, and boundary blank evidence with piecewise-linear calibration
- **Evaluation and benchmark reporting**: structural, timing, confidence, and performance metrics with per-split aggregation, outlier ranking, and per-word traces

---

## Installation

### Prerequisites

- **Rust 1.75+** (edition 2021)
- A wav2vec2 CTC model in safetensors (Candle) or ONNX format, with `config.json` and `vocab.json`
- Audio must be **16 kHz mono f32** (the standard wav2vec2 sample rate)

### Feature flags

| Feature                  | Description                                                        |
|--------------------------|--------------------------------------------------------------------|
| `wgpu-dp`                 | wgpu Viterbi backend (Vulkan, DX12, Metal)                        |
| `cuda-dp`                | CUDA Viterbi backend via cudarc + NVRTC (requires CUDA toolkit)   |
| `onnx`                   | ONNX Runtime model backend (CPU or CUDA execution provider)       |
| `alignment-profiling`    | Per-stage timing and memory profiling (benchmark mode)             |

CI runs only CPU backends (default + `onnx`); GPU features (`wgpu-dp`, `cuda-dp`) are not tested in CI. For GPU benchmarks in CI, use the manual workflows **Prepare benchmark assets** and **GPU manual (benchmark)**; see [BENCHMARKS.md](BENCHMARKS.md#benchmark-assets-in-ci-github-actions).

### Basic (CPU only, Candle runtime)

```bash
cargo build --release
```

### With GPU Viterbi (wgpu)

```bash
cargo build --release --features wgpu-dp
```

### Full CUDA pipeline (ONNX + CUDA Viterbi zero-copy)

```bash
cargo build --release --features "onnx,cuda-dp"
```

This gives the fastest path: ORT produces logits on GPU → on-device log-softmax kernel → on-device Viterbi → only the state path array is transferred to host.

### With profiling

```bash
cargo build --release --features "onnx,cuda-dp,alignment-profiling"
```

---

## Usage

### As a library

```rust
use wav2vec2_rs::{ForcedAlignerBuilder, Wav2Vec2Config, AlignmentInput};

let config = Wav2Vec2Config {
    model_path: "model.safetensors".into(), // or "model.onnx"
    config_path: "config.json".into(),
    vocab_path: "vocab.json".into(),
    device: "cpu".into(), // or "cuda"
    expected_sample_rate_hz: 16_000,
};

let aligner = ForcedAlignerBuilder::new(config)
    // .with_runtime_kind(RuntimeKind::Onnx) // for ONNX backend
    .build()?;

let input = AlignmentInput {
    sample_rate_hz: 16_000,
    samples: audio_f32_16khz,
    transcript: "the quick brown fox".into(),
    normalized: None, // auto-computed; or precompute with normalize_audio()
};

let output = aligner.align(&input)?;
for word in &output.words {
    println!("{}: [{}, {}) ms  conf={:.2}",
        word.word, word.start_ms, word.end_ms,
        word.confidence.unwrap_or(0.0));
}
```

### Examples

Runnable examples are in the `examples/` directory:

- **[examples/build_default.rs](examples/build_default.rs)** — Build with all defaults (Candle runtime, default tokenizer, Viterbi aligner, default word grouper). Update paths to your model files and 16 kHz mono audio to run.
- **[examples/custom_builder_components.rs](examples/custom_builder_components.rs)** — Build with custom injected components: a mock runtime backend, custom tokenizer, and custom word grouper. Runs without real model files.

```bash
cargo run --example build_default      # requires model + config + vocab paths
cargo run --example custom_builder_components
```

### Alignment report CLI (benchmark binary)

The project ships with an `alignment_report` binary that serves as both a quality evaluation tool and a performance benchmarker. It processes LibriSpeech test sets, compares predicted word timings against reference TextGrid files, and optionally produces detailed per-stage performance reports.

#### Generating a quality report (JSON)

```bash
cargo run --release --features "onnx,report-cli" --bin alignment_report -- \
    --model-dir models/onnx_wav2vec2_base_960h \
    --dataset-root test-data \
    --device cuda \
    --runtime onnx \
    --output-format json
```

This produces a JSON report with per-sentence structural, timing, and confidence metrics, plus aggregate statistics across test-clean and test-other splits.

#### Generating TextGrid output

```bash
cargo run --release --features "onnx,report-cli" --bin alignment_report -- \
    --model-dir models/onnx_wav2vec2_base_960h \
    --dataset-root test-data \
    --device cuda \
    --runtime onnx \
    --output-format textgrid
```

Writes `.TextGrid` files alongside each LibriSpeech `.flac` file, with tiers for words, word-confidence, and transcript.

#### Running performance benchmarks

```bash
cargo run --release --features "onnx,cuda-dp,alignment-profiling,report-cli" \
    --bin alignment_report -- \
    --model-dir models/onnx_wav2vec2_base_960h \
    --dataset-root test-data \
    --device cuda \
    --runtime onnx \
    --output-format perf \
    --perf-out target/perf/rust-cuda.jsonl \
    --perf-warmup 10 \
    --perf-repeats 30 \
    --perf-aggregate median \
    --perf-append
```

See [BENCHMARKS.md](./BENCHMARKS.md) for a detailed description of the benchmark methodology, known biases, and how to reproduce results.

#### Key CLI flags

| Flag | Description |
|------|-------------|
| `--model-dir` | Path to model directory (must contain model weights, `config.json`, `vocab.json`) |
| `--dataset-root` | Path to test data root (expects `LibriSpeech/test-clean` and `LibriSpeech/test-other` underneath) |
| `--device` | `cpu` or `cuda` |
| `--runtime` | `onnx` or `candle` |
| `--output-format` | `json` (quality report), `textgrid` (TextGrid files), `perf` (performance only) |
| `--limit` / `--offset` | Process a subset of cases |
| `--cases-file` | Filter to specific utterance IDs |
| `--perf-out` | Output path for perf JSON/JSONL |
| `--perf-warmup` | Number of warm-up iterations (default: 10) |
| `--perf-repeats` | Number of timed repeats per utterance (default: 30) |
| `--perf-aggregate` | `median` or `mean` |
| `--perf-append` | Append JSONL records (one per utterance) instead of writing a single JSON file |
| `--perf-scaling-report` | Print T×S scaling analysis with Pearson correlation |

### Python comparison script

A companion Python script reuses the original [wav2vec2aligner](https://github.com/EveryVoiceTTS/wav2vec2aligner) to generate TextGrid files and perf records on the same LibriSpeech data, enabling direct comparison:

```bash
python scripts/wav2vec2aligner_librispeech_textgrids.py \
    --dataset-root test-data/LibriSpeech \
    --device cuda \
    --perf-out target/perf/python-cuda.jsonl \
    --perf-warmup 10 \
    --perf-repeats 30 \
    --perf-aggregate median \
    --perf-append
```

A patch file (`wav2vec2aligner-main.perf-monitoring.patch`) adds profiling instrumentation to the original Python project so that per-stage timings (forward, post, dp, group, conf) are recorded in the same JSONL schema as the Rust implementation.

### Web tools (separate branches)

Two small web pages for working with alignment and benchmark data live on dedicated branches (not merged into `main`):

| Branch | Description |
|--------|-------------|
| **Alignment Report** | Web tool to read and inspect performance reports generated by the alignment pipeline (e.g. `--output-format perf` / `--perf-out` JSONL). |
| **alignment explorer** | Web tool to compare alignment data produced by this implementation with the official torchAudio-based pipeline (e.g. Rust vs Python JSONL outputs). |

Check out the branch and open the HTML (and any assets) in a browser to use each tool.

---

## Test data

Benchmarks and evaluation use the [LibriSpeech](https://www.openslr.org/12) corpus, specifically the **test-clean** and **test-other** subsets. Download them from:

> **https://www.openslr.org/12**

Extract them under `test-data/LibriSpeech/test-clean` and `test-data/LibriSpeech/test-other`.

---

## Architecture

### Pipeline stages

```
Audio [f32] ──► Normalize ──► Forward Pass ──► Log-softmax ──► Viterbi DP ──► Grouping ──► Words
                  (μ=0,σ=1)   (Candle|ORT)    (CPU|GPU)       (CPU|wgpu|CUDA) (expand+score)
```

Each stage is abstracted behind a trait (`RuntimeBackend`, `SequenceAligner`, `Tokenizer`, `WordGrouper`) and can be replaced via the builder. The default implementations are wired in `pipeline::defaults`.

### Module layout

```
src/
├── alignment/
│   ├── viterbi.rs              # Dispatch: CPU → wgpu → CUDA based on T×S threshold
│   ├── cuda/
│   │   ├── viterbi.cu          # CUDA kernels: log_softmax_rows, viterbi_forward, viterbi_backtrace
│   │   └── viterbi_cuda.rs     # cudarc host code, zero-copy + upload variants
│   ├── gpu/
│   │   ├── viterbi.wgsl        # WGSL compute shader: single-workgroup wavefront
│   │   └── viterbi_gpu.rs      # wgpu host code, buffer management, blocking readback
│   ├── tokenization.rs         # Case-aware CTC token sequence builder (blank-interleaved)
│   ├── grouping/
│   │   ├── path_to_words.rs    # Phase 1: walk Viterbi path → raw word boundaries
│   │   ├── blank_expansion.rs  # Phase 2: expand boundaries (Balanced policy)
│   │   └── mod.rs              # Orchestration, quality confidence, calibration
│   └── report.rs               # Evaluation: structural/timing/confidence metrics
├── model/
│   ├── ctc_model.rs            # Wav2Vec2ForCTC (Candle)
│   ├── encoder.rs              # Transformer encoder with positional conv
│   ├── feature_extractor.rs    # Conv1d stack with weight-norm, GroupNorm/LayerNorm
│   ├── feature_projection.rs   # Linear projection to hidden dim
│   └── layers.rs               # LayerNorm, GroupNorm1d (custom for Candle)
├── pipeline/
│   ├── builder.rs              # ForcedAlignerBuilder: wire config → pipeline
│   ├── runtime.rs              # ForcedAligner: align() and align_profiled()
│   ├── model_runtime.rs        # CandleRuntimeBackend, OnnxRuntimeBackend
│   ├── cuda_forward.rs         # CudaLogProbsBuffer: zero-copy device buffer
│   ├── defaults.rs             # Default trait implementations
│   ├── traits.rs               # RuntimeBackend, Tokenizer, SequenceAligner, WordGrouper
│   └── memory_tracker.rs       # Per-stage RSS + GPU memory profiling
├── config.rs                   # Wav2Vec2Config, Wav2Vec2ModelConfig
├── types.rs                    # AlignmentInput, AlignmentOutput, WordTiming, WordConfidenceStats
└── error.rs                    # AlignmentError (Io, Json, Runtime, InvalidInput)

src/bin/
├── alignment_report.rs         # CLI binary: quality reports, TextGrid generation, perf benchmarks
└── alignment_report/
    ├── json_report_formatter.rs      # Quality report JSON serializer
    ├── perf_report_formatter.rs      # Perf benchmark JSON/JSONL serializer
    └── text_grid_report_formatter.rs # TextGrid output writer

scripts/
└── wav2vec2aligner_librispeech_textgrids.py  # Python comparison benchmark script
```

### CTC Viterbi algorithm

The core DP aligns a CTC token sequence `S` (blank-interleaved: `⟨blank, c₁, blank, |, blank, c₂, blank, ...⟩`) against `T` frames of log-probabilities from the acoustic model.

**State transitions** follow CTC constraints: stay on current state (`s → s`), step forward (`s-1 → s`), or skip (`s-2 → s`, only if `tokens[s] ≠ tokens[s-2]` to prevent skipping blanks between repeated characters).

**Reachability band pruning** avoids touching unreachable cells — at each time step `t`, only states in `[curr_start, curr_end]` need to be evaluated. This is significant for long sequences.

**Backpointer storage** uses only 2 bits per cell. Backtrace reconstructs the full path in O(T).

### Three Viterbi backends

All three backends implement identical DP logic and produce bit-identical paths. The dispatch in `viterbi.rs` selects based on T×S product (below 40,000, CPU is faster than GPU launch overhead).

There is no strong performance reason for having both the wgpu and CUDA backends — they achieve comparable throughput on the same hardware. Both exist because building them was a fun exercise in exploring GPU compute from Rust through two very different APIs (portable graphics API vs. vendor-native toolkit).

**CPU** — Scalar DP with ping-pong score arrays. Two `Vec<f32>` of length S are swapped each time step. Reference implementation, always available.

**wgpu** (`wgpu-dp` feature) — A single compute shader dispatch runs the entire T-step DP in one workgroup of 256 threads using `workgroupBarrier()` synchronization. Only the T-length path buffer is copied back to host. Supports Vulkan, DX12, and Metal.

**CUDA** (`cuda-dp` feature) — Three kernels compiled at runtime via NVRTC: `log_softmax_rows` (shared-memory reduction), `viterbi_forward` (wavefront DP in dynamic shared memory), and `viterbi_backtrace` (single-thread O(T) path extraction). When ORT runs on CUDA, the entire log-softmax → Viterbi → backtrace pipeline executes on device with zero-copy — only the final path array transfers to host.

### Tokenization

`build_token_sequence_case_aware` detects vocabulary casing (uppercase-only, lowercase-only, or mixed) and normalizes the transcript accordingly. It produces a blank-interleaved token sequence with a parallel `chars` array mapping each position to its character (or `None` for blanks).

### Word grouping and blank expansion

Grouping happens in three phases:

1. **Path to raw words** — Walks the Viterbi path frame by frame, building tight `[start_frame, end_frame]` boundaries and accumulating emission log-probs and top-2 margins per word.

2. **Blank expansion** — Three policies (Balanced, ConservativeStart, AggressiveTail) expand boundaries into adjacent blank regions with different trade-offs for left expansion, right pullback, and minimum interior silence.

3. **Candidate selection** — All three candidates are scored using boundary blank evidence, shift penalty, and pause plausibility. The best expansion is selected per word.

### Confidence scoring

Each word receives a composite quality confidence score blending geometric mean emission probability, margin, p10 log-prob, and boundary evidence, passed through piecewise-linear calibration to produce a value in [0, 1].

---

## Development / CI

The project uses **Clippy with `-D warnings`** so that lint fixes are required before merge. Formatting is enforced with `cargo fmt --check`. No global Clippy allow flags are used in CI; a few lints are allowed locally where the team prefers readability (e.g. `clippy::needless_range_loop` in DP loops, `clippy::too_many_arguments` on grouping APIs). To run the same checks locally:

```bash
cargo fmt --all -- --check
cargo clippy --all-targets --features onnx -- -D warnings
cargo test --features onnx
```

Optional **pre-commit hooks** (format + clippy) are in `githooks/`. Enable with `git config core.hooksPath githooks` from the repo root; see [githooks/README.md](githooks/README.md).

If you add CI (e.g. GitHub Actions), run these three steps as blocking jobs.

### Manual crate publish workflow

A manual GitHub Action is available at `.github/workflows/publish-crate-manual.yml`.

**Assume the version bump is already committed** on the branch you run from (branch protection may prevent the action from pushing). The workflow only builds, packages, and publishes:

- **Actions → Publish crate (manual) → Run workflow** (no inputs)

Required secret:

- `CRATES_IO_TOKEN` (crates.io API token)

The workflow runs `cargo build --release`, `cargo package`, and `cargo publish`.

---

## License

This project is licensed under the Mozilla Public License Version 2.0 — see the [LICENSE](https://www.mozilla.org/en-US/MPL/2.0/) file for details.