transcribe-rs 0.3.11

A simple library to help you transcribe audio
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
# transcribe-rs

Multi-engine speech-to-text library for Rust. Supports Parakeet, Canary, Cohere, Moonshine, SenseVoice, GigaAM, Whisper, Whisperfile, and OpenAI.

## Breaking Changes in 0.3.0

Version 0.3.0 changes the `SpeechModel` trait. If you need the old API, pin to `version = "=0.2.9"`.

- `transcribe()` and `transcribe_file()` now take `&TranscribeOptions` instead of `Option<&str>` for language
- `SpeechModel` requires `Send`, enabling `Box<dyn SpeechModel + Send>` across threads
- `TranscribeOptions` includes a `translate` field for Whisper/Whisperfile translation support
- `WhisperEngine::capabilities()` now returns actual model language support (English-only vs multilingual) instead of always reporting 99 languages

**Note:** 0.3.0 is a large migration. We believe correctness is preserved for all engines, but expect potential issues as this stabilizes. Please report any problems on [GitHub](https://github.com/cjpais/transcribe-rs/issues).

## Installation

```toml
[dependencies]
transcribe-rs = { version = "0.3", features = ["onnx"] }
```

No features are enabled by default. Pick the engines you need:

| Feature | Engines |
|---------|---------|
| `onnx` | Parakeet, Canary, Cohere, Moonshine, SenseVoice, GigaAM (via ONNX Runtime) |
| `whisper-cpp` | Whisper (local, GGML via whisper.cpp with Metal/Vulkan/CUDA) |
| `whisperfile` | Whisperfile (local server wrapper) |
| `openai` | OpenAI API (remote, async) |
| `all` | Everything above |

GPU accelerator features for whisper.cpp:

| Feature | Backend |
|---------|---------|
| `whisper-metal` | Apple Metal (macOS) |
| `whisper-vulkan` | Vulkan (Windows/Linux) |
| `whisper-cuda` | NVIDIA CUDA (requires CUDA Toolkit 12.0+) |

GPU accelerator features for ORT engines:

| Feature | Backend |
|---------|---------|
| `ort-cuda` | NVIDIA CUDA |
| `ort-rocm` | AMD ROCm |
| `ort-directml` | Microsoft DirectML (Windows) |
| `ort-coreml` | Apple CoreML (macOS/iOS) |
| `ort-webgpu` | WebGPU via Dawn |

## Quick Start

```rust
use transcribe_rs::onnx::parakeet::{ParakeetModel, ParakeetParams, TimestampGranularity};
use transcribe_rs::onnx::Quantization;
use std::path::PathBuf;

let mut model = ParakeetModel::load(
    &PathBuf::from("models/parakeet-tdt-0.6b-v3-int8"),
    &Quantization::Int8,
)?;

let samples = transcribe_rs::audio::read_wav_samples(&PathBuf::from("audio.wav"))?;
let result = model.transcribe_with(
    &samples,
    &ParakeetParams {
        timestamp_granularity: Some(TimestampGranularity::Segment),
        ..Default::default()
    },
)?;
println!("{}", result.text);
```

All local engines implement the `SpeechModel` trait. Remote engines (OpenAI) implement `RemoteTranscriptionEngine` separately because they are async and file-based.

## Hardware Acceleration

By default, engines use CPU. To enable GPU acceleration, enable the appropriate feature and set the accelerator preference before loading any models:

```rust
use transcribe_rs::{set_ort_accelerator, OrtAccelerator};

// Use CUDA for all ORT engines (SenseVoice, GigaAM, Parakeet, Moonshine)
set_ort_accelerator(OrtAccelerator::Cuda);

// Or auto-detect the best available GPU
set_ort_accelerator(OrtAccelerator::Auto);
```

For whisper.cpp, GPU backend (Metal, Vulkan, CUDA) is selected at compile time via feature flags. You can control whether GPU is used at runtime:

```rust
use transcribe_rs::{set_whisper_accelerator, WhisperAccelerator};

set_whisper_accelerator(WhisperAccelerator::CpuOnly); // force CPU
```

**DirectML note:** DirectML requires special ORT session settings (`parallel_execution(false)`, `memory_pattern(false)`) that would hurt performance on other backends. Because of this, `Auto` mode does not include DirectML — you must explicitly select it with `OrtAccelerator::DirectMl`.

Query which ORT accelerators are compiled in with `OrtAccelerator::available()`.

## Usage by Engine

### Canary

```rust
use transcribe_rs::onnx::canary::{CanaryModel, CanaryParams};
use transcribe_rs::onnx::Quantization;
use std::path::PathBuf;

let mut model = CanaryModel::load(
    &PathBuf::from("models/canary-1b-v2"),
    &Quantization::Int8,
)?;

let samples = transcribe_rs::audio::read_wav_samples(&PathBuf::from("audio.wav"))?;
let result = model.transcribe_with(
    &samples,
    &CanaryParams {
        language: Some("en".to_string()),
        ..Default::default()
    },
)?;
```

Canary supports translation via `target_language`:

```rust
let result = model.transcribe_with(
    &samples,
    &CanaryParams {
        language: Some("de".to_string()),
        target_language: Some("en".to_string()),
        ..Default::default()
    },
)?;
```

Model variant (Flash vs V2) is auto-detected from vocabulary size. Flash models support en/de/es/fr; V2 supports 25 languages.

**Features:**
- **PnC** (punctuation and capitalization) — enabled by default. When on, the model adds proper punctuation and capitalization. Set `use_pnc: false` for raw output.
- **ITN** (inverse text normalization) — enabled by default. Converts spoken numbers to written form (e.g. "one hundred twenty three" becomes "123"). Set `use_itn: false` to disable. Only supported on V2 models; silently ignored on Flash.
- **Translation** — set `target_language` to translate between supported languages.

### Cohere

```rust
use transcribe_rs::onnx::cohere::{CohereModel, CohereParams};
use transcribe_rs::onnx::Quantization;
use std::path::PathBuf;

let mut model = CohereModel::load(
    &PathBuf::from("models/cohere-int4"),
    &Quantization::Int4,
)?;

let samples = transcribe_rs::audio::read_wav_samples(&PathBuf::from("audio.wav"))?;
let result = model.transcribe_with(
    &samples,
    &CohereParams {
        language: Some("en".to_string()),
        ..Default::default()
    },
)?;
```

Available in int4 and int8 quantizations.

### SenseVoice

```rust
use transcribe_rs::onnx::sense_voice::{SenseVoiceModel, SenseVoiceParams};
use transcribe_rs::onnx::Quantization;
use std::path::PathBuf;

let mut model = SenseVoiceModel::load(
    &PathBuf::from("models/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17"),
    &Quantization::Int8,
)?;

let samples = transcribe_rs::audio::read_wav_samples(&PathBuf::from("audio.wav"))?;
let result = model.transcribe_with(
    &samples,
    &SenseVoiceParams {
        language: Some("en".to_string()),
        ..Default::default()
    },
)?;
```

### Moonshine

```rust
use transcribe_rs::onnx::moonshine::{MoonshineModel, MoonshineVariant};
use transcribe_rs::onnx::Quantization;
use transcribe_rs::SpeechModel;
use std::path::PathBuf;

let mut model = MoonshineModel::load(
    &PathBuf::from("models/moonshine-base"),
    MoonshineVariant::Base,
    &Quantization::default(),
)?;
let result = model.transcribe_file(&PathBuf::from("audio.wav"), &transcribe_rs::TranscribeOptions::default())?;
```

Streaming variant:

```rust
use transcribe_rs::onnx::moonshine::StreamingModel;
use transcribe_rs::onnx::Quantization;
use transcribe_rs::SpeechModel;
use std::path::PathBuf;

let mut model = StreamingModel::load(
    &PathBuf::from("models/moonshine-streaming/moonshine-tiny-streaming-en"),
    4,  // threads
    &Quantization::default(),
)?;
let result = model.transcribe_file(&PathBuf::from("audio.wav"), &transcribe_rs::TranscribeOptions::default())?;
```

### GigaAM

```rust
use transcribe_rs::onnx::gigaam::GigaAMModel;
use transcribe_rs::onnx::Quantization;
use transcribe_rs::SpeechModel;
use std::path::PathBuf;

let mut model = GigaAMModel::load(
    &PathBuf::from("models/giga-am-v3"),
    &Quantization::default(),
)?;
let result = model.transcribe_file(&PathBuf::from("audio.wav"), &transcribe_rs::TranscribeOptions::default())?;
```

### Whisper (whisper.cpp)

```rust
use transcribe_rs::whisper_cpp::{WhisperEngine, WhisperInferenceParams};
use std::path::PathBuf;

let mut engine = WhisperEngine::load(&PathBuf::from("models/whisper-medium-q4_1.bin"))?;

let samples = transcribe_rs::audio::read_wav_samples(&PathBuf::from("audio.wav"))?;
let result = engine.transcribe_with(
    &samples,
    &WhisperInferenceParams {
        initial_prompt: Some("Context prompt here.".to_string()),
        ..Default::default()
    },
)?;
```

### Whisperfile

```rust
use transcribe_rs::whisperfile::{
    WhisperfileEngine, WhisperfileInferenceParams, WhisperfileLoadParams,
};
use std::path::PathBuf;

let mut engine = WhisperfileEngine::load_with_params(
    &PathBuf::from("models/whisperfile-0.9.3"),
    &PathBuf::from("models/ggml-small.bin"),
    WhisperfileLoadParams {
        port: 8080,
        startup_timeout_secs: 60,
        ..Default::default()
    },
)?;

let samples = transcribe_rs::audio::read_wav_samples(&PathBuf::from("audio.wav"))?;
let result = engine.transcribe_with(
    &samples,
    &WhisperfileInferenceParams {
        language: Some("en".to_string()),
        ..Default::default()
    },
)?;
// Server shuts down automatically when engine is dropped.
```

### OpenAI (Remote)

```rust
use transcribe_rs::remote::openai::{self, OpenAIModel, OpenAIRequestParams};
use transcribe_rs::{remote, RemoteTranscriptionEngine};
use std::path::PathBuf;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let engine = openai::default_engine();
    let result = engine
        .transcribe_file(
            &PathBuf::from("audio.wav"),
            OpenAIRequestParams::builder()
                .model(OpenAIModel::Gpt4oMiniTranscribe)
                .timestamp_granularity(remote::openai::OpenAITimestampGranularity::Segment)
                .build()?,
        )
        .await?;
    println!("{}", result.text);
    Ok(())
}
```

## Models

All audio input must be **16 kHz, mono, 16-bit PCM WAV**.

### Model Downloads

| Engine | Download |
|--------|----------|
| Parakeet (int8) | [blob.handy.computer]https://blob.handy.computer/parakeet-v3-int8.tar.gz / [HuggingFace]https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx/tree/main |
| Canary 180M Flash | [HuggingFace]https://huggingface.co/istupakov/canary-180m-flash-onnx |
| Canary 1B Flash | [HuggingFace]https://huggingface.co/istupakov/canary-1b-flash-onnx |
| Canary 1B v2 | [HuggingFace]https://huggingface.co/istupakov/canary-1b-v2-onnx |
| Cohere (int4) | [HuggingFace]https://huggingface.co/cstr/cohere-transcribe-onnx-int4 |
| Cohere (int8) | [HuggingFace]https://huggingface.co/tristanripke/cohere-transcribe-onnx-int8 |
| SenseVoice (int8) | [blob.handy.computer]https://blob.handy.computer/sense-voice-int8.tar.gz / [sherpa-onnx]https://github.com/k2-fsa/sherpa-onnx/releases/tag/asr-models |
| Moonshine | [blob.handy.computer (base)]https://blob.handy.computer/moonshine-base.tar.gz, [blob.handy.computer (tiny streaming en)]https://blob.handy.computer/moonshine-tiny-streaming-en.tar.gz, [blob.handy.computer (small streaming en)]https://blob.handy.computer/moonshine-small-streaming-en.tar.gz, [blob.handy.computer (medium streaming en)]https://blob.handy.computer/moonshine-medium-streaming-en.tar.gz |
| GigaAM | [HuggingFace]https://huggingface.co/istupakov/gigaam-v3-onnx/tree/main |
| Whisper (GGML) | [HuggingFace]https://huggingface.co/ggerganov/whisper.cpp/tree/main |
| Whisperfile binary | [GitHub]https://github.com/mozilla-ai/llamafile/releases/download/0.9.3/whisperfile-0.9.3 |

### Directory Layouts

**Parakeet** (directory):
```
models/parakeet-tdt-0.6b-v3-int8/
├── encoder-model.int8.onnx
├── decoder_joint-model.int8.onnx
├── nemo128.onnx
└── vocab.txt
```

**Canary** (directory):
```
models/canary-1b-v2/
├── encoder-model.int8.onnx
├── decoder-model.int8.onnx
├── nemo128.onnx
└── vocab.txt
```

**Cohere** (directory):
```
models/cohere-int4/
├── cohere-encoder.int4.onnx
├── cohere-encoder.int4.onnx.data
├── cohere-decoder.int4.onnx
├── cohere-decoder.int4.onnx.data
└── tokens.txt
```

**SenseVoice** (directory):
```
models/sense-voice/
├── model.int8.onnx
└── tokens.txt
```

**Moonshine** (directory):
```
models/moonshine-base/
├── encoder_model.onnx
├── decoder_model_merged.onnx
└── tokenizer.json
```

**Moonshine Streaming** (directory):
```
models/moonshine-streaming/moonshine-tiny-streaming-en/
├── encoder.onnx
├── decoder.onnx
├── streaming_config.json
└── tokenizer.json
```

**GigaAM** (directory):
```
models/giga-am-v3/
├── model.onnx          (or model.int8.onnx)
└── vocab.txt
```

**Whisper**: single file (e.g. `whisper-medium-q4_1.bin`).

### Moonshine Variants

| Variant | Language |
|---------|----------|
| Tiny | English |
| TinyAr | Arabic |
| TinyZh | Chinese |
| TinyJa | Japanese |
| TinyKo | Korean |
| TinyUk | Ukrainian |
| TinyVi | Vietnamese |
| Base | English |
| BaseEs | Spanish |

## Examples and Tests

Each engine has an example in `examples/`. Run with the appropriate feature flag:

```bash
cargo run --example parakeet --features onnx
cargo run --example canary --features onnx
cargo run --example cohere --features onnx
cargo run --example sense_voice --features onnx
cargo run --example moonshine --features onnx
cargo run --example moonshine_streaming --features onnx
cargo run --example gigaam --features onnx
cargo run --example whisper --features whisper-cpp
cargo run --example whisperfile --features whisperfile
cargo run --example openai --features openai
```

Tests are also feature-gated. Models must be present locally; tests skip gracefully if not found.

```bash
cargo test --features onnx
cargo test --features whisper-cpp
cargo test --features whisperfile
cargo test --all-features
```

Whisperfile tests look for the binary at `models/whisperfile-0.9.3` (override with `WHISPERFILE_BIN`) and model at `models/ggml-small.bin` (override with `WHISPERFILE_MODEL`). GigaAM tests require `samples/russian.wav`.

Development aliases from `.cargo/config.toml`:

```bash
cargo check-all    # cargo check --all-features
cargo build-all    # cargo build --all-features
cargo test-all     # cargo test --all-features
```

## Performance

Parakeet int8 benchmarks:

| Platform | Speed |
|----------|-------|
| MBP M4 Max | ~30x real-time |
| Zen 3 (5700X) | ~20x real-time |
| Skylake (i5-6500) | ~5x real-time |
| Jetson Nano CPU | ~5x real-time |

## Acknowledgments

- [istupakov]https://github.com/istupakov/onnx-asr for the ONNX Parakeet, Canary, and GigaAM exports
- [NVIDIA]https://github.com/NVIDIA/NeMo for Parakeet and Canary
- [whisper.cpp]https://github.com/ggerganov/whisper.cpp
- [jart]http://github.com/jart / [Mozilla AI]https://github.com/mozilla-ai for [llamafile]https://github.com/mozilla-ai/llamafile and Whisperfile
- [UsefulSensors]https://github.com/usefulsensors for Moonshine
- [FunASR]https://github.com/modelscope/FunASR / [sherpa-onnx]https://github.com/k2-fsa/sherpa-onnx for SenseVoice
- [SberDevices]https://github.com/salute-developers/GigaAM for GigaAM