Expand description
CPU-only Qwen3-ASR speech recognition in pure Rust.
BLAS and SIMD optimizations are selected automatically at compile time based on the target platform — Accelerate + NEON on macOS/aarch64, OpenBLAS + AVX2 on Linux/x86_64, etc. For best performance on x86_64, build with:
RUSTFLAGS="-C target-cpu=native" cargo build --releaseImportant: Always build in release mode (--release). Debug builds are
10–50x slower and unusable for real-time inference.
§Quick Start
use qwen_asr::context::QwenCtx;
use qwen_asr::transcribe;
let mut ctx = QwenCtx::load("qwen3-asr-0.6b").expect("model not found");
let text = transcribe::transcribe(&mut ctx, "audio.wav").unwrap();
println!("{text}");§Forced Alignment
With the aligner model variant you can obtain word-level timestamps for a known transcript:
use qwen_asr::context::QwenCtx;
use qwen_asr::align;
let mut ctx = QwenCtx::load("qwen3-aligner-0.6b").expect("aligner model not found");
let samples: Vec<f32> = vec![]; // 16 kHz mono f32 PCM
let results = align::forced_align(&mut ctx, &samples, "Hello world", "English").unwrap();
for r in &results {
println!("{}: {:.0} – {:.0} ms", r.text, r.start_ms, r.end_ms);
}§Module Guide
| Module | Purpose |
|---|---|
context | Engine state — start here with context::QwenCtx::load |
transcribe | Offline, segmented, and streaming transcription |
audio | WAV loading, resampling, mel spectrogram |
align | Forced alignment (word/character timestamps) |
config | Model configuration and variant detection |
tokenizer | GPT-2 byte-level BPE tokenizer |
The remaining modules (encoder, decoder, kernels, safetensors) are
implementation details and not intended for direct use.
Modules§
- align
- Forced alignment: word- and character-level timestamps.
- audio
- WAV loading, resampling, and mel spectrogram computation.
- config
- Model configuration and automatic variant detection.
- context
- Top-level engine state (
QwenCtx) owning all loaded weights and runtime buffers. - decoder
- Qwen3 LLM decoder with GQA, KV cache, and generation.
- encoder
- Audio encoder: Conv2D stem + windowed transformer + projection cascade.
- kernels
- BLAS/vDSP bindings, thread pool, and SIMD kernel dispatch.
- safetensors
- Safetensors mmap reader with multi-shard support.
- tokenizer
- GPT-2 byte-level BPE tokenizer for Qwen.
- transcribe
- Offline, segmented, and streaming transcription orchestration.
Functions§
- optimization_
flags - Returns a list of compile-time optimization flags enabled for this build.