Unified streaming speech-to-text for voice pipelines, wrapping multiple ASR engines behind common Rust traits. Same pattern as wavekat-vad, wavekat-turn, and wavekat-tts.
[!WARNING] Pre-1.0. The trait surface may iterate as more backends land. Pin to an exact patch version.
Backends
| Backend | Feature flag | Transport | Languages | Status | License |
|---|---|---|---|---|---|
| sherpa-onnx (streaming Zipformer / Paraformer) | sherpa-onnx |
Local ONNX | EN, ZH, EN+ZH | ✅ Available | Apache 2.0 |
Local-first by design: the bundled sherpa-onnx backend ships today and runs entirely on-device.
Quick start
use SherpaOnnxAsr;
use ;
let = new?; // auto-downloads bilingual model on first run
let samples = vec!; // 1 s of 16 kHz mono audio
let frame = new;
asr.push_audio?;
asr.finish?;
for event in rx.try_iter
The StreamingAsr trait
All backends implement a common trait so you can write code generic over backends:
Transcript events come back through an mpsc::Receiver<TranscriptEvent>
the backend hands you at construction time:
Channel::{Local, Remote} tags which side of a two-channel call each
event belongs to — the daemon tees both RTP directions through one ASR
instance.
Architecture
wavekat-vad → "is someone speaking?"
wavekat-turn → "are they done speaking?"
wavekat-asr → "what did they say?"
wavekat-tts → "synthesize the response"
│ │ │ │
└───────────────────┴─────────────────────┴────────────────────┘
│
AudioFrame (wavekat-core)
The trait surface stays deliberately small. Backends own their own resampling, network state, and tokenizer.
AudioFrame ──▶ push_audio(frame, channel) ──▶ ┌───────────┐
│ Backend │
end of call ─▶ finish() ──▶ │ │
│ │
TranscriptEvent ◀─│ │
on Receiver └───────────┘
Why sync push + receiver, rather than async fn? The intended consumer
already runs an event loop and fans events out to clients; matching that
shape avoids forcing a tokio runtime through the trait. Backends that
need their own runtime spawn one internally.
sherpa-onnx backend
Local streaming Zipformer / Paraformer via
sherpa-onnx. Auto-downloads the
selected model from HuggingFace on first use; cached under $HF_HOME/hub/
(default ~/.cache/huggingface/hub/).
Model presets
Model choice is a construction-time call — the ONNX files load into the recognizer, so switching models requires rebuilding the backend.
WAVEKAT_ASR_PRESET |
Constant | HF repo | Best for |
|---|---|---|---|
bilingual (default) |
BILINGUAL_ZH_EN |
csukuangfj/sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20 |
Mixed EN+ZH calls |
en |
ZIPFORMER_EN |
csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26 |
English-only |
zh |
PARAFORMER_ZH |
csukuangfj/sherpa-onnx-streaming-paraformer-zh |
Mandarin-only (often beats bilingual on ZH WER) |
paraformer-zh-en |
PARAFORMER_BILINGUAL_ZH_EN |
csukuangfj/sherpa-onnx-streaming-paraformer-bilingual-zh-en |
ZH-leaning bilingual alternative |
Examples
Two runnable examples ship behind --features sherpa-onnx. First run
auto-downloads the selected model.
# Transcribe a 16 kHz mono WAV file
# Live mic transcription (Ctrl-C to stop)
# Pick a different model (default is `bilingual`)
WAVEKAT_ASR_PRESET=en
Feature flags
| Flag | Default | Description |
|---|---|---|
sherpa-onnx |
No | Local streaming Zipformer / Paraformer via sherpa-onnx; pulls in hf-hub for first-run model download |
Building from source
Enabling sherpa-onnx pulls in sherpa-onnx-sys, which builds vendored
ONNX Runtime through CMake. You'll need:
- A C++ toolchain (
clangorgcc) andcmakeon PATH. - Linux only — and only for the
transcribe_micexample: ALSA dev headers (libasound2-devon Debian/Ubuntu,alsa-lib-develon Fedora). The library itself has no system audio dependency.
The first build of sherpa-onnx-sys is slow (5–10 min); subsequent
builds are cached by Cargo.
Important notes
- Sample rate. The
StreamingAsrtrait accepts anyAudioFramesample rate; backends resample internally. The sherpa-onnx backend currently expects 16 kHz f32 input — 8 kHz telephony resampling lands in a follow-up (seedocs/03-sherpa-onnx-backend.md). - Dual-channel routing.
Channel::{Local, Remote}is wired through the trait today; per-channel state isolation in sherpa-onnx is Phase 2.
License
Licensed under Apache 2.0.
Copyright 2026 WaveKat.
Acknowledgements
- sherpa-onnx — streaming ASR runtime by the k2-fsa team (Apache 2.0)
- Pretrained model checkpoints from the sherpa-onnx pretrained zoo on HuggingFace