wavekat-asr 0.0.2

Streaming ASR trait surface for voice pipelines, intended to wrap one or more speech-to-text backends behind a common Rust API. Same pattern as wavekat-vad and wavekat-turn.

[!WARNING] Scaffold release. This crate ships only the trait shape and a scripted-event mock backend so downstream consumers can wire integration tests against the contract. No real ASR backends are bundled yet — the trait may iterate before the first one lands. Pin to an exact patch version.

What's included

Item	Feature flag
`StreamingAsr` trait, `TranscriptEvent`, `Channel`, `AsrError`	always
`MockAsr` — scripted partials → final, paired with an `mpsc::Receiver`	`mock`
`SherpaOnnxAsr` — local streaming Zipformer (EN+ZH bilingual by default); auto-downloads model from HuggingFace on first use	`sherpa-onnx`

Quick start

cargo add wavekat-asr --features mock

use wavekat_asr::{AudioFrame, Channel, StreamingAsr, TranscriptEvent};
use wavekat_asr::backends::mock::MockAsr;

let (mut asr, rx) = MockAsr::new();
let samples = vec![0i16; 160];
let frame = AudioFrame::new(&samples, 16_000);

asr.push_audio(&frame, Channel::Local).unwrap();
asr.finish().unwrap();

for event in rx.try_iter() {
    match event {
        TranscriptEvent::Final { text, .. } => println!("final: {text}"),
        TranscriptEvent::Partial { text, .. } => println!("partial: {text}"),
        _ => {}
    }
}

Examples

Two runnable examples ship behind --features sherpa-onnx. First run auto-downloads the selected model into hf-hub's cache.

# Transcribe a 16 kHz mono WAV file
cargo run --release --example transcribe_wav --features sherpa-onnx -- audio.wav

# Live mic transcription (Ctrl-C to stop)
cargo run --release --example transcribe_mic --features sherpa-onnx

# Pick a different model (default is `bilingual`)
WAVEKAT_ASR_PRESET=en cargo run --release --example transcribe_mic --features sherpa-onnx

Bundled model presets — model choice is a construction-time call (the ONNX files load into the recognizer); switching models requires rebuilding the backend.

`WAVEKAT_ASR_PRESET`	Constant	HF repo	Best for
`bilingual` (default)	`BILINGUAL_ZH_EN`	`csukuangfj/sherpa-onnx-streaming-zipformer-bilingual-zh-en-2023-02-20`	Mixed EN+ZH calls
`en`	`ZIPFORMER_EN`	`csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26`	English-only
`zh`	`PARAFORMER_ZH`	`csukuangfj/sherpa-onnx-streaming-paraformer-zh`	Mandarin-only (often beats bilingual on ZH WER)
`paraformer-zh-en`	`PARAFORMER_BILINGUAL_ZH_EN`	`csukuangfj/sherpa-onnx-streaming-paraformer-bilingual-zh-en`	ZH-leaning bilingual alternative

Architecture

The crate exposes one trait — StreamingAsr — and one event enum — TranscriptEvent. The trait keeps the surface that consumers see as small as possible; backends will own their own resampling, network state, and tokenizer.

   AudioFrame ──▶  push_audio(frame, channel)  ──▶  ┌───────────┐
                                                    │  Backend  │
   end of call ─▶  finish()                    ──▶  │           │
                                                    │           │
                                  TranscriptEvent ◀─│           │
                                  on Receiver       └───────────┘

Why a sync push + receiver pair, rather than async fn? The daemon that will consume this (wavekat-voice) already runs an event loop and fans events out over SSE; matching that shape avoids forcing a tokio runtime through the trait. Backends that need their own runtime will spawn one internally.

License

Apache-2.0. See LICENSE.