Tauri Plugin STT (Speech-to-Text)
Cross-platform speech recognition plugin for Tauri 2.x. Desktop targets
(Windows, macOS, Linux) use whisper.cpp
via whisper-rs; mobile targets
delegate to the native OS engines (SFSpeechRecognizer on iOS,
SpeechRecognizer on Android).
Highlights
- One model, 99 languages — Whisper is multilingual; users download a single GGML model and it works for English, Portuguese, Mandarin, …
- No native runtime to ship —
whisper-rsbuilds whisper.cpp statically; there is nolibvosk.so/.dylibto install separately. - Explicit model lifecycle — the host app controls when (and whether) a
model is downloaded.
start_listeningreturnsModelNotInstalledinstead of silently pulling hundreds of MB. - Hardware acceleration — opt-in
metal/cuda/vulkanfeatures map straight to the matching whisper.cpp backend.
Platform Matrix
| Platform | Engine | Model |
|---|---|---|
| iOS | SFSpeechRecognizer (Speech.framework) |
OS |
| Android | SpeechRecognizer |
OS |
| macOS | whisper.cpp via whisper-rs (Metal opt.) |
GGML |
| Windows | whisper.cpp via whisper-rs (CUDA opt.) |
GGML |
| Linux | whisper.cpp via whisper-rs (Vulkan opt.) |
GGML |
Installation
[]
= { = "0.2", = ["metal"] } # macOS
# or "cuda" / "vulkan" — omit for plain CPU inference
Register the plugin and the four model-management commands:
Capability:
Model Catalogue
| id | display | size | tier |
|---|---|---|---|
tiny |
Tiny | 75 MB | fastest |
base |
Base | 142 MB | balanced ⭐ |
small |
Small | 466 MB | accurate |
medium |
Medium | 1.5 GB | very accurate |
large-v3 |
Large v3 | 3.0 GB | most accurate |
Files are fetched from
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-<id>.bin
and stored under <app_data_dir>/whisper-models/. The active selection
is persisted to whisper-models/active.txt.
Commands
list_models()→{ models, active, total_disk_bytes }install_model(id)— downloads the model, emitsstt://download-progressremove_model(id)— deletes the file and clears the active marker if neededset_active_model(id)— picks which installed modelstart_listeningloadsstart_listening({ language?, max_duration? })— push-to-talk sessionstop_listening()— runs Whisper over the captured audio and emits one final resultis_available()— reportsavailable: trueonly when a model is installedget_supported_languages()— curated list of UI-facing localescheck_permission()/request_permission()— microphone permission helpers
Events
stt://download-progress—{ status, modelId, model, progress, downloaded?, total? }stt://result—{ transcript, isFinal, confidence }stt://error—{ code, message }plugin:stt:result— same payload asstt://result(legacy listener channel)plugin:stt:stateChange—{ state, isAvailable, language }
Behaviour Notes
- Whisper is not a streaming recogniser. The plugin buffers audio while
recording and runs a single pass on
stop_listening. UX is push-to-talk. - Audio is captured at the device default rate, downmixed to mono, then decimated to 16 kHz with nearest-neighbour. Whisper is robust enough that a high-quality resampler buys nothing measurable.
- Inference uses
min(available_parallelism(), 4)threads — beyond that whisper.cpp shows diminishing returns and we want headroom for the UI.
Mobile
The mobile bridges expose the same JS API surface but list_models returns an
empty list and install_model / remove_model / set_active_model are
no-ops: the OS engine has no downloadable model concept. Use is_available
to gate UI; on iOS / Android it reflects actual recognizer availability.
License
MIT.