memo-stt
Plug-and-play, local speech-to-text for Rust. Add Whisper transcription to any application in a few lines — no API keys, no cloud calls, automatic GPU acceleration where available, and the model downloads itself on first use.
Quick start
[]
= "0.1"
use SttEngine;
let mut engine = new_default?;
engine.warmup?;
let text = engine.transcribe?;
println!;
On the first call, the default model
(ggml-small.en-q5_1.bin, ~500 MB) is downloaded to your platform cache
directory. Every subsequent run is fully offline.
Why memo-stt
- Zero configuration. No API keys, no environment variables, no manual model setup.
- Local and private. Audio never leaves the machine.
- Automatic GPU acceleration. Metal on macOS via
whisper.cpp; CUDA on Linux/Windows when available; clean CPU fallback otherwise. - Simple, three-method API.
new_default/warmup/transcribe. - Cross-platform. macOS, Linux, Windows.
Recommended model
Use
ggml-small.en-q5_1.bin(the default). It is the best general-purpose choice for almost every use case: ~500 MB on disk, sub-second latency on modern hardware, and accuracy that is very close to the larger distil models for clean English speech.
You only need a different model if you have a specific reason:
| Model | Size | Typical latency (M1) | When to use |
|---|---|---|---|
ggml-small.en-q5_1 (default) |
~500 MB | 200–500 ms | Recommended. Best balance of speed, size, accuracy. |
ggml-distil-large-v3-q5_1 |
~500 MB | 300–600 ms | Noisy audio, accents, harder transcripts. |
ggml-distil-large-v3-q8_0 |
~800 MB | 400–800 ms | Maximum accuracy, willing to pay extra latency and disk. |
Models live in your platform cache directory:
- macOS:
~/Library/Caches/memo-stt/models/ - Linux:
~/.cache/memo-stt/models/ - Windows:
%LOCALAPPDATA%\memo-stt\models\
Pre-built models can be downloaded from the whisper.cpp model repository on Hugging Face.
Quantization, briefly
- Q5_1 — 5-bit quantization. Smaller, faster, very close to full accuracy for English. This is the recommended default.
- Q8_0 — 8-bit quantization. Larger and slower, slight accuracy bump.
If you are not sure which to pick, pick Q5_1. The small.en-q5_1 model is the sweet spot for nearly all real-time applications.
Examples
Basic transcription
use SttEngine;
Custom model path
use SttEngine;
let engine = new?;
Custom vocabulary / context prompt
use SttEngine;
let mut engine = new_default?;
engine.set_prompt;
engine.warmup?;
More examples live in the examples/ directory.
API reference
SttEngine — the main transcription engine.
| Method | Purpose |
|---|---|
SttEngine::new_default(sample_rate) |
Create with the default model (auto-downloaded). |
SttEngine::new(model_path, sample_rate) |
Create with a custom model file. |
engine.warmup() |
Pre-initialize GPU state to reduce first-call latency. |
engine.transcribe(&samples) |
Run inference on 16-bit mono PCM samples. |
engine.set_prompt(Some(text)) |
Seed transcription with custom vocabulary. |
Full rustdoc is published at docs.rs/memo-stt.
Audio format
- 16-bit signed PCM (
i16) - Mono
- Any sample rate (specified to
new/new_default); resampled to 16 kHz internally - Minimum length: roughly 1 second
Platform support
| Feature | macOS | Linux | Windows |
|---|---|---|---|
Library / SttEngine |
✓ | ✓ | ✓ |
| GPU acceleration | Metal | CUDA (if installed) | CUDA (if installed) |
| Standalone binary (mic + hotkeys) | ✓ | ✓ | ✓ |
| Active-application context | ✓ | — | — |
Requirements
- Rust 1.74 or newer
- ~500 MB of free disk space for the default model
- Internet connection for the one-time model download
Standalone binary (optional)
memo-stt also ships a CLI with hotkey-driven recording, microphone capture,
and BLE-device support. It is gated behind the binary feature so it does
not pull heavy dependencies into library consumers.
Then:
INPUT_SOURCE=ble
CLI features
- Push-to-talk recording with a configurable hotkey (default:
Fn) - Hold-to-lock continuous recording (
Fn+Control) - Optional BLE audio input from
memo_-prefixed devices - Real-time 7-bar waveform output for desktop UI integration
- Active application + window title capture on macOS
- Structured JSON output for downstream tools
CLI output
The CLI prints a JSON object per transcription:
CLI environment variables
| Variable | Values | Description |
|---|---|---|
INPUT_SOURCE |
system (default), ble, radio |
Audio input source. |
MEMO_AUDIO_LEVELS_INTERVAL_MS |
0 (default) or ms |
Throttle AUDIO_LEVELS: waveform lines. 0 emits every callback. |
Desktop integration protocol
When embedded in a desktop app, the CLI writes a few well-known stdout lines:
AUDIO_LEVELS:<json array>— 7 waveform values in0..=1BLE_PRESS_ENTER— emitted on BLE control0x03(second tap after stop)
Framework integration
SttEngine is Send and reusable across calls; create it once and reuse it.
Tauri
use SttEngine;
egui / iced / any GUI framework
use SttEngine;
// Create the engine once in your app state and reuse it.
let mut engine = new_default?;
engine.warmup?;
// In your event/button handler:
let text = engine.transcribe?;
Contributing
Issues and pull requests are welcome at
github.com/oliverbhull/memo-stt.
Please run cargo fmt, cargo clippy, and cargo test before submitting.
License
MIT — see LICENSE.
Acknowledgments
- whisper-rs — Rust bindings for whisper.cpp
- whisper.cpp — Whisper inference in C/C++
- OpenAI Whisper — the original model