# memo-stt
Plug-and-play speech-to-text for Rust. Add local transcription to any app in a
few lines, with automatic GPU acceleration and zero configuration. Avoid
expensive API calls.
[](https://crates.io/crates/memo-stt)
[](https://docs.rs/memo-stt)
[](https://crates.io/crates/memo-stt)
[](https://github.com/oliverbhull/memo-stt/actions/workflows/ci.yml)
[](https://opensource.org/licenses/MIT)
## Quick start
```toml
[dependencies]
memo-stt = "0.1"
```
```rust
use memo_stt::SttEngine;
let mut engine = SttEngine::new_default(16000)?;
engine.warmup()?;
let text = engine.transcribe(&audio_samples)?;
println!("Transcribed: {}", text);
```
On the first call, the default model
(`ggml-small.en-q5_1.bin`, ~500 MB) is downloaded to your platform cache
directory. Every subsequent run is fully offline.
## Why memo-stt
- **Zero configuration.** No API keys, no environment variables, no
manual model setup.
- **Local and private.** Audio never leaves the machine.
- **Automatic GPU acceleration.** Metal on macOS; CUDA on Linux/Windows when
available; clean CPU fallback otherwise.
- **Simple, three-method API.** `new_default` / `warmup` / `transcribe`.
- **Cross-platform.** macOS, Linux, Windows.
## Recommended model
> **Use `ggml-small.en-q5_1.bin` (the default).** It is the best general-purpose
> choice for almost every use case: ~500 MB on disk, sub-second latency on
> modern hardware, and accuracy that is very close to the larger distil models
> for clean English speech.
You only need a different model if you have a specific reason:
| `ggml-small.en-q5_1` *(default)* | ~500 MB | 200–500 ms | **Recommended.** Best balance of speed, size, accuracy. |
| `ggml-distil-large-v3-q5_1` | ~500 MB | 300–600 ms | Noisy audio, accents, harder transcripts. |
| `ggml-distil-large-v3-q8_0` | ~800 MB | 400–800 ms | Maximum accuracy, willing to pay extra latency and disk. |
Models live in your platform cache directory:
- **macOS**: `~/Library/Caches/memo-stt/models/`
- **Linux**: `~/.cache/memo-stt/models/`
- **Windows**: `%LOCALAPPDATA%\memo-stt\models\`
Pre-built models can be downloaded from the
[model repository on Hugging Face](https://huggingface.co/ggerganov/whisper.cpp).
### Quantization, briefly
- **Q5_1** — 5-bit quantization. Smaller, faster, very close to full accuracy
for English. This is the recommended default.
- **Q8_0** — 8-bit quantization. Larger and slower, slight accuracy bump.
If you are not sure which to pick, pick **Q5_1**. The small.en-q5_1 model is
the sweet spot for nearly all real-time applications.
## Examples
### Basic transcription
```rust
use memo_stt::SttEngine;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut engine = SttEngine::new_default(16000)?;
engine.warmup()?; // optional, reduces first-call latency
let samples: Vec<i16> = vec![/* 16 kHz mono PCM */];
let text = engine.transcribe(&samples)?;
println!("{}", text);
Ok(())
}
```
### Custom model path
```rust
use memo_stt::SttEngine;
let engine = SttEngine::new("models/ggml-small.en-q5_1.bin", 16000)?;
```
### Custom vocabulary / context prompt
```rust
use memo_stt::SttEngine;
let mut engine = SttEngine::new_default(16000)?;
engine.set_prompt(Some("Rust, cargo, crates.io, tokio".to_string()));
engine.warmup()?;
```
More examples live in the [`examples/`](examples/) directory.
## API reference
`SttEngine` — the main transcription engine.
| `SttEngine::new_default(sample_rate)` | Create with the default model (auto-downloaded). |
| `SttEngine::new(model_path, sample_rate)` | Create with a custom model file. |
| `engine.warmup()` | Pre-initialize GPU state to reduce first-call latency. |
| `engine.transcribe(&samples)` | Run inference on 16-bit mono PCM samples. |
| `engine.set_prompt(Some(text))` | Seed transcription with custom vocabulary. |
Full rustdoc is published at [docs.rs/memo-stt](https://docs.rs/memo-stt).
## Audio format
- 16-bit signed PCM (`i16`)
- Mono
- Any sample rate (specified to `new` / `new_default`); resampled to 16 kHz
internally
- Minimum length: roughly 1 second
## Platform support
| Library / `SttEngine` | ✓ | ✓ | ✓ |
| GPU acceleration | Metal | CUDA (if installed) | CUDA (if installed) |
| Standalone binary (mic + hotkeys) | ✓ | ✓ | ✓ |
| Active-application context | ✓ | — | — |
## Requirements
- Rust **1.74** or newer
- ~500 MB of free disk space for the default model
- Internet connection for the one-time model download
## Standalone binary (optional)
memo-stt also ships a CLI with hotkey-driven recording, microphone capture,
and BLE-device support. It is gated behind the `binary` feature so it does
not pull heavy dependencies into library consumers.
```bash
cargo install memo-stt --features binary
```
Then:
```bash
memo-stt # default: system mic + Fn hotkey
memo-stt --hotkey Control # use a different trigger key
INPUT_SOURCE=ble memo-stt # use a paired BLE audio device
```
### CLI features
- Push-to-talk recording with a configurable hotkey (default: `Fn`)
- Hold-to-lock continuous recording (`Fn` + `Control`)
- Optional BLE audio input from `memo_`-prefixed devices
- Real-time 7-bar waveform output for desktop UI integration
- Active application + window title capture on macOS
- Structured JSON output for downstream tools
### CLI output
The CLI prints a JSON object per transcription:
```json
{
"rawTranscript": "Hello world",
"processedText": "Hello world",
"wasProcessedByLLM": false,
"appContext": {
"appName": "Terminal",
"windowTitle": "~/dev/memo-stt"
}
}
```
### CLI environment variables
| `INPUT_SOURCE` | `system` (default), `ble`, `radio` | Audio input source. |
| `MEMO_AUDIO_LEVELS_INTERVAL_MS` | `0` (default) or ms | Throttle `AUDIO_LEVELS:` waveform lines. `0` emits every callback. |
### Desktop integration protocol
When embedded in a desktop app, the CLI writes a few well-known stdout lines:
- `AUDIO_LEVELS:<json array>` — 7 waveform values in `0..=1`
- `BLE_PRESS_ENTER` — emitted on BLE control `0x03` (second tap after stop)
## Framework integration
`SttEngine` is `Send` and reusable across calls; create it once and reuse it.
### Tauri
```rust
use memo_stt::SttEngine;
#[tauri::command]
fn transcribe_audio(samples: Vec<i16>) -> Result<String, String> {
let mut engine = SttEngine::new_default(16000).map_err(|e| e.to_string())?;
engine.transcribe(&samples).map_err(|e| e.to_string())
}
```
### egui / iced / any GUI framework
```rust
use memo_stt::SttEngine;
// Create the engine once in your app state and reuse it.
let mut engine = SttEngine::new_default(16000)?;
engine.warmup()?;
// In your event/button handler:
let text = engine.transcribe(&audio_samples)?;
```
## Contributing
Issues and pull requests are welcome at
[github.com/oliverbhull/memo-stt](https://github.com/oliverbhull/memo-stt).
Please run `cargo fmt`, `cargo clippy`, and `cargo test` before submitting.
## License
MIT — see [LICENSE](LICENSE).
## Acknowledgments
Built on open-source local speech-recognition runtimes and model tooling.