memo-stt 0.1.1

Plug-and-play speech-to-text for Rust. Add local transcription to any app in a few lines, with automatic GPU acceleration and zero configuration.
Documentation

memo-stt

Plug-and-play speech-to-text for Rust. Add local transcription to any app in a few lines, with automatic GPU acceleration and zero configuration. Avoid expensive API calls.

crates.io docs.rs downloads CI License: MIT

Quick start

[dependencies]
memo-stt = "0.1"
use memo_stt::SttEngine;

let mut engine = SttEngine::new_default(16000)?;
engine.warmup()?;
let text = engine.transcribe(&audio_samples)?;
println!("Transcribed: {}", text);

On the first call, the default model (ggml-small.en-q5_1.bin, ~500 MB) is downloaded to your platform cache directory. Every subsequent run is fully offline.

Why memo-stt

  • Zero configuration. No API keys, no environment variables, no manual model setup.
  • Local and private. Audio never leaves the machine.
  • Automatic GPU acceleration. Metal on macOS; CUDA on Linux/Windows when available; clean CPU fallback otherwise.
  • Simple, three-method API. new_default / warmup / transcribe.
  • Cross-platform. macOS, Linux, Windows.

Recommended model

Use ggml-small.en-q5_1.bin (the default). It is the best general-purpose choice for almost every use case: ~500 MB on disk, sub-second latency on modern hardware, and accuracy that is very close to the larger distil models for clean English speech.

You only need a different model if you have a specific reason:

Model Size Typical latency (M1) When to use
ggml-small.en-q5_1 (default) ~500 MB 200–500 ms Recommended. Best balance of speed, size, accuracy.
ggml-distil-large-v3-q5_1 ~500 MB 300–600 ms Noisy audio, accents, harder transcripts.
ggml-distil-large-v3-q8_0 ~800 MB 400–800 ms Maximum accuracy, willing to pay extra latency and disk.

Models live in your platform cache directory:

  • macOS: ~/Library/Caches/memo-stt/models/
  • Linux: ~/.cache/memo-stt/models/
  • Windows: %LOCALAPPDATA%\memo-stt\models\

Pre-built models can be downloaded from the model repository on Hugging Face.

Quantization, briefly

  • Q5_1 — 5-bit quantization. Smaller, faster, very close to full accuracy for English. This is the recommended default.
  • Q8_0 — 8-bit quantization. Larger and slower, slight accuracy bump.

If you are not sure which to pick, pick Q5_1. The small.en-q5_1 model is the sweet spot for nearly all real-time applications.

Examples

Basic transcription

use memo_stt::SttEngine;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut engine = SttEngine::new_default(16000)?;
    engine.warmup()?; // optional, reduces first-call latency

    let samples: Vec<i16> = vec![/* 16 kHz mono PCM */];
    let text = engine.transcribe(&samples)?;
    println!("{}", text);
    Ok(())
}

Custom model path

use memo_stt::SttEngine;

let engine = SttEngine::new("models/ggml-small.en-q5_1.bin", 16000)?;

Custom vocabulary / context prompt

use memo_stt::SttEngine;

let mut engine = SttEngine::new_default(16000)?;
engine.set_prompt(Some("Rust, cargo, crates.io, tokio".to_string()));
engine.warmup()?;

More examples live in the examples/ directory.

API reference

SttEngine — the main transcription engine.

Method Purpose
SttEngine::new_default(sample_rate) Create with the default model (auto-downloaded).
SttEngine::new(model_path, sample_rate) Create with a custom model file.
engine.warmup() Pre-initialize GPU state to reduce first-call latency.
engine.transcribe(&samples) Run inference on 16-bit mono PCM samples.
engine.set_prompt(Some(text)) Seed transcription with custom vocabulary.

Full rustdoc is published at docs.rs/memo-stt.

Audio format

  • 16-bit signed PCM (i16)
  • Mono
  • Any sample rate (specified to new / new_default); resampled to 16 kHz internally
  • Minimum length: roughly 1 second

Platform support

Feature macOS Linux Windows
Library / SttEngine
GPU acceleration Metal CUDA (if installed) CUDA (if installed)
Standalone binary (mic + hotkeys)
Active-application context

Requirements

  • Rust 1.74 or newer
  • ~500 MB of free disk space for the default model
  • Internet connection for the one-time model download

Standalone binary (optional)

memo-stt also ships a CLI with hotkey-driven recording, microphone capture, and BLE-device support. It is gated behind the binary feature so it does not pull heavy dependencies into library consumers.

cargo install memo-stt --features binary

Then:

memo-stt                          # default: system mic + Fn hotkey
memo-stt --hotkey Control         # use a different trigger key
INPUT_SOURCE=ble memo-stt         # use a paired BLE audio device

CLI features

  • Push-to-talk recording with a configurable hotkey (default: Fn)
  • Hold-to-lock continuous recording (Fn + Control)
  • Optional BLE audio input from memo_-prefixed devices
  • Real-time 7-bar waveform output for desktop UI integration
  • Active application + window title capture on macOS
  • Structured JSON output for downstream tools

CLI output

The CLI prints a JSON object per transcription:

{
  "rawTranscript": "Hello world",
  "processedText": "Hello world",
  "wasProcessedByLLM": false,
  "appContext": {
    "appName": "Terminal",
    "windowTitle": "~/dev/memo-stt"
  }
}

CLI environment variables

Variable Values Description
INPUT_SOURCE system (default), ble, radio Audio input source.
MEMO_AUDIO_LEVELS_INTERVAL_MS 0 (default) or ms Throttle AUDIO_LEVELS: waveform lines. 0 emits every callback.

Desktop integration protocol

When embedded in a desktop app, the CLI writes a few well-known stdout lines:

  • AUDIO_LEVELS:<json array> — 7 waveform values in 0..=1
  • BLE_PRESS_ENTER — emitted on BLE control 0x03 (second tap after stop)

Framework integration

SttEngine is Send and reusable across calls; create it once and reuse it.

Tauri

use memo_stt::SttEngine;

#[tauri::command]
fn transcribe_audio(samples: Vec<i16>) -> Result<String, String> {
    let mut engine = SttEngine::new_default(16000).map_err(|e| e.to_string())?;
    engine.transcribe(&samples).map_err(|e| e.to_string())
}

egui / iced / any GUI framework

use memo_stt::SttEngine;

// Create the engine once in your app state and reuse it.
let mut engine = SttEngine::new_default(16000)?;
engine.warmup()?;

// In your event/button handler:
let text = engine.transcribe(&audio_samples)?;

Contributing

Issues and pull requests are welcome at github.com/oliverbhull/memo-stt. Please run cargo fmt, cargo clippy, and cargo test before submitting.

License

MIT — see LICENSE.

Acknowledgments

Built on open-source local speech-recognition runtimes and model tooling.