Scribble

Scribble is a fast, lightweight transcription engine written in Rust, with a built-in Whisper backend and a backend trait for custom implementations.

Scribble will demux/decode audio or video containers (MP4, MP3, WAV, FLAC, OGG, WebM, MKV, etc.), downmix to mono, and resample to 16 kHz — no preprocessing required.
Demo
Goals
- Provide a clean, idiomatic Rust API for audio transcription
- Support multiple output formats (JSON, VTT, plain text, etc.)
- Work equally well as a CLI tool or embedded service
- Be streaming-first: designed to support incremental, chunk-based transcription pipelines (live audio, long-running streams, and low-latency workflows)
- Enable composable pipelines: VAD → transcription → encoding, with clear extension points for streaming and real-time use cases
- Keep the core simple, explicit, and easy to extend
Scribble is built with streaming and real-time transcription in mind, even when operating on static files today.
Installation
Clone the repository and build the binaries:
This will produce the following binaries:
scribble-cli— transcribe audio/video (decodes + normalizes to mono 16 kHz)model-downloader— download Whisper and VAD models
model-downloader
model-downloader is a small helper CLI for downloading known-good Whisper and Whisper-VAD models into a local directory.
List available models
Example output:
Whisper models:
- tiny
- base.en
- large-v3-turbo
- large-v3-turbo-q8_0
...
VAD models:
- silero-v5.1.2
- silero-v6.2.0
Download a model
By default, models are downloaded into ./models.
Download into a custom directory
Downloads are performed safely:
- written to
*.part - fsynced
- atomically renamed into place
scribble-cli
scribble-cli is the main transcription CLI.
It accepts audio or video containers and normalizes them to Whisper’s required mono 16 kHz internally. Provide:
- an input media path (e.g. MP4, MP3, WAV, FLAC, OGG, WebM, MKV) or
-to stream from stdin - a Whisper model
- a Whisper-VAD model (used when
--enable-vadis set)
Basic transcription (VTT output)
Output is written to stdout in WebVTT format by default.
JSON output
Enable voice activity detection (VAD)
When VAD is enabled:
- non-speech regions are suppressed
- if no speech is detected, no output is produced
Specify language explicitly
If --language is omitted, Whisper will auto-detect.
Write output to a file
Library usage
Scribble is also designed to be embedded as a library.
High-level usage looks like:
use ;
use File;
let mut scribble = new?;
let mut input = open?;
let mut output = Vecnew;
let opts = Opts ;
scribble.transcribe?;
let json = Stringfrom_utf8?;
println!;
TODOs
- Expand testing (goal of 80%+ test coverage)
- Update VAD to utilize streaming approach
- Implement the webserver
- Streaming / incremental transcription support
Status
Scribble is under active development. The API is not yet stable, but the foundations are in place and evolving quickly.
License
MIT