Scribble

Scribble is a fast, lightweight transcription engine written in Rust, with a built-in Whisper backend and a backend trait for custom implementations.

Scribble will demux/decode audio or video containers (MP4, MP3, WAV, FLAC, OGG, WebM, MKV, etc.), downmix to mono, and resample to 16 kHz — no preprocessing required.

Goals

Provide a clean, idiomatic Rust API for audio transcription
Support multiple output formats (JSON, VTT, plain text, etc.)
Work equally well as a CLI tool or embedded service
Be streaming-first: designed to support incremental, chunk-based transcription pipelines (live audio, long-running streams, and low-latency workflows)
Enable composable pipelines: VAD → transcription → encoding, with clear extension points for streaming and real-time use cases
Keep the core simple, explicit, and easy to extend

Scribble is built with streaming and real-time transcription in mind, even when operating on static files today.

Installation

Clone the repository and build the binaries:

cargo build --release

This will produce the following binaries:

scribble-cli — transcribe audio/video (decodes + normalizes to mono 16 kHz)
model-downloader — download Whisper and VAD models

model-downloader

model-downloader is a small helper CLI for downloading known-good Whisper and Whisper-VAD models into a local directory.

List available models

cargo run --bin model-downloader -- --list

Example output:

Whisper models:
  - tiny
  - base.en
  - large-v3-turbo
  - large-v3-turbo-q8_0
  ...

VAD models:
  - silero-v5.1.2
  - silero-v6.2.0

Download a model

cargo run --bin model-downloader -- --name large-v3-turbo

By default, models are downloaded into ./models.

Download into a custom directory

cargo run --bin model-downloader -- \
  --name silero-v6.2.0 \
  --dir /opt/scribble/models

Downloads are performed safely:

written to *.part
fsynced
atomically renamed into place

scribble-cli

scribble-cli is the main transcription CLI.

It accepts audio or video containers and normalizes them to Whisper’s required mono 16 kHz internally. Provide:

an input media path (e.g. MP4, MP3, WAV, FLAC, OGG, WebM, MKV) or - to stream from stdin
a Whisper model
a Whisper-VAD model (used when --enable-vad is set)

Basic transcription (VTT output)

cargo run --bin scribble-cli -- \
  --model ./models/ggml-large-v3-turbo.bin \
  --vad-model ./models/ggml-silero-v6.2.0.bin \
  --input ./input.mp4

Output is written to stdout in WebVTT format by default.

JSON output

cargo run --bin scribble-cli -- \
  --model ./models/ggml-large-v3-turbo.bin \
  --vad-model ./models/ggml-silero-v6.2.0.bin \
  --input ./input.wav \
  --output-type json

Enable voice activity detection (VAD)

cargo run --bin scribble-cli -- \
  --model ./models/ggml-large-v3-turbo.bin \
  --vad-model ./models/ggml-silero-v6.2.0.bin \
  --enable-vad \
  --input ./input.wav

When VAD is enabled:

non-speech regions are suppressed
if no speech is detected, no output is produced

Specify language explicitly

cargo run --bin scribble-cli -- \
  --model ./models/ggml-large-v3-turbo.bin \
  --vad-model ./models/ggml-silero-v6.2.0.bin \
  --input ./input.wav \
  --language en

If --language is omitted, Whisper will auto-detect.

Write output to a file

cargo run --bin scribble-cli -- \
  --model ./models/ggml-large-v3-turbo.bin \
  --vad-model ./models/ggml-silero-v6.2.0.bin \
  --input ./input.wav \
  --output-type vtt \
  > transcript.vtt

Library usage

Scribble is also designed to be embedded as a library.

High-level usage looks like:

use scribble::{opts::Opts, output_type::OutputType, scribble::Scribble};
use std::fs::File;

let mut scribble = Scribble::new(
    "./models/ggml-large-v3-turbo.bin",
    "./models/ggml-silero-v6.2.0.bin",
)?;

let mut input = File::open("audio.wav")?;
let mut output = Vec::new();

let opts = Opts {
    enable_translate_to_english: false,
    enable_voice_activity_detection: true,
    language: None,
    output_type: OutputType::Json,
    incremental_min_window_seconds: 1,
};

scribble.transcribe(&mut input, &mut output, &opts)?;

let json = String::from_utf8(output)?;
println!("{json}");

TODOs

Expand testing (goal of 80%+ test coverage)
Update VAD to utilize streaming approach
Implement the webserver
Streaming / incremental transcription support

Status

Scribble is under active development. The API is not yet stable, but the foundations are in place and evolving quickly.

License

MIT

scribble 0.2.0