vtt-rs
A Rust library and command-line utility for real-time audio transcription using OpenAI-compatible APIs. Perfect for adding situational awareness to AI agents through speech recognition.
Documentation
- API Documentation - Full API reference (auto-generated from code)
- Integration Guide - Comprehensive guide for AI agent integration
Or build locally:
Configuration
-
Set
OPENAI_API_KEYin your environment before running anything. -
The binary expects an optional JSON configuration file (default
vtt.config.jsonin the current directory, or pass an alternate path as the first argument). -
Supported keys (all optional; sensible defaults exist):
chunk_duration_secs: duration of each captured audio block that is transcribed.model: which OpenAI transcription model to hit.endpoint: custom transcription endpoint for e.g. a proxy service.out_file: path to append every transcription (chunk ID + contents).on_device: optional block to turn on the bundled Candle Whisper runner.
On-Device Whisper
Set on_device.enabled to true in your config to run Whisper locally without
calling the OpenAI API. You can pick from the built-in model shortcuts
("tiny", "small", etc.), force CPU execution, and optionally select a
specific input device.
Usage as a Library
Add vtt-rs to your Cargo.toml:
[]
= { = "https://github.com/geoffsee/vtt-rs" }
= { = "1", = ["rt-multi-thread", "macros"] }
Basic Example
use ;
async
AI Agent Integration
The library is designed to give AI agents "ears" - the ability to perceive and respond to their audio environment. Check out the examples:
examples/ai_agent.rs- Basic AI agent with audio awarenessexamples/streaming_agent.rs- Advanced agent with temporal context
Run examples with:
OPENAI_API_KEY=sk-...
Usage as a CLI
OPENAI_API_KEY=sk-...
- Omit the CLI argument to let the tool load
vtt.config.jsonfrom the current directory if it exists, otherwise it runs with defaults. - Transcripts are printed live and, when
out_fileis set, appended to that file in addition to the console output.
Features
- Real-time transcription: Continuously captures and transcribes audio
- Event-driven API: React to transcriptions as they happen
- Configurable chunking: Adjust audio chunk duration for your needs
- OpenAI compatible: Works with OpenAI Whisper and compatible APIs
- Async/await: Built on Tokio for efficient async processing
- Type-safe: Strongly typed events and configuration