vtt-rs

A Rust library and command-line utility for real-time audio transcription using OpenAI-compatible APIs. Perfect for adding situational awareness to AI agents through speech recognition.

Documentation

API Documentation - Full API reference (auto-generated from code)
Integration Guide - Comprehensive guide for AI agent integration

Or build locally:

cargo doc --no-deps --open

Configuration

If your endpoint requires authentication, set OPENAI_API_KEY in your environment. For local OpenAI-compatible servers that don't require auth, this can be omitted.
The binary expects an optional JSON configuration file (default vtt.config.json in the current directory, or pass an alternate path as the first argument).
Supported keys (all optional; sensible defaults exist):
```
{
  "chunk_duration_secs": 5,
  "model": "whisper-1",
  "endpoint": "https://api.openai.com/v1/audio/transcriptions",
  "out_file": "transcripts.log",
  "on_device": {
    "enabled": false,
    "model": "tiny.en",
    "cpu": true
  }
}
```
- chunk_duration_secs: duration of each captured audio block that is transcribed.
- model: which OpenAI transcription model to hit.
- endpoint: custom transcription endpoint for e.g. a proxy service.
- out_file: path to append every transcription (chunk ID + contents).
- on_device: optional block to turn on the bundled Candle Whisper runner.

On-Device Whisper

Set on_device.enabled to true in your config to run Whisper locally without calling the OpenAI API. You can pick from the built-in model shortcuts ("tiny", "small", etc.), force CPU execution, and optionally select a specific input device.

Local MLX Parakeet (no API key)

You can use a local OpenAI-compatible server that serves the MLX model mlx-community/parakeet-tdt-0.6b-v2. Point endpoint to your server and set model accordingly. No OPENAI_API_KEY is required when the server does not enforce auth.

Example config snippet:

{
  "chunk_duration_secs": 3,
  "model": "mlx-community/parakeet-tdt-0.6b-v2",
  "endpoint": "http://localhost:8000/v1/audio/transcriptions",
  "out_file": "transcripts.log"
}

Then run the CLI without setting OPENAI_API_KEY:

cargo run -- vtt.config.json

Notes:

Ensure your local server implements an OpenAI-compatible audio transcription endpoint and understands the model identifier.
On-device mode in this repo currently supports Whisper via Candle. Parakeet support is provided via the remote endpoint path as shown above.

Usage as a Library

Add vtt-rs to your Cargo.toml:

[dependencies]
vtt-rs = { git = "https://github.com/geoffsee/vtt-rs" }
tokio = { version = "1", features = ["rt-multi-thread", "macros"] }

Basic Example

use vtt_rs::{Config, TranscriptionEvent, TranscriptionService};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let config = Config::default();
    let api_key = std::env::var("OPENAI_API_KEY")?;

    let mut service = TranscriptionService::new(config, api_key)?;
    let (mut receiver, _stream) = service.start().await?;

    // Process transcription events
    while let Some(event) = receiver.recv().await {
        match event {
            TranscriptionEvent::Transcription { chunk_id, text } => {
                println!("Heard: {}", text);
                // Feed this to your AI agent for situational awareness
            }
            TranscriptionEvent::Error { chunk_id, error } => {
                eprintln!("Error: {}", error);
            }
        }
    }

    Ok(())
}

AI Agent Integration

The library is designed to give AI agents "ears" - the ability to perceive and respond to their audio environment. Check out the examples:

examples/ai_agent.rs - Basic AI agent with audio awareness
examples/streaming_agent.rs - Advanced agent with temporal context

Run examples with:

OPENAI_API_KEY=sk-... cargo run --example ai_agent

Usage as a CLI

# With OpenAI or any endpoint requiring auth
OPENAI_API_KEY=sk-... cargo run -- vtt.config.json

# With a local server that does not require auth
cargo run -- vtt.config.json

Omit the CLI argument to let the tool load vtt.config.json from the current directory if it exists, otherwise it runs with defaults.
Transcripts are printed live and, when out_file is set, appended to that file in addition to the console output.

Features

Real-time transcription: Continuously captures and transcribes audio
Event-driven API: React to transcriptions as they happen
Configurable chunking: Adjust audio chunk duration for your needs
OpenAI compatible: Works with OpenAI Whisper and compatible APIs
Async/await: Built on Tokio for efficient async processing
Type-safe: Strongly typed events and configuration

vtt-rs 0.1.3