voxtty
The power of whisper — your voice commands.
voxtty turns your voice into action. Dictate, code, command, and converse — switch modes hands-free, choose your backend (local Whisper or cloud AI), and type anywhere. Private by default, limitless by design.
Built in Rust. Runs offline. Your voice never leaves unless you say so.
🚀 Quick Start
Install
# Option 1: One-line installer (downloads pre-built binary or builds from source)
|
# Option 2: Cargo (requires Rust toolchain + system deps)
# Option 3: Build from source
&&
Setup
# 1. Setup ydotool (required for typing into applications)
# add to ~/.bashrc
# 2. Start Speaches backend (Docker)
# 3. Test your microphone
# 4. Start voice typing
That's it! Click the tray icon to toggle voice typing on/off.
Optional: AI Assistant Mode
# 1. Select an AI model (interactive)
# Choose from: OpenAI, Anthropic, Google, DeepSeek, Ollama (local/free), OpenRouter
# 2. Use assistant mode with wake words OR tray menu
# Voice commands: Say "hey assistant" for writing help, "code mode" for code
# GUI: Click tray icon → Select mode (Dictation/Assistant/Code)
# Note: Privacy warnings shown automatically when using cloud AI (OpenAI, etc.)
📖 For detailed configuration, model selection, and troubleshooting, see the sections below in this README.
🎯 Inspiration & Evolution
voxtty was inspired by themanyone/voice_typing, a brilliant bash-based voice typing solution. While the original project demonstrated the power of offline voice typing, voxtty takes it further by:
- Rewritten in Rust - Memory-safe, fast, and reliable
- Speaches Backend - Uses Speaches AI for a more extensible and maintainable transcription backend
- Better Performance - ~2 seconds latency on i7 CPU (no GPU required) with basic model
- Enhanced UX - System tray integration for seamless control
- Production Ready - Proper error handling, device selection, and configuration options
Why Speaches?
Speaches provides a superior backend option compared to direct whisper.cpp integration:
- OpenAI-Compatible API - Standard REST interface for easy integration
- Docker Support - Run in containers with consistent environments
- CPU Optimized - Excellent performance even without GPU (~2s latency on i7)
- Model Flexibility - Easy switching between different Whisper model sizes
- Network Ready - Can run locally or on a dedicated transcription server
- Better Extensibility - Clean API makes it easy to add features and improvements
Backend Comparison
| Feature | whisper.cpp | Speaches | Realtime (WebSocket) |
|---|---|---|---|
| Setup | Manual build | Docker one-liner | API key or self-hosted |
| Latency | ~3-4s | ~2s | ~150ms |
| Privacy | 100% offline | 100% offline | Depends on provider |
| Providers | Local only | Local only | Speaches, ElevenLabs, OpenAI |
| Best For | Minimal setup | Production use | Lowest latency |
Realtime Providers:
- Speaches - Self-hosted, free, ~150ms latency, 🔒 100% LOCAL (privacy-preserving)
- ElevenLabs - Cloud, excellent accuracy, requires API key, ☁️ CLOUD (sends audio to third-party)
- OpenAI - Cloud, GPT-4o transcription, requires API key, ☁️ CLOUD (sends audio to third-party)
✨ Features
🔒 Privacy First
- 100% Offline Processing - All transcription happens locally using Whisper AI
- No Cloud Services - Your voice never leaves your machine (for dictation mode)
- Privacy Warnings - Automatic alerts when using cloud AI services (Assistant/Code modes)
- No Data Collection - Zero telemetry, zero tracking
- Self-Hosted - Run Speaches backend in Docker on your own hardware
- Network Isolated - Works completely offline, no internet required
- Local AI Option - Use Ollama for 100% offline Assistant/Code modes
⚡ Realtime Streaming Mode
- ~150ms Latency - WebSocket-based streaming for near-instant transcription
- Multiple Providers - ElevenLabs, OpenAI Realtime, or Speaches WebSocket
- Auto-Reconnection - Automatically reconnects if connection drops
- Connection Status - Tray tooltip shows
[Disconnected]when offline
🎤 Smart Voice Detection
- Voice Activity Detection (VAD) - Automatically detects when you start and stop speaking
- WebRTC VAD Engine - Industry-standard voice detection with low false positives
- Amplitude Threshold - Dual detection system for reliable speech capture
- Configurable Silence Detection - Customizable pause duration before transcription
⌨️ System-Wide Integration
- Universal Typing - Works in any application via ydotool
- No GUI Required - Runs in TTY, X11, Wayland, or any Linux environment
- Instant Text Insertion - Transcribed text appears directly where you're typing
🎛️ Flexible Control
- System Tray Icon - Quick toggle on/off with visual status indicator (click to enable/disable)
- GUI Mode Switching - Switch between Dictation/Assistant/Code modes from tray menu (when
--assistantenabled) - Voice Commands - Wake words for hands-free mode switching ("hey assistant", "code mode", "dictation mode")
- Audio Feedback - Notification sounds for pause/resume/mode changes
- Multiple Backends - Support for whisper.cpp, Speaches, OpenAI, or ElevenLabs
- Interactive Device Selection - Choose your preferred microphone
- Always Available - Runs in background, ready when you need it
🔧 Developer Friendly
- Echo Test Mode - Built-in
--echo-testCLI flag to verify audio input with instant playback - Debug Mode - Detailed logging for troubleshooting with
--debugflag - Flexible Configuration - Environment variables and CLI flags for easy customization
- Backend Agnostic - Switch backends with a single flag, no code changes
- Clean Rust Codebase - Modern, safe, and maintainable
📋 Requirements
Core Dependencies
- Whisper AI Backend (choose one):
- whisper.cpp - Lightweight C++ server
- Speaches - Modern API server (recommended)
- Audio: ALSA (libasound2-dev)
- System Integration: ydotool (for typing), DBus (for system tray)
- Build Tools: Rust 1.70+, cargo, pkg-config
Optional: AI Assistant Mode
- LLM Provider (choose one):
Runtime Dependencies
ydotool- System-wide input simulationalsa-utils- Audio utilitiespulseaudioorpipewire-pulse- Audio server (recommended)
🚀 Installation
Choose Your Installation Method
| Method | Best For | Requirements |
|---|---|---|
| curl | bash | Quick install, end users | curl or wget |
| cargo install | Rust users | Rust toolchain + system deps |
| Build from source | Development, contributing | Rust toolchain + system deps |
| Debian Package | Custom packaging | debhelper, cargo |
Option 1: One-line Installer (Recommended)
Downloads a pre-built binary from GitHub Releases, or builds from source as fallback:
|
Installs to ~/.local/bin by default. Override with INSTALL_DIR:
INSTALL_DIR=/usr/local/bin |
Option 2: Cargo Install
# Requires: pkg-config libasound2-dev libatk1.0-dev libgtk-3-dev
Option 3: Build from Source
Build Dependencies (Ubuntu/Debian)
Option 3: Systemd User Service (Auto-start on Login)
For automatic startup with realtime transcription:
# 1. Install binary
# 2. Create environment file with API keys
# 3. Create systemd service
# 4. Enable and start
# 5. Check status
Service management:
Optional: Desktop Entry (App Menu)
Option 4: Build Your Own Debian Package
For: Creating custom packages or contributing
Prerequisites
# Install build dependencies
Build Process
The Debian package is built using standard Debian packaging tools:
# Method 1: Use the build script (recommended)
# Method 2: Manual build (same as the script)
# The package will be created in the parent directory
What Gets Built
The build process creates:
- Binary package:
voxtty_0.1.0-1_amd64.deb- Main installable package - Debug symbols:
voxtty-dbgsym_0.1.0-1_amd64.deb- Debug symbols (optional) - Build artifacts:
.buildinfo,.changesfiles - Build metadata
Package Contents
/usr/bin/voxtty # Main executable
/usr/share/doc/voxtty/ # Documentation
/usr/share/man/man1/ # Man pages (if included)
Install Your Custom Package
# Install the package
# If dependencies are missing, fix them
# Verify installation
Debian Packaging Files
The package is configured via files in the debian/ directory:
debian/control- Package metadata, dependencies, and descriptiondebian/rules- Build instructions (uses dh with Cargo)debian/changelog- Version history and release notesdebian/install- Files to include in the packagedebian/compat- Debhelper compatibility level
Customizing the Package
To modify the package:
-
Change version: Edit
debian/changelog -
Add dependencies: Edit
debian/controlDepends: ${shlibs:Depends}, ${misc:Depends}, your-new-dependency -
Modify build: Edit
debian/rules: -
Rebuild package:
Troubleshooting Package Build
Build fails with missing dependencies?
Clean build artifacts?
Test package without installing?
⚙️ Setup
1. Install Whisper Backend
Option A: whisper.cpp (Recommended for local use)
# Clone and build whisper.cpp
# Download a model (tiny.en is fastest, small.en is more accurate)
# Start the server
Option B: Speaches API (Recommended - Better Performance & Extensibility)
Speaches provides superior performance and flexibility compared to whisper.cpp. On an i7 CPU (no GPU), expect ~2 second latency with the basic model.
# Quick start with Docker (CPU-only)
# Or with docker-compose
# IMPORTANT: Initial Speaches Configuration
# After starting Speaches for the first time, you must configure it:
# 1. Set the base URL for Speaches API
# 2. Check available models in the registry
# 3. Download and activate your chosen model (first-time setup)
# 4. Configure voxtty to use Speaches
# 5. Test the connection (requires a test audio file)
Initial Setup Notes:
- The
/v1/registryendpoint shows all available Whisper models - The
/v1/models/{model_name}POST endpoint downloads and activates a model - Model download happens once; subsequent starts use the cached model
- Choose model size based on your needs:
tiny.en(fastest) →small.en(balanced) →medium.en(accurate)
Performance Notes:
- CPU-only (i7): ~2 seconds latency with basic model
- GPU-enabled: Sub-second latency possible
- Model sizes: tiny.en (fastest) → small.en (balanced) → medium.en (accurate)
- 100% Offline: No internet required after initial model download
2. Configure ydotool
# Add to ~/.bashrc or ~/.zshrc
# Enable and start ydotool service
# Verify it's running
# Test typing
3. Verify Audio Input (Important!)
Before using voxtty, verify your microphone works correctly with the built-in echo test:
# Run echo test - speak and hear your voice played back
# Select specific device interactively, then test
# Test with debug output to see audio levels
Echo Test Mode: Speak into your microphone, pause, and you'll hear your recording played back. This confirms:
- ✅ Microphone is working
- ✅ Audio levels are correct
- ✅ VAD (Voice Activity Detection) is triggering properly
- ✅ No audio driver issues
🎯 Usage
Basic Usage
# Start with default settings (whisper.cpp backend)
# Start with system tray icon
# Use Speaches API backend
# Enable debug output
Advanced Usage
# Interactive device selection with debug output
# Echo test with specific device
# Speaches API with tray control
Realtime Streaming Mode
For the lowest latency (~150ms), use realtime WebSocket streaming:
# Realtime with Speaches (self-hosted, free)
# Realtime with ElevenLabs (cloud, requires API key)
# Realtime with OpenAI (cloud, requires API key)
# Realtime with voice commands (pause/resume/mode switching)
Realtime Features:
- Audio feedback sounds for pause/resume/mode changes
- Auto-reconnects if WebSocket connection drops
- Tray tooltip shows connection status
- Voice commands work continuously (no need to wait for silence)
Command-Line Options
| Option | Description |
|---|---|
--echo-test |
Run audio echo test - Speak and hear playback to verify microphone |
--select-device |
Interactively choose audio input device |
--debug |
Enable detailed debug logging (shows VAD triggers, audio levels) |
--speaches |
Use Speaches backend instead of whisper.cpp (default) |
--tray |
Enable system tray icon with click-to-toggle control |
--tui |
Enable Terminal UI (TUI) mode - Full-screen terminal interface |
--realtime |
Enable realtime WebSocket streaming (~150ms latency) |
--elevenlabs |
Use ElevenLabs for realtime transcription (requires API key) |
--openai |
Use OpenAI for transcription |
--assistant |
Enable assistant modes with wake word activation |
--auto |
Enable voice commands without full assistant mode |
--tui-output |
Enable text output in TUI mode (types to active application) |
--mcp |
Enable MCP tool calling (loads ~/.config/voxtty/mcp_servers.toml or .mcp.json) |
--mock-mcp |
Use built-in mock MCP server for testing (cannot combine with --mcp) |
--bidirectional |
Enable bidirectional conversation with TTS responses |
Configuration Priority: CLI flags → Environment variables → Defaults
Terminal UI (TUI) Mode
Launch voxtty with a consolidated single-screen dashboard:
# Launch TUI in demo mode
# TUI with specific backend
Single Dashboard View:
┌─────────────────────────────────────────────────┐
│ voxtty | Dictation | Speaches | LISTENING │ ← Status bar
├──────────────────┬──────────────────────────────┤
│ Live Audio │ Configuration │
│ ████████ │ Model: GPT-4o-mini │
│ VAD: ● ACTIVE │ [m] Select Model │
│ Device: Default │ [d] Select Device │
├──────────────────┴──────────────────────────────┤
│ Last Transcription (5s ago) │
│ "testing one two three..." │
├──────────────────┬──────────────────────────────┤
│ Mode Selection │ Actions │
│ [1] ▶ Dictation │ [p] Pause │
│ [2] Assistant │ [e] Echo Test │
│ [3] Code │ │
└──────────────────┴──────────────────────────────┘
[q]Quit [?]Help [1-3]Mode [p]Pause [e]Echo
Everything At a Glance:
- Live audio visualization - Real-time voice level bar graph
- VAD indicator - Voice Activity Detection status (● ACTIVE / ○ Inactive)
- Last transcription - Most recent text with timestamp
- Mode switcher - Quick [1-3] keys to switch modes
- Quick actions - One-key access to echo test, pause, device selection
- Model info - Current AI model configuration
Keyboard Shortcuts:
1-3- Switch mode (Dictation/Assistant/Code)porSpace- Pause/Resume listeninge- Run echo testm- Select AI modeld- Select audio device?orh- Toggle help screenqorEsc- Quit
No Navigation Needed - All controls visible on one screen!
🎮 Controls
System Tray Icon
The tray icon shows a colored circle with a letter indicating the current mode:
| Icon | Color | Meaning |
|---|---|---|
| D | 🟢 Green | Dictation mode (active) |
| A | 🔵 Blue | Assistant mode (active) |
| C | 🟣 Purple | Code mode (active) |
| D/A/C | 🟠 Orange | Paused (listening for "resume") |
| D/A/C | ⚫ Gray | Disabled (click to enable) |
- Left Click - Toggle voice typing on/off
- Right Click - Menu to switch modes (when
--assistantor--autoenabled)
Voice Commands
Wake words for hands-free control (requires --auto or --assistant flag):
| Command | Wake Words |
|---|---|
| Dictation Mode | "dictation mode", "normal mode", "typing mode", "type mode" |
| Assistant Mode | "hey assistant", "assistant mode" |
| Code Mode | "code mode", "coding mode", "write code" |
| Pause | "pause", "stop listening", "go to sleep" |
| Resume | "resume", "start listening", "wake up" |
🔧 Configuration
voxtty uses a layered configuration system for maximum flexibility:
Configuration Layers (Priority Order)
1. CLI Flags (highest)
2. Environment Variables
3. Config File (~/.config/voxtty/config.toml)
4. Auto-Detection (ydotool socket)
5. Built-in Defaults (lowest)
Config File
voxtty automatically creates ~/.config/voxtty/config.toml on first run:
# ydotool socket path (auto-detected if not specified)
= "/run/user/1000/.ydotool_socket"
# Speaches backend
= "http://localhost:8000/v1/audio/transcriptions"
= "Systran/faster-distil-whisper-small.en"
# whisper.cpp backend
= "http://127.0.0.1:7777/inference"
Environment Variables (Override Config File)
Backend Selection
| Backend | CLI Flag | Default URL | Configuration |
|---|---|---|---|
| whisper.cpp | (default) | http://127.0.0.1:7777/inference |
Config file or env var |
| Speaches | --speaches |
http://localhost:8000/v1/audio/transcriptions |
Config file or env var |
Privacy Summary by Component
Quick reference for privacy-conscious users:
| Component | Backend | Privacy | Internet Required | CLI Flag |
|---|---|---|---|---|
| Transcription | whisper.cpp | 🔒 100% Local | No | (default) |
| Speaches | 🔒 100% Local | No | --speaches |
|
| Speaches Realtime | 🔒 100% Local | No | --realtime --speaches |
|
| OpenAI Realtime | ☁️ Cloud | Yes | --realtime --openai |
|
| ElevenLabs | ☁️ Cloud | Yes | --realtime --elevenlabs |
|
| LLM (Assistant/Code) | Ollama | 🔒 100% Local | No | --llm ollama |
| Anthropic Claude | ☁️ Cloud | Yes | --llm anthropic |
|
| OpenAI GPT | ☁️ Cloud | Yes | --llm openai |
|
| Google Gemini | ☁️ Cloud | Yes | --llm google |
|
| DeepSeek | ☁️ Cloud | Yes | --llm deepseek |
Privacy Tip: For complete privacy, use:
# 100% offline voice typing
# 100% offline with AI assistance
⚠️ Important: ydotool Setup
Ubuntu's ydotool package is BROKEN. You MUST build from source:
&& &&
&&
📖 See the relevant sections in this README for setup and configuration details.
Audio Tuning
If you experience issues with voice detection:
-
Recording never stops - Microphone volume too high
- Lower mic volume in system settings
- Increase silence threshold in code
-
Recording doesn't start - Microphone volume too low
- Increase mic volume in system settings
- Decrease amplitude threshold in code
-
Background noise triggers recording - Environment too noisy
- Use push-to-talk via hotkey toggle
- Increase VAD sensitivity
🏗️ Architecture
What voxtty IS (and what it's NOT)
voxtty is a voice-to-text application that listens to your microphone and types text system-wide. It's designed for direct user interaction, not as a protocol server.
voxtty supports MCP (Model Context Protocol) tool calling, allowing the LLM to invoke external tools during conversation — check the weather, roll dice, run calculations, control smart home devices, and more.
Quick Start
# Test with built-in mock MCP server
# Use your own MCP servers
MCP Server Configuration
voxtty supports two config formats:
Option 1: Native TOML (~/.config/voxtty/mcp_servers.toml):
[[]]
= "weather"
= "python3"
= ["-m", "weather_mcp_server"]
[]
= "your-api-key"
[[]]
= "home-assistant"
= "node"
= ["path/to/ha-mcp-server.js"]
Option 2: Claude Code format (.mcp.json in project directory — auto-detected as fallback):
Each server is spawned as a child process and communicates via JSON-RPC over stdio. On startup, voxtty discovers available tools via tools/list and makes them available to the LLM alongside built-in tools (speak, type_text, switch_mode, process_command).
How It Works
- voxtty spawns each MCP server and sends
initialize+tools/list - Discovered tools are added to the LLM's tool definitions
- When the LLM calls an MCP tool, voxtty executes it via
tools/calland feeds the result back - The LLM uses the result to formulate a spoken or typed response
- Max 5 tool call iterations per turn (prevents infinite loops)
Built-in Mock Server (--mock-mcp)
For testing without external servers:
- get_time — Current date/time
- calculate — Math expressions (sqrt, trig, pi)
- weather — Mock weather for common cities
- random_fact — Random fun facts
- dice_roll — Standard notation (2d6, 1d20+3)
- echo — Echo test
Writing Your Own MCP Server
Any program that reads JSON-RPC from stdin and writes to stdout works. See test_mcp_server.py for a complete Python example. The protocol follows the MCP specification.
Barge-in (TTS Interruption)
When using bidirectional mode, speaking while the AI is talking interrupts playback immediately. The interrupt triggers on partial transcription (~200ms), enabling natural conversational flow without waiting for the AI to finish.
Core Components
- Audio Capture - CPAL for cross-platform audio input
- Voice Detection - WebRTC VAD + amplitude threshold
- Transcription - Whisper.cpp or Speaches API
- Text Input - ydotool for system-wide typing
- UI Controls - ksni (system tray)
- MCP Tools - External tool integration via stdio JSON-RPC
Audio Pipeline
Microphone → CPAL → VAD → WAV Buffer → Whisper AI → ydotool → Text Output
Detection Algorithm
- Capture audio in 30ms frames at 16kHz
- Run WebRTC VAD on each frame
- Check amplitude threshold (>1000)
- Require 200ms of speech to start
- Wait 1000ms of silence to stop
- Transcribe and type result
🐛 Troubleshooting
Quick Fixes
Audio not working?
Transcription failing?
# Check backend is running
|
Text not typing?
# Check ydotool
Need more help?
# Run with debug output
🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Development Setup
# Clone repository
# Build in debug mode
# Run with debug output
# Run tests
# Check code quality
📚 Documentation
All documentation is contained in this README. Additional detailed guides coming soon!
📝 License
This project is licensed under the GNU General Public License v2.0 - see the LICENSE file for details.
🙏 Acknowledgments
- OpenAI Whisper - State-of-the-art speech recognition
- whisper.cpp - Efficient C++ implementation
- ydotool - Generic command-line automation tool
- WebRTC VAD - Voice activity detection
🔗 Links
- Repository: https://github.com/jflaflamme/voxtty
- Issues: https://github.com/jflaflamme/voxtty/issues
- Releases: https://github.com/jflaflamme/voxtty/releases
Made with ❤️ by Jeff Laflamme
The power of whisper — your voice commands