# voxtty
[](https://github.com/jflaflamme/voxtty/actions/workflows/ci.yml)
**The power of whisper — your voice commands.**
voxtty turns your voice into action. Dictate, code, command, and converse — switch modes hands-free, choose your backend (local Whisper or cloud AI), and type anywhere. Private by default, limitless by design.
Built in Rust. Runs offline. Your voice never leaves unless you say so.
## 🚀 Quick Start
### Install
```bash
# Option 1: One-line installer (downloads pre-built binary or builds from source)
# Option 2: Cargo (requires Rust toolchain + system deps)
cargo install --git https://github.com/jflaflamme/voxtty
# Option 3: Build from source
git clone https://github.com/jflaflamme/voxtty.git
cd voxtty && cargo build --release
sudo cp target/release/voxtty /usr/local/bin/
```
### Setup
```bash
# 1. Setup ydotool (required for typing into applications)
sudo systemctl enable --now ydotool.service
export YDOTOOL_SOCKET=/tmp/.ydotool_socket # add to ~/.bashrc
# 2. Start Speaches backend (Docker)
docker run -d --name speaches -p 8000:8000 \
ghcr.io/speaches-ai/speaches:latest
# 3. Test your microphone
voxtty --echo-test
# 4. Start voice typing
voxtty --speaches --tray
```
**That's it!** Click the tray icon to toggle voice typing on/off.
### Optional: AI Assistant Mode
```bash
# 1. Select an AI model (interactive)
voxtty --select-model
# Choose from: OpenAI, Anthropic, Google, DeepSeek, Ollama (local/free), OpenRouter
# 2. Use assistant mode with wake words OR tray menu
voxtty --assistant --tray
# Voice commands: Say "hey assistant" for writing help, "code mode" for code
# GUI: Click tray icon → Select mode (Dictation/Assistant/Code)
# Note: Privacy warnings shown automatically when using cloud AI (OpenAI, etc.)
```
📖 **For detailed configuration, model selection, and troubleshooting**, see the sections below in this README.
## 🎯 Inspiration & Evolution
voxtty was inspired by [themanyone/voice_typing](https://github.com/themanyone/voice_typing), a brilliant bash-based voice typing solution. While the original project demonstrated the power of offline voice typing, voxtty takes it further by:
- **Rewritten in Rust** - Memory-safe, fast, and reliable
- **Speaches Backend** - Uses [Speaches AI](https://github.com/speaches-ai/speaches) for a more extensible and maintainable transcription backend
- **Better Performance** - ~2 seconds latency on i7 CPU (no GPU required) with basic model
- **Enhanced UX** - System tray integration for seamless control
- **Production Ready** - Proper error handling, device selection, and configuration options
### Why Speaches?
[Speaches](https://github.com/speaches-ai/speaches) provides a superior backend option compared to direct whisper.cpp integration:
- **OpenAI-Compatible API** - Standard REST interface for easy integration
- **Docker Support** - Run in containers with consistent environments
- **CPU Optimized** - Excellent performance even without GPU (~2s latency on i7)
- **Model Flexibility** - Easy switching between different Whisper model sizes
- **Network Ready** - Can run locally or on a dedicated transcription server
- **Better Extensibility** - Clean API makes it easy to add features and improvements
### Backend Comparison
| **Setup** | Manual build | Docker one-liner | API key or self-hosted |
| **Latency** | ~3-4s | ~2s | **~150ms** |
| **Privacy** | 100% offline | 100% offline | Depends on provider |
| **Providers** | Local only | Local only | Speaches, ElevenLabs, OpenAI |
| **Best For** | Minimal setup | Production use | **Lowest latency** |
**Realtime Providers:**
- **Speaches** - Self-hosted, free, ~150ms latency, 🔒 **100% LOCAL** (privacy-preserving)
- **ElevenLabs** - Cloud, excellent accuracy, requires API key, ☁️ **CLOUD** (sends audio to third-party)
- **OpenAI** - Cloud, GPT-4o transcription, requires API key, ☁️ **CLOUD** (sends audio to third-party)
## ✨ Features
### 🔒 Privacy First
- **100% Offline Processing** - All transcription happens locally using Whisper AI
- **No Cloud Services** - Your voice never leaves your machine (for dictation mode)
- **Privacy Warnings** - Automatic alerts when using cloud AI services (Assistant/Code modes)
- **No Data Collection** - Zero telemetry, zero tracking
- **Self-Hosted** - Run Speaches backend in Docker on your own hardware
- **Network Isolated** - Works completely offline, no internet required
- **Local AI Option** - Use Ollama for 100% offline Assistant/Code modes
### ⚡ Realtime Streaming Mode
- **~150ms Latency** - WebSocket-based streaming for near-instant transcription
- **Multiple Providers** - ElevenLabs, OpenAI Realtime, or Speaches WebSocket
- **Auto-Reconnection** - Automatically reconnects if connection drops
- **Connection Status** - Tray tooltip shows `[Disconnected]` when offline
### 🎤 Smart Voice Detection
- **Voice Activity Detection (VAD)** - Automatically detects when you start and stop speaking
- **WebRTC VAD Engine** - Industry-standard voice detection with low false positives
- **Amplitude Threshold** - Dual detection system for reliable speech capture
- **Configurable Silence Detection** - Customizable pause duration before transcription
### ⌨️ System-Wide Integration
- **Universal Typing** - Works in any application via ydotool
- **No GUI Required** - Runs in TTY, X11, Wayland, or any Linux environment
- **Instant Text Insertion** - Transcribed text appears directly where you're typing
### 🎛️ Flexible Control
- **System Tray Icon** - Quick toggle on/off with visual status indicator (click to enable/disable)
- **GUI Mode Switching** - Switch between Dictation/Assistant/Code modes from tray menu (when `--assistant` enabled)
- **Voice Commands** - Wake words for hands-free mode switching ("hey assistant", "code mode", "dictation mode")
- **Audio Feedback** - Notification sounds for pause/resume/mode changes
- **Multiple Backends** - Support for whisper.cpp, Speaches, OpenAI, or ElevenLabs
- **Interactive Device Selection** - Choose your preferred microphone
- **Always Available** - Runs in background, ready when you need it
### 🔧 Developer Friendly
- **Echo Test Mode** - Built-in `--echo-test` CLI flag to verify audio input with instant playback
- **Debug Mode** - Detailed logging for troubleshooting with `--debug` flag
- **Flexible Configuration** - Environment variables and CLI flags for easy customization
- **Backend Agnostic** - Switch backends with a single flag, no code changes
- **Clean Rust Codebase** - Modern, safe, and maintainable
## 📋 Requirements
### Core Dependencies
- **Whisper AI Backend** (choose one):
- [whisper.cpp](https://github.com/ggerganov/whisper.cpp) - Lightweight C++ server
- [Speaches](https://github.com/speaches-ai/speaches) - Modern API server (recommended)
- **Audio**: ALSA (libasound2-dev)
- **System Integration**: ydotool (for typing), DBus (for system tray)
- **Build Tools**: Rust 1.70+, cargo, pkg-config
### Optional: AI Assistant Mode
- **LLM Provider** (choose one):
- [Ollama](https://ollama.com/) - Free, local AI models (recommended for privacy)
- [OpenAI](https://openai.com/) - GPT-4o, GPT-4o-mini (API key required)
- [Anthropic](https://anthropic.com/) - Claude models (API key required)
- [Google](https://ai.google.dev/) - Gemini models (API key required)
- [DeepSeek](https://deepseek.com/) - Affordable models (API key required)
- [OpenRouter](https://openrouter.ai/) - Access multiple providers (API key required)
### Runtime Dependencies
- `ydotool` - System-wide input simulation
- `alsa-utils` - Audio utilities
- `pulseaudio` or `pipewire-pulse` - Audio server (recommended)
## 🚀 Installation
### Choose Your Installation Method
| **curl \| bash** | Quick install, end users | curl or wget |
| **cargo install** | Rust users | Rust toolchain + system deps |
| **Build from source** | Development, contributing | Rust toolchain + system deps |
| **Debian Package** | Custom packaging | debhelper, cargo |
### Option 1: One-line Installer (Recommended)
Downloads a pre-built binary from GitHub Releases, or builds from source as fallback:
```bash
Installs to `~/.local/bin` by default. Override with `INSTALL_DIR`:
```bash
INSTALL_DIR=/usr/local/bin curl -fsSL https://raw.githubusercontent.com/jflaflamme/voxtty/main/install.sh | bash
```
### Option 2: Cargo Install
```bash
# Requires: pkg-config libasound2-dev libatk1.0-dev libgtk-3-dev
cargo install --git https://github.com/jflaflamme/voxtty
```
### Option 3: Build from Source
```bash
git clone https://github.com/jflaflamme/voxtty.git
cd voxtty
cargo build --release
sudo cp target/release/voxtty /usr/local/bin/
```
### Build Dependencies (Ubuntu/Debian)
```bash
sudo apt install pkg-config libasound2-dev libatk1.0-dev libgtk-3-dev libgdk-pixbuf2.0-dev libssl-dev
```
### Option 3: Systemd User Service (Auto-start on Login)
For automatic startup with realtime transcription:
```bash
# 1. Install binary
sudo cp target/release/voxtty /usr/local/bin/
# 2. Create environment file with API keys
mkdir -p ~/.config/voxtty
cat > ~/.config/voxtty/env << 'EOF'
ELEVENLABS_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here
ANTHROPIC_API_KEY=your_key_here
EOF
chmod 600 ~/.config/voxtty/env
# 3. Create systemd service
mkdir -p ~/.config/systemd/user
cat > ~/.config/systemd/user/voxtty.service << 'EOF'
[Unit]
Description=voxtty voice typing
After=graphical-session.target
[Service]
EnvironmentFile=%h/.config/voxtty/env
ExecStart=/usr/local/bin/voxtty --tray --auto --realtime --elevenlabs
Restart=on-failure
RestartSec=5
[Install]
WantedBy=default.target
EOF
# 4. Enable and start
systemctl --user daemon-reload
systemctl --user enable --now voxtty
# 5. Check status
systemctl --user status voxtty
```
**Service management:**
```bash
systemctl --user status voxtty # Check status
systemctl --user stop voxtty # Stop
systemctl --user restart voxtty # Restart
journalctl --user -u voxtty -f # Watch logs
```
### Optional: Desktop Entry (App Menu)
```bash
cat > ~/.local/share/applications/voxtty.desktop << 'EOF'
[Desktop Entry]
Name=voxtty
Comment=Voice to text typing
Exec=voxtty --tray --auto
Icon=audio-input-microphone
Terminal=false
Type=Application
Categories=Utility;Accessibility;
EOF
```
### Option 4: Build Your Own Debian Package
**For**: Creating custom packages or contributing
#### Prerequisites
```bash
# Install build dependencies
sudo apt install debhelper-compat cargo rustc pkg-config libasound2-dev libdbus-1-dev
```
#### Build Process
The Debian package is built using standard Debian packaging tools:
```bash
# Method 1: Use the build script (recommended)
./packaging/scripts/build-deb.sh
# Method 2: Manual build (same as the script)
cargo build --release
dpkg-buildpackage -rfakeroot -us -uc
# The package will be created in the parent directory
ls -lh ../voxtty_*.deb
```
#### What Gets Built
The build process creates:
- **Binary package**: `voxtty_0.1.0-1_amd64.deb` - Main installable package
- **Debug symbols**: `voxtty-dbgsym_0.1.0-1_amd64.deb` - Debug symbols (optional)
- **Build artifacts**: `.buildinfo`, `.changes` files - Build metadata
#### Package Contents
```
/usr/bin/voxtty # Main executable
/usr/share/doc/voxtty/ # Documentation
/usr/share/man/man1/ # Man pages (if included)
```
#### Install Your Custom Package
```bash
# Install the package
sudo dpkg -i ../voxtty_*.deb
# If dependencies are missing, fix them
sudo apt-get install -f
# Verify installation
which voxtty
voxtty --version
```
#### Debian Packaging Files
The package is configured via files in the `debian/` directory:
- **`debian/control`** - Package metadata, dependencies, and description
- **`debian/rules`** - Build instructions (uses dh with Cargo)
- **`debian/changelog`** - Version history and release notes
- **`debian/install`** - Files to include in the package
- **`debian/compat`** - Debhelper compatibility level
#### Customizing the Package
To modify the package:
1. **Change version**: Edit `debian/changelog`
```bash
dch -v 0.2.0-1 "New release with feature X"
```
2. **Add dependencies**: Edit `debian/control`
```
Depends: ${shlibs:Depends}, ${misc:Depends}, your-new-dependency
```
3. **Modify build**: Edit `debian/rules`
```makefile
override_dh_auto_build:
cargo build --release --features your-feature
```
4. **Rebuild package**:
```bash
./packaging/scripts/build-deb.sh
```
#### Troubleshooting Package Build
**Build fails with missing dependencies?**
```bash
sudo apt-get build-dep .
```
**Clean build artifacts?**
```bash
cargo clean
debian/rules clean
rm -f ../voxtty_*
```
**Test package without installing?**
```bash
dpkg-deb --contents ../voxtty_*.deb
dpkg-deb --info ../voxtty_*.deb
```
## ⚙️ Setup
### 1. Install Whisper Backend
#### Option A: whisper.cpp (Recommended for local use)
```bash
# Clone and build whisper.cpp
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make
# Download a model (tiny.en is fastest, small.en is more accurate)
bash ./models/download-ggml-model.sh tiny.en
# Start the server
./server -l en -m models/ggml-tiny.en.bin --port 7777 --convert
```
#### Option B: Speaches API (Recommended - Better Performance & Extensibility)
Speaches provides superior performance and flexibility compared to whisper.cpp. On an i7 CPU (no GPU), expect ~2 second latency with the basic model.
```bash
# Quick start with Docker (CPU-only)
docker run -d \
--name speaches \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
ghcr.io/speaches-ai/speaches:latest
# Or with docker-compose
cat > docker-compose.yml <<EOF
services:
speaches:
image: ghcr.io/speaches-ai/speaches:latest
ports:
- "8000:8000"
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
environment:
- TRANSCRIPTION_MODEL_ID=Systran/faster-distil-whisper-small.en
EOF
docker-compose up -d
# IMPORTANT: Initial Speaches Configuration
# After starting Speaches for the first time, you must configure it:
# 1. Set the base URL for Speaches API
export SPEACHES_BASE_URL="http://localhost:8000"
# 2. Check available models in the registry
curl "$SPEACHES_BASE_URL/v1/registry"
# 3. Download and activate your chosen model (first-time setup)
curl "$SPEACHES_BASE_URL/v1/models/Systran/faster-distil-whisper-small.en" -X POST
# 4. Configure voxtty to use Speaches
export SPEACHES_BASE_URL="http://localhost:8000/v1/audio/transcriptions"
export TRANSCRIPTION_MODEL_ID="Systran/faster-distil-whisper-small.en"
# 5. Test the connection (requires a test audio file)
curl -X POST http://localhost:8000/v1/audio/transcriptions \
-F "file=@test.wav" \
-F "model=Systran/faster-distil-whisper-small.en"
```
**Initial Setup Notes:**
- The `/v1/registry` endpoint shows all available Whisper models
- The `/v1/models/{model_name}` POST endpoint downloads and activates a model
- Model download happens once; subsequent starts use the cached model
- Choose model size based on your needs: `tiny.en` (fastest) → `small.en` (balanced) → `medium.en` (accurate)
**Performance Notes:**
- **CPU-only (i7)**: ~2 seconds latency with basic model
- **GPU-enabled**: Sub-second latency possible
- **Model sizes**: tiny.en (fastest) → small.en (balanced) → medium.en (accurate)
- **100% Offline**: No internet required after initial model download
### 2. Configure ydotool
```bash
# Add to ~/.bashrc or ~/.zshrc
export YDOTOOL_SOCKET=/tmp/.ydotool_socket
# Enable and start ydotool service
sudo systemctl enable ydotool.service
sudo systemctl start ydotool.service
# Verify it's running
sudo systemctl status ydotool.service
# Test typing
ydotool type "Hello, World!"
```
### 3. Verify Audio Input (Important!)
Before using voxtty, verify your microphone works correctly with the built-in echo test:
```bash
# Run echo test - speak and hear your voice played back
voxtty --echo-test
# Select specific device interactively, then test
voxtty --select-device --echo-test
# Test with debug output to see audio levels
voxtty --echo-test --debug
```
**Echo Test Mode**: Speak into your microphone, pause, and you'll hear your recording played back. This confirms:
- ✅ Microphone is working
- ✅ Audio levels are correct
- ✅ VAD (Voice Activity Detection) is triggering properly
- ✅ No audio driver issues
## 🎯 Usage
### Basic Usage
```bash
# Start with default settings (whisper.cpp backend)
voxtty
# Start with system tray icon
voxtty --tray
# Use Speaches API backend
voxtty --speaches
# Enable debug output
voxtty --debug
```
### Advanced Usage
```bash
# Interactive device selection with debug output
voxtty --select-device --debug
# Echo test with specific device
voxtty --select-device --echo-test
# Speaches API with tray control
voxtty --speaches --tray
```
### Realtime Streaming Mode
For the lowest latency (~150ms), use realtime WebSocket streaming:
```bash
# Realtime with Speaches (self-hosted, free)
voxtty --realtime --speaches --tray
# Realtime with ElevenLabs (cloud, requires API key)
export ELEVENLABS_API_KEY=your_key_here
voxtty --realtime --elevenlabs --tray
# Realtime with OpenAI (cloud, requires API key)
export OPENAI_API_KEY=your_key_here
voxtty --realtime --openai --tray
# Realtime with voice commands (pause/resume/mode switching)
voxtty --realtime --speaches --auto --tray
```
**Realtime Features:**
- Audio feedback sounds for pause/resume/mode changes
- Auto-reconnects if WebSocket connection drops
- Tray tooltip shows connection status
- Voice commands work continuously (no need to wait for silence)
### Command-Line Options
| `--echo-test` | **Run audio echo test** - Speak and hear playback to verify microphone |
| `--select-device` | Interactively choose audio input device |
| `--debug` | Enable detailed debug logging (shows VAD triggers, audio levels) |
| `--speaches` | **Use Speaches backend** instead of whisper.cpp (default) |
| `--tray` | Enable system tray icon with click-to-toggle control |
| `--tui` | **Enable Terminal UI (TUI) mode** - Full-screen terminal interface |
| `--realtime` | **Enable realtime WebSocket streaming** (~150ms latency) |
| `--elevenlabs` | Use ElevenLabs for realtime transcription (requires API key) |
| `--openai` | Use OpenAI for transcription |
| `--assistant` | Enable assistant modes with wake word activation |
| `--auto` | Enable voice commands without full assistant mode |
| `--tui-output` | Enable text output in TUI mode (types to active application) |
| `--mcp` | **Enable MCP tool calling** (loads `~/.config/voxtty/mcp_servers.toml` or `.mcp.json`) |
| `--mock-mcp` | Use built-in mock MCP server for testing (cannot combine with `--mcp`) |
| `--bidirectional` | Enable bidirectional conversation with TTS responses |
**Configuration Priority**: CLI flags → Environment variables → Defaults
### Terminal UI (TUI) Mode
Launch voxtty with a consolidated single-screen dashboard:
```bash
# Launch TUI in demo mode
voxtty --tui
# TUI with specific backend
voxtty --tui --speaches
voxtty --tui --realtime --elevenlabs
```
**Single Dashboard View:**
```
┌─────────────────────────────────────────────────┐
│ Live Audio │ Configuration │
│ ████████ │ Model: GPT-4o-mini │
│ VAD: ● ACTIVE │ [m] Select Model │
│ Device: Default │ [d] Select Device │
├──────────────────┴──────────────────────────────┤
│ Last Transcription (5s ago) │
│ "testing one two three..." │
├──────────────────┬──────────────────────────────┤
│ Mode Selection │ Actions │
│ [1] ▶ Dictation │ [p] Pause │
│ [2] Assistant │ [e] Echo Test │
│ [3] Code │ │
└──────────────────┴──────────────────────────────┘
[q]Quit [?]Help [1-3]Mode [p]Pause [e]Echo
```
**Everything At a Glance:**
- **Live audio visualization** - Real-time voice level bar graph
- **VAD indicator** - Voice Activity Detection status (● ACTIVE / ○ Inactive)
- **Last transcription** - Most recent text with timestamp
- **Mode switcher** - Quick [1-3] keys to switch modes
- **Quick actions** - One-key access to echo test, pause, device selection
- **Model info** - Current AI model configuration
**Keyboard Shortcuts:**
- `1-3` - Switch mode (Dictation/Assistant/Code)
- `p` or `Space` - Pause/Resume listening
- `e` - Run echo test
- `m` - Select AI model
- `d` - Select audio device
- `?` or `h` - Toggle help screen
- `q` or `Esc` - Quit
**No Navigation Needed** - All controls visible on one screen!
## 🎮 Controls
### System Tray Icon
The tray icon shows a colored circle with a letter indicating the current mode:
| **D** | 🟢 Green | Dictation mode (active) |
| **A** | 🔵 Blue | Assistant mode (active) |
| **C** | 🟣 Purple | Code mode (active) |
| **D/A/C** | 🟠 Orange | Paused (listening for "resume") |
| **D/A/C** | ⚫ Gray | Disabled (click to enable) |
- **Left Click** - Toggle voice typing on/off
- **Right Click** - Menu to switch modes (when `--assistant` or `--auto` enabled)
### Voice Commands
Wake words for hands-free control (requires `--auto` or `--assistant` flag):
| **Dictation Mode** | "dictation mode", "normal mode", "typing mode", "type mode" |
| **Assistant Mode** | "hey assistant", "assistant mode" |
| **Code Mode** | "code mode", "coding mode", "write code" |
| **Pause** | "pause", "stop listening", "go to sleep" |
| **Resume** | "resume", "start listening", "wake up" |
## 🔧 Configuration
voxtty uses a **layered configuration system** for maximum flexibility:
### Configuration Layers (Priority Order)
```
1. CLI Flags (highest)
2. Environment Variables
3. Config File (~/.config/voxtty/config.toml)
4. Auto-Detection (ydotool socket)
5. Built-in Defaults (lowest)
```
### Config File
voxtty automatically creates `~/.config/voxtty/config.toml` on first run:
```toml
# ydotool socket path (auto-detected if not specified)
ydotool_socket = "/run/user/1000/.ydotool_socket"
# Speaches backend
speaches_base_url = "http://localhost:8000/v1/audio/transcriptions"
transcription_model_id = "Systran/faster-distil-whisper-small.en"
# whisper.cpp backend
whisper_url = "http://127.0.0.1:7777/inference"
```
### Environment Variables (Override Config File)
```bash
export YDOTOOL_SOCKET=/run/user/$(id -u)/.ydotool_socket
export SPEACHES_BASE_URL=http://localhost:8000/v1/audio/transcriptions
export TRANSCRIPTION_MODEL_ID=Systran/faster-distil-whisper-small.en
```
### Backend Selection
| whisper.cpp | (default) | `http://127.0.0.1:7777/inference` | Config file or env var |
| Speaches | `--speaches` | `http://localhost:8000/v1/audio/transcriptions` | Config file or env var |
### Privacy Summary by Component
Quick reference for privacy-conscious users:
| **Transcription** | whisper.cpp | 🔒 100% Local | No | (default) |
| | Speaches | 🔒 100% Local | No | `--speaches` |
| | Speaches Realtime | 🔒 100% Local | No | `--realtime --speaches` |
| | OpenAI Realtime | ☁️ Cloud | Yes | `--realtime --openai` |
| | ElevenLabs | ☁️ Cloud | Yes | `--realtime --elevenlabs` |
| **LLM (Assistant/Code)** | Ollama | 🔒 100% Local | No | `--llm ollama` |
| | Anthropic Claude | ☁️ Cloud | Yes | `--llm anthropic` |
| | OpenAI GPT | ☁️ Cloud | Yes | `--llm openai` |
| | Google Gemini | ☁️ Cloud | Yes | `--llm google` |
| | DeepSeek | ☁️ Cloud | Yes | `--llm deepseek` |
**Privacy Tip**: For complete privacy, use:
```bash
# 100% offline voice typing
voxtty --speaches --tray
# 100% offline with AI assistance
voxtty --speaches --assistant --llm ollama --tray
```
### ⚠️ Important: ydotool Setup
**Ubuntu's ydotool package is BROKEN**. You MUST build from source:
```bash
git clone https://github.com/ReimuNotMoe/ydotool.git
cd ydotool && mkdir build && cd build
cmake -DSYSTEMD_USER_SERVICE=ON ..
make -j $(nproc) && sudo make install
systemctl --user enable --now ydotoold.service
```
📖 **See the relevant sections in this README for setup and configuration details.**
### Audio Tuning
If you experience issues with voice detection:
1. **Recording never stops** - Microphone volume too high
- Lower mic volume in system settings
- Increase silence threshold in code
2. **Recording doesn't start** - Microphone volume too low
- Increase mic volume in system settings
- Decrease amplitude threshold in code
3. **Background noise triggers recording** - Environment too noisy
- Use push-to-talk via hotkey toggle
- Increase VAD sensitivity
## 🏗️ Architecture
### What voxtty IS (and what it's NOT)
voxtty is a **voice-to-text application** that listens to your microphone and types text system-wide. It's designed for direct user interaction, not as a protocol server.
voxtty supports **MCP (Model Context Protocol)** tool calling, allowing the LLM to invoke external tools during conversation — check the weather, roll dice, run calculations, control smart home devices, and more.
#### Quick Start
```bash
# Test with built-in mock MCP server
./target/release/voxtty --mock-mcp --assistant --bidirectional --realtime --elevenlabs
# Use your own MCP servers
./target/release/voxtty --mcp --assistant --bidirectional --realtime --elevenlabs
```
#### MCP Server Configuration
voxtty supports two config formats:
**Option 1: Native TOML** (`~/.config/voxtty/mcp_servers.toml`):
```toml
[[servers]]
name = "weather"
command = "python3"
args = ["-m", "weather_mcp_server"]
[servers.env]
API_KEY = "your-api-key"
[[servers]]
name = "home-assistant"
command = "node"
args = ["path/to/ha-mcp-server.js"]
```
**Option 2: Claude Code format** (`.mcp.json` in project directory — auto-detected as fallback):
```json
{
"mcpServers": {
"weather": {
"type": "stdio",
"command": "python3",
"args": ["-m", "weather_mcp_server"],
"env": { "API_KEY": "your-api-key" }
}
}
}
```
Each server is spawned as a child process and communicates via JSON-RPC over stdio. On startup, voxtty discovers available tools via `tools/list` and makes them available to the LLM alongside built-in tools (speak, type_text, switch_mode, process_command).
#### How It Works
1. voxtty spawns each MCP server and sends `initialize` + `tools/list`
2. Discovered tools are added to the LLM's tool definitions
3. When the LLM calls an MCP tool, voxtty executes it via `tools/call` and feeds the result back
4. The LLM uses the result to formulate a spoken or typed response
5. Max 5 tool call iterations per turn (prevents infinite loops)
#### Built-in Mock Server (`--mock-mcp`)
For testing without external servers:
- **get_time** — Current date/time
- **calculate** — Math expressions (sqrt, trig, pi)
- **weather** — Mock weather for common cities
- **random_fact** — Random fun facts
- **dice_roll** — Standard notation (2d6, 1d20+3)
- **echo** — Echo test
#### Writing Your Own MCP Server
Any program that reads JSON-RPC from stdin and writes to stdout works. See `test_mcp_server.py` for a complete Python example. The protocol follows the [MCP specification](https://spec.modelcontextprotocol.io/).
### Barge-in (TTS Interruption)
When using bidirectional mode, speaking while the AI is talking interrupts playback immediately. The interrupt triggers on partial transcription (~200ms), enabling natural conversational flow without waiting for the AI to finish.
### Core Components
- **Audio Capture** - CPAL for cross-platform audio input
- **Voice Detection** - WebRTC VAD + amplitude threshold
- **Transcription** - Whisper.cpp or Speaches API
- **Text Input** - ydotool for system-wide typing
- **UI Controls** - ksni (system tray)
- **MCP Tools** - External tool integration via stdio JSON-RPC
### Audio Pipeline
```
Microphone → CPAL → VAD → WAV Buffer → Whisper AI → ydotool → Text Output
```
### Detection Algorithm
1. Capture audio in 30ms frames at 16kHz
2. Run WebRTC VAD on each frame
3. Check amplitude threshold (>1000)
4. Require 200ms of speech to start
5. Wait 1000ms of silence to stop
6. Transcribe and type result
## 🐛 Troubleshooting
### Quick Fixes
**Audio not working?**
```bash
voxtty --echo-test
```
**Transcription failing?**
```bash
# Check backend is running
```
**Text not typing?**
```bash
# Check ydotool
systemctl --user status ydotoold.service
ydotool type "test"
```
**Need more help?**
```bash
# Run with debug output
voxtty --debug --speaches
```
## 🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
### Development Setup
```bash
# Clone repository
git clone https://github.com/jflaflamme/voxtty.git
cd voxtty
# Build in debug mode
cargo build
# Run with debug output
cargo run -- --debug --echo-test
# Run tests
cargo test
# Check code quality
cargo clippy
cargo fmt --check
```
## 📚 Documentation
All documentation is contained in this README. Additional detailed guides coming soon!
## 📝 License
This project is licensed under the GNU General Public License v2.0 - see the [LICENSE](LICENSE) file for details.
## 🙏 Acknowledgments
- [OpenAI Whisper](https://github.com/openai/whisper) - State-of-the-art speech recognition
- [whisper.cpp](https://github.com/ggerganov/whisper.cpp) - Efficient C++ implementation
- [ydotool](https://github.com/ReimuNotMoe/ydotool) - Generic command-line automation tool
- [WebRTC VAD](https://github.com/wiseman/py-webrtcvad) - Voice activity detection
## 🔗 Links
- **Repository**: https://github.com/jflaflamme/voxtty
- **Issues**: https://github.com/jflaflamme/voxtty/issues
- **Releases**: https://github.com/jflaflamme/voxtty/releases
---
**Made with ❤️ by Jeff Laflamme**
*The power of whisper — your voice commands*