voxtty 0.1.0

The power of whisper — your voice commands. Dictate, code, command, and converse on Linux.
voxtty-0.1.0 is not a library.

voxtty

CI

The power of whisper — your voice commands.

voxtty turns your voice into action. Dictate, code, command, and converse — switch modes hands-free, choose your backend (local Whisper or cloud AI), and type anywhere. Private by default, limitless by design.

Built in Rust. Runs offline. Your voice never leaves unless you say so.

🚀 Quick Start

Install

# Option 1: One-line installer (downloads pre-built binary or builds from source)
curl -fsSL https://raw.githubusercontent.com/jflaflamme/voxtty/main/install.sh | bash

# Option 2: Cargo (requires Rust toolchain + system deps)
cargo install --git https://github.com/jflaflamme/voxtty

# Option 3: Build from source
git clone https://github.com/jflaflamme/voxtty.git
cd voxtty && cargo build --release
sudo cp target/release/voxtty /usr/local/bin/

Setup

# 1. Setup ydotool (required for typing into applications)
sudo systemctl enable --now ydotool.service
export YDOTOOL_SOCKET=/tmp/.ydotool_socket  # add to ~/.bashrc

# 2. Start Speaches backend (Docker)
docker run -d --name speaches -p 8000:8000 \
  ghcr.io/speaches-ai/speaches:latest

# 3. Test your microphone
voxtty --echo-test

# 4. Start voice typing
voxtty --speaches --tray

That's it! Click the tray icon to toggle voice typing on/off.

Optional: AI Assistant Mode

# 1. Select an AI model (interactive)
voxtty --select-model
# Choose from: OpenAI, Anthropic, Google, DeepSeek, Ollama (local/free), OpenRouter

# 2. Use assistant mode with wake words OR tray menu
voxtty --assistant --tray
# Voice commands: Say "hey assistant" for writing help, "code mode" for code
# GUI: Click tray icon → Select mode (Dictation/Assistant/Code)
# Note: Privacy warnings shown automatically when using cloud AI (OpenAI, etc.)

📖 For detailed configuration, model selection, and troubleshooting, see the sections below in this README.

🎯 Inspiration & Evolution

voxtty was inspired by themanyone/voice_typing, a brilliant bash-based voice typing solution. While the original project demonstrated the power of offline voice typing, voxtty takes it further by:

  • Rewritten in Rust - Memory-safe, fast, and reliable
  • Speaches Backend - Uses Speaches AI for a more extensible and maintainable transcription backend
  • Better Performance - ~2 seconds latency on i7 CPU (no GPU required) with basic model
  • Enhanced UX - System tray integration for seamless control
  • Production Ready - Proper error handling, device selection, and configuration options

Why Speaches?

Speaches provides a superior backend option compared to direct whisper.cpp integration:

  • OpenAI-Compatible API - Standard REST interface for easy integration
  • Docker Support - Run in containers with consistent environments
  • CPU Optimized - Excellent performance even without GPU (~2s latency on i7)
  • Model Flexibility - Easy switching between different Whisper model sizes
  • Network Ready - Can run locally or on a dedicated transcription server
  • Better Extensibility - Clean API makes it easy to add features and improvements

Backend Comparison

Feature whisper.cpp Speaches Realtime (WebSocket)
Setup Manual build Docker one-liner API key or self-hosted
Latency ~3-4s ~2s ~150ms
Privacy 100% offline 100% offline Depends on provider
Providers Local only Local only Speaches, ElevenLabs, OpenAI
Best For Minimal setup Production use Lowest latency

Realtime Providers:

  • Speaches - Self-hosted, free, ~150ms latency, 🔒 100% LOCAL (privacy-preserving)
  • ElevenLabs - Cloud, excellent accuracy, requires API key, ☁️ CLOUD (sends audio to third-party)
  • OpenAI - Cloud, GPT-4o transcription, requires API key, ☁️ CLOUD (sends audio to third-party)

✨ Features

🔒 Privacy First

  • 100% Offline Processing - All transcription happens locally using Whisper AI
  • No Cloud Services - Your voice never leaves your machine (for dictation mode)
  • Privacy Warnings - Automatic alerts when using cloud AI services (Assistant/Code modes)
  • No Data Collection - Zero telemetry, zero tracking
  • Self-Hosted - Run Speaches backend in Docker on your own hardware
  • Network Isolated - Works completely offline, no internet required
  • Local AI Option - Use Ollama for 100% offline Assistant/Code modes

⚡ Realtime Streaming Mode

  • ~150ms Latency - WebSocket-based streaming for near-instant transcription
  • Multiple Providers - ElevenLabs, OpenAI Realtime, or Speaches WebSocket
  • Auto-Reconnection - Automatically reconnects if connection drops
  • Connection Status - Tray tooltip shows [Disconnected] when offline

🎤 Smart Voice Detection

  • Voice Activity Detection (VAD) - Automatically detects when you start and stop speaking
  • WebRTC VAD Engine - Industry-standard voice detection with low false positives
  • Amplitude Threshold - Dual detection system for reliable speech capture
  • Configurable Silence Detection - Customizable pause duration before transcription

⌨️ System-Wide Integration

  • Universal Typing - Works in any application via ydotool
  • No GUI Required - Runs in TTY, X11, Wayland, or any Linux environment
  • Instant Text Insertion - Transcribed text appears directly where you're typing

🎛️ Flexible Control

  • System Tray Icon - Quick toggle on/off with visual status indicator (click to enable/disable)
  • GUI Mode Switching - Switch between Dictation/Assistant/Code modes from tray menu (when --assistant enabled)
  • Voice Commands - Wake words for hands-free mode switching ("hey assistant", "code mode", "dictation mode")
  • Audio Feedback - Notification sounds for pause/resume/mode changes
  • Multiple Backends - Support for whisper.cpp, Speaches, OpenAI, or ElevenLabs
  • Interactive Device Selection - Choose your preferred microphone
  • Always Available - Runs in background, ready when you need it

🔧 Developer Friendly

  • Echo Test Mode - Built-in --echo-test CLI flag to verify audio input with instant playback
  • Debug Mode - Detailed logging for troubleshooting with --debug flag
  • Flexible Configuration - Environment variables and CLI flags for easy customization
  • Backend Agnostic - Switch backends with a single flag, no code changes
  • Clean Rust Codebase - Modern, safe, and maintainable

📋 Requirements

Core Dependencies

  • Whisper AI Backend (choose one):
  • Audio: ALSA (libasound2-dev)
  • System Integration: ydotool (for typing), DBus (for system tray)
  • Build Tools: Rust 1.70+, cargo, pkg-config

Optional: AI Assistant Mode

  • LLM Provider (choose one):
    • Ollama - Free, local AI models (recommended for privacy)
    • OpenAI - GPT-4o, GPT-4o-mini (API key required)
    • Anthropic - Claude models (API key required)
    • Google - Gemini models (API key required)
    • DeepSeek - Affordable models (API key required)
    • OpenRouter - Access multiple providers (API key required)

Runtime Dependencies

  • ydotool - System-wide input simulation
  • alsa-utils - Audio utilities
  • pulseaudio or pipewire-pulse - Audio server (recommended)

🚀 Installation

Choose Your Installation Method

Method Best For Requirements
curl | bash Quick install, end users curl or wget
cargo install Rust users Rust toolchain + system deps
Build from source Development, contributing Rust toolchain + system deps
Debian Package Custom packaging debhelper, cargo

Option 1: One-line Installer (Recommended)

Downloads a pre-built binary from GitHub Releases, or builds from source as fallback:

curl -fsSL https://raw.githubusercontent.com/jflaflamme/voxtty/main/install.sh | bash

Installs to ~/.local/bin by default. Override with INSTALL_DIR:

INSTALL_DIR=/usr/local/bin curl -fsSL https://raw.githubusercontent.com/jflaflamme/voxtty/main/install.sh | bash

Option 2: Cargo Install

# Requires: pkg-config libasound2-dev libatk1.0-dev libgtk-3-dev
cargo install --git https://github.com/jflaflamme/voxtty

Option 3: Build from Source

git clone https://github.com/jflaflamme/voxtty.git
cd voxtty
cargo build --release
sudo cp target/release/voxtty /usr/local/bin/

Build Dependencies (Ubuntu/Debian)

sudo apt install pkg-config libasound2-dev libatk1.0-dev libgtk-3-dev libgdk-pixbuf2.0-dev libssl-dev

Option 3: Systemd User Service (Auto-start on Login)

For automatic startup with realtime transcription:

# 1. Install binary
sudo cp target/release/voxtty /usr/local/bin/

# 2. Create environment file with API keys
mkdir -p ~/.config/voxtty
cat > ~/.config/voxtty/env << 'EOF'
ELEVENLABS_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here
ANTHROPIC_API_KEY=your_key_here
EOF
chmod 600 ~/.config/voxtty/env

# 3. Create systemd service
mkdir -p ~/.config/systemd/user
cat > ~/.config/systemd/user/voxtty.service << 'EOF'
[Unit]
Description=voxtty voice typing
After=graphical-session.target

[Service]
EnvironmentFile=%h/.config/voxtty/env
ExecStart=/usr/local/bin/voxtty --tray --auto --realtime --elevenlabs
Restart=on-failure
RestartSec=5

[Install]
WantedBy=default.target
EOF

# 4. Enable and start
systemctl --user daemon-reload
systemctl --user enable --now voxtty

# 5. Check status
systemctl --user status voxtty

Service management:

systemctl --user status voxtty    # Check status
systemctl --user stop voxtty      # Stop
systemctl --user restart voxtty   # Restart
journalctl --user -u voxtty -f    # Watch logs

Optional: Desktop Entry (App Menu)

cat > ~/.local/share/applications/voxtty.desktop << 'EOF'
[Desktop Entry]
Name=voxtty
Comment=Voice to text typing
Exec=voxtty --tray --auto
Icon=audio-input-microphone
Terminal=false
Type=Application
Categories=Utility;Accessibility;
EOF

Option 4: Build Your Own Debian Package

For: Creating custom packages or contributing

Prerequisites

# Install build dependencies
sudo apt install debhelper-compat cargo rustc pkg-config libasound2-dev libdbus-1-dev

Build Process

The Debian package is built using standard Debian packaging tools:

# Method 1: Use the build script (recommended)
./packaging/scripts/build-deb.sh

# Method 2: Manual build (same as the script)
cargo build --release
dpkg-buildpackage -rfakeroot -us -uc

# The package will be created in the parent directory
ls -lh ../voxtty_*.deb

What Gets Built

The build process creates:

  • Binary package: voxtty_0.1.0-1_amd64.deb - Main installable package
  • Debug symbols: voxtty-dbgsym_0.1.0-1_amd64.deb - Debug symbols (optional)
  • Build artifacts: .buildinfo, .changes files - Build metadata

Package Contents

/usr/bin/voxtty              # Main executable
/usr/share/doc/voxtty/       # Documentation
/usr/share/man/man1/            # Man pages (if included)

Install Your Custom Package

# Install the package
sudo dpkg -i ../voxtty_*.deb

# If dependencies are missing, fix them
sudo apt-get install -f

# Verify installation
which voxtty
voxtty --version

Debian Packaging Files

The package is configured via files in the debian/ directory:

  • debian/control - Package metadata, dependencies, and description
  • debian/rules - Build instructions (uses dh with Cargo)
  • debian/changelog - Version history and release notes
  • debian/install - Files to include in the package
  • debian/compat - Debhelper compatibility level

Customizing the Package

To modify the package:

  1. Change version: Edit debian/changelog

    dch -v 0.2.0-1 "New release with feature X"
    
  2. Add dependencies: Edit debian/control

    Depends: ${shlibs:Depends}, ${misc:Depends}, your-new-dependency
    
  3. Modify build: Edit debian/rules

    override_dh_auto_build:
        cargo build --release --features your-feature
    
  4. Rebuild package:

    ./packaging/scripts/build-deb.sh
    

Troubleshooting Package Build

Build fails with missing dependencies?

sudo apt-get build-dep .

Clean build artifacts?

cargo clean
debian/rules clean
rm -f ../voxtty_*

Test package without installing?

dpkg-deb --contents ../voxtty_*.deb
dpkg-deb --info ../voxtty_*.deb

⚙️ Setup

1. Install Whisper Backend

Option A: whisper.cpp (Recommended for local use)

# Clone and build whisper.cpp
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make

# Download a model (tiny.en is fastest, small.en is more accurate)
bash ./models/download-ggml-model.sh tiny.en

# Start the server
./server -l en -m models/ggml-tiny.en.bin --port 7777 --convert

Option B: Speaches API (Recommended - Better Performance & Extensibility)

Speaches provides superior performance and flexibility compared to whisper.cpp. On an i7 CPU (no GPU), expect ~2 second latency with the basic model.

# Quick start with Docker (CPU-only)
docker run -d \
  --name speaches \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  ghcr.io/speaches-ai/speaches:latest

# Or with docker-compose
cat > docker-compose.yml <<EOF
services:
  speaches:
    image: ghcr.io/speaches-ai/speaches:latest
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    environment:
      - TRANSCRIPTION_MODEL_ID=Systran/faster-distil-whisper-small.en
EOF

docker-compose up -d

# IMPORTANT: Initial Speaches Configuration
# After starting Speaches for the first time, you must configure it:

# 1. Set the base URL for Speaches API
export SPEACHES_BASE_URL="http://localhost:8000"

# 2. Check available models in the registry
curl "$SPEACHES_BASE_URL/v1/registry"

# 3. Download and activate your chosen model (first-time setup)
curl "$SPEACHES_BASE_URL/v1/models/Systran/faster-distil-whisper-small.en" -X POST

# 4. Configure voxtty to use Speaches
export SPEACHES_BASE_URL="http://localhost:8000/v1/audio/transcriptions"
export TRANSCRIPTION_MODEL_ID="Systran/faster-distil-whisper-small.en"

# 5. Test the connection (requires a test audio file)
curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F "file=@test.wav" \
  -F "model=Systran/faster-distil-whisper-small.en"

Initial Setup Notes:

  • The /v1/registry endpoint shows all available Whisper models
  • The /v1/models/{model_name} POST endpoint downloads and activates a model
  • Model download happens once; subsequent starts use the cached model
  • Choose model size based on your needs: tiny.en (fastest) → small.en (balanced) → medium.en (accurate)

Performance Notes:

  • CPU-only (i7): ~2 seconds latency with basic model
  • GPU-enabled: Sub-second latency possible
  • Model sizes: tiny.en (fastest) → small.en (balanced) → medium.en (accurate)
  • 100% Offline: No internet required after initial model download

2. Configure ydotool

# Add to ~/.bashrc or ~/.zshrc
export YDOTOOL_SOCKET=/tmp/.ydotool_socket

# Enable and start ydotool service
sudo systemctl enable ydotool.service
sudo systemctl start ydotool.service

# Verify it's running
sudo systemctl status ydotool.service

# Test typing
ydotool type "Hello, World!"

3. Verify Audio Input (Important!)

Before using voxtty, verify your microphone works correctly with the built-in echo test:

# Run echo test - speak and hear your voice played back
voxtty --echo-test

# Select specific device interactively, then test
voxtty --select-device --echo-test

# Test with debug output to see audio levels
voxtty --echo-test --debug

Echo Test Mode: Speak into your microphone, pause, and you'll hear your recording played back. This confirms:

  • ✅ Microphone is working
  • ✅ Audio levels are correct
  • ✅ VAD (Voice Activity Detection) is triggering properly
  • ✅ No audio driver issues

🎯 Usage

Basic Usage

# Start with default settings (whisper.cpp backend)
voxtty

# Start with system tray icon
voxtty --tray

# Use Speaches API backend
voxtty --speaches

# Enable debug output
voxtty --debug

Advanced Usage

# Interactive device selection with debug output
voxtty --select-device --debug

# Echo test with specific device
voxtty --select-device --echo-test

# Speaches API with tray control
voxtty --speaches --tray

Realtime Streaming Mode

For the lowest latency (~150ms), use realtime WebSocket streaming:

# Realtime with Speaches (self-hosted, free)
voxtty --realtime --speaches --tray

# Realtime with ElevenLabs (cloud, requires API key)
export ELEVENLABS_API_KEY=your_key_here
voxtty --realtime --elevenlabs --tray

# Realtime with OpenAI (cloud, requires API key)
export OPENAI_API_KEY=your_key_here
voxtty --realtime --openai --tray

# Realtime with voice commands (pause/resume/mode switching)
voxtty --realtime --speaches --auto --tray

Realtime Features:

  • Audio feedback sounds for pause/resume/mode changes
  • Auto-reconnects if WebSocket connection drops
  • Tray tooltip shows connection status
  • Voice commands work continuously (no need to wait for silence)

Command-Line Options

Option Description
--echo-test Run audio echo test - Speak and hear playback to verify microphone
--select-device Interactively choose audio input device
--debug Enable detailed debug logging (shows VAD triggers, audio levels)
--speaches Use Speaches backend instead of whisper.cpp (default)
--tray Enable system tray icon with click-to-toggle control
--tui Enable Terminal UI (TUI) mode - Full-screen terminal interface
--realtime Enable realtime WebSocket streaming (~150ms latency)
--elevenlabs Use ElevenLabs for realtime transcription (requires API key)
--openai Use OpenAI for transcription
--assistant Enable assistant modes with wake word activation
--auto Enable voice commands without full assistant mode
--tui-output Enable text output in TUI mode (types to active application)
--mcp Enable MCP tool calling (loads ~/.config/voxtty/mcp_servers.toml or .mcp.json)
--mock-mcp Use built-in mock MCP server for testing (cannot combine with --mcp)
--bidirectional Enable bidirectional conversation with TTS responses

Configuration Priority: CLI flags → Environment variables → Defaults

Terminal UI (TUI) Mode

Launch voxtty with a consolidated single-screen dashboard:

# Launch TUI in demo mode
voxtty --tui

# TUI with specific backend
voxtty --tui --speaches
voxtty --tui --realtime --elevenlabs

Single Dashboard View:

┌─────────────────────────────────────────────────┐
│ voxtty | Dictation | Speaches | LISTENING       │  ← Status bar
├──────────────────┬──────────────────────────────┤
│ Live Audio       │ Configuration                │
│ ████████         │ Model: GPT-4o-mini           │
│ VAD: ● ACTIVE    │ [m] Select Model             │
│ Device: Default  │ [d] Select Device            │
├──────────────────┴──────────────────────────────┤
│ Last Transcription (5s ago)                     │
│ "testing one two three..."                      │
├──────────────────┬──────────────────────────────┤
│ Mode Selection   │ Actions                      │
│ [1] ▶ Dictation  │ [p] Pause                    │
│ [2]   Assistant  │ [e] Echo Test                │
│ [3]   Code       │                              │
└──────────────────┴──────────────────────────────┘
[q]Quit  [?]Help  [1-3]Mode  [p]Pause  [e]Echo

Everything At a Glance:

  • Live audio visualization - Real-time voice level bar graph
  • VAD indicator - Voice Activity Detection status (● ACTIVE / ○ Inactive)
  • Last transcription - Most recent text with timestamp
  • Mode switcher - Quick [1-3] keys to switch modes
  • Quick actions - One-key access to echo test, pause, device selection
  • Model info - Current AI model configuration

Keyboard Shortcuts:

  • 1-3 - Switch mode (Dictation/Assistant/Code)
  • p or Space - Pause/Resume listening
  • e - Run echo test
  • m - Select AI model
  • d - Select audio device
  • ? or h - Toggle help screen
  • q or Esc - Quit

No Navigation Needed - All controls visible on one screen!

🎮 Controls

System Tray Icon

The tray icon shows a colored circle with a letter indicating the current mode:

Icon Color Meaning
D 🟢 Green Dictation mode (active)
A 🔵 Blue Assistant mode (active)
C 🟣 Purple Code mode (active)
D/A/C 🟠 Orange Paused (listening for "resume")
D/A/C ⚫ Gray Disabled (click to enable)
  • Left Click - Toggle voice typing on/off
  • Right Click - Menu to switch modes (when --assistant or --auto enabled)

Voice Commands

Wake words for hands-free control (requires --auto or --assistant flag):

Command Wake Words
Dictation Mode "dictation mode", "normal mode", "typing mode", "type mode"
Assistant Mode "hey assistant", "assistant mode"
Code Mode "code mode", "coding mode", "write code"
Pause "pause", "stop listening", "go to sleep"
Resume "resume", "start listening", "wake up"

🔧 Configuration

voxtty uses a layered configuration system for maximum flexibility:

Configuration Layers (Priority Order)

1. CLI Flags (highest)
2. Environment Variables
3. Config File (~/.config/voxtty/config.toml)
4. Auto-Detection (ydotool socket)
5. Built-in Defaults (lowest)

Config File

voxtty automatically creates ~/.config/voxtty/config.toml on first run:

# ydotool socket path (auto-detected if not specified)
ydotool_socket = "/run/user/1000/.ydotool_socket"

# Speaches backend
speaches_base_url = "http://localhost:8000/v1/audio/transcriptions"
transcription_model_id = "Systran/faster-distil-whisper-small.en"

# whisper.cpp backend
whisper_url = "http://127.0.0.1:7777/inference"

Environment Variables (Override Config File)

export YDOTOOL_SOCKET=/run/user/$(id -u)/.ydotool_socket
export SPEACHES_BASE_URL=http://localhost:8000/v1/audio/transcriptions
export TRANSCRIPTION_MODEL_ID=Systran/faster-distil-whisper-small.en

Backend Selection

Backend CLI Flag Default URL Configuration
whisper.cpp (default) http://127.0.0.1:7777/inference Config file or env var
Speaches --speaches http://localhost:8000/v1/audio/transcriptions Config file or env var

Privacy Summary by Component

Quick reference for privacy-conscious users:

Component Backend Privacy Internet Required CLI Flag
Transcription whisper.cpp 🔒 100% Local No (default)
Speaches 🔒 100% Local No --speaches
Speaches Realtime 🔒 100% Local No --realtime --speaches
OpenAI Realtime ☁️ Cloud Yes --realtime --openai
ElevenLabs ☁️ Cloud Yes --realtime --elevenlabs
LLM (Assistant/Code) Ollama 🔒 100% Local No --llm ollama
Anthropic Claude ☁️ Cloud Yes --llm anthropic
OpenAI GPT ☁️ Cloud Yes --llm openai
Google Gemini ☁️ Cloud Yes --llm google
DeepSeek ☁️ Cloud Yes --llm deepseek

Privacy Tip: For complete privacy, use:

# 100% offline voice typing
voxtty --speaches --tray

# 100% offline with AI assistance
voxtty --speaches --assistant --llm ollama --tray

⚠️ Important: ydotool Setup

Ubuntu's ydotool package is BROKEN. You MUST build from source:

git clone https://github.com/ReimuNotMoe/ydotool.git
cd ydotool && mkdir build && cd build
cmake -DSYSTEMD_USER_SERVICE=ON ..
make -j $(nproc) && sudo make install
systemctl --user enable --now ydotoold.service

📖 See the relevant sections in this README for setup and configuration details.

Audio Tuning

If you experience issues with voice detection:

  1. Recording never stops - Microphone volume too high

    • Lower mic volume in system settings
    • Increase silence threshold in code
  2. Recording doesn't start - Microphone volume too low

    • Increase mic volume in system settings
    • Decrease amplitude threshold in code
  3. Background noise triggers recording - Environment too noisy

    • Use push-to-talk via hotkey toggle
    • Increase VAD sensitivity

🏗️ Architecture

What voxtty IS (and what it's NOT)

voxtty is a voice-to-text application that listens to your microphone and types text system-wide. It's designed for direct user interaction, not as a protocol server.

voxtty supports MCP (Model Context Protocol) tool calling, allowing the LLM to invoke external tools during conversation — check the weather, roll dice, run calculations, control smart home devices, and more.

Quick Start

# Test with built-in mock MCP server
./target/release/voxtty --mock-mcp --assistant --bidirectional --realtime --elevenlabs

# Use your own MCP servers
./target/release/voxtty --mcp --assistant --bidirectional --realtime --elevenlabs

MCP Server Configuration

voxtty supports two config formats:

Option 1: Native TOML (~/.config/voxtty/mcp_servers.toml):

[[servers]]
name = "weather"
command = "python3"
args = ["-m", "weather_mcp_server"]

[servers.env]
API_KEY = "your-api-key"

[[servers]]
name = "home-assistant"
command = "node"
args = ["path/to/ha-mcp-server.js"]

Option 2: Claude Code format (.mcp.json in project directory — auto-detected as fallback):

{
  "mcpServers": {
    "weather": {
      "type": "stdio",
      "command": "python3",
      "args": ["-m", "weather_mcp_server"],
      "env": { "API_KEY": "your-api-key" }
    }
  }
}

Each server is spawned as a child process and communicates via JSON-RPC over stdio. On startup, voxtty discovers available tools via tools/list and makes them available to the LLM alongside built-in tools (speak, type_text, switch_mode, process_command).

How It Works

  1. voxtty spawns each MCP server and sends initialize + tools/list
  2. Discovered tools are added to the LLM's tool definitions
  3. When the LLM calls an MCP tool, voxtty executes it via tools/call and feeds the result back
  4. The LLM uses the result to formulate a spoken or typed response
  5. Max 5 tool call iterations per turn (prevents infinite loops)

Built-in Mock Server (--mock-mcp)

For testing without external servers:

  • get_time — Current date/time
  • calculate — Math expressions (sqrt, trig, pi)
  • weather — Mock weather for common cities
  • random_fact — Random fun facts
  • dice_roll — Standard notation (2d6, 1d20+3)
  • echo — Echo test

Writing Your Own MCP Server

Any program that reads JSON-RPC from stdin and writes to stdout works. See test_mcp_server.py for a complete Python example. The protocol follows the MCP specification.

Barge-in (TTS Interruption)

When using bidirectional mode, speaking while the AI is talking interrupts playback immediately. The interrupt triggers on partial transcription (~200ms), enabling natural conversational flow without waiting for the AI to finish.

Core Components

  • Audio Capture - CPAL for cross-platform audio input
  • Voice Detection - WebRTC VAD + amplitude threshold
  • Transcription - Whisper.cpp or Speaches API
  • Text Input - ydotool for system-wide typing
  • UI Controls - ksni (system tray)
  • MCP Tools - External tool integration via stdio JSON-RPC

Audio Pipeline

Microphone → CPAL → VAD → WAV Buffer → Whisper AI → ydotool → Text Output

Detection Algorithm

  1. Capture audio in 30ms frames at 16kHz
  2. Run WebRTC VAD on each frame
  3. Check amplitude threshold (>1000)
  4. Require 200ms of speech to start
  5. Wait 1000ms of silence to stop
  6. Transcribe and type result

🐛 Troubleshooting

Quick Fixes

Audio not working?

voxtty --echo-test

Transcription failing?

# Check backend is running
docker ps | grep speaches          # For Speaches
curl http://127.0.0.1:7777/        # For whisper.cpp

Text not typing?

# Check ydotool
systemctl --user status ydotoold.service
ydotool type "test"

Need more help?

# Run with debug output
voxtty --debug --speaches

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Development Setup

# Clone repository
git clone https://github.com/jflaflamme/voxtty.git
cd voxtty

# Build in debug mode
cargo build

# Run with debug output
cargo run -- --debug --echo-test

# Run tests
cargo test

# Check code quality
cargo clippy
cargo fmt --check

📚 Documentation

All documentation is contained in this README. Additional detailed guides coming soon!

📝 License

This project is licensed under the GNU General Public License v2.0 - see the LICENSE file for details.

🙏 Acknowledgments

🔗 Links


Made with ❤️ by Jeff Laflamme

The power of whisper — your voice commands