voxtty-0.1.0 is not a library.

voxtty

The power of whisper — your voice commands.

voxtty turns your voice into action. Dictate, code, command, and converse — switch modes hands-free, choose your backend (local Whisper or cloud AI), and type anywhere. Private by default, limitless by design.

Built in Rust. Runs offline. Your voice never leaves unless you say so.

🚀 Quick Start

Install

# Option 1: One-line installer (downloads pre-built binary or builds from source)
curl -fsSL https://raw.githubusercontent.com/jflaflamme/voxtty/main/install.sh | bash

# Option 2: Cargo (requires Rust toolchain + system deps)
cargo install --git https://github.com/jflaflamme/voxtty

# Option 3: Build from source
git clone https://github.com/jflaflamme/voxtty.git
cd voxtty && cargo build --release
sudo cp target/release/voxtty /usr/local/bin/

Setup

# 1. Setup ydotool (required for typing into applications)
sudo systemctl enable --now ydotool.service
export YDOTOOL_SOCKET=/tmp/.ydotool_socket  # add to ~/.bashrc

# 2. Start Speaches backend (Docker)
docker run -d --name speaches -p 8000:8000 \
  ghcr.io/speaches-ai/speaches:latest

# 3. Test your microphone
voxtty --echo-test

# 4. Start voice typing
voxtty --speaches --tray

That's it! Click the tray icon to toggle voice typing on/off.

Optional: AI Assistant Mode

# 1. Select an AI model (interactive)
voxtty --select-model
# Choose from: OpenAI, Anthropic, Google, DeepSeek, Ollama (local/free), OpenRouter

# 2. Use assistant mode with wake words OR tray menu
voxtty --assistant --tray
# Voice commands: Say "hey assistant" for writing help, "code mode" for code
# GUI: Click tray icon → Select mode (Dictation/Assistant/Code)
# Note: Privacy warnings shown automatically when using cloud AI (OpenAI, etc.)

📖 For detailed configuration, model selection, and troubleshooting, see the sections below in this README.

🎯 Inspiration & Evolution

voxtty was inspired by themanyone/voice_typing, a brilliant bash-based voice typing solution. While the original project demonstrated the power of offline voice typing, voxtty takes it further by:

Rewritten in Rust - Memory-safe, fast, and reliable
Speaches Backend - Uses Speaches AI for a more extensible and maintainable transcription backend
Better Performance - ~2 seconds latency on i7 CPU (no GPU required) with basic model
Enhanced UX - System tray integration for seamless control
Production Ready - Proper error handling, device selection, and configuration options

Why Speaches?

Speaches provides a superior backend option compared to direct whisper.cpp integration:

OpenAI-Compatible API - Standard REST interface for easy integration
Docker Support - Run in containers with consistent environments
CPU Optimized - Excellent performance even without GPU (~2s latency on i7)
Model Flexibility - Easy switching between different Whisper model sizes
Network Ready - Can run locally or on a dedicated transcription server
Better Extensibility - Clean API makes it easy to add features and improvements

Backend Comparison

Feature	whisper.cpp	Speaches	Realtime (WebSocket)
Setup	Manual build	Docker one-liner	API key or self-hosted
Latency	~3-4s	~2s	~150ms
Privacy	100% offline	100% offline	Depends on provider
Providers	Local only	Local only	Speaches, ElevenLabs, OpenAI
Best For	Minimal setup	Production use	Lowest latency

Realtime Providers:

Speaches - Self-hosted, free, ~150ms latency, 🔒 100% LOCAL (privacy-preserving)
ElevenLabs - Cloud, excellent accuracy, requires API key, ☁️ CLOUD (sends audio to third-party)
OpenAI - Cloud, GPT-4o transcription, requires API key, ☁️ CLOUD (sends audio to third-party)

✨ Features

🔒 Privacy First

100% Offline Processing - All transcription happens locally using Whisper AI
No Cloud Services - Your voice never leaves your machine (for dictation mode)
Privacy Warnings - Automatic alerts when using cloud AI services (Assistant/Code modes)
No Data Collection - Zero telemetry, zero tracking
Self-Hosted - Run Speaches backend in Docker on your own hardware
Network Isolated - Works completely offline, no internet required
Local AI Option - Use Ollama for 100% offline Assistant/Code modes

⚡ Realtime Streaming Mode

~150ms Latency - WebSocket-based streaming for near-instant transcription
Multiple Providers - ElevenLabs, OpenAI Realtime, or Speaches WebSocket
Auto-Reconnection - Automatically reconnects if connection drops
Connection Status - Tray tooltip shows [Disconnected] when offline

🎤 Smart Voice Detection

Voice Activity Detection (VAD) - Automatically detects when you start and stop speaking
WebRTC VAD Engine - Industry-standard voice detection with low false positives
Amplitude Threshold - Dual detection system for reliable speech capture
Configurable Silence Detection - Customizable pause duration before transcription

⌨️ System-Wide Integration

Universal Typing - Works in any application via ydotool
No GUI Required - Runs in TTY, X11, Wayland, or any Linux environment
Instant Text Insertion - Transcribed text appears directly where you're typing

🎛️ Flexible Control

System Tray Icon - Quick toggle on/off with visual status indicator (click to enable/disable)
GUI Mode Switching - Switch between Dictation/Assistant/Code modes from tray menu (when --assistant enabled)
Voice Commands - Wake words for hands-free mode switching ("hey assistant", "code mode", "dictation mode")
Audio Feedback - Notification sounds for pause/resume/mode changes
Multiple Backends - Support for whisper.cpp, Speaches, OpenAI, or ElevenLabs
Interactive Device Selection - Choose your preferred microphone
Always Available - Runs in background, ready when you need it

🔧 Developer Friendly

Echo Test Mode - Built-in --echo-test CLI flag to verify audio input with instant playback
Debug Mode - Detailed logging for troubleshooting with --debug flag
Flexible Configuration - Environment variables and CLI flags for easy customization
Backend Agnostic - Switch backends with a single flag, no code changes
Clean Rust Codebase - Modern, safe, and maintainable

📋 Requirements

Core Dependencies

Whisper AI Backend (choose one):
- whisper.cpp - Lightweight C++ server
- Speaches - Modern API server (recommended)
Audio: ALSA (libasound2-dev)
System Integration: ydotool (for typing), DBus (for system tray)
Build Tools: Rust 1.70+, cargo, pkg-config

Optional: AI Assistant Mode

LLM Provider (choose one):
- Ollama - Free, local AI models (recommended for privacy)
- OpenAI - GPT-4o, GPT-4o-mini (API key required)
- Anthropic - Claude models (API key required)
- Google - Gemini models (API key required)
- DeepSeek - Affordable models (API key required)
- OpenRouter - Access multiple providers (API key required)

Runtime Dependencies

ydotool - System-wide input simulation
alsa-utils - Audio utilities
pulseaudio or pipewire-pulse - Audio server (recommended)

🚀 Installation

Choose Your Installation Method

Method	Best For	Requirements
curl \| bash	Quick install, end users	curl or wget
cargo install	Rust users	Rust toolchain + system deps
Build from source	Development, contributing	Rust toolchain + system deps
Debian Package	Custom packaging	debhelper, cargo

Option 1: One-line Installer (Recommended)

Downloads a pre-built binary from GitHub Releases, or builds from source as fallback:

curl -fsSL https://raw.githubusercontent.com/jflaflamme/voxtty/main/install.sh | bash

Installs to ~/.local/bin by default. Override with INSTALL_DIR:

INSTALL_DIR=/usr/local/bin curl -fsSL https://raw.githubusercontent.com/jflaflamme/voxtty/main/install.sh | bash

Option 2: Cargo Install

# Requires: pkg-config libasound2-dev libatk1.0-dev libgtk-3-dev
cargo install --git https://github.com/jflaflamme/voxtty

Option 3: Build from Source

git clone https://github.com/jflaflamme/voxtty.git
cd voxtty
cargo build --release
sudo cp target/release/voxtty /usr/local/bin/

Build Dependencies (Ubuntu/Debian)

sudo apt install pkg-config libasound2-dev libatk1.0-dev libgtk-3-dev libgdk-pixbuf2.0-dev libssl-dev

Option 3: Systemd User Service (Auto-start on Login)

For automatic startup with realtime transcription:

# 1. Install binary
sudo cp target/release/voxtty /usr/local/bin/

# 2. Create environment file with API keys
mkdir -p ~/.config/voxtty
cat > ~/.config/voxtty/env << 'EOF'
ELEVENLABS_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here
ANTHROPIC_API_KEY=your_key_here
EOF
chmod 600 ~/.config/voxtty/env

# 3. Create systemd service
mkdir -p ~/.config/systemd/user
cat > ~/.config/systemd/user/voxtty.service << 'EOF'
[Unit]
Description=voxtty voice typing
After=graphical-session.target

[Service]
EnvironmentFile=%h/.config/voxtty/env
ExecStart=/usr/local/bin/voxtty --tray --auto --realtime --elevenlabs
Restart=on-failure
RestartSec=5

[Install]
WantedBy=default.target
EOF

# 4. Enable and start
systemctl --user daemon-reload
systemctl --user enable --now voxtty

# 5. Check status
systemctl --user status voxtty

Service management:

systemctl --user status voxtty    # Check status
systemctl --user stop voxtty      # Stop
systemctl --user restart voxtty   # Restart
journalctl --user -u voxtty -f    # Watch logs

Optional: Desktop Entry (App Menu)

cat > ~/.local/share/applications/voxtty.desktop << 'EOF'
[Desktop Entry]
Name=voxtty
Comment=Voice to text typing
Exec=voxtty --tray --auto
Icon=audio-input-microphone
Terminal=false
Type=Application
Categories=Utility;Accessibility;
EOF

Option 4: Build Your Own Debian Package

For: Creating custom packages or contributing

Prerequisites

# Install build dependencies
sudo apt install debhelper-compat cargo rustc pkg-config libasound2-dev libdbus-1-dev

Build Process

The Debian package is built using standard Debian packaging tools:

# Method 1: Use the build script (recommended)
./packaging/scripts/build-deb.sh

# Method 2: Manual build (same as the script)
cargo build --release
dpkg-buildpackage -rfakeroot -us -uc

# The package will be created in the parent directory
ls -lh ../voxtty_*.deb

What Gets Built

The build process creates:

Binary package: voxtty_0.1.0-1_amd64.deb - Main installable package
Debug symbols: voxtty-dbgsym_0.1.0-1_amd64.deb - Debug symbols (optional)
Build artifacts: .buildinfo, .changes files - Build metadata

Package Contents

/usr/bin/voxtty              # Main executable
/usr/share/doc/voxtty/       # Documentation
/usr/share/man/man1/            # Man pages (if included)

Install Your Custom Package

# Install the package
sudo dpkg -i ../voxtty_*.deb

# If dependencies are missing, fix them
sudo apt-get install -f

# Verify installation
which voxtty
voxtty --version

Debian Packaging Files

The package is configured via files in the debian/ directory:

debian/control - Package metadata, dependencies, and description
debian/rules - Build instructions (uses dh with Cargo)
debian/changelog - Version history and release notes
debian/install - Files to include in the package
debian/compat - Debhelper compatibility level

Customizing the Package

To modify the package:

Change version: Edit debian/changelog

dch -v 0.2.0-1 "New release with feature X"

Add dependencies: Edit debian/control

Depends: ${shlibs:Depends}, ${misc:Depends}, your-new-dependency

Modify build: Edit debian/rules

override_dh_auto_build:
    cargo build --release --features your-feature

Rebuild package:
```
./packaging/scripts/build-deb.sh
```

Troubleshooting Package Build

Build fails with missing dependencies?

sudo apt-get build-dep .

Clean build artifacts?

cargo clean
debian/rules clean
rm -f ../voxtty_*

Test package without installing?

dpkg-deb --contents ../voxtty_*.deb
dpkg-deb --info ../voxtty_*.deb

⚙️ Setup

1. Install Whisper Backend

Option A: whisper.cpp (Recommended for local use)

# Clone and build whisper.cpp
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make

# Download a model (tiny.en is fastest, small.en is more accurate)
bash ./models/download-ggml-model.sh tiny.en

# Start the server
./server -l en -m models/ggml-tiny.en.bin --port 7777 --convert

Option B: Speaches API (Recommended - Better Performance & Extensibility)

Speaches provides superior performance and flexibility compared to whisper.cpp. On an i7 CPU (no GPU), expect ~2 second latency with the basic model.

# Quick start with Docker (CPU-only)
docker run -d \
  --name speaches \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  ghcr.io/speaches-ai/speaches:latest

# Or with docker-compose
cat > docker-compose.yml <<EOF
services:
  speaches:
    image: ghcr.io/speaches-ai/speaches:latest
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    environment:
      - TRANSCRIPTION_MODEL_ID=Systran/faster-distil-whisper-small.en
EOF

docker-compose up -d

# IMPORTANT: Initial Speaches Configuration
# After starting Speaches for the first time, you must configure it:

# 1. Set the base URL for Speaches API
export SPEACHES_BASE_URL="http://localhost:8000"

# 2. Check available models in the registry
curl "$SPEACHES_BASE_URL/v1/registry"

# 3. Download and activate your chosen model (first-time setup)
curl "$SPEACHES_BASE_URL/v1/models/Systran/faster-distil-whisper-small.en" -X POST

# 4. Configure voxtty to use Speaches
export SPEACHES_BASE_URL="http://localhost:8000/v1/audio/transcriptions"
export TRANSCRIPTION_MODEL_ID="Systran/faster-distil-whisper-small.en"

# 5. Test the connection (requires a test audio file)
curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F "file=@test.wav" \
  -F "model=Systran/faster-distil-whisper-small.en"

Initial Setup Notes:

The /v1/registry endpoint shows all available Whisper models
The /v1/models/{model_name} POST endpoint downloads and activates a model
Model download happens once; subsequent starts use the cached model
Choose model size based on your needs: tiny.en (fastest) → small.en (balanced) → medium.en (accurate)

Performance Notes:

CPU-only (i7): ~2 seconds latency with basic model
GPU-enabled: Sub-second latency possible
Model sizes: tiny.en (fastest) → small.en (balanced) → medium.en (accurate)
100% Offline: No internet required after initial model download

2. Configure ydotool

# Add to ~/.bashrc or ~/.zshrc
export YDOTOOL_SOCKET=/tmp/.ydotool_socket

# Enable and start ydotool service
sudo systemctl enable ydotool.service
sudo systemctl start ydotool.service

# Verify it's running
sudo systemctl status ydotool.service

# Test typing
ydotool type "Hello, World!"

3. Verify Audio Input (Important!)

Before using voxtty, verify your microphone works correctly with the built-in echo test:

# Run echo test - speak and hear your voice played back
voxtty --echo-test

# Select specific device interactively, then test
voxtty --select-device --echo-test

# Test with debug output to see audio levels
voxtty --echo-test --debug

Echo Test Mode: Speak into your microphone, pause, and you'll hear your recording played back. This confirms:

✅ Microphone is working
✅ Audio levels are correct
✅ VAD (Voice Activity Detection) is triggering properly
✅ No audio driver issues

🎯 Usage

Basic Usage

# Start with default settings (whisper.cpp backend)
voxtty

# Start with system tray icon
voxtty --tray

# Use Speaches API backend
voxtty --speaches

# Enable debug output
voxtty --debug

Advanced Usage

# Interactive device selection with debug output
voxtty --select-device --debug

# Echo test with specific device
voxtty --select-device --echo-test

# Speaches API with tray control
voxtty --speaches --tray

Realtime Streaming Mode

For the lowest latency (~150ms), use realtime WebSocket streaming:

# Realtime with Speaches (self-hosted, free)
voxtty --realtime --speaches --tray

# Realtime with ElevenLabs (cloud, requires API key)
export ELEVENLABS_API_KEY=your_key_here
voxtty --realtime --elevenlabs --tray

# Realtime with OpenAI (cloud, requires API key)
export OPENAI_API_KEY=your_key_here
voxtty --realtime --openai --tray

# Realtime with voice commands (pause/resume/mode switching)
voxtty --realtime --speaches --auto --tray

Realtime Features:

Audio feedback sounds for pause/resume/mode changes
Auto-reconnects if WebSocket connection drops
Tray tooltip shows connection status
Voice commands work continuously (no need to wait for silence)

Command-Line Options

Option	Description
`--echo-test`	Run audio echo test - Speak and hear playback to verify microphone
`--select-device`	Interactively choose audio input device
`--debug`	Enable detailed debug logging (shows VAD triggers, audio levels)
`--speaches`	Use Speaches backend instead of whisper.cpp (default)
`--tray`	Enable system tray icon with click-to-toggle control
`--tui`	Enable Terminal UI (TUI) mode - Full-screen terminal interface
`--realtime`	Enable realtime WebSocket streaming (~150ms latency)
`--elevenlabs`	Use ElevenLabs for realtime transcription (requires API key)
`--openai`	Use OpenAI for transcription
`--assistant`	Enable assistant modes with wake word activation
`--auto`	Enable voice commands without full assistant mode
`--tui-output`	Enable text output in TUI mode (types to active application)
`--mcp`	Enable MCP tool calling (loads `~/.config/voxtty/mcp_servers.toml` or `.mcp.json`)
`--mock-mcp`	Use built-in mock MCP server for testing (cannot combine with `--mcp`)
`--bidirectional`	Enable bidirectional conversation with TTS responses

Configuration Priority: CLI flags → Environment variables → Defaults

Terminal UI (TUI) Mode

Launch voxtty with a consolidated single-screen dashboard:

# Launch TUI in demo mode
voxtty --tui

# TUI with specific backend
voxtty --tui --speaches
voxtty --tui --realtime --elevenlabs

Single Dashboard View:

┌─────────────────────────────────────────────────┐
│ voxtty | Dictation | Speaches | LISTENING       │  ← Status bar
├──────────────────┬──────────────────────────────┤
│ Live Audio       │ Configuration                │
│ ████████         │ Model: GPT-4o-mini           │
│ VAD: ● ACTIVE    │ [m] Select Model             │
│ Device: Default  │ [d] Select Device            │
├──────────────────┴──────────────────────────────┤
│ Last Transcription (5s ago)                     │
│ "testing one two three..."                      │
├──────────────────┬──────────────────────────────┤
│ Mode Selection   │ Actions                      │
│ [1] ▶ Dictation  │ [p] Pause                    │
│ [2]   Assistant  │ [e] Echo Test                │
│ [3]   Code       │                              │
└──────────────────┴──────────────────────────────┘
[q]Quit  [?]Help  [1-3]Mode  [p]Pause  [e]Echo

Everything At a Glance:

Live audio visualization - Real-time voice level bar graph
VAD indicator - Voice Activity Detection status (● ACTIVE / ○ Inactive)
Last transcription - Most recent text with timestamp
Mode switcher - Quick [1-3] keys to switch modes
Quick actions - One-key access to echo test, pause, device selection
Model info - Current AI model configuration

Keyboard Shortcuts:

1-3 - Switch mode (Dictation/Assistant/Code)
p or Space - Pause/Resume listening
e - Run echo test
m - Select AI model
d - Select audio device
? or h - Toggle help screen
q or Esc - Quit

No Navigation Needed - All controls visible on one screen!

🎮 Controls

System Tray Icon

The tray icon shows a colored circle with a letter indicating the current mode:

Icon	Color	Meaning
D	🟢 Green	Dictation mode (active)
A	🔵 Blue	Assistant mode (active)
C	🟣 Purple	Code mode (active)
D/A/C	🟠 Orange	Paused (listening for "resume")
D/A/C	⚫ Gray	Disabled (click to enable)

Left Click - Toggle voice typing on/off
Right Click - Menu to switch modes (when --assistant or --auto enabled)

Voice Commands

Wake words for hands-free control (requires --auto or --assistant flag):

Command	Wake Words
Dictation Mode	"dictation mode", "normal mode", "typing mode", "type mode"
Assistant Mode	"hey assistant", "assistant mode"
Code Mode	"code mode", "coding mode", "write code"
Pause	"pause", "stop listening", "go to sleep"
Resume	"resume", "start listening", "wake up"

🔧 Configuration

voxtty uses a layered configuration system for maximum flexibility:

Configuration Layers (Priority Order)

1. CLI Flags (highest)
2. Environment Variables
3. Config File (~/.config/voxtty/config.toml)
4. Auto-Detection (ydotool socket)
5. Built-in Defaults (lowest)

Config File

voxtty automatically creates ~/.config/voxtty/config.toml on first run:

# ydotool socket path (auto-detected if not specified)
ydotool_socket = "/run/user/1000/.ydotool_socket"

# Speaches backend
speaches_base_url = "http://localhost:8000/v1/audio/transcriptions"
transcription_model_id = "Systran/faster-distil-whisper-small.en"

# whisper.cpp backend
whisper_url = "http://127.0.0.1:7777/inference"

Environment Variables (Override Config File)

export YDOTOOL_SOCKET=/run/user/$(id -u)/.ydotool_socket
export SPEACHES_BASE_URL=http://localhost:8000/v1/audio/transcriptions
export TRANSCRIPTION_MODEL_ID=Systran/faster-distil-whisper-small.en

Backend Selection

Backend	CLI Flag	Default URL	Configuration
whisper.cpp	(default)	`http://127.0.0.1:7777/inference`	Config file or env var
Speaches	`--speaches`	`http://localhost:8000/v1/audio/transcriptions`	Config file or env var

Privacy Summary by Component

Quick reference for privacy-conscious users:

Component	Backend	Privacy	Internet Required	CLI Flag
Transcription	whisper.cpp	🔒 100% Local	No	(default)
	Speaches	🔒 100% Local	No	`--speaches`
	Speaches Realtime	🔒 100% Local	No	`--realtime --speaches`
	OpenAI Realtime	☁️ Cloud	Yes	`--realtime --openai`
	ElevenLabs	☁️ Cloud	Yes	`--realtime --elevenlabs`
LLM (Assistant/Code)	Ollama	🔒 100% Local	No	`--llm ollama`
	Anthropic Claude	☁️ Cloud	Yes	`--llm anthropic`
	OpenAI GPT	☁️ Cloud	Yes	`--llm openai`
	Google Gemini	☁️ Cloud	Yes	`--llm google`
	DeepSeek	☁️ Cloud	Yes	`--llm deepseek`

Privacy Tip: For complete privacy, use:

# 100% offline voice typing
voxtty --speaches --tray

# 100% offline with AI assistance
voxtty --speaches --assistant --llm ollama --tray

⚠️ Important: ydotool Setup

Ubuntu's ydotool package is BROKEN. You MUST build from source:

git clone https://github.com/ReimuNotMoe/ydotool.git
cd ydotool && mkdir build && cd build
cmake -DSYSTEMD_USER_SERVICE=ON ..
make -j $(nproc) && sudo make install
systemctl --user enable --now ydotoold.service

📖 See the relevant sections in this README for setup and configuration details.

Audio Tuning

If you experience issues with voice detection:

Recording never stops - Microphone volume too high
- Lower mic volume in system settings
- Increase silence threshold in code
Recording doesn't start - Microphone volume too low
- Increase mic volume in system settings
- Decrease amplitude threshold in code
Background noise triggers recording - Environment too noisy
- Use push-to-talk via hotkey toggle
- Increase VAD sensitivity

🏗️ Architecture

What voxtty IS (and what it's NOT)

voxtty is a voice-to-text application that listens to your microphone and types text system-wide. It's designed for direct user interaction, not as a protocol server.

voxtty supports MCP (Model Context Protocol) tool calling, allowing the LLM to invoke external tools during conversation — check the weather, roll dice, run calculations, control smart home devices, and more.

Quick Start

# Test with built-in mock MCP server
./target/release/voxtty --mock-mcp --assistant --bidirectional --realtime --elevenlabs

# Use your own MCP servers
./target/release/voxtty --mcp --assistant --bidirectional --realtime --elevenlabs

MCP Server Configuration

voxtty supports two config formats:

Option 1: Native TOML (~/.config/voxtty/mcp_servers.toml):

[[servers]]
name = "weather"
command = "python3"
args = ["-m", "weather_mcp_server"]

[servers.env]
API_KEY = "your-api-key"

[[servers]]
name = "home-assistant"
command = "node"
args = ["path/to/ha-mcp-server.js"]

Option 2: Claude Code format (.mcp.json in project directory — auto-detected as fallback):

{
  "mcpServers": {
    "weather": {
      "type": "stdio",
      "command": "python3",
      "args": ["-m", "weather_mcp_server"],
      "env": { "API_KEY": "your-api-key" }
    }
  }
}

Each server is spawned as a child process and communicates via JSON-RPC over stdio. On startup, voxtty discovers available tools via tools/list and makes them available to the LLM alongside built-in tools (speak, type_text, switch_mode, process_command).

How It Works

voxtty spawns each MCP server and sends initialize + tools/list
Discovered tools are added to the LLM's tool definitions
When the LLM calls an MCP tool, voxtty executes it via tools/call and feeds the result back
The LLM uses the result to formulate a spoken or typed response
Max 5 tool call iterations per turn (prevents infinite loops)

Built-in Mock Server (`--mock-mcp`)

For testing without external servers:

get_time — Current date/time
calculate — Math expressions (sqrt, trig, pi)
weather — Mock weather for common cities
random_fact — Random fun facts
dice_roll — Standard notation (2d6, 1d20+3)
echo — Echo test

Writing Your Own MCP Server

Any program that reads JSON-RPC from stdin and writes to stdout works. See test_mcp_server.py for a complete Python example. The protocol follows the MCP specification.

Barge-in (TTS Interruption)

When using bidirectional mode, speaking while the AI is talking interrupts playback immediately. The interrupt triggers on partial transcription (~200ms), enabling natural conversational flow without waiting for the AI to finish.

Core Components

Audio Capture - CPAL for cross-platform audio input
Voice Detection - WebRTC VAD + amplitude threshold
Transcription - Whisper.cpp or Speaches API
Text Input - ydotool for system-wide typing
UI Controls - ksni (system tray)
MCP Tools - External tool integration via stdio JSON-RPC

Audio Pipeline

Microphone → CPAL → VAD → WAV Buffer → Whisper AI → ydotool → Text Output

Detection Algorithm

Capture audio in 30ms frames at 16kHz
Run WebRTC VAD on each frame
Check amplitude threshold (>1000)
Require 200ms of speech to start
Wait 1000ms of silence to stop
Transcribe and type result

🐛 Troubleshooting

Quick Fixes

Audio not working?

voxtty --echo-test

Transcription failing?

# Check backend is running
docker ps | grep speaches          # For Speaches
curl http://127.0.0.1:7777/        # For whisper.cpp

Text not typing?

# Check ydotool
systemctl --user status ydotoold.service
ydotool type "test"

Need more help?

# Run with debug output
voxtty --debug --speaches

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Development Setup

# Clone repository
git clone https://github.com/jflaflamme/voxtty.git
cd voxtty

# Build in debug mode
cargo build

# Run with debug output
cargo run -- --debug --echo-test

# Run tests
cargo test

# Check code quality
cargo clippy
cargo fmt --check

📚 Documentation

All documentation is contained in this README. Additional detailed guides coming soon!

📝 License

This project is licensed under the GNU General Public License v2.0 - see the LICENSE file for details.

🙏 Acknowledgments

OpenAI Whisper - State-of-the-art speech recognition
whisper.cpp - Efficient C++ implementation
ydotool - Generic command-line automation tool
WebRTC VAD - Voice activity detection

🔗 Links

Repository: https://github.com/jflaflamme/voxtty
Issues: https://github.com/jflaflamme/voxtty/issues
Releases: https://github.com/jflaflamme/voxtty/releases

Made with ❤️ by Jeff Laflamme

The power of whisper — your voice commands

voxtty 0.1.0

voxtty

🚀 Quick Start

Install

Setup

Optional: AI Assistant Mode

🎯 Inspiration & Evolution

Why Speaches?

Backend Comparison

✨ Features

🔒 Privacy First

⚡ Realtime Streaming Mode

🎤 Smart Voice Detection

⌨️ System-Wide Integration

🎛️ Flexible Control

🔧 Developer Friendly

📋 Requirements

Core Dependencies

Optional: AI Assistant Mode

Runtime Dependencies

🚀 Installation

Choose Your Installation Method

Option 1: One-line Installer (Recommended)

Option 2: Cargo Install

Option 3: Build from Source

Build Dependencies (Ubuntu/Debian)

Option 3: Systemd User Service (Auto-start on Login)

Optional: Desktop Entry (App Menu)

Option 4: Build Your Own Debian Package

Prerequisites

Build Process

What Gets Built

Package Contents

Install Your Custom Package

Debian Packaging Files

Customizing the Package

Troubleshooting Package Build

⚙️ Setup

1. Install Whisper Backend

Option A: whisper.cpp (Recommended for local use)

Option B: Speaches API (Recommended - Better Performance & Extensibility)

2. Configure ydotool

3. Verify Audio Input (Important!)

🎯 Usage

Basic Usage

Advanced Usage

Realtime Streaming Mode

Command-Line Options

Terminal UI (TUI) Mode

🎮 Controls

System Tray Icon

Voice Commands

🔧 Configuration

Configuration Layers (Priority Order)

Config File

Environment Variables (Override Config File)

Backend Selection

Privacy Summary by Component

⚠️ Important: ydotool Setup

Audio Tuning

🏗️ Architecture

What voxtty IS (and what it's NOT)

Quick Start

MCP Server Configuration

How It Works

Built-in Mock Server (--mock-mcp)

Writing Your Own MCP Server

Barge-in (TTS Interruption)

Core Components

Audio Pipeline

Detection Algorithm

🐛 Troubleshooting

Quick Fixes

🤝 Contributing

Development Setup

📚 Documentation

📝 License

🙏 Acknowledgments

🔗 Links

Built-in Mock Server (`--mock-mcp`)