car-voice
Voice I/O capability for Common Agent Runtime.
What it does
Channel-neutral microphone capture, voice activity detection, speech-to-text, text-to-speech, and audio playback. Any CAR-based agent or channel (CLI, GUI, IDE plug-in) can consume this crate without pulling in a UI shell.
Module map
| Module | Purpose |
|---|---|
config |
VoiceConfig — provider selection, VAD tuning, mode |
error |
VoiceError |
events |
VoiceEvent enum — SpeechStart / SpeechEnd / Transcript / BargeIn |
stt |
SttProvider trait |
tts |
Speaker trait + raw playback helper |
provider |
Factories that build STT/TTS from config |
elevenlabs_stt / elevenlabs_tts |
ElevenLabs cloud providers |
whisper_cpp_stt |
In-process Whisper STT via whisper.cpp (Metal on Apple Silicon) |
local_tts |
Local OpenAI-compatible TTS (MLX-Whisper, mlx-audio Kokoro/Qwen3-TTS) |
listener |
Listener trait + cross-platform CpalListener |
voice_processing_listener |
macOS VoiceProcessingIO listener — hardware AEC, AGC, barge-in |
voice_audio_mixer |
Software mixer feeding VPIO bus 0 reference signal so AEC has something to subtract |
vad |
Energy-based VAD with adaptive noise floor + runtime threshold boost |
enrollment |
Speaker voiceprint enrollment + per-segment role classification |
narration |
TARS-style commentary helpers, pure functions |
Where it fits
Foundation for car-meeting (multi-source meeting capture) and the WebSocket voice.* methods. Speech runtime install / doctor / smoke commands live in car speech (see car-cli).