adk-audio
Audio intelligence and pipeline orchestration for ADK-Rust agents.
Provides unified traits for Text-to-Speech (TTS), Speech-to-Text (STT), music generation, audio FX/DSP processing, and Voice Activity Detection (VAD), with a composable pipeline system for building voice agent loops, podcast production, transcription, and generative soundscapes.
Installation
[]
= "0.6.0"
Or via the umbrella crate (experimental):
[]
= { = "0.6.0", = ["audio"] }
Feature Flags
| Feature | Description | Key dependencies |
|---|---|---|
tts (default) |
Cloud TTS providers (ElevenLabs, OpenAI, Gemini, Cartesia) | reqwest, base64 |
stt (default) |
Cloud STT providers (Whisper API, Deepgram, AssemblyAI) | reqwest, tokio-tungstenite |
music |
Music generation providers | reqwest |
fx |
DSP processors (normalizer, resampler, noise, compressor, trimmer, pitch) | rubato, dasp |
vad |
Voice Activity Detection | webrtc-vad |
mlx |
Local inference model loading (tokenizers + HF Hub, cross-platform) | tokenizers, hf-hub |
onnx |
ONNX Runtime local inference (cross-platform) | ort, tokenizers, hf-hub |
kokoro |
Kokoro-82M ONNX TTS with espeak-ng phonemizer | espeak-rs, ndarray (implies onnx) |
chatterbox |
Chatterbox ONNX TTS | implies onnx |
whisper-onnx |
Whisper ONNX STT (base/small/medium/large) | implies onnx |
distil-whisper |
Distil-Whisper ONNX STT | implies onnx |
moonshine |
Moonshine ONNX STT | implies onnx |
qwen3-tts |
Qwen3-TTS native Candle-based TTS (0.6B / 1.7B) | qwen_tts, candle-core, hf-hub |
all-onnx |
All ONNX backends (STT + TTS) | combines above |
livekit |
adk-realtime bridge | livekit-api, adk-realtime |
desktop-audio |
Desktop mic capture, speaker playback, VAD turn-taking (PipeWire/ALSA/CoreAudio/WASAPI) | cpal (implies vad) |
streaming |
Streaming support marker | — |
all |
All portable features — safe for CI on any platform | everything above |
Core Types
AudioFrame
The canonical audio buffer used throughout the crate — raw PCM-16 LE samples with metadata:
use AudioFrame;
// Create from raw PCM data (duration computed automatically)
let frame = new; // 16kHz mono
println!;
// Access raw i16 samples
let samples: & = frame.samples;
// Generate silence
let silence = silence; // 500ms at 24kHz
// Merge multiple frames into one
let merged = merge_frames;
Codec
Encode/decode between AudioFrame and external formats:
use ;
// Encode to WAV
let wav_bytes = encode?;
// Decode from WAV
let frame = decode?;
Currently supports Pcm16 (passthrough) and Wav. Other formats (Opus, Mp3, Flac, Ogg) are defined but not yet implemented — check AudioFormat::supports_encode() / supports_decode().
Cloud Providers
TTS Providers
All cloud TTS providers implement the TtsProvider trait with synthesize() (batch) and synthesize_stream() (streaming) methods.
| Provider | Type | Env var | Feature |
|---|---|---|---|
ElevenLabsTts |
High-quality multilingual voices | ELEVENLABS_API_KEY |
tts |
OpenAiTts |
TTS-1 and TTS-1-HD models | OPENAI_API_KEY |
tts |
GeminiTts |
Native audio via generateContent | GEMINI_API_KEY |
tts |
CartesiaTts |
Sonic-2 low-latency streaming | CARTESIA_API_KEY |
tts |
use ;
let tts = from_env?;
let request = TtsRequest ;
let frame = tts.synthesize.await?;
println!;
TtsRequest also supports optional language, pitch, emotion (enum: Neutral, Happy, Sad, Angry, Whisper, Excited, Calm), and output_format fields.
All cloud providers accept a CloudTtsConfig for API key and optional base URL override.
STT Providers
All cloud STT providers implement the SttProvider trait with transcribe() (batch) and transcribe_stream() (streaming) methods.
| Provider | Type | Env var | Feature |
|---|---|---|---|
WhisperApiStt |
OpenAI Whisper transcription | OPENAI_API_KEY |
stt |
DeepgramStt |
Nova-2 with diarization and streaming | DEEPGRAM_API_KEY |
stt |
AssemblyAiStt |
Universal model with async jobs and streaming | ASSEMBLYAI_API_KEY |
stt |
use ;
let stt = from_env?;
let opts = SttOptions ;
let transcript = stt.transcribe.await?;
println!;
The Transcript result includes text, per-Word timestamps with confidence, Speaker diarization, overall confidence, and language_detected.
Local Inference
ONNX TTS (cross-platform)
Runs TTS models via ONNX Runtime with CUDA, CoreML, or CPU execution providers:
use ;
let tts = default_kokoro.await?;
The OnnxTtsProvider is generic over a Preprocessor trait:
TokenizerPreprocessor— default, uses HuggingFacetokenizer.jsonKokoroPreprocessor— espeak-ng phonemizer for Kokoro-82M (requireskokorofeature + system espeak-ng)
Execution providers: OnnxExecutionProvider::Cpu, Cuda, CoreMl, DirectMl.
ONNX STT (cross-platform)
Three STT backends behind separate feature flags:
| Backend | Feature | Type |
|---|---|---|
| Whisper (base/small/medium/large) | whisper-onnx |
SttBackend::Whisper(WhisperModelSize) |
| Distil-Whisper (small/medium/large-v3) | distil-whisper |
SttBackend::DistilWhisper(DistilWhisperVariant) |
| Moonshine (tiny/base) | moonshine |
SttBackend::Moonshine(MoonshineVariant) |
use ;
let config = builder
.backend
.build;
let stt = new.await?;
// Or use the default:
let stt = default_whisper.await?;
MLX (Apple Silicon)
Local TTS and STT using tokenizers + HF Hub. Full Metal GPU inference via mlx-rs is planned.
use ;
let tts = default_kokoro.await?;
Configurable via MlxTtsConfig and MlxSttConfig, with MlxQuantization options.
Qwen3-TTS (Candle-based)
Native Candle-based TTS supporting 10 languages with predefined speaker voices. Runs on CPU, Metal (macOS), or CUDA.
use ;
let tts = new.await?; // 0.6B
let request = TtsRequest ;
let frame = tts.synthesize.await?;
Variants: Small (0.6B, faster) and Large (1.7B, higher quality). Predefined speakers: vivian, serena, dylan, eric, ryan, aiden (en), uncle_fu (zh), ono_anna (ja), sohee (ko).
Chatterbox TTS
ONNX-based TTS via the chatterbox feature:
use ;
Model Registry
LocalModelRegistry handles downloading and caching model weights from HuggingFace Hub:
use LocalModelRegistry;
let registry = default; // ~/.cache/adk-audio/models/
let path = registry.get_or_download.await?;
Custom cache directory: LocalModelRegistry::new("/my/cache").
DSP Processors
Behind the fx feature. All implement the AudioProcessor trait.
| Processor | Description |
|---|---|
LoudnessNormalizer |
EBU R128 loudness normalization |
Resampler |
Sample rate conversion (8kHz–96kHz) |
NoiseSuppressor |
Spectral noise reduction |
DynamicRangeCompressor |
Dynamic range compression |
SilenceTrimmer |
Leading/trailing silence removal |
PitchShifter |
Voice pitch adjustment |
FxChain
Chain processors in series — output of stage N feeds into stage N+1:
use ;
let chain = new
.push
.push;
let output = chain.process.await?;
Mixer
Multi-track audio mixer with per-track volume control:
use Mixer;
let mut mixer = new;
mixer.add_track;
mixer.add_track;
mixer.push_frame;
mixer.push_frame;
let mixed = mixer.mix?;
Pipeline System
The AudioPipelineBuilder composes providers, processors, and agents into async processing topologies. Each pipeline returns a PipelineHandle with input_tx / output_rx channels, real-time metrics, and a shutdown() method.
Pipeline Topologies
| Builder method | Flow | Required components |
|---|---|---|
build_tts() |
Text → TTS → Audio | tts |
build_stt() |
Audio → STT → Transcript | stt |
build_voice_agent() |
Audio → VAD → STT → Agent → TTS → Audio | tts, stt, vad, agent |
build_transform() |
Audio → FxChain → Audio | pre_fx (optional) |
build_music() |
Text → MusicProvider → Audio | music |
use ;
let mut handle = new
.tts
.stt
.vad
.agent
.pre_fx // optional: applied before STT
.post_fx // optional: applied after TTS
.buffer_size // channel buffer (default 32)
.build_voice_agent?;
// Send input
handle.input_tx.send.await?;
handle.input_tx.send.await?;
// Receive output
if let Some = handle.output_rx.recv.await
// Shutdown
handle.shutdown;
SentenceChunker
Buffers LLM tokens and emits complete sentences at delimiter boundaries (.!?;\n), reducing time-to-first-audio in voice agent pipelines:
use SentenceChunker;
let mut chunker = new;
let sentences = chunker.push;
// sentences == ["Hello world."]
let more = chunker.push;
// more == ["How are you?", "Fine."]
let remaining = chunker.flush;
// remaining == None (buffer empty)
Preset Pipelines
Factory functions for common topologies:
use *;
let handle = ivr_pipeline?; // voice agent
let handle = podcast_pipeline?; // TTS only
let handle = transcription_pipeline?; // STT only
let handle = enhance_pipeline?; // FX transform
Pipeline Metrics
PipelineMetrics tracks real-time latency and quality:
| Field | Description |
|---|---|
tts_latency_ms |
TTS synthesis latency |
stt_latency_ms |
STT transcription latency |
llm_latency_ms |
Agent reasoning latency |
total_audio_ms |
Total audio processed |
vad_speech_ratio |
Speech-to-total frame ratio (0.0–1.0) |
Agent Tools
Four tools implement adk_core::Tool for LLM agent integration:
| Tool | Description | Required input |
|---|---|---|
SpeakTool |
Synthesize text to speech | {text, voice?, emotion?} |
TranscribeTool |
Transcribe audio to text | {audio_data (base64), sample_rate?, language?} |
ApplyFxTool |
Apply a named FX chain | {audio_data (base64), chain, sample_rate?} |
GenerateMusicTool |
Generate music from prompt | {prompt, duration_secs, genre?} |
use ;
let speak = new;
let transcribe = new;
let agent = new
.tool
.tool
.build?;
VAD (Voice Activity Detection)
The VadProcessor trait (behind vad feature) provides:
is_speech(&AudioFrame) -> bool— binary speech detectionsegment(&AudioFrame) -> Vec<SpeechSegment>— identify speech segments with start/end timestamps
Used by the voice agent pipeline to gate STT inference to speech-only segments.
Realtime Bridge
Behind the livekit feature, RealtimeBridge converts between adk-realtime base64-encoded PCM16 audio streams and pipeline PipelineInput/PipelineOutput:
use RealtimeBridge;
let bridge = new; // 24kHz mono
let input_stream = bridge.from_realtime; // base64 → PipelineInput
let output_stream = bridge.to_realtime; // PipelineOutput → base64
Error Handling
All operations return AudioResult<T> (alias for Result<T, AudioError>). Error variants:
| Variant | Description |
|---|---|
Tts { provider, message } |
TTS provider error |
Stt { provider, message } |
STT provider error |
Music(String) |
Music generation error |
Fx(String) |
Audio processing error |
PipelineClosed(String) |
Pipeline misconfigured or shut down |
Vad(String) |
Voice activity detection error |
Codec(String) |
Encode/decode error |
ModelDownload { model_id, message } |
Model download or registry error |
Io(std::io::Error) |
I/O error |
Network(reqwest::Error) |
HTTP error (feature-gated) |
Related Crates
- adk-rust — Umbrella crate
- adk-core —
Tooltrait,Agenttrait - adk-realtime — Real-time audio/video streaming
- adk-tool — Additional tool utilities
License
See LICENSE in the repository root.