adk-audio
Audio intelligence and pipeline orchestration for ADK-Rust agents.
Provides unified traits for Text-to-Speech (TTS), Speech-to-Text (STT), music generation, audio FX/DSP processing, and Voice Activity Detection (VAD), with a composable pipeline system for building voice agent loops, podcast production, transcription, and generative soundscapes.
Features
| Feature | Description | Dependencies |
|---|---|---|
tts (default) |
Cloud TTS providers (ElevenLabs, OpenAI, Gemini, Cartesia) | reqwest, base64 |
stt (default) |
Cloud STT providers (Whisper API, Deepgram, AssemblyAI) | reqwest, tokio-tungstenite |
music |
Music generation providers | reqwest |
fx |
DSP processors (normalizer, resampler, noise, compressor, trimmer, pitch) | rubato, dasp |
vad |
Voice Activity Detection | webrtc-vad |
opus |
Opus codec encode/decode | audiopus (requires cmake) |
mlx |
Apple Silicon local inference (macOS only) | mlx-rs, tokenizers, hf-hub |
onnx |
ONNX Runtime local inference (cross-platform) | ort, tokenizers, hf-hub |
livekit |
adk-realtime bridge | livekit-api, adk-realtime |
all |
All non-platform features (no mlx/onnx) | — |
Quick Start
Cloud TTS
use ;
let tts = from_env?;
let request = TtsRequest ;
let frame = tts.synthesize.await?;
println!;
Cloud STT
use ;
let stt = from_env?;
let transcript = stt.transcribe.await?;
println!;
Pipeline
use AudioPipelineBuilder;
let handle = new
.tts
.build_tts?;
Cloud Providers
TTS
- ElevenLabs — High-quality multilingual voices (
ELEVENLABS_API_KEY) - OpenAI — TTS-1 and TTS-1-HD models (
OPENAI_API_KEY) - Gemini — Native audio via generateContent (
GEMINI_API_KEY) - Cartesia — Sonic-2 low-latency streaming (
CARTESIA_API_KEY)
STT
- Whisper API — OpenAI Whisper transcription (
OPENAI_API_KEY) - Deepgram — Nova-2 with diarization and streaming (
DEEPGRAM_API_KEY) - AssemblyAI — Universal model with async jobs and streaming (
ASSEMBLYAI_API_KEY)
Local Inference
MLX (Apple Silicon)
Runs TTS and STT models on Metal GPU with zero-copy unified memory:
use ;
let tts = default_kokoro.await?;
ONNX (Cross-Platform)
Runs TTS models via ONNX Runtime with CUDA, CoreML, or CPU:
use ;
let tts = default_kokoro.await?;
DSP Processors
Behind the fx feature:
LoudnessNormalizer— EBU R128 loudness normalizationResampler— Sample rate conversion (8kHz–96kHz)NoiseSuppressor— Spectral noise reductionDynamicRangeCompressor— Dynamic range compressionSilenceTrimmer— Leading/trailing silence removalPitchShifter— Voice pitch adjustment
License
See LICENSE in the repository root.