Unified text-to-speech for voice pipelines, wrapping multiple TTS engines behind common Rust traits. Same pattern as wavekat-vad and wavekat-turn.
[!WARNING] Early development. API may change between minor versions.
Backends
| Backend | Feature flag | Status | License |
|---|---|---|---|
| Qwen3-TTS (VoiceDesign 1.7B) | qwen3-tts |
✅ Available | Apache 2.0 |
| Qwen3-TTS (Voice Clone 0.6B) | qwen3-tts |
✅ Available | Apache 2.0 |
| CosyVoice | cosyvoice |
🚧 Planned | Apache 2.0 |
Model weights
ONNX-converted weights are published under the wavekat organization on Hugging Face.
| Backend | Repository | Precision |
|---|---|---|
| Qwen3-TTS VoiceDesign | wavekat/Qwen3-TTS-1.7B-VoiceDesign-ONNX | FP32, INT4 |
| Qwen3-TTS Voice Clone | wavekat/Qwen3-TTS-0.6B-Base-ONNX | FP32, INT4 |
Quick start
VoiceDesign (prompt-based styling)
use ;
use Qwen3Tts;
// use wavekat_tts::backends::qwen3_tts::{ModelConfig, ModelPrecision, ExecutionProvider};
Voice Clone (reference-audio cloning)
Requires a reference WAV file (
ref.wav) — a short mono clip of the voice you want to clone, plus a transcript of what is spoken in the clip.
use AudioFrame;
use ;
// use wavekat_tts::backends::qwen3_tts::{ModelConfig, ModelPrecision};
Model files are cached by the HF Hub client at $HF_HOME/hub/ (default ~/.cache/huggingface/hub/).
Set WAVEKAT_MODEL_DIR to load from a local directory and skip all downloads.
All backends produce AudioFrame<'static> from wavekat-core — the same
type consumed by wavekat-vad and wavekat-turn.
Architecture
wavekat-vad → "is someone speaking?"
wavekat-turn → "are they done speaking?"
wavekat-tts → "synthesize the response"
│ │ │
└───────────────────┴─────────────────────┘
│
AudioFrame (wavekat-core)
Two trait families:
TtsBackend— batch synthesis: text →AudioFrame<'static>StreamingTtsBackend— streaming: text → iterator ofAudioFrame<'static>chunks
Examples
Generate a WAV file from text (model files are auto-downloaded on first run):
# VoiceDesign (1.7B)
# cargo run --example synthesize --features qwen3-tts -- --precision fp32 "Hello, world\!"
# Voice Clone (0.6B)
# cargo run --example synthesize_clone --features qwen3-tts -- --precision fp32 \
# --ref-audio ref.wav --ref-text "Transcript." "Text to synthesize."
Performance
| Backend | Precision | Provider | Hardware | RTF short | RTF medium | RTF long |
|---|---|---|---|---|---|---|
| qwen3-tts | int4 | CPU | Standard_NC4as_T4_v3 | 1.98 | 2.04 | 2.34 |
| qwen3-tts | int4 | CUDA | Standard_NC4as_T4_v3 | 0.78 | 0.85 | 1.07 |
RTF < 1.0 = faster-than-real-time. Lower is better.
To update: run make bench-csv-cuda on target hardware, then commit bench/results/.
Feature flags
Backends
| Flag | Default | Description |
|---|---|---|
qwen3-tts |
off | Qwen3-TTS local ONNX inference |
cosyvoice |
off | CosyVoice local ONNX inference (planned) |
Execution providers
Composable with any backend flag. Selects the inference hardware at build time.
| Flag | Description | Status |
|---|---|---|
cuda |
NVIDIA CUDA GPU | ✅ Working |
tensorrt |
NVIDIA TensorRT | 🚧 Not configured |
coreml |
Apple CoreML (macOS) | 🚧 Not configured |
License
Licensed under Apache 2.0.
Copyright 2026 WaveKat.