wavekat-tts 0.0.2

Unified text-to-speech for voice pipelines, wrapping multiple TTS engines behind common Rust traits. Same pattern as wavekat-vad and wavekat-turn.

[!WARNING] Early development. API may change between minor versions.

Backends

Backend	Feature flag	Status	License
Qwen3-TTS	`qwen3-tts`	✅ Available	Apache 2.0
CosyVoice	`cosyvoice`	🚧 Planned	Apache 2.0

Quick start

cargo add wavekat-tts --features qwen3-tts

use wavekat_tts::{TtsBackend, SynthesizeRequest};
use wavekat_tts::backends::qwen3_tts::{Qwen3Tts, ModelConfig, ModelPrecision, ExecutionProvider};

// Auto-downloads INT4 model files on first run, runs on CPU (default):
let tts = Qwen3Tts::new()?;

// Or FP32 on CPU:
// let tts = Qwen3Tts::from_config(ModelConfig::default().with_precision(ModelPrecision::Fp32))?;

// Or INT4 from a local directory on CUDA:
// let tts = Qwen3Tts::from_config(
//     ModelConfig::default()
//         .with_dir("models/qwen3-tts-1.7b")
//         .with_execution_provider(ExecutionProvider::Cuda),
// )?;

let request = SynthesizeRequest::new("Hello, world")
    .with_instruction("Speak naturally and clearly.");
let audio = tts.synthesize(&request)?;

// Save to WAV (wavekat-core includes WAV I/O via the `wav` feature):
audio.write_wav("output.wav")?;

println!("{}s at {} Hz", audio.duration_secs(), audio.sample_rate());

Model files are cached by the HF Hub client at $HF_HOME/hub/ (default ~/.cache/huggingface/hub/). Set WAVEKAT_MODEL_DIR to load from a local directory and skip all downloads.

All backends produce AudioFrame<'static> from wavekat-core — the same type consumed by wavekat-vad and wavekat-turn.

Architecture

wavekat-vad   →  "is someone speaking?"
wavekat-turn  →  "are they done speaking?"
wavekat-tts   →  "synthesize the response"
     │                   │                     │
     └───────────────────┴─────────────────────┘
                         │
                   AudioFrame (wavekat-core)

Two trait families:

TtsBackend — batch synthesis: text → AudioFrame<'static>
StreamingTtsBackend — streaming: text → iterator of AudioFrame<'static> chunks

Examples

Generate a WAV file from text (model files are auto-downloaded on first run):

cargo run --example synthesize --features qwen3-tts -- "Hello, world\!"
cargo run --example synthesize --features qwen3-tts -- --instruction "Speak in a warm, friendly tone." "Give every small business the voice of a big one."
cargo run --example synthesize --features qwen3-tts -- --precision fp32 "Hello"
cargo run --example synthesize --features qwen3-tts -- --model-dir /path/to/model --output hello.wav "Hello"

Performance

Backend	Precision	Provider	Hardware	RTF short	RTF medium	RTF long
qwen3-tts	int4	CPU	Standard_NC4as_T4_v3	1.98	2.04	2.34
qwen3-tts	int4	CUDA	Standard_NC4as_T4_v3	0.78	0.85	1.07

RTF < 1.0 = faster-than-real-time. Lower is better.
To update: run make bench-csv-cuda on target hardware, then commit bench/results/.

Feature flags

Backends

Flag	Default	Description
`qwen3-tts`	off	Qwen3-TTS local ONNX inference
`cosyvoice`	off	CosyVoice local ONNX inference (planned)

Execution providers

Composable with any backend flag. Selects the inference hardware at build time.

Flag	Description	Status
`cuda`	NVIDIA CUDA GPU	✅ Working
`tensorrt`	NVIDIA TensorRT	🚧 Not configured
`coreml`	Apple CoreML (macOS)	🚧 Not configured

License

Licensed under Apache 2.0.