Qwen3-TTS-RS
A Rust implementation of the Qwen3-TTS text-to-speech model using the Candle ML framework.

Features
- Complete implementation of the Qwen3-TTS architecture
- Speaker encoder (ECAPA-TDNN) for voice cloning
- 12Hz audio tokenizer (V2) for high-quality audio generation
- Three synthesis modes:
- CustomVoice: Use predefined speaker voices
- VoiceDesign: Create voices from natural language descriptions
- VoiceClone: Clone voices from reference audio
- Batch processing for multiple texts
- Voice prompt caching for faster repeated generation
- URL-based audio loading for voice cloning
- Standalone tokenizer CLI for audio codec testing
- Full control over generation parameters
- Multi-language support: Chinese, English, Japanese, Korean, French, German, Spanish (+ auto-detect)
Architecture Overview
Qwen3-TTS uses a hierarchical generation approach:
- Speaker Encoder: Extracts speaker embeddings from reference audio using ECAPA-TDNN
- Talker Model: Generates semantic tokens (codebook 0) using multimodal RoPE
- Code Predictor: Generates acoustic tokens (codebooks 1-31)
- Audio Tokenizer: Decodes all 32 codebooks to audio waveforms
CLI Usage
Basic Text-to-Speech
# Using a predefined speaker (CustomVoice mode)
# With language specification
Synthesis Modes
CustomVoice (Predefined Speakers)
Use built-in speaker voices with optional instructions:
VoiceDesign (Natural Language Description)
Create a voice from a text description:
VoiceClone (Reference Audio)
Clone a voice from reference audio:
# X-vector only mode (faster, uses only speaker embedding)
# No --ref-text "..."
# From local file
# From URL
Batch Processing
Using TXT
Create a text file with one text per line:
# inputs.txt
This generates outputs/output_0.wav, outputs/output_1.wav, etc.
Using JSON
For more control, use JSON format (detected automatically from .json extension):
Voice Prompt Caching
Save computed voice prompts for reuse (avoids recomputing speaker embeddings):
# Save voice prompt while generating
# Reuse saved prompt (faster, no need for reference audio)
# Create prompt without generating audio
Generation Parameters
Talker Parameters (Semantic Token Generation)
# Greedy decoding (deterministic)
# Set random seed for reproducibility
Max Tokens (default: 2048)
If you want to generate long form text you will need to adjust the --max-tokens.
The "Hz" in the model names literally means "tokens per second".
- v1 25Hz = 25 tokens/second = 40ms per token
- v2 12Hz = 12.5 tokens/second = 80ms per token
Given: tokens = duration_seconds × token_rate_hz
| max_tokens | 12Hz(v2) | 25Hz(v1) |
|---|---|---|
| 2,000 | 2m 40s | 1m 20s |
| 4,000 | 5m 20s | 2m 40s |
| 8,000 | 10m | 5m 20s |
| 16,000 | 21m | 10m |
| 32,000 | 42m | 20m |
Per Page
Reading time for 12 pages
- Average: 500 words per page
- Speech rate: 150 words per minute (conversational pace)
| Pages | Words | Duration | 12Hz | 25Hz |
|---|---|---|---|---|
| 1 | 500 | 3.3 min | 2,500 | 5,000 |
| 5 | 2,500 | 17 min | 12,750 | 25,500 |
| 12 | 6,000 | 40 min | 30,000 | 60,000 |
| 25 | 12,500 | 83 min | 62,250 | 124,500 |
#### Subtalker Parameters (Acoustic Token Generation)
Control the code predictor that generates codebooks 1-31:
```bash
cargo run --release -- \
--text "Hello, world!" \
--speaker vivian \
--subtalker-temperature 0.9 \
--subtalker-top-k 50 \
--subtalker-top-p 1.0 \
--output output.wav
# Disable subtalker sampling (greedy)
cargo run --release -- \
--text "Hello, world!" \
--speaker vivian \
--no-subtalker-sample \
--output output.wav
Hardware Options
# Use CPU
# Use CUDA GPU
# Use Metal (macOS)
# Set data type
Tokenizer CLI
Standalone CLI for audio encoding/decoding (codec testing):
# Encode audio to codes
# Decode codes back to audio
# Round-trip test (encode then decode)
The codes JSON format: