Expand description
§Qwen3-TTS
Pure Rust inference for Qwen3-TTS, a high-quality text-to-speech model from Alibaba.
§Features
- CPU inference with optional MKL/Accelerate for faster BLAS operations
- CUDA support for NVIDIA GPU acceleration
- Metal support for Apple Silicon
- Streaming-friendly architecture with incremental token generation
- Voice cloning via ECAPA-TDNN speaker encoder (Base models)
- Auto-detection of model variant from
config.json
§Quick Start
use qwen3_tts::{Qwen3TTS, SynthesisOptions, auto_device};
// Load model — variant auto-detected from config.json
let device = auto_device()?;
let model = Qwen3TTS::from_pretrained("path/to/model", device)?;
// Synthesize speech with default settings
let audio = model.synthesize("Hello, world!", None)?;
audio.save("output.wav")?;
// Or with custom options
let options = SynthesisOptions {
temperature: 0.8,
top_k: 30,
..Default::default()
};
let audio = model.synthesize("Custom settings!", Some(options))?;§Architecture
The TTS pipeline consists of three stages:
-
TalkerModel: Transformer that generates semantic tokens from text autoregressively. Uses dual embeddings (text + codec) with MRoPE (multimodal rotary position encoding) across all variants.
-
CodePredictor: For each semantic token, generates 15 acoustic tokens using a 5-layer autoregressive decoder. The code predictor always has
hidden_size=1024regardless of the talker size; 1.7B models use asmall_to_mtp_projectionlayer to bridge the gap. -
Decoder12Hz: Converts the 16-codebook codec tokens to audio waveform at 24kHz. Uses ConvNeXt blocks and transposed convolutions for upsampling. Shared across all model variants.
§Model Variants
Five official variants exist in two size classes:
| Variant | Size | Talker hidden | Speaker conditioning | HuggingFace ID |
|---|---|---|---|---|
| 0.6B Base | 1.8 GB | 1024 | Voice cloning (ECAPA-TDNN) | Qwen/Qwen3-TTS-12Hz-0.6B-Base |
| 0.6B CustomVoice | 1.8 GB | 1024 | 9 preset speakers | Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice |
| 1.7B Base | 3.9 GB | 2048 | Voice cloning (ECAPA-TDNN) | Qwen/Qwen3-TTS-12Hz-1.7B-Base |
| 1.7B CustomVoice | 3.9 GB | 2048 | 9 preset speakers | Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice |
| 1.7B VoiceDesign | 3.8 GB | 2048 | Text-described voices | Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign |
Base: Includes a speaker encoder for voice cloning from reference audio. Supports x_vector_only (speaker embedding) and ICL (in-context learning with reference audio + text) modes.
CustomVoice: 9 preset speakers (Serena, Vivian, Ryan, Aiden, etc.) with no speaker encoder. Uses discrete speaker token IDs for voice selection.
VoiceDesign: Creates novel voices from text descriptions (e.g., “a deep male voice”). No speaker encoder or preset speakers.
All variants share the same speech tokenizer and decoder weights. The code predictor architecture is identical (1024 hidden, 5 layers, 16 heads) across all variants.
§Sample Rate
Output audio is always 24kHz mono. Use audio::resample() if you need
a different sample rate.
Re-exports§
pub use audio::AudioBuffer;pub use models::config::Qwen3TTSConfig;pub use generation::SamplingContext;pub use models::talker::codec_tokens;pub use models::talker::special_tokens;pub use models::talker::tts_tokens;pub use models::talker::Language;pub use models::talker::Speaker;pub use models::CodePredictor;pub use models::CodePredictorConfig;pub use models::ModelType;pub use models::ParsedModelConfig;pub use models::SpeakerEncoderConfig;pub use models::TalkerConfig;pub use models::TalkerModel;
Modules§
- audio
- Audio processing utilities for Qwen3-TTS
- generation
- Generation and sampling utilities for Qwen3-TTS
- models
- Neural network models for Qwen3-TTS
- profiling
- Feature-gated profiling support via
tracing-chrome. - tokenizer
- Text tokenization for Qwen3-TTS
Structs§
- Qwen3TTS
- Main TTS interface using proper autoregressive pipeline.
- Streaming
Session - Streaming synthesis session.
- Synthesis
Options - Options for speech synthesis
- Synthesis
Timing - Per-stage timing breakdown from a synthesis run.
- Voice
Clone Prompt - Reference audio prompt for voice cloning.
Constants§
- CODEC_
EOS_ TOKEN_ ID - The codec end-of-sequence token ID (2150).
- SAMPLES_
PER_ FRAME - Number of audio samples per codec frame at 24kHz (1920 = 80ms per frame at 12Hz).
Functions§
- auto_
device - Select the best available compute device for inference.
- codes_
to_ tensor - Convert a slice of codec frames into a tensor of shape
[1, 16, T]. - compute_
dtype_ for_ device - Return the recommended compute dtype for the given device.
- device_
info - Human-readable label for a
Device. - parse_
device - Parse a device string into a
Device. - sync_
device - Force the GPU to complete all pending work before returning.
Type Aliases§
- Frame
Codes - A sequence of codec frames, where each frame contains 16 codebook values
(1 semantic + 15 acoustic, formatted as
[semantic, acoustic_0..14]).