Crate qwen3_tts

Expand description

§Qwen3-TTS

Pure Rust inference for Qwen3-TTS, a high-quality text-to-speech model from Alibaba.

§Features

CPU inference with optional MKL/Accelerate for faster BLAS operations
CUDA support for NVIDIA GPU acceleration
Metal support for Apple Silicon
Streaming-friendly architecture with incremental token generation
Voice cloning via ECAPA-TDNN speaker encoder (Base models)
Auto-detection of model variant from config.json

§Quick Start

use qwen3_tts::{Qwen3TTS, SynthesisOptions, auto_device};

// Load model — variant auto-detected from config.json
let device = auto_device()?;
let model = Qwen3TTS::from_pretrained("path/to/model", device)?;

// Synthesize speech with default settings
let audio = model.synthesize("Hello, world!", None)?;
audio.save("output.wav")?;

// Or with custom options
let options = SynthesisOptions {
    temperature: 0.8,
    top_k: 30,
    ..Default::default()
};
let audio = model.synthesize("Custom settings!", Some(options))?;

§Architecture

The TTS pipeline consists of three stages:

TalkerModel: Transformer that generates semantic tokens from text autoregressively. Uses dual embeddings (text + codec) with MRoPE (multimodal rotary position encoding) across all variants.
CodePredictor: For each semantic token, generates 15 acoustic tokens using a 5-layer autoregressive decoder. The code predictor always has hidden_size=1024 regardless of the talker size; 1.7B models use a small_to_mtp_projection layer to bridge the gap.
Decoder12Hz: Converts the 16-codebook codec tokens to audio waveform at 24kHz. Uses ConvNeXt blocks and transposed convolutions for upsampling. Shared across all model variants.

§Model Variants

Five official variants exist in two size classes:

Variant	Size	Talker hidden	Speaker conditioning	HuggingFace ID
0.6B Base	1.8 GB	1024	Voice cloning (ECAPA-TDNN)	`Qwen/Qwen3-TTS-12Hz-0.6B-Base`
0.6B CustomVoice	1.8 GB	1024	9 preset speakers	`Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice`
1.7B Base	3.9 GB	2048	Voice cloning (ECAPA-TDNN)	`Qwen/Qwen3-TTS-12Hz-1.7B-Base`
1.7B CustomVoice	3.9 GB	2048	9 preset speakers	`Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice`
1.7B VoiceDesign	3.8 GB	2048	Text-described voices	`Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign`

Base: Includes a speaker encoder for voice cloning from reference audio. Supports x_vector_only (speaker embedding) and ICL (in-context learning with reference audio + text) modes.

CustomVoice: 9 preset speakers (Serena, Vivian, Ryan, Aiden, etc.) with no speaker encoder. Uses discrete speaker token IDs for voice selection.

VoiceDesign: Creates novel voices from text descriptions (e.g., “a deep male voice”). No speaker encoder or preset speakers.

All variants share the same speech tokenizer and decoder weights. The code predictor architecture is identical (1024 hidden, 5 layers, 16 heads) across all variants.

§Sample Rate

Output audio is always 24kHz mono. Use audio::resample() if you need a different sample rate.

Re-exports§

pub use audio::AudioBuffer;
pub use models::config::Qwen3TTSConfig;
pub use generation::SamplingContext;
pub use models::talker::codec_tokens;
pub use models::talker::special_tokens;
pub use models::talker::tts_tokens;
pub use models::talker::Language;
pub use models::talker::Speaker;
pub use models::CodePredictor;
pub use models::CodePredictorConfig;
pub use models::ModelType;
pub use models::ParsedModelConfig;
pub use models::SpeakerEncoderConfig;
pub use models::TalkerConfig;
pub use models::TalkerModel;

Modules§

audio: Audio processing utilities for Qwen3-TTS
generation: Generation and sampling utilities for Qwen3-TTS
models: Neural network models for Qwen3-TTS
profiling: Feature-gated profiling support via tracing-chrome.
tokenizer: Text tokenization for Qwen3-TTS

Structs§

Qwen3TTS: Main TTS interface using proper autoregressive pipeline.
StreamingSession: Streaming synthesis session.
SynthesisOptions: Options for speech synthesis
SynthesisTiming: Per-stage timing breakdown from a synthesis run.
VoiceClonePrompt: Reference audio prompt for voice cloning.

Constants§

CODEC_EOS_TOKEN_ID: The codec end-of-sequence token ID (2150).
SAMPLES_PER_FRAME: Number of audio samples per codec frame at 24kHz (1920 = 80ms per frame at 12Hz).

Functions§

auto_device: Select the best available compute device for inference.
codes_to_tensor: Convert a slice of codec frames into a tensor of shape [1, 16, T].
compute_dtype_for_device: Return the recommended compute dtype for the given device.
device_info: Human-readable label for a Device.
parse_device: Parse a device string into a Device.
sync_device: Force the GPU to complete all pending work before returning.

Type Aliases§

FrameCodes: A sequence of codec frames, where each frame contains 16 codebook values (1 semantic + 15 acoustic, formatted as [semantic, acoustic_0..14]).

Crate qwen3_tts

Crate qwen3_tts Copy item path

§Qwen3-TTS

§Features

§Quick Start

§Architecture

§Model Variants

§Sample Rate

Re-exports§

Modules§

Structs§

Constants§

Functions§

Type Aliases§

Crate qwen3_tts