Skip to main content

Crate qwen3_tts

Crate qwen3_tts 

Source
Expand description

§Qwen3-TTS

Pure Rust inference for Qwen3-TTS, a high-quality text-to-speech model from Alibaba.

§Features

  • CPU inference with optional MKL/Accelerate for faster BLAS operations
  • CUDA support for NVIDIA GPU acceleration
  • Metal support for Apple Silicon
  • Streaming-friendly architecture with incremental token generation
  • Voice cloning via ECAPA-TDNN speaker encoder (Base models)
  • Auto-detection of model variant from config.json

§Quick Start

use qwen3_tts::{Qwen3TTS, SynthesisOptions, auto_device};

// Load model — variant auto-detected from config.json
let device = auto_device()?;
let model = Qwen3TTS::from_pretrained("path/to/model", device)?;

// Synthesize speech with default settings
let audio = model.synthesize("Hello, world!", None)?;
audio.save("output.wav")?;

// Or with custom options
let options = SynthesisOptions {
    temperature: 0.8,
    top_k: 30,
    ..Default::default()
};
let audio = model.synthesize("Custom settings!", Some(options))?;

§Architecture

The TTS pipeline consists of three stages:

  1. TalkerModel: Transformer that generates semantic tokens from text autoregressively. Uses dual embeddings (text + codec) with MRoPE (multimodal rotary position encoding) across all variants.

  2. CodePredictor: For each semantic token, generates 15 acoustic tokens using a 5-layer autoregressive decoder. The code predictor always has hidden_size=1024 regardless of the talker size; 1.7B models use a small_to_mtp_projection layer to bridge the gap.

  3. Decoder12Hz: Converts the 16-codebook codec tokens to audio waveform at 24kHz. Uses ConvNeXt blocks and transposed convolutions for upsampling. Shared across all model variants.

§Model Variants

Five official variants exist in two size classes:

VariantSizeTalker hiddenSpeaker conditioningHuggingFace ID
0.6B Base1.8 GB1024Voice cloning (ECAPA-TDNN)Qwen/Qwen3-TTS-12Hz-0.6B-Base
0.6B CustomVoice1.8 GB10249 preset speakersQwen/Qwen3-TTS-12Hz-0.6B-CustomVoice
1.7B Base3.9 GB2048Voice cloning (ECAPA-TDNN)Qwen/Qwen3-TTS-12Hz-1.7B-Base
1.7B CustomVoice3.9 GB20489 preset speakersQwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
1.7B VoiceDesign3.8 GB2048Text-described voicesQwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign

Base: Includes a speaker encoder for voice cloning from reference audio. Supports x_vector_only (speaker embedding) and ICL (in-context learning with reference audio + text) modes.

CustomVoice: 9 preset speakers (Serena, Vivian, Ryan, Aiden, etc.) with no speaker encoder. Uses discrete speaker token IDs for voice selection.

VoiceDesign: Creates novel voices from text descriptions (e.g., “a deep male voice”). No speaker encoder or preset speakers.

All variants share the same speech tokenizer and decoder weights. The code predictor architecture is identical (1024 hidden, 5 layers, 16 heads) across all variants.

§Sample Rate

Output audio is always 24kHz mono. Use audio::resample() if you need a different sample rate.

Re-exports§

pub use audio::AudioBuffer;
pub use models::config::Qwen3TTSConfig;
pub use generation::SamplingContext;
pub use models::talker::codec_tokens;
pub use models::talker::special_tokens;
pub use models::talker::tts_tokens;
pub use models::talker::Language;
pub use models::talker::Speaker;
pub use models::CodePredictor;
pub use models::CodePredictorConfig;
pub use models::ModelType;
pub use models::ParsedModelConfig;
pub use models::SpeakerEncoderConfig;
pub use models::TalkerConfig;
pub use models::TalkerModel;

Modules§

audio
Audio processing utilities for Qwen3-TTS
generation
Generation and sampling utilities for Qwen3-TTS
models
Neural network models for Qwen3-TTS
profiling
Feature-gated profiling support via tracing-chrome.
tokenizer
Text tokenization for Qwen3-TTS

Structs§

Qwen3TTS
Main TTS interface using proper autoregressive pipeline.
StreamingSession
Streaming synthesis session.
SynthesisOptions
Options for speech synthesis
SynthesisTiming
Per-stage timing breakdown from a synthesis run.
VoiceClonePrompt
Reference audio prompt for voice cloning.

Constants§

CODEC_EOS_TOKEN_ID
The codec end-of-sequence token ID (2150).
SAMPLES_PER_FRAME
Number of audio samples per codec frame at 24kHz (1920 = 80ms per frame at 12Hz).

Functions§

auto_device
Select the best available compute device for inference.
codes_to_tensor
Convert a slice of codec frames into a tensor of shape [1, 16, T].
compute_dtype_for_device
Return the recommended compute dtype for the given device.
device_info
Human-readable label for a Device.
parse_device
Parse a device string into a Device.
sync_device
Force the GPU to complete all pending work before returning.

Type Aliases§

FrameCodes
A sequence of codec frames, where each frame contains 16 codebook values (1 semantic + 15 acoustic, formatted as [semantic, acoustic_0..14]).