Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Qwen3 TTS - Rust CLI tools
A Rust implementation of the Qwen3 Text-to-Speech (TTS) model inference. Provides three cross-platform CLI tools suitable for agentic skills for AI agents and bots.
- tts — generate speech from text with named speaker voices
- voice_clone — clone a voice from reference audio
- api_server — OpenAI-compatible HTTP API server
Supports two backends: libtorch (via the tch crate, cross-platform with optional CUDA) and MLX (Apple Silicon native via Metal GPU).
Learn more:
- A Rust implementation / CLI for Qwen3's ASR (Automatic Speech Recognition or Speech-to-Text) models
- An OpenAI compatible API server for audio / speech
- An OpenClaw SKILL for voice generation. Copy and Paste to your lobster to install it
Quick Start
Install binaries, models, and reference audio for your platform:
|
The installer detects your OS, CPU, and NVIDIA GPU (if present), then sets up everything in ./qwen3_tts_rs/.
Text-to-Speech
Generate speech with a named speaker using the CustomVoice model:
# Output: output.wav (24 kHz)
Voice Cloning
Clone a voice from reference audio using the Base model (ICL mode):
# Output: output_voice_clone.wav (24 kHz)
API Server
Start the OpenAI-compatible API server with the CustomVoice model:
Then call the endpoint:
Reference
tts — Text-to-Speech
tts <model_path> [text] [speaker] [language] [instruction]
| Argument | Default | Description |
|---|---|---|
model_path |
(required) | Path to model directory |
text |
"Hello! This is a test..." | Text to synthesize (max ~4096 chars) |
speaker |
Vivian | Speaker name (see below) |
language |
english | Language: english, chinese, japanese, korean |
instruction |
(empty) | Voice style instruction (1.7B models only) |
Output: output.wav (24 kHz, 16-bit PCM)
Available speakers (CustomVoice models): Vivian, Serena, Ryan, Aiden, Uncle_fu, Ono_anna, Sohee, Eric, Dylan
Instruction examples (1.7B CustomVoice only):
"Speak in an urgent and excited voice""Speak happily and joyfully""Speak slowly and calmly""Speak in a whisper"
voice_clone — Voice Cloning
voice_clone <model_path> <ref_audio> [text] [language] [ref_text]
| Argument | Default | Description |
|---|---|---|
model_path |
(required) | Path to model directory |
ref_audio |
(required) | Path to reference WAV file |
text |
"Hello! This is a test..." | Text to synthesize |
language |
english | Language |
ref_text |
(none) | Transcript of reference audio (enables ICL mode, higher quality) |
Output: output_voice_clone.wav (24 kHz, 16-bit PCM)
Preparing reference audio: Must be mono 24 kHz 16-bit WAV. Convert with ffmpeg:
Example:
api_server — OpenAI-Compatible API
api_server <model_path> [--host 127.0.0.1] [--port 8080]
| Option | Default | Description |
|---|---|---|
model_path |
(required) | Path to model directory |
--host |
127.0.0.1 | Bind address |
--port |
8080 | Listen port |
Endpoints:
| Method | Path | Description |
|---|---|---|
| POST | /v1/audio/speech |
Generate speech (OpenAI-compatible) |
| GET | /v1/models |
List available models |
| GET | /health |
Health check |
API Request: POST /v1/audio/speech
| Field | Type | Default | Description |
|---|---|---|---|
input |
string | (required) | Text to synthesize (max 4096 chars) |
voice |
string | "alloy" |
OpenAI name or Qwen3 speaker name (see mapping below) |
model |
string | — | Accepted for compatibility, ignored |
response_format |
string | "wav" |
"wav", "pcm", "mp3", "flac", "ogg", or "opus" |
speed |
float | 1.0 | Speed multiplier (0.25–4.0) |
stream |
bool | false | Enable SSE streaming (requires "pcm") |
language |
string | "english" |
english, chinese, japanese, korean, auto |
instructions |
string | — | Voice style instruction (1.7B models only) |
audio_sample |
string | — | Base64-encoded reference WAV for voice cloning |
audio_sample_text |
string | — | Transcript of reference audio (required with audio_sample) |
Voice name mapping (OpenAI → Qwen3):
| OpenAI | Qwen3 |
|---|---|
| alloy | serena |
| echo | ryan |
| fable | vivian |
| onyx | eric |
| nova | ono_anna |
| shimmer | sohee |
You can also pass Qwen3 speaker names directly (e.g., "voice": "vivian").
Streaming: When stream: true and response_format: "pcm", the server returns Server-Sent Events with base64-encoded PCM chunks:
data: {"type":"speech.audio.delta","delta":"<base64 PCM>"}
data: {"type":"speech.audio.done"}
Voice cloning via API:
# Encode reference audio as base64
REF_B64=
Build from Source
macOS (MLX backend)
Requires Apple Silicon Mac, Xcode, and CMake.
Linux (libtorch backend)
1. Download libtorch from libtorch-releases:
# Linux x86_64 (CPU)
# Linux x86_64 (CUDA 12.6)
# Linux ARM64 (CPU)
# Linux ARM64 (CUDA 12.6 / Jetson)
2. Set environment and build:
Alternatively, use pip-installed PyTorch instead of downloading libtorch:
Download models and generate tokenizer
After building, download models and generate tokenizer.json for each:
Binaries are in target/release/: tts, voice_clone, api_server.
Rust library usage
Add to your Cargo.toml:
[]
= "0.2"
# Or for MLX backend:
# qwen3-tts-rs = { version = "0.2", default-features = false, features = ["mlx"] }
See the API documentation on docs.rs for library usage examples.
Performance (Apple M4 Mac, MLX backend)
Test sentences: ~15–20 words in English ("The quick brown fox..." / "Scientists have discovered...") and Chinese. RTF = Real-Time Factor (wall time / audio duration). Lower is better; < 1.0 means faster than real-time.
0.6B CustomVoice
CLI (tts / voice_clone)
| Test | Speaker | Language | Audio | Wall Time | RTF |
|---|---|---|---|---|---|
| Preset voice | Vivian | English | 5.92s | 10.93s | 1.85x |
| Preset voice | Ryan | English | 8.16s | 14.15s | 1.73x |
| Preset voice | Vivian | Chinese | 6.64s | 11.26s | 1.70x |
| Voice clone (ICL) | ref audio | English | 7.04s | 16.77s | 2.38x |
API server (after warmup)
| Test | Voice | Mode | Audio | Wall Time | RTF |
|---|---|---|---|---|---|
| Non-streaming WAV | alloy (serena) | full | 6.40s | 9.90s | 1.55x |
| Non-streaming WAV | echo (ryan) | full | 8.00s | 12.70s | 1.59x |
| Streaming PCM | alloy (serena) | stream | ~6.4s | 10.22s | ~1.60x |
| Voice clone WAV | alloy + ref | full | 9.04s | 20.11s | 2.22x |
1.7B CustomVoice
CLI (tts)
| Test | Speaker | Language | Audio | Wall Time | RTF |
|---|---|---|---|---|---|
| Preset voice | Vivian | English | 6.24s | 18.92s | 3.03x |
| Preset voice | Ryan | English | 8.64s | 27.50s | 3.18x |
| Preset voice | Vivian | Chinese | 6.24s | 18.31s | 2.93x |
| Preset + instruction | Vivian | English | 8.80s | 29.97s | 3.41x |
API server (after warmup)
| Test | Voice | Mode | Audio | Wall Time | RTF |
|---|---|---|---|---|---|
| Non-streaming WAV | alloy (serena) | full | 5.52s | 15.14s | 2.74x |
| Non-streaming WAV | echo (ryan) | full | 8.64s | 26.31s | 3.05x |
| Streaming PCM | alloy (serena) | stream | 8.16s | 19.79s | 2.43x |
0.6B vs 1.7B comparison
| Metric | 0.6B avg RTF | 1.7B avg RTF | Slowdown |
|---|---|---|---|
| CLI preset voice | 1.76x | 3.05x | ~1.7x |
| API non-streaming | 1.57x | 2.90x | ~1.8x |
Architecture
Text → Tokenizer → Dual-stream Embeddings → TalkerModel (28-layer Transformer)
↓
codec_head → Code 0
↓
CodePredictor (5-layer Transformer) → Codes 1-15
↓
Vocoder → 24kHz Waveform
License
Apache-2.0
Credits
Based on the original Python implementation by the Alibaba Qwen team.