Expand description
§any-tts
A Rust text-to-speech library powered primarily by the candle ML framework. Provides a unified trait-based API with pluggable model backends, including native Candle implementations and adapters for official upstream runtimes.
§Supported Models
- Kokoro-82M — 82M parameter StyleTTS2 model with ISTFTNet decoder for fast, high-quality speech
- OmniVoice — native Candle implementation of the OmniVoice zero-shot TTS model
- Qwen3-TTS-12Hz-1.7B-CustomVoice — 1.7B parameter multi-codebook LM for 10 languages
- Qwen3-TTS-12Hz-1.7B-VoiceDesign — 1.7B model with natural language voice descriptions
- VibeVoice-1.5B — native Candle implementation of Microsoft’s multi-speaker speech diffusion model
- VibeVoice-Realtime-0.5B — native Candle implementation of Microsoft’s cached-prompt realtime TTS model
- Voxtral-4B-TTS-2603 — native Candle implementation of Mistral’s 4B TTS model
§Feature Flags
cuda— Enable CUDA GPU accelerationmetal— Enable Metal GPU acceleration (macOS/iOS)accelerate— Enable Apple Accelerate frameworkkokoro— Build Kokoro model support (default)omnivoice— Build native OmniVoice support (default)qwen3-tts— Build Qwen3-TTS model support (default)vibevoice— Build native VibeVoice support (default)voxtral— Build native Voxtral support (default)download— Enable automatic model downloading from HuggingFace Hub (default)
§Quick Start
use any_tts::{TtsModel, TtsConfig, SynthesisRequest, ModelType};
// Load a model
let config = TtsConfig::new(ModelType::Qwen3Tts)
.with_model_path("/path/to/model");
let model = any_tts::load_model(config).unwrap();
// Synthesize speech
let request = SynthesisRequest::new("Hello, world!")
.with_language("en");
let audio = model.synthesize(&request).unwrap();
// audio.samples contains f32 PCM data at model.sample_rate() Hz
let wav_bytes = audio.get_wav();
let _ = wav_bytes;Re-exports§
pub use audio::AudioSamples;pub use audio::DenoiseOptions;pub use config::preferred_runtime_choice;pub use config::preferred_runtime_choices;pub use config::DType;pub use config::ModelAsset;pub use config::ModelAssetBundle;pub use config::ModelAssetDir;pub use config::ModelFiles;pub use config::RuntimeChoice;pub use config::TtsConfig;pub use device::DeviceSelection;pub use error::TtsError;pub use mel::MelConfig;pub use mel::MelSpectrogram;pub use models::ModelAssetRequirement;pub use models::ModelType;pub use traits::ModelInfo;pub use traits::ReferenceAudio;pub use traits::SynthesisRequest;pub use traits::TtsModel;pub use traits::VoiceCloning;pub use traits::VoiceEmbedding;
Modules§
- audio
- Audio output types and utilities.
- config
- Configuration types for TTS models.
- device
- Device selection utilities.
- download
- Built-in Hugging Face model download utilities.
- error
- Error types for any-tts.
- layers
- Shared neural-network building blocks used by model backends.
- mel
- Mel spectrogram extraction for voice cloning and audio analysis.
- models
- Model backends for TTS synthesis.
- tensor_
utils - Shared tensor utilities for model implementations.
- tokenizer
- Text tokenizer wrapper.
- traits
- Core TTS trait and request/response types.
Functions§
- load_
model - Load a TTS model based on the provided configuration.