Skip to main content

Crate any_tts

Crate any_tts 

Source
Expand description

§any-tts

A Rust text-to-speech library powered primarily by the candle ML framework. Provides a unified trait-based API with pluggable model backends, including native Candle implementations and adapters for official upstream runtimes.

§Supported Models

  • Kokoro-82M — 82M parameter StyleTTS2 model with ISTFTNet decoder for fast, high-quality speech
  • OmniVoice — native Candle implementation of the OmniVoice zero-shot TTS model
  • Qwen3-TTS-12Hz-1.7B-CustomVoice — 1.7B parameter multi-codebook LM for 10 languages
  • Qwen3-TTS-12Hz-1.7B-VoiceDesign — 1.7B model with natural language voice descriptions
  • VibeVoice-1.5B — native Candle implementation of Microsoft’s multi-speaker speech diffusion model
  • VibeVoice-Realtime-0.5B — native Candle implementation of Microsoft’s cached-prompt realtime TTS model
  • Voxtral-4B-TTS-2603 — native Candle implementation of Mistral’s 4B TTS model

§Feature Flags

  • cuda — Enable CUDA GPU acceleration
  • metal — Enable Metal GPU acceleration (macOS/iOS)
  • accelerate — Enable Apple Accelerate framework
  • kokoro — Build Kokoro model support (default)
  • omnivoice — Build native OmniVoice support (default)
  • qwen3-tts — Build Qwen3-TTS model support (default)
  • vibevoice — Build native VibeVoice support (default)
  • voxtral — Build native Voxtral support (default)
  • download — Enable automatic model downloading from HuggingFace Hub (default)

§Quick Start

use any_tts::{TtsModel, TtsConfig, SynthesisRequest, ModelType};

// Load a model
let config = TtsConfig::new(ModelType::Qwen3Tts)
    .with_model_path("/path/to/model");
let model = any_tts::load_model(config).unwrap();

// Synthesize speech
let request = SynthesisRequest::new("Hello, world!")
    .with_language("en");
let audio = model.synthesize(&request).unwrap();

// audio.samples contains f32 PCM data at model.sample_rate() Hz
let wav_bytes = audio.get_wav();
let _ = wav_bytes;

Re-exports§

pub use audio::AudioSamples;
pub use audio::DenoiseOptions;
pub use config::preferred_runtime_choice;
pub use config::preferred_runtime_choices;
pub use config::DType;
pub use config::ModelAsset;
pub use config::ModelAssetBundle;
pub use config::ModelAssetDir;
pub use config::ModelFiles;
pub use config::RuntimeChoice;
pub use config::TtsConfig;
pub use device::DeviceSelection;
pub use error::TtsError;
pub use mel::MelConfig;
pub use mel::MelSpectrogram;
pub use models::ModelAssetRequirement;
pub use models::ModelType;
pub use traits::ModelInfo;
pub use traits::ReferenceAudio;
pub use traits::SynthesisRequest;
pub use traits::TtsModel;
pub use traits::VoiceCloning;
pub use traits::VoiceEmbedding;

Modules§

audio
Audio output types and utilities.
config
Configuration types for TTS models.
device
Device selection utilities.
download
Built-in Hugging Face model download utilities.
error
Error types for any-tts.
layers
Shared neural-network building blocks used by model backends.
mel
Mel spectrogram extraction for voice cloning and audio analysis.
models
Model backends for TTS synthesis.
tensor_utils
Shared tensor utilities for model implementations.
tokenizer
Text tokenizer wrapper.
traits
Core TTS trait and request/response types.

Functions§

load_model
Load a TTS model based on the provided configuration.