any-tts
The Flow-Like icon above is intentionally in use here at the top of this README.
any-tts is a Rust text-to-speech library built around Candle with one trait-based API for multiple open-weight model families. You can point it at local files, hand it explicit paths from your own cache, feed it named in-memory byte assets from an object store, or let it resolve missing assets from Hugging Face and keep the synthesis call site unchanged.
If you want one Rust TTS surface for small local models, multilingual research checkpoints, and agent-oriented voice stacks without rewriting your application around each model family, this is the repo.
For Flow-like specifically: every public backend can now load from relative-path byte assets, so object_store reads can go straight into TtsConfig without writing temp files first.
Why this repo exists
- One API for Kokoro, OmniVoice, Qwen3-TTS, VibeVoice, and Voxtral.
- Native Rust backends across the public model surface.
- Local path loading, in-memory byte bundles, per-file wiring, or Hugging Face fallback.
- CPU first, GPU when available: CUDA, Metal, and Accelerate build targets.
- Request-level control for
language,voice,instruct,max_tokens,temperature, andcfg_scale. - WAV output everywhere, with built-in WAV and MP3 input decoding for cleanup and reference-audio workflows.
Public model support
| Model | Status in any-tts | Default upstream | Best at | Main tradeoff | Model license |
|---|---|---|---|---|---|
| Kokoro-82M | Public, native, lightweight | hexgrad/Kokoro-82M |
Fast local TTS with small weights | Uses an in-tree pure-Rust phonemizer compatible with Kokoro's current public language set; parity tuning is still ongoing | Apache-2.0 |
| OmniVoice | Public, native | k2-fsa/OmniVoice |
Huge language coverage and instruct-driven voice design | The current Rust backend does not yet expose upstream zero-shot cloning | Apache-2.0 |
| Qwen3-TTS-12Hz-1.7B | Public, native | Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice |
Strong multilingual control, named speakers, and instruct handling | Heavy weights and extra speech-tokenizer assets | Apache-2.0 |
| VibeVoice-1.5B | Public, native | microsoft/VibeVoice-1.5B |
Long-form multi-speaker speech diffusion with native Rust inference | Still early and currently optimized for single-request parity work rather than streaming performance | MIT |
| Voxtral-4B-TTS-2603 | Public, native | mistralai/Voxtral-4B-TTS-2603 |
Production-style voice agents, preset voices, low-latency oriented stack | Largest backend here and not commercially permissive | CC BY-NC 4.0 |
Important: the Rust crate is dual licensed under MIT OR Apache-2.0. The model weights are not. Always check the model-specific license before shipping. Voxtral is the one that changes the deployment story the most because its published checkpoint is CC BY-NC 4.0.
Referenced but not yet public API
| Model | Where it appears in this repo | Current status | Upstream license | Notes |
|---|---|---|---|---|
| KugelAudio-0-Open | src/models/kugelaudio/ and examples/generate_kugelaudio.rs |
In-tree experiment, not exported from the public ModelType enum |
MIT | Focused on 24 European languages and pre-encoded voices, but the current example targets a model variant that is not part of the crate's exported API surface yet. |
That split matters. The README below treats Kokoro, OmniVoice, Qwen3-TTS, VibeVoice, and Voxtral as supported top-level backends, and it treats KugelAudio as work in progress.
Installation
For CPU-only builds, a recent stable Rust toolchain is enough. For GPU builds, compile the feature set that matches your machine. Kokoro no longer requires a system espeak-ng install: the repo now ships an in-tree pure-Rust phonemizer with an espeak-rs-compatible interface for the language set exposed by the current Kokoro backend.
Add the crate from crates.io:
[]
= "0.1"
Or opt into a smaller feature set:
[]
= { = "0.1", = false, = ["kokoro", "download", "metal"] }
Feature flags
By default the crate enables qwen3-tts, kokoro, omnivoice, vibevoice, voxtral, and download.
| Feature | What it does |
|---|---|
kokoro |
Enables the Kokoro backend. |
omnivoice |
Enables the native OmniVoice backend. |
qwen3-tts |
Enables the Qwen3-TTS backend. |
vibevoice |
Enables the native VibeVoice backend. |
voxtral |
Enables the native Voxtral backend. |
download |
Allows missing model files to be pulled from Hugging Face Hub through the crate's built-in downloader. |
cuda |
Builds Candle with CUDA support. |
metal |
Builds Candle with Metal support for Apple GPUs. |
accelerate |
Enables Apple Accelerate support for CPU-heavy Apple builds. |
Backend selection
DeviceSelection::Autotries CUDA first, then Metal, then CPU.DeviceSelection::Cpu,DeviceSelection::Cuda(0), andDeviceSelection::Metal(0)let you force the runtime target.preferred_runtime_choice(ModelType::...)returns the fastest safe device and dtype for the current machine.TtsConfig::with_preferred_runtime()applies that runtime choice in one builder call.DTypecan be set toF32,F16, orBF16.- On CPU, models that cannot safely run BF16 fall back to
F32. - The native OmniVoice helper prefers
cuda:0 (bf16), thenmetal:0 (f32), thencpu (f32).
Quick start
use ;
Byte-first loading
If your runtime already has model artifacts in memory, use ModelAssetBundle or the with_*_bytes() builders instead of writing them to disk first.
use ;
The relative paths in the asset bundle should match the model layout documented below, for example config.json, audio_tokenizer/model.safetensors, or voice_embedding/Aurora.pt.
Audio bytes
Generated audio already comes back as AudioSamples, so output does not need filesystem paths either.
audio.get_wav()returns a complete WAV file asVec<u8>for every backend.audio.save_wav()is a convenience helper on top of the same byte encoder.
Audio cleanup
use ;
use Cursor;
The denoiser auto-detects WAV and MP3 input streams and applies a speech-band filter plus a short-time spectral gate. It is useful for attenuating steady background noise and background music, but it is not a full voice-isolation or source-separation model.
File resolution flow
any-tts resolves model assets in four tiers, in this order:
- Explicit files you set on
TtsConfigwith methods likewith_config_file()orwith_weight_file(). - Auto-discovery from
with_asset_bundle()orwith_asset_bytes()using model-relative paths. - Auto-discovery from
with_model_path()using the expected filenames for that backend. - Hugging Face fallback through the
downloadfeature.
That means you can mix strategies. A service with its own artifact cache can hand over a few exact files and let the crate discover or download the rest.
Model asset layouts
You can inspect the documented manifest programmatically through ModelType::asset_requirements(). The expected relative paths are:
| Model | Required asset patterns | Optional asset patterns |
|---|---|---|
| Kokoro | config.json, model.safetensors or *.pth |
voices/*.pt |
| OmniVoice | config.json, tokenizer.json, model.safetensors or model-*-of-*.safetensors, audio_tokenizer/config.json, audio_tokenizer/model.safetensors or audio_tokenizer/model-*-of-*.safetensors |
generation_config.json |
| Qwen3-TTS | config.json, tokenizer.json, model.safetensors or model-*-of-*.safetensors, speech_tokenizer/model.safetensors or speech_tokenizer/model-*-of-*.safetensors |
speech_tokenizer/config.json, generation_config.json |
| VibeVoice | config.json, tokenizer.json, model.safetensors or model-*-of-*.safetensors |
preprocessor_config.json, generation_config.json |
| Voxtral | params.json, tekken.json, consolidated.safetensors, voice_embedding/*.pt |
none |
Using these exact relative paths makes the byte-based API foolproof because the same names are what with_model_path() auto-discovery expects on disk.
Request controls
SynthesisRequest keeps the per-call control surface stable across models.
| Field | Purpose | Notes |
|---|---|---|
text |
Input text to synthesize | Required for every backend. |
language |
Language tag or model-specific language name | Supports ISO tags in several backends and auto where available. |
voice |
Named speaker or preset voice | Works for Kokoro, Qwen3 CustomVoice, and Voxtral. OmniVoice rejects named voices. |
instruct |
Natural-language style control | Most useful on OmniVoice and Qwen3. |
max_tokens |
Upper bound on generated codec/audio tokens | Helpful for latency testing and smoke tests. |
temperature |
Sampling temperature | Supported where the backend uses it. |
cfg_scale |
Classifier-free guidance scale | Used by OmniVoice and other backends that expose CFG-like control. |
reference_audio |
Reference clip for voice cloning | Only partially supported today; unsupported backends return an explicit error. |
voice_embedding |
Precomputed embedding payload | Currently reusable with backends that accept embeddings directly. |
Examples in this repo
These are the example entry points that match the current public crate surface:
Outputs are written under output/ by the example binaries.
generate_vibevoice keeps writing the main raw render to the configured
VIBEVOICE_OUTPUT path and also writes *_base.wav,
*_denoised_default.wav, and *_denoised_aggressive.wav under
output/denoise/ by default. You can override that folder with
VIBEVOICE_DENOISE_DIR.
generate_comparison_suite writes a shared English and German comparison set under output/model_comparison/cpu/ and output/model_comparison/metal/, plus report.json files with per-model load time, per-sample synthesis time, audio duration, and realtime factor. It loads one model at a time so the full suite can run sequentially on tighter memory budgets.
Note on KugelAudio: there is an in-tree generate_kugelaudio example, but it currently targets a non-exported ModelType variant and should be treated as experimental repo work rather than public API.
Model guide
Kokoro-82M
What it is
Kokoro is the compact option in this repo: an 82M-parameter StyleTTS2 plus ISTFTNet stack with Apache-licensed weights. In practice, it is the backend you reach for when you want a fast local model, simple deployment, and a much smaller download than the larger multilingual checkpoints.
What works in any-tts today
- Native Rust inference.
- Default output at 24 kHz.
- Named preset voices discovered from the
voices/directory. - Language tags exposed by the current backend:
en,ja,zh,ko,fr,de,it,pt,es,hi. - Optional voice-cloning support only when a checkpoint includes style-encoder weights.
Pros
- Small enough to be the practical local-first choice.
- Apache-2.0 model license makes deployment straightforward.
- Good fit for desktop apps, tools, and low-latency local generation.
- Simple model layout relative to the larger codebook-based stacks.
Cons
- Uses a pure-Rust phonemizer for English input, so deployment is simpler than the previous espeak-based setup.
- The common open release is mostly about preset voice packs, not raw zero-shot cloning.
- Less expressive control than the bigger instruct-heavy model families.
License
- Upstream model weights: Apache-2.0.
- Crate code using the model:
MIT OR Apache-2.0.
OmniVoice
What it is
OmniVoice is the ambition play in this repo. Upstream, it is a diffusion language model TTS stack aimed at omnilingual zero-shot speech generation with voice design and massive language coverage.
What works in any-tts today
- Native Candle backend.
language,instruct,cfg_scale, andmax_tokensrequest controls.- Automatic runtime preference selection for CPU, CUDA, or Metal.
- Repo-exposed language set:
auto,en,zh,ja,ko,de,fr,es,pt,ru,it.
What does not work yet in the Rust backend
- Named voices.
- Reference-audio voice cloning.
- Reusable voice embeddings.
The code returns explicit errors for those cases instead of silently falling back to Python.
Pros
- Strong upstream story for language coverage.
- Good fit for instruction-driven voice design.
- Benchmark helper in this repo already makes backend comparisons easy.
- Apache-2.0 model license.
Cons
- The current Rust implementation exposes less than the upstream model card promises.
- If your main requirement is zero-shot cloning from reference audio, this backend is not there yet in this crate.
- Heavier than Kokoro and less turnkey than the small local-first path.
License
- Upstream model weights: Apache-2.0.
- Crate code using the model:
MIT OR Apache-2.0.
Qwen3-TTS
What it is
Qwen3-TTS is the control-heavy multilingual option. It uses a discrete multi-codebook language model plus a speech-tokenizer decoder and is designed for named speakers, instruction-following, and multiple TTS operating modes.
What works in any-tts today
- Native Rust backend.
- Default path points to
Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice. - Named speaker generation for CustomVoice checkpoints.
- VoiceDesign checkpoints also work when selected through
with_hf_model_id()or local files. - 24 kHz output with the extra speech-tokenizer weights resolved alongside the main model.
- Repo-level language support tracks the current checkpoint config and includes
auto.
Example: switch from CustomVoice to VoiceDesign
use ;
let model = load_model?;
let audio = model.synthesize?;
Pros
- Best overall control surface in the current public crate API.
- Strong multilingual coverage: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian upstream.
- Named speakers are easy to use from a single request builder.
- VoiceDesign gives you a second mode without changing the main crate API.
Cons
- Large download and memory footprint compared with Kokoro.
- Requires the separate speech-tokenizer decoder assets in addition to the main weights.
- Upstream Base-model voice cloning exists, but reference-audio cloning is not implemented in this crate yet.
License
- Upstream model weights: Apache-2.0.
- Crate code using the model:
MIT OR Apache-2.0.
Voxtral-4B-TTS-2603
What it is
Voxtral is the biggest public backend in the repo and the most obviously voice-agent-oriented. It pairs a language model with acoustic generation and preset voice embeddings, and the published checkpoint is tuned for multilingual, low-latency TTS deployment scenarios.
What works in any-tts today
- Native Rust backend.
- Preset voice selection from the checkpoint's
voice_embedding/assets. - Optional direct
voice_embeddingreuse when you already have a compatible embedding. - Repo-exposed languages:
en,fr,es,de,it,pt,nl,ar,hi. - Default sample rate is resolved from the model config and outputs at 24 kHz with the published checkpoint.
Pros
- Best fit here for production-style voice-agent workloads.
- Upstream model card emphasizes streaming and low time-to-first-audio.
- Comes with preset voices and a clear multilingual story.
Cons
- The open checkpoint does not ship reference-audio encoder weights, so raw voice cloning is unavailable.
- This is the heaviest public backend in the crate.
- The published model license is
CC BY-NC 4.0, so commercial deployment needs extra care or a different model choice.
License
- Upstream model weights: CC BY-NC 4.0.
- Crate code using the model:
MIT OR Apache-2.0.
KugelAudio-0-Open
What it is
KugelAudio is the European-language-focused experimental backend that already lives in the repo tree but is not yet wired into the public load_model() surface. The upstream project positions it as an open-source AR plus diffusion TTS stack trained for 24 European languages with pre-encoded voices.
What the repo tells us today
- There is a full in-tree Rust model implementation.
- The example targets
KugelAudio-0-Openand expects a large model footprint. - The Rust code explicitly states that raw reference-audio cloning is not implemented yet.
- The public API does not currently export the model, so treat it as active development rather than a supported stable backend.
Pros
- Strong European language positioning.
- MIT-licensed upstream software.
- Clear room for a future public backend if the exported API catches up with the in-tree implementation.
Cons
- Not part of the exported crate surface yet.
- Large model size and memory requirements.
- Documentation and examples in this repo should currently be read as experimental for this model.
License
- Upstream software repository: MIT.
- Public model and deployment terms should still be verified case by case before production use.
Usage patterns that hold across models
Local directory loading
use ;
let model = load_model?;
Explicit file-path loading
use ;
let model = load_model?;
Device and dtype selection
use DType;
use ;
let model = load_model?;
Repo health files
This repo now includes the standard GitHub community files you would expect for an active project:
- Contributing
- Code of Conduct
- Security Policy
- Support Guide
- Issue templates under
.github/ISSUE_TEMPLATE/ - A pull request template at
.github/PULL_REQUEST_TEMPLATE.md
Contributing
If you are adding a backend, model variant, or new loading flow, keep the public story honest: unsupported features should fail explicitly, examples should match exported API, and docs should separate experimental repo work from supported top-level surfaces.
The short version is in CONTRIBUTING.md.
Security
Model weights, runtime backends, and artifact loading all change the risk profile of TTS systems. Please read SECURITY.md before disclosing a vulnerability publicly.
License
The crate metadata declares MIT OR Apache-2.0 for this repository's Rust code. That does not supersede the terms attached to any model weights you download and run through it.
Status
This repo already has a strong shape: five public native backends, one obvious experimental sixth backend, trait-based loading, and example coverage for the core synthesis paths. The right way to think about it is not "a single-model wrapper" but "a Rust TTS platform layer that is learning how to speak multiple open ecosystems without hiding their differences."