kittentts-rs
A Rust port of KittenTTS — an ultra-lightweight, CPU-only text-to-speech engine based on ONNX models.
Screenshots
| iOS | Android |
|---|---|
![]() |
![]() |
Features
- ONNX Runtime inference — uses
ort(ORT 2.0 bindings) for fast CPU inference - Full text preprocessing — numbers, currencies, abbreviations, ordinals, units, etc. → spoken words
- espeak-ng phonemisation — identical IPA output to the Python library
- Same ONNX models — works with all KittenTTS HuggingFace checkpoints
- Automatic chunking — long texts split into ≤400-char sentence chunks, then concatenated
Prerequisites
espeak-ng must be on $PATH for phonemisation:
# Alpine Linux
# Debian / Ubuntu
# macOS
Installation
Via crates.io
Or add it manually to your Cargo.toml:
[]
= "0.2"
Via GitHub
To use the latest unreleased code directly from the repository:
# Clone and use as a local path dependency
[]
= { = "../kittentts-rs" }
Or reference it as a git dependency without cloning manually:
[]
= { = "https://github.com/eugenehp/kittentts-rs" }
# Pin to a specific branch or tag
= { = "https://github.com/eugenehp/kittentts-rs", = "main" }
= { = "https://github.com/eugenehp/kittentts-rs", = "v0.2.0" }
Quick Start
Add to Cargo.toml:
[]
= "0.2"
use download;
use Path;
Run the bundled example:
Available Models
| Model | Params | Size |
|---|---|---|
KittenML/kitten-tts-mini-0.8 |
80M | 80 MB |
KittenML/kitten-tts-micro-0.8 |
40M | 41 MB |
KittenML/kitten-tts-nano-0.8-fp32 |
15M | 56 MB |
KittenML/kitten-tts-nano-0.8-int8 |
15M | 25 MB |
Available Voices (v0.8)
Bella, Jasper, Luna, Bruno, Rosie, Hugo, Kiki, Leo
API
// Load from HuggingFace Hub
let tts = load_from_hub?;
// Load from local files (if you already have the ONNX + voices.npz)
let tts = load?;
// Generate audio → Vec<f32> at 24 kHz
let audio: = tts.generate?;
// Generate and save to WAV
tts.generate_to_file?;
// Available voices
println!;
Architecture
Input text
↓ TextPreprocessor (preprocess.rs)
• numbers / currency / percentages / ordinals → words
• contractions, units, scientific notation, fractions, …
↓ chunk_text() (model.rs)
• split into ≤400-char sentence chunks
↓ espeak-ng subprocess (phonemize.rs)
• text → IPA phoneme string (en-us, with stress)
↓ ipa_to_ids() (tokenize.rs)
• IPA chars → integer token IDs (fixed vocab, same as Python)
• prepend/append pad token 0
↓ tract-onnx inference (model.rs)
• inputs: input_ids [1, T], style [1, D], speed [1]
• output: audio waveform [samples]
↓ tail-trim (–5 000 samples) + chunk concatenation
↓ Vec<f32> @ 24 kHz or WAV file
Crate Structure
| File | Role |
|---|---|
src/lib.rs |
Public API & re-exports |
src/preprocess.rs |
Text preprocessing pipeline (mirrors preprocess.py) |
src/phonemize.rs |
eSpeak-NG subprocess wrapper |
src/tokenize.rs |
IPA character → token ID (mirrors TextCleaner) |
src/npz.rs |
Hand-written NPY/NPZ loader (no ndarray-npy needed) |
src/model.rs |
ONNX inference via tract, chunking, WAV output |
src/download.rs |
HuggingFace Hub model download + config.json parsing |
examples/basic.rs |
CLI example |
Running Tests
Citation
If you use kittentts-rs in your research or project, please cite:
If you also use the underlying KittenTTS models, please additionally cite the original library:
Changelog
See CHANGELOG.md for a full history of releases and changes.
License
This project is licensed under the Apache License 2.0.

