kokoroxide [WIP]
A high-performance Rust implementation of Kokoro TTS (Text-to-Speech) synthesis, leveraging ONNX Runtime for efficient neural speech generation. Uses espeak-ng for text-to-phoneme conversion, with built-in conversion logic into Misaki phoneme notation expected by Kokoro models. Distributed under a dual MIT/Apache-2.0 license to match the broader Rust ecosystem.
Note: Currently only supports and has been tested with American English. Contributions for different languages are very welcome!
Features
- 🎨 Voice Style Control - Customize voice characteristics with style vectors
- 🔤 Phoneme Support - Direct phoneme input for precise pronunciation control
- ⚡ Speed Control - Adjust speech rate dynamically
- 🔧 Flexible API - Multiple generation methods for different use cases
Installation
Add this to your Cargo.toml:
[]
= "0.1.1"
Quick Start
use ;
For a complete runnable example pointing at real assets, see the kokoroxide-demo sample project in this workspace (kokoroxide-demo/src/main.rs).
API Overview
Core Types
KokoroTTS
The main TTS engine that handles text-to-speech conversion.
// Create with default config
let tts = new?;
// Create with custom config
let config = new
.with_max_tokens_length
.with_sample_rate;
let tts = with_config?;
VoiceStyle
Represents voice characteristics as a style vector. Voice files contain multiple style vectors indexed by token length.
// Load from binary file
let voice = load_voice_style?;
// Create custom voice with vector size
let custom_voice = new;
GeneratedAudio
Contains the generated audio samples and metadata.
let audio = tts.speak?;
println!;
println!;
audio.save_to_wav?;
Generation Methods
1. Simple Text-to-Speech
let audio = tts.speak?;
2. With Speed Control
let audio = tts.generate_speech?; // 1.5x speed
3. From Phonemes
let audio = tts.generate_speech_from_phonemes?;
4. From Token IDs
let tokens = vec!; // Pre-tokenized input
let audio = tts.generate_from_tokens?;
Configuration
TTSConfig Options
use GraphOptimizationLevel;
let config = new
.with_max_tokens_length // Maximum token sequence length
.with_sample_rate // Audio sample rate in Hz
.with_graph_optimization_level; // ONNX graph optimization
Graph Optimization Levels
The with_graph_optimization_level() method allows you to control ONNX Runtime's graph optimization:
GraphOptimizationLevel::Disable- No optimizationsGraphOptimizationLevel::Level1- Basic optimizationsGraphOptimizationLevel::Level2- Extended optimizationsGraphOptimizationLevel::Level3- Maximum optimizations (default)
System Requirements
Prerequisites
-
Rust 1.70+
-
espeak-ng (required for text-to-phoneme conversion):
- Ubuntu/Debian:
sudo apt-get install espeak-ng libespeak-ng-dev - macOS:
brew install espeak-ng - Windows: Download from espeak-ng releases
- Arch Linux:
sudo pacman -S espeak-ng
- Ubuntu/Debian:
-
ONNX Runtime (automatically downloaded via
ortcrate) -
Kokoro model files:
- Model file (e.g.,
kokoro-v0_19.onnx) - Tokenizer configuration (
tokenizer.json) - Voice style files (
.binformat) - Downloaded at runtime or managed outside the crate package to keep the published crate lightweight
- Model file (e.g.,
Build Configuration
The crate automatically links to espeak-ng based on your platform:
- macOS: Looks for espeak-ng in
/opt/homebrew/lib(Homebrew default) - Linux: Uses system library paths
If espeak-ng is installed in a non-standard location, you may need to set:
# Linux
# macOS
Environment Variables
-
DEBUG_PHONEMES- Enable phoneme debugging output:DEBUG_PHONEMES=1This will print:
- Input text
- Espeak IPA output
- Converted Misaki phonemes
-
DEBUG_TOKENS- Enable token debugging output:DEBUG_TOKENS=1This will print:
- Generated token IDs array
-
DEBUG_TIMING- Enable performance timing logs:DEBUG_TIMING=1This will print:
- Phoneme tokenization time
- Espeak IPA conversion time
- Total tokenization time
-
All debug modes:
DEBUG_PHONEMES=1 DEBUG_TOKENS=1 DEBUG_TIMING=1
Model Files
Download the Kokoro model files from the official repository:
- Model: Kokoro-82M ONNX
- Required files:
*.onnx- The model filetokenizer.json- Tokenizer configuration- Voice files (
*.bin) - Style vectors for different voices - Provide these assets at runtime (they are not packaged with the crate to keep the published tarball lightweight)
Examples
Basic TTS Application
use ;
License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Acknowledgments
This project implements the Kokoro TTS model in Rust, providing a high-performance alternative to Python implementations.