# kokoroxide [WIP]
A high-performance Rust implementation of Kokoro TTS (Text-to-Speech) synthesis, leveraging ONNX Runtime for efficient neural speech generation. Uses espeak-ng for text-to-phoneme conversion, with built-in conversion logic into Misaki phoneme notation expected by Kokoro models. Distributed under a dual MIT/Apache-2.0 license to match the broader Rust ecosystem.
> **Note:** Currently only supports and has been tested with American English. Contributions for different languages are very welcome!
## Features
- 🎨 **Voice Style Control** - Customize voice characteristics with style vectors
- 🔤 **Phoneme Support** - Direct phoneme input for precise pronunciation control
- ⚡ **Speed Control** - Adjust speech rate dynamically
- 🔧 **Flexible API** - Multiple generation methods for different use cases
## Installation
Add this to your `Cargo.toml`:
```toml
[dependencies]
kokoroxide = "0.1.3"
```
## Quick Start
```rust
use kokoroxide::{load_voice_style, KokoroTTS, TTSConfig};
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Configure the ONNX model + tokenizer that Kokoro requires.
// These files live outside the crate; download them from Kokoro's distribution (https://huggingface.co/onnx-community/Kokoro-82M-v1.0-ONNX).
let config = TTSConfig::new("path/to/kokoro.onnx", "path/to/tokenizer.json")
.with_sample_rate(24000)
.with_max_tokens_length(512)
.with_graph_optimization_level(kokoroxide::GraphOptimizationLevel::Disable);
// Build the speech engine with the explicit configuration so advanced knobs are available.
let tts_service = KokoroTTS::with_config(config)?;
// Load a voice style vector (.bin) that controls prosody and speaker identity.
let voice = load_voice_style("path/to/voice.bin")?;
// Generate speech at 1.0x speed for the requested text.
let text = "Hello, this is a text-to-speech synthesis example.";
let audio = tts_service.generate_speech(text, &voice, 1.0)?;
// Persist the synthesized waveform to a WAV file for playback.
audio.save_to_wav("path/to/output.wav")?;
Ok(())
}
```
For a complete runnable example pointing at real assets, see the `kokoroxide-demo` sample project in this workspace (`kokoroxide-demo/src/main.rs`).
## API Overview
### Core Types
#### `KokoroTTS`
The main TTS engine that handles text-to-speech conversion.
```rust
// Create with default config
let tts = KokoroTTS::new(model_path, tokenizer_path)?;
// Create with custom config
let config = TTSConfig::new(model_path, tokenizer_path)
.with_max_tokens_length(128)
.with_sample_rate(24000);
let tts = KokoroTTS::with_config(config)?;
```
#### `VoiceStyle`
Represents voice characteristics as a style vector. Voice files contain multiple style vectors indexed by token length.
```rust
// Load from binary file
let voice = load_voice_style("voice.bin")?;
// Create custom voice with vector size
let custom_voice = VoiceStyle::new(vec![0.1, 0.2, ...], 256);
```
#### `GeneratedAudio`
Contains the generated audio samples and metadata.
```rust
let audio = tts.speak("Hello!", &voice)?;
println!("Duration: {} seconds", audio.duration_seconds);
println!("Sample rate: {} Hz", audio.sample_rate);
audio.save_to_wav("output.wav")?;
```
### Generation Methods
#### 1. Simple Text-to-Speech
```rust
let audio = tts.speak("Hello, world!", &voice)?;
```
#### 2. With Speed Control
```rust
let audio = tts.generate_speech("Speak faster!", &voice, 1.5)?; // 1.5x speed
```
#### 3. From Phonemes
```rust
let audio = tts.generate_speech_from_phonemes("həˈloʊ wɜːld", &voice, 1.0)?;
```
#### 4. From Token IDs
```rust
let tokens = vec![101, 2234, 1567, 102]; // Pre-tokenized input
let audio = tts.generate_from_tokens(&tokens, &voice, 1.0)?;
```
## Configuration
### TTSConfig Options
```rust
use ort::{execution_providers::CoreMLExecutionProviderOptions, ExecutionProvider, GraphOptimizationLevel};
let config = TTSConfig::new(model_path, tokenizer_path)
.with_max_tokens_length(512) // Maximum token sequence length
.with_sample_rate(24000) // Audio sample rate in Hz
.with_graph_optimization_level(GraphOptimizationLevel::Level3)
.with_execution_providers(vec![
ExecutionProvider::CoreML(CoreMLExecutionProviderOptions::default()),
]); // Optional hardware acceleration
```
If you don't need custom providers, you can skip the call to `with_execution_providers` and the default CPU provider will be used.
#### Graph Optimization Levels
The `with_graph_optimization_level()` method allows you to control ONNX Runtime's graph optimization:
- `GraphOptimizationLevel::Disable` - No optimizations
- `GraphOptimizationLevel::Level1` - Basic optimizations
- `GraphOptimizationLevel::Level2` - Extended optimizations
- `GraphOptimizationLevel::Level3` - Maximum optimizations (default)
## System Requirements
### Prerequisites
1. **Rust 1.70+**
2. **espeak-ng** (required for text-to-phoneme conversion):
- **Ubuntu/Debian**: `sudo apt-get install espeak-ng libespeak-ng-dev`
- **macOS**: `brew install espeak-ng`
- **Windows**: Download from [espeak-ng releases](https://github.com/espeak-ng/espeak-ng/releases)
- **Arch Linux**: `sudo pacman -S espeak-ng`
3. **ONNX Runtime** (automatically downloaded via `ort` crate)
4. **Kokoro model files**:
- Model file (e.g., `kokoro-v0_19.onnx`)
- Tokenizer configuration (`tokenizer.json`)
- Voice style files (`.bin` format)
- Downloaded at runtime or managed outside the crate package to keep the published crate lightweight
### Build Configuration
The crate automatically links to espeak-ng based on your platform:
- **macOS**: Looks for espeak-ng in `/opt/homebrew/lib` (Homebrew default)
- **Linux**: Uses system library paths
If espeak-ng is installed in a non-standard location, you may need to set:
```bash
export LD_LIBRARY_PATH=/path/to/espeak-ng/lib:$LD_LIBRARY_PATH # Linux
export DYLD_LIBRARY_PATH=/path/to/espeak-ng/lib:$DYLD_LIBRARY_PATH # macOS
```
### Environment Variables
- **`DEBUG_PHONEMES`** - Enable phoneme debugging output:
```bash
DEBUG_PHONEMES=1 cargo run
```
This will print:
- Input text
- Espeak IPA output
- Converted Misaki phonemes
- **`DEBUG_TOKENS`** - Enable token debugging output:
```bash
DEBUG_TOKENS=1 cargo run
```
This will print:
- Generated token IDs array
- **`DEBUG_TIMING`** - Enable performance timing logs:
```bash
DEBUG_TIMING=1 cargo run
```
This will print:
- Phoneme tokenization time
- Espeak IPA conversion time
- Total tokenization time
- **All debug modes**:
```bash
DEBUG_PHONEMES=1 DEBUG_TOKENS=1 DEBUG_TIMING=1 cargo run
```
## Model Files
Download the Kokoro model files from the official repository:
- Model: [Kokoro-82M ONNX](https://huggingface.co/onnx-community/Kokoro-82M-v1.0-ONNX)
- Required files:
- `*.onnx` - The model file
- `tokenizer.json` - Tokenizer configuration
- Voice files (`*.bin`) - Style vectors for different voices
- Provide these assets at runtime (they are not packaged with the crate to keep the published tarball lightweight)
## Examples
### Basic TTS Application
```rust
use kokoroxide::{KokoroTTS, load_voice_style};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let tts = KokoroTTS::new("model.onnx", "tokenizer.json")?;
let voice = load_voice_style("voice.bin")?;
let text = "Welcome to kokoroxide TTS!";
let audio = tts.generate_speech(text, &voice, 1.0)?;
audio.save_to_wav("welcome.wav")?;
println!("Generated {} seconds of audio", audio.duration_seconds);
Ok(())
}
```
## License
Licensed under either of:
- Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)
at your option.
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## Acknowledgments
This project implements the Kokoro TTS model in Rust, providing a high-performance alternative to Python implementations.