# rust-tts-wrapper
Cross-platform TTS (Text-to-Speech) wrapper with C ABI. Mirrors [js-tts-wrapper](https://github.com/AACTools/js-tts-wrapper) and [swift-tts-wrapper](https://github.com/AACTools/swift-tts-wrapper).
## Engines (21 total)
| System (speech-dispatcher) | Local | None | — | — | Estimated | — |
| Sherpa-ONNX | Local (191 models) | None | Simulated* | Speakers | Estimated | — |
| Azure | Cloud | Key + Region | Chunked | API | Estimated | Platform-aware |
| Google Cloud | Cloud | API Key | Chunked | API | **Real** (v1beta1 timepoints) | Platform-aware |
| OpenAI | Cloud | API Key | Chunked | — | Estimated | Platform-aware |
| ElevenLabs | Cloud | API Key | Chunked | API | Estimated | Platform-aware |
| Cartesia | Cloud | API Key | Chunked | API | Estimated | Platform-aware |
| Deepgram | Cloud | API Key | Chunked | — | Estimated | Platform-aware |
| PlayHT | Cloud | API Key + User ID | Chunked | — | Estimated | Platform-aware |
| Fish Audio | Cloud | API Key | Chunked | — | Estimated | Platform-aware |
| Hume AI | Cloud | API Key | Chunked | — | Estimated | Platform-aware |
| Mistral | Cloud | API Key | Chunked | — | Estimated | Platform-aware |
| Murf | Cloud | API Key | Chunked | — | Estimated | Platform-aware |
| Resemble AI | Cloud | API Key | Chunked | — | Estimated | Platform-aware |
| Unreal Speech | Cloud | API Key | Chunked | — | Estimated | Platform-aware |
| UpliftAI | Cloud | API Key | Chunked | — | Estimated | Platform-aware |
| Amazon Polly | Cloud | Key + Secret + Region | Chunked | — | Estimated | Platform-aware |
| IBM Watson | Cloud | Key + Region + Instance | Chunked | — | Estimated | Platform-aware |
| Wit.ai | Cloud | Token | Chunked | — | Estimated | Platform-aware |
| xAI | Cloud | API Key | Chunked | — | Estimated | Platform-aware |
| ModelsLab | Cloud | API Key | Chunked | — | Estimated | Platform-aware |
- **Streaming**: Cloud engines stream audio in 8KB chunks via the `on_audio` callback. Sherpa-ONNX delivers all audio at once after synthesis (*simulated).
- **Voice List**: Engines with "API" can enumerate voices from the provider's API.
- **Word Boundaries**: Google returns real timing via v1beta1 timepoints with SSML marks. All others use word-length-adjusted estimation (150 WPM baseline, configurable).
- **Speech Markdown**: Auto-detected and converted to platform-specific SSML via [speechmarkdown-rust](https://github.com/AACTools/speechmarkdown-rust). Azure gets Microsoft SSML, Google gets Assistant SSML, others get Alexa SSML.
## Rust API
### `TtsEngine` Trait
```rust
pub trait TtsEngine: Send + Sync + Debug {
// Speaking
fn speak(&self, text: &str, voice: Option<&str>, rate: f32, pitch: f32, volume: f32,
on_audio: Option<OnAudioCallback>, on_boundary: Option<OnBoundaryCallback>) -> TtsResult<()>;
fn speak_with_options(&self, text: &str, options: Option<&SpeakOptions>,
on_audio: Option<OnAudioCallback>, on_boundary: Option<OnBoundaryCallback>) -> TtsResult<()>;
fn speak_sync(&self, text: &str, voice: Option<&str>, rate: f32, pitch: f32, volume: f32,
on_audio: Option<OnAudioCallback>, on_boundary: Option<OnBoundaryCallback>) -> TtsResult<()>;
// Synthesis (no playback)
fn synth_to_bytes(&self, text: &str, voice: Option<&str>, rate: f32, pitch: f32, volume: f32) -> TtsResult<Vec<u8>>;
fn synth_to_bytes_with_options(&self, text: &str, options: Option<&SpeakOptions>) -> TtsResult<Vec<u8>>;
fn synth_with_boundaries(&self, text: &str, voice: Option<&str>, rate: f32, pitch: f32, volume: f32) -> TtsResult<(Vec<u8>, Vec<WordBoundary>)>;
// Control
fn stop(&self) -> TtsResult<()>;
fn pause(&self) -> TtsResult<()>;
fn resume(&self) -> TtsResult<()>;
// Introspection
fn get_voices(&self) -> TtsResult<Vec<Voice>>;
fn engine_id(&self) -> &'static str;
fn check_credentials(&self) -> TtsResult<bool>;
}
```
### Callback Types
```rust
pub type OnAudioCallback<'a> = &'a mut dyn FnMut(&[u8]);
pub type OnBoundaryCallback<'a> = &'a mut dyn FnMut(&str, f32, f32); // word, start_s, end_s
pub type OnStartCallback<'a> = &'a mut dyn FnMut();
pub type OnEndCallback<'a> = &'a mut dyn FnMut();
pub type OnErrorCallback<'a> = &'a mut dyn FnMut(&str);
```
### Core Types
```rust
pub struct Voice {
pub id: String,
pub name: String,
pub gender: Gender, // Male | Female | Unknown
pub provider: String,
pub language_codes: Vec<LanguageCode>,
}
pub struct LanguageCode {
pub bcp47: String, // "en-US"
pub iso639_3: String, // "eng"
pub display: String, // "English (United States)"
}
pub struct WordBoundary {
pub text: String,
pub offset: u64, // milliseconds
pub duration: u64, // milliseconds
}
pub struct SpeakOptions {
pub rate: Option<f32>,
pub speech_rate: Option<SpeechRate>, // XSlow | Slow | Medium | Fast | XFast
pub pitch: Option<f32>,
pub speech_pitch: Option<SpeechPitch>, // XLow | Low | Medium | High | XHigh
pub volume: Option<f32>,
pub voice: Option<String>,
pub format: Option<AudioFormat>, // Mp3 | Wav | Ogg | Opus | Aac | Flac | Pcm
pub use_speech_markdown: bool,
pub use_word_boundary: bool,
pub raw_ssml: bool,
pub extra: HashMap<String, String>,
}
pub enum Gender { Male, Female, Unknown }
pub enum AudioFormat { Mp3, Wav, Ogg, Opus, Aac, Flac, Pcm }
pub enum SpeechRate { XSlow, Slow, Medium, Fast, XFast }
pub enum SpeechPitch { XLow, Low, Medium, High, XHigh }
```
### Utility Functions
```rust
// Word boundary estimation (matches Swift WordTimingEstimator)
pub fn estimate_word_boundaries(text: &str) -> Vec<WordBoundary>;
pub fn estimate_word_boundaries_with_wpm(text: &str, words_per_minute: f64) -> Vec<WordBoundary>;
// Speech Markdown preprocessing
pub fn preprocess_speech_markdown(text: &str, platform: &str) -> (String, bool);
// Gender normalization
pub fn normalize_gender(value: &str) -> Gender;
```
### Factory
```rust
pub fn create_engine(engine_id: &str, credentials_json: &str) -> Option<Box<dyn TtsEngine>>;
pub fn engine_count() -> usize;
pub fn engine_list() -> Vec<EngineDescriptor>;
```
## C API
All functions are `extern "C"`, `#[no_mangle]`:
| `tts_create(engine_id, credentials_json)` | Create engine, returns opaque `tts_ctx*` |
| `tts_destroy(ctx)` | Free engine context |
| `tts_speak(ctx, text)` | Speak async, returns 0/-1 |
| `tts_speak_sync(ctx, text)` | Speak sync (blocking) |
| `tts_stop(ctx)` | Stop speech |
| `tts_pause(ctx)` | Pause in-progress speech |
| `tts_resume(ctx)` | Resume paused speech |
| `tts_synth_to_bytes(ctx, text, out_bytes, out_len)` | Synth to buffer (returns 0/-1) |
| `tts_free_bytes(bytes, len)` | Free buffer from tts_synth_to_bytes |
| `tts_get_voices(ctx, out_voices, out_count)` | Get voice list |
| `tts_free_voices(voices, count)` | Free voice array |
| `tts_set_voice(ctx, voice_id)` | Set voice |
| `tts_set_rate(ctx, rate)` | Set rate (1.0 = normal) |
| `tts_set_pitch(ctx, pitch)` | Set pitch (1.0 = normal) |
| `tts_set_volume(ctx, volume)` | Set volume (1.0 = normal) |
| `tts_set_on_audio(ctx, cb, userdata)` | Set streaming audio callback |
| `tts_set_on_boundary(ctx, cb, userdata)` | Set word boundary callback |
| `tts_get_engine_count()` | Count registered engines |
| `tts_get_engines(out_engines)` | Get engine descriptors |
| `tts_free_engine_info(engines, count)` | Free engine info |
| `tts_get_last_error()` | Get last error message |
### C Example
```c
#include "tts_wrapper.h"
#include <stdio.h>
void on_audio(const uint8_t* chunk, uintptr_t size, void* userdata) {
printf("Audio chunk: %zu bytes\n", size);
}
void on_boundary(const char* word, float start, float end, void* userdata) {
printf("Word '%s' %.3f-%.3f\n", word, start, end);
}
int main() {
tts_ctx* ctx = tts_create("openai", "{\"apiKey\":\"your-key\"}");
tts_set_on_audio(ctx, on_audio, NULL);
tts_set_on_boundary(ctx, on_boundary, NULL);
tts_set_voice(ctx, "alloy");
tts_speak_sync(ctx, "Hello world");
tts_destroy(ctx);
}
```
### Rust Example
```rust
use rust_tts_wrapper::{factory, types::SpeakOptions};
let engine = factory::create_engine("openai", r#"{"apiKey":"key"}"#).unwrap();
// Simple speak
engine.speak("Hello", Some("alloy"), 1.0, 1.0, 1.0, None, None).unwrap();
// With callbacks
let mut audio_cb = |chunk: &[u8]| println!("{} bytes", chunk.len());
let mut boundary_cb = |word: &str, s: f32, e: f32| println!("{}: {:.3}-{:.3}", word, s, e);
engine.speak_sync("Hello world", Some("alloy"), 1.0, 1.0, 1.0,
Some(&mut audio_cb), Some(&mut boundary_cb)).unwrap();
// With SpeakOptions
let opts = SpeakOptions { voice: Some("alloy".into()), ..Default::default() };
engine.speak_with_options("Hello", Some(&opts), None, None).unwrap();
// Synth to bytes
let audio = engine.synth_to_bytes("Hello", Some("alloy"), 1.0, 1.0, 1.0).unwrap();
// Get voices
for v in engine.get_voices().unwrap() {
println!("{} ({}) - {}", v.name, v.gender, v.primary_language());
}
// Check credentials
assert!(engine.check_credentials().unwrap());
```
## Build
```bash
cargo build --all-features
```
### Features
- `system` — speech-dispatcher (Linux system TTS)
- `cloud` — all 19 cloud engines via HTTP + speechmarkdown-rust + base64
- `sherpaonnx` — Sherpa-ONNX offline TTS (191 models)
### Lint & Test
```bash
cargo fmt --all -- --check
cargo clippy --all-features -- -D warnings
cargo test --all-features
```
## Bindings
### Python (`bindings/python/tts_wrapper.py`)
```python
from tts_wrapper import TTSClient
client = TTSClient("openai", {"apiKey": "your-key"})
client.on_audio(lambda chunk: print(f"{len(chunk)} bytes"))
client.on_boundary(lambda word, s, e: print(f"{word}: {s:.3f}-{e:.3f}"))
client.set_voice("alloy")
client.speak_sync("Hello world")
client.stop()
```
### .NET (`bindings/dotnet/TtsClient.cs`)
```csharp
using TtsWrapper;
var client = new TtsClient("openai", new() { {"apiKey", "your-key"} });
client.SetVoice("alloy");
client.SetRate(1.0f);
client.SetPitch(1.0f);
client.SetVolume(1.0f);
client.SpeakSync("Hello world");
client.Stop();
```
### Swift (`bindings/swift/TtsClient.swift`)
```swift
let client = TTSClient(engineId: "openai", credentials: ["apiKey": "your-key"])
client.setVoice("alloy")
client.setRate(1.0)
client.speakSync("Hello world")
client.stop()
```
## Architecture
```
TtsEngine (trait)
|
+--------------+--------------+
| | |
SystemEngine CloudEngine SherpaOnnxEngine
(speech- (19 cloud (191 local
dispatcher) providers) models)
```
Cloud engines use provider-specific `CloudConfig`:
- **Azure**: SSML XML body with prosody tags, XML escaping
- **Google**: JSON body with base64 audio, v1beta1 timepoint support
- **All others**: Standard JSON bodies
## Sherpa-ONNX Models
191 models from the bundled `merged_models.json` registry. Models are loaded from `~/.rust-tts-wrapper/sherpaonnx/`.
## License
MIT