rust-tts-wrapper 0.1.0

Cross-platform TTS wrapper with C API — mirrors js-tts-wrapper / SwiftTTSWrapper
Documentation
# rust-tts-wrapper

Cross-platform TTS (Text-to-Speech) wrapper with C ABI. Mirrors [js-tts-wrapper](https://github.com/AACTools/js-tts-wrapper) and [swift-tts-wrapper](https://github.com/AACTools/swift-tts-wrapper).

## Engines (21 total)

| Engine | Type | Credentials | Streaming | Voice List | Word Boundaries | Speech Markdown |
|--------|------|-------------|-----------|------------|-----------------|-----------------|
| System (speech-dispatcher) | Local | None ||| Estimated ||
| Sherpa-ONNX | Local (191 models) | None | Simulated* | Speakers | Estimated ||
| Azure | Cloud | Key + Region | Chunked | API | Estimated | Platform-aware |
| Google Cloud | Cloud | API Key | Chunked | API | **Real** (v1beta1 timepoints) | Platform-aware |
| OpenAI | Cloud | API Key | Chunked || Estimated | Platform-aware |
| ElevenLabs | Cloud | API Key | Chunked | API | Estimated | Platform-aware |
| Cartesia | Cloud | API Key | Chunked | API | Estimated | Platform-aware |
| Deepgram | Cloud | API Key | Chunked || Estimated | Platform-aware |
| PlayHT | Cloud | API Key + User ID | Chunked || Estimated | Platform-aware |
| Fish Audio | Cloud | API Key | Chunked || Estimated | Platform-aware |
| Hume AI | Cloud | API Key | Chunked || Estimated | Platform-aware |
| Mistral | Cloud | API Key | Chunked || Estimated | Platform-aware |
| Murf | Cloud | API Key | Chunked || Estimated | Platform-aware |
| Resemble AI | Cloud | API Key | Chunked || Estimated | Platform-aware |
| Unreal Speech | Cloud | API Key | Chunked || Estimated | Platform-aware |
| UpliftAI | Cloud | API Key | Chunked || Estimated | Platform-aware |
| Amazon Polly | Cloud | Key + Secret + Region | Chunked || Estimated | Platform-aware |
| IBM Watson | Cloud | Key + Region + Instance | Chunked || Estimated | Platform-aware |
| Wit.ai | Cloud | Token | Chunked || Estimated | Platform-aware |
| xAI | Cloud | API Key | Chunked || Estimated | Platform-aware |
| ModelsLab | Cloud | API Key | Chunked || Estimated | Platform-aware |

- **Streaming**: Cloud engines stream audio in 8KB chunks via the `on_audio` callback. Sherpa-ONNX delivers all audio at once after synthesis (*simulated).
- **Voice List**: Engines with "API" can enumerate voices from the provider's API.
- **Word Boundaries**: Google returns real timing via v1beta1 timepoints with SSML marks. All others use word-length-adjusted estimation (150 WPM baseline, configurable).
- **Speech Markdown**: Auto-detected and converted to platform-specific SSML via [speechmarkdown-rust]https://github.com/AACTools/speechmarkdown-rust. Azure gets Microsoft SSML, Google gets Assistant SSML, others get Alexa SSML.

## Rust API

### `TtsEngine` Trait

```rust
pub trait TtsEngine: Send + Sync + Debug {
    // Speaking
    fn speak(&self, text: &str, voice: Option<&str>, rate: f32, pitch: f32, volume: f32,
             on_audio: Option<OnAudioCallback>, on_boundary: Option<OnBoundaryCallback>) -> TtsResult<()>;
    fn speak_with_options(&self, text: &str, options: Option<&SpeakOptions>,
                          on_audio: Option<OnAudioCallback>, on_boundary: Option<OnBoundaryCallback>) -> TtsResult<()>;
    fn speak_sync(&self, text: &str, voice: Option<&str>, rate: f32, pitch: f32, volume: f32,
                  on_audio: Option<OnAudioCallback>, on_boundary: Option<OnBoundaryCallback>) -> TtsResult<()>;

    // Synthesis (no playback)
    fn synth_to_bytes(&self, text: &str, voice: Option<&str>, rate: f32, pitch: f32, volume: f32) -> TtsResult<Vec<u8>>;
    fn synth_to_bytes_with_options(&self, text: &str, options: Option<&SpeakOptions>) -> TtsResult<Vec<u8>>;
    fn synth_with_boundaries(&self, text: &str, voice: Option<&str>, rate: f32, pitch: f32, volume: f32) -> TtsResult<(Vec<u8>, Vec<WordBoundary>)>;

    // Control
    fn stop(&self) -> TtsResult<()>;
    fn pause(&self) -> TtsResult<()>;
    fn resume(&self) -> TtsResult<()>;

    // Introspection
    fn get_voices(&self) -> TtsResult<Vec<Voice>>;
    fn engine_id(&self) -> &'static str;
    fn check_credentials(&self) -> TtsResult<bool>;
}
```

### Callback Types

```rust
pub type OnAudioCallback<'a>    = &'a mut dyn FnMut(&[u8]);
pub type OnBoundaryCallback<'a> = &'a mut dyn FnMut(&str, f32, f32);  // word, start_s, end_s
pub type OnStartCallback<'a>    = &'a mut dyn FnMut();
pub type OnEndCallback<'a>      = &'a mut dyn FnMut();
pub type OnErrorCallback<'a>    = &'a mut dyn FnMut(&str);
```

### Core Types

```rust
pub struct Voice {
    pub id: String,
    pub name: String,
    pub gender: Gender,             // Male | Female | Unknown
    pub provider: String,
    pub language_codes: Vec<LanguageCode>,
}

pub struct LanguageCode {
    pub bcp47: String,              // "en-US"
    pub iso639_3: String,           // "eng"
    pub display: String,            // "English (United States)"
}

pub struct WordBoundary {
    pub text: String,
    pub offset: u64,                // milliseconds
    pub duration: u64,              // milliseconds
}

pub struct SpeakOptions {
    pub rate: Option<f32>,
    pub speech_rate: Option<SpeechRate>,      // XSlow | Slow | Medium | Fast | XFast
    pub pitch: Option<f32>,
    pub speech_pitch: Option<SpeechPitch>,    // XLow | Low | Medium | High | XHigh
    pub volume: Option<f32>,
    pub voice: Option<String>,
    pub format: Option<AudioFormat>,          // Mp3 | Wav | Ogg | Opus | Aac | Flac | Pcm
    pub use_speech_markdown: bool,
    pub use_word_boundary: bool,
    pub raw_ssml: bool,
    pub extra: HashMap<String, String>,
}

pub enum Gender { Male, Female, Unknown }
pub enum AudioFormat { Mp3, Wav, Ogg, Opus, Aac, Flac, Pcm }
pub enum SpeechRate { XSlow, Slow, Medium, Fast, XFast }
pub enum SpeechPitch { XLow, Low, Medium, High, XHigh }
```

### Utility Functions

```rust
// Word boundary estimation (matches Swift WordTimingEstimator)
pub fn estimate_word_boundaries(text: &str) -> Vec<WordBoundary>;
pub fn estimate_word_boundaries_with_wpm(text: &str, words_per_minute: f64) -> Vec<WordBoundary>;

// Speech Markdown preprocessing
pub fn preprocess_speech_markdown(text: &str, platform: &str) -> (String, bool);

// Gender normalization
pub fn normalize_gender(value: &str) -> Gender;
```

### Factory

```rust
pub fn create_engine(engine_id: &str, credentials_json: &str) -> Option<Box<dyn TtsEngine>>;
pub fn engine_count() -> usize;
pub fn engine_list() -> Vec<EngineDescriptor>;
```

## C API

All functions are `extern "C"`, `#[no_mangle]`:

| Function | Description |
|----------|-------------|
| `tts_create(engine_id, credentials_json)` | Create engine, returns opaque `tts_ctx*` |
| `tts_destroy(ctx)` | Free engine context |
| `tts_speak(ctx, text)` | Speak async, returns 0/-1 |
| `tts_speak_sync(ctx, text)` | Speak sync (blocking) |
| `tts_stop(ctx)` | Stop speech |
| `tts_pause(ctx)` | Pause in-progress speech |
| `tts_resume(ctx)` | Resume paused speech |
| `tts_synth_to_bytes(ctx, text, out_bytes, out_len)` | Synth to buffer (returns 0/-1) |
| `tts_free_bytes(bytes, len)` | Free buffer from tts_synth_to_bytes |
| `tts_get_voices(ctx, out_voices, out_count)` | Get voice list |
| `tts_free_voices(voices, count)` | Free voice array |
| `tts_set_voice(ctx, voice_id)` | Set voice |
| `tts_set_rate(ctx, rate)` | Set rate (1.0 = normal) |
| `tts_set_pitch(ctx, pitch)` | Set pitch (1.0 = normal) |
| `tts_set_volume(ctx, volume)` | Set volume (1.0 = normal) |
| `tts_set_on_audio(ctx, cb, userdata)` | Set streaming audio callback |
| `tts_set_on_boundary(ctx, cb, userdata)` | Set word boundary callback |
| `tts_get_engine_count()` | Count registered engines |
| `tts_get_engines(out_engines)` | Get engine descriptors |
| `tts_free_engine_info(engines, count)` | Free engine info |
| `tts_get_last_error()` | Get last error message |

### C Example

```c
#include "tts_wrapper.h"
#include <stdio.h>

void on_audio(const uint8_t* chunk, uintptr_t size, void* userdata) {
    printf("Audio chunk: %zu bytes\n", size);
}

void on_boundary(const char* word, float start, float end, void* userdata) {
    printf("Word '%s' %.3f-%.3f\n", word, start, end);
}

int main() {
    tts_ctx* ctx = tts_create("openai", "{\"apiKey\":\"your-key\"}");
    tts_set_on_audio(ctx, on_audio, NULL);
    tts_set_on_boundary(ctx, on_boundary, NULL);
    tts_set_voice(ctx, "alloy");
    tts_speak_sync(ctx, "Hello world");
    tts_destroy(ctx);
}
```

### Rust Example

```rust
use rust_tts_wrapper::{factory, types::SpeakOptions};

let engine = factory::create_engine("openai", r#"{"apiKey":"key"}"#).unwrap();

// Simple speak
engine.speak("Hello", Some("alloy"), 1.0, 1.0, 1.0, None, None).unwrap();

// With callbacks
let mut audio_cb = |chunk: &[u8]| println!("{} bytes", chunk.len());
let mut boundary_cb = |word: &str, s: f32, e: f32| println!("{}: {:.3}-{:.3}", word, s, e);
engine.speak_sync("Hello world", Some("alloy"), 1.0, 1.0, 1.0,
    Some(&mut audio_cb), Some(&mut boundary_cb)).unwrap();

// With SpeakOptions
let opts = SpeakOptions { voice: Some("alloy".into()), ..Default::default() };
engine.speak_with_options("Hello", Some(&opts), None, None).unwrap();

// Synth to bytes
let audio = engine.synth_to_bytes("Hello", Some("alloy"), 1.0, 1.0, 1.0).unwrap();

// Get voices
for v in engine.get_voices().unwrap() {
    println!("{} ({}) - {}", v.name, v.gender, v.primary_language());
}

// Check credentials
assert!(engine.check_credentials().unwrap());
```

## Build

```bash
cargo build --all-features
```

### Features

- `system` — speech-dispatcher (Linux system TTS)
- `cloud` — all 19 cloud engines via HTTP + speechmarkdown-rust + base64
- `sherpaonnx` — Sherpa-ONNX offline TTS (191 models)

### Lint & Test

```bash
cargo fmt --all -- --check
cargo clippy --all-features -- -D warnings
cargo test --all-features
```

## Bindings

### Python (`bindings/python/tts_wrapper.py`)

```python
from tts_wrapper import TTSClient

client = TTSClient("openai", {"apiKey": "your-key"})
client.on_audio(lambda chunk: print(f"{len(chunk)} bytes"))
client.on_boundary(lambda word, s, e: print(f"{word}: {s:.3f}-{e:.3f}"))
client.set_voice("alloy")
client.speak_sync("Hello world")
client.stop()
```

### .NET (`bindings/dotnet/TtsClient.cs`)

```csharp
using TtsWrapper;

var client = new TtsClient("openai", new() { {"apiKey", "your-key"} });
client.SetVoice("alloy");
client.SetRate(1.0f);
client.SetPitch(1.0f);
client.SetVolume(1.0f);
client.SpeakSync("Hello world");
client.Stop();
```

### Swift (`bindings/swift/TtsClient.swift`)

```swift
let client = TTSClient(engineId: "openai", credentials: ["apiKey": "your-key"])
client.setVoice("alloy")
client.setRate(1.0)
client.speakSync("Hello world")
client.stop()
```

## Architecture

```
               TtsEngine (trait)
                     |
      +--------------+--------------+
      |              |              |
  SystemEngine   CloudEngine   SherpaOnnxEngine
  (speech-       (19 cloud      (191 local
  dispatcher)    providers)     models)
```

Cloud engines use provider-specific `CloudConfig`:
- **Azure**: SSML XML body with prosody tags, XML escaping
- **Google**: JSON body with base64 audio, v1beta1 timepoint support
- **All others**: Standard JSON bodies

## Sherpa-ONNX Models

191 models from the bundled `merged_models.json` registry. Models are loaded from `~/.rust-tts-wrapper/sherpaonnx/`.

## License

MIT