rust-tts-wrapper

Cross-platform TTS (Text-to-Speech) wrapper with C ABI. Mirrors js-tts-wrapper and swift-tts-wrapper.

Engines (21 total)

Engine	Type	Credentials	Streaming	Voice List	Word Boundaries	Speech Markdown
System (speech-dispatcher)	Local	None	—	—	Estimated	—
Sherpa-ONNX	Local (191 models)	None	Simulated*	Speakers	Estimated	—
Azure	Cloud	Key + Region	Chunked	API	Estimated	Platform-aware
Google Cloud	Cloud	API Key	Chunked	API	Real (v1beta1 timepoints)	Platform-aware
OpenAI	Cloud	API Key	Chunked	—	Estimated	Platform-aware
ElevenLabs	Cloud	API Key	Chunked	API	Estimated	Platform-aware
Cartesia	Cloud	API Key	Chunked	API	Estimated	Platform-aware
Deepgram	Cloud	API Key	Chunked	—	Estimated	Platform-aware
PlayHT	Cloud	API Key + User ID	Chunked	—	Estimated	Platform-aware
Fish Audio	Cloud	API Key	Chunked	—	Estimated	Platform-aware
Hume AI	Cloud	API Key	Chunked	—	Estimated	Platform-aware
Mistral	Cloud	API Key	Chunked	—	Estimated	Platform-aware
Murf	Cloud	API Key	Chunked	—	Estimated	Platform-aware
Resemble AI	Cloud	API Key	Chunked	—	Estimated	Platform-aware
Unreal Speech	Cloud	API Key	Chunked	—	Estimated	Platform-aware
UpliftAI	Cloud	API Key	Chunked	—	Estimated	Platform-aware
Amazon Polly	Cloud	Key + Secret + Region	Chunked	—	Estimated	Platform-aware
IBM Watson	Cloud	Key + Region + Instance	Chunked	—	Estimated	Platform-aware
Wit.ai	Cloud	Token	Chunked	—	Estimated	Platform-aware
xAI	Cloud	API Key	Chunked	—	Estimated	Platform-aware
ModelsLab	Cloud	API Key	Chunked	—	Estimated	Platform-aware

Streaming: Cloud engines stream audio in 8KB chunks via the on_audio callback. Sherpa-ONNX delivers all audio at once after synthesis (*simulated).
Voice List: Engines with "API" can enumerate voices from the provider's API.
Word Boundaries: Google returns real timing via v1beta1 timepoints with SSML marks. All others use word-length-adjusted estimation (150 WPM baseline, configurable).
Speech Markdown: Auto-detected and converted to platform-specific SSML via speechmarkdown-rust. Azure gets Microsoft SSML, Google gets Assistant SSML, others get Alexa SSML.

Rust API

`TtsEngine` Trait

pub trait TtsEngine: Send + Sync + Debug {
    // Speaking
    fn speak(&self, text: &str, voice: Option<&str>, rate: f32, pitch: f32, volume: f32,
             on_audio: Option<OnAudioCallback>, on_boundary: Option<OnBoundaryCallback>) -> TtsResult<()>;
    fn speak_with_options(&self, text: &str, options: Option<&SpeakOptions>,
                          on_audio: Option<OnAudioCallback>, on_boundary: Option<OnBoundaryCallback>) -> TtsResult<()>;
    fn speak_sync(&self, text: &str, voice: Option<&str>, rate: f32, pitch: f32, volume: f32,
                  on_audio: Option<OnAudioCallback>, on_boundary: Option<OnBoundaryCallback>) -> TtsResult<()>;

    // Synthesis (no playback)
    fn synth_to_bytes(&self, text: &str, voice: Option<&str>, rate: f32, pitch: f32, volume: f32) -> TtsResult<Vec<u8>>;
    fn synth_to_bytes_with_options(&self, text: &str, options: Option<&SpeakOptions>) -> TtsResult<Vec<u8>>;
    fn synth_with_boundaries(&self, text: &str, voice: Option<&str>, rate: f32, pitch: f32, volume: f32) -> TtsResult<(Vec<u8>, Vec<WordBoundary>)>;

    // Control
    fn stop(&self) -> TtsResult<()>;
    fn pause(&self) -> TtsResult<()>;
    fn resume(&self) -> TtsResult<()>;

    // Introspection
    fn get_voices(&self) -> TtsResult<Vec<Voice>>;
    fn engine_id(&self) -> &'static str;
    fn check_credentials(&self) -> TtsResult<bool>;
}

Callback Types

pub type OnAudioCallback<'a>    = &'a mut dyn FnMut(&[u8]);
pub type OnBoundaryCallback<'a> = &'a mut dyn FnMut(&str, f32, f32);  // word, start_s, end_s
pub type OnStartCallback<'a>    = &'a mut dyn FnMut();
pub type OnEndCallback<'a>      = &'a mut dyn FnMut();
pub type OnErrorCallback<'a>    = &'a mut dyn FnMut(&str);

Core Types

pub struct Voice {
    pub id: String,
    pub name: String,
    pub gender: Gender,             // Male | Female | Unknown
    pub provider: String,
    pub language_codes: Vec<LanguageCode>,
}

pub struct LanguageCode {
    pub bcp47: String,              // "en-US"
    pub iso639_3: String,           // "eng"
    pub display: String,            // "English (United States)"
}

pub struct WordBoundary {
    pub text: String,
    pub offset: u64,                // milliseconds
    pub duration: u64,              // milliseconds
}

pub struct SpeakOptions {
    pub rate: Option<f32>,
    pub speech_rate: Option<SpeechRate>,      // XSlow | Slow | Medium | Fast | XFast
    pub pitch: Option<f32>,
    pub speech_pitch: Option<SpeechPitch>,    // XLow | Low | Medium | High | XHigh
    pub volume: Option<f32>,
    pub voice: Option<String>,
    pub format: Option<AudioFormat>,          // Mp3 | Wav | Ogg | Opus | Aac | Flac | Pcm
    pub use_speech_markdown: bool,
    pub use_word_boundary: bool,
    pub raw_ssml: bool,
    pub extra: HashMap<String, String>,
}

pub enum Gender { Male, Female, Unknown }
pub enum AudioFormat { Mp3, Wav, Ogg, Opus, Aac, Flac, Pcm }
pub enum SpeechRate { XSlow, Slow, Medium, Fast, XFast }
pub enum SpeechPitch { XLow, Low, Medium, High, XHigh }

Utility Functions

// Word boundary estimation (matches Swift WordTimingEstimator)
pub fn estimate_word_boundaries(text: &str) -> Vec<WordBoundary>;
pub fn estimate_word_boundaries_with_wpm(text: &str, words_per_minute: f64) -> Vec<WordBoundary>;

// Speech Markdown preprocessing
pub fn preprocess_speech_markdown(text: &str, platform: &str) -> (String, bool);

// Gender normalization
pub fn normalize_gender(value: &str) -> Gender;

Factory

pub fn create_engine(engine_id: &str, credentials_json: &str) -> Option<Box<dyn TtsEngine>>;
pub fn engine_count() -> usize;
pub fn engine_list() -> Vec<EngineDescriptor>;

C API

All functions are extern "C", #[no_mangle]:

Function	Description
`tts_create(engine_id, credentials_json)`	Create engine, returns opaque `tts_ctx*`
`tts_destroy(ctx)`	Free engine context
`tts_speak(ctx, text)`	Speak async, returns 0/-1
`tts_speak_sync(ctx, text)`	Speak sync (blocking)
`tts_stop(ctx)`	Stop speech
`tts_pause(ctx)`	Pause in-progress speech
`tts_resume(ctx)`	Resume paused speech
`tts_synth_to_bytes(ctx, text, out_bytes, out_len)`	Synth to buffer (returns 0/-1)
`tts_free_bytes(bytes, len)`	Free buffer from tts_synth_to_bytes
`tts_get_voices(ctx, out_voices, out_count)`	Get voice list
`tts_free_voices(voices, count)`	Free voice array
`tts_set_voice(ctx, voice_id)`	Set voice
`tts_set_rate(ctx, rate)`	Set rate (1.0 = normal)
`tts_set_pitch(ctx, pitch)`	Set pitch (1.0 = normal)
`tts_set_volume(ctx, volume)`	Set volume (1.0 = normal)
`tts_set_on_audio(ctx, cb, userdata)`	Set streaming audio callback
`tts_set_on_boundary(ctx, cb, userdata)`	Set word boundary callback
`tts_get_engine_count()`	Count registered engines
`tts_get_engines(out_engines)`	Get engine descriptors
`tts_free_engine_info(engines, count)`	Free engine info
`tts_get_last_error()`	Get last error message

C Example

#include "tts_wrapper.h"
#include <stdio.h>

void on_audio(const uint8_t* chunk, uintptr_t size, void* userdata) {
    printf("Audio chunk: %zu bytes\n", size);
}

void on_boundary(const char* word, float start, float end, void* userdata) {
    printf("Word '%s' %.3f-%.3f\n", word, start, end);
}

int main() {
    tts_ctx* ctx = tts_create("openai", "{\"apiKey\":\"your-key\"}");
    tts_set_on_audio(ctx, on_audio, NULL);
    tts_set_on_boundary(ctx, on_boundary, NULL);
    tts_set_voice(ctx, "alloy");
    tts_speak_sync(ctx, "Hello world");
    tts_destroy(ctx);
}

Rust Example

use rust_tts_wrapper::{factory, types::SpeakOptions};

let engine = factory::create_engine("openai", r#"{"apiKey":"key"}"#).unwrap();

// Simple speak
engine.speak("Hello", Some("alloy"), 1.0, 1.0, 1.0, None, None).unwrap();

// With callbacks
let mut audio_cb = |chunk: &[u8]| println!("{} bytes", chunk.len());
let mut boundary_cb = |word: &str, s: f32, e: f32| println!("{}: {:.3}-{:.3}", word, s, e);
engine.speak_sync("Hello world", Some("alloy"), 1.0, 1.0, 1.0,
    Some(&mut audio_cb), Some(&mut boundary_cb)).unwrap();

// With SpeakOptions
let opts = SpeakOptions { voice: Some("alloy".into()), ..Default::default() };
engine.speak_with_options("Hello", Some(&opts), None, None).unwrap();

// Synth to bytes
let audio = engine.synth_to_bytes("Hello", Some("alloy"), 1.0, 1.0, 1.0).unwrap();

// Get voices
for v in engine.get_voices().unwrap() {
    println!("{} ({}) - {}", v.name, v.gender, v.primary_language());
}

// Check credentials
assert!(engine.check_credentials().unwrap());

Build

cargo build --all-features

Features

system — speech-dispatcher (Linux system TTS)
cloud — all 19 cloud engines via HTTP + speechmarkdown-rust + base64
sherpaonnx — Sherpa-ONNX offline TTS (191 models)

Lint & Test

cargo fmt --all -- --check
cargo clippy --all-features -- -D warnings
cargo test --all-features

Bindings

Python (`bindings/python/tts_wrapper.py`)

from tts_wrapper import TTSClient

client = TTSClient("openai", {"apiKey": "your-key"})
client.on_audio(lambda chunk: print(f"{len(chunk)} bytes"))
client.on_boundary(lambda word, s, e: print(f"{word}: {s:.3f}-{e:.3f}"))
client.set_voice("alloy")
client.speak_sync("Hello world")
client.stop()

.NET (`bindings/dotnet/TtsClient.cs`)

using TtsWrapper;

var client = new TtsClient("openai", new() { {"apiKey", "your-key"} });
client.SetVoice("alloy");
client.SetRate(1.0f);
client.SetPitch(1.0f);
client.SetVolume(1.0f);
client.SpeakSync("Hello world");
client.Stop();

Swift (`bindings/swift/TtsClient.swift`)

let client = TTSClient(engineId: "openai", credentials: ["apiKey": "your-key"])
client.setVoice("alloy")
client.setRate(1.0)
client.speakSync("Hello world")
client.stop()

Architecture

               TtsEngine (trait)
                     |
      +--------------+--------------+
      |              |              |
  SystemEngine   CloudEngine   SherpaOnnxEngine
  (speech-       (19 cloud      (191 local
  dispatcher)    providers)     models)

Cloud engines use provider-specific CloudConfig:

Azure: SSML XML body with prosody tags, XML escaping
Google: JSON body with base64 audio, v1beta1 timepoint support
All others: Standard JSON bodies

Sherpa-ONNX Models

191 models from the bundled merged_models.json registry. Models are loaded from ~/.rust-tts-wrapper/sherpaonnx/.

License

MIT

rust-tts-wrapper 0.1.0