espeak-ng 0.1.1

# espeak-ng-rs

[![crates.io](https://img.shields.io/crates/v/espeak-ng.svg)](https://crates.io/crates/espeak-ng)
[![docs.rs](https://docs.rs/espeak-ng/badge.svg)](https://docs.rs/espeak-ng)
[![License: GPL-3.0-or-later](https://img.shields.io/badge/license-GPL--3.0--or--later-blue.svg)](https://www.gnu.org/licenses/gpl-3.0.html)

A pure-Rust port of [eSpeak NG](https://github.com/espeak-ng/espeak-ng) text-to-speech,
built with a test-first, bottom-up approach.

The C library is used as an oracle: the Rust implementation must produce
**bit-identical output** for every input.

---

## Status

| Module         | Status        | Notes                                                                  |
|----------------|---------------|------------------------------------------------------------------------|
| `encoding`     | ✅ Complete   | All codepages (UTF-8, ISO-8859-*, KOI8-R, ISCII, …)                   |
| `phoneme`      | ✅ Complete   | Phoneme table loader, IPA rendering, instruction scanner               |
| `dictionary`   | ✅ Complete   | Hash lookup, rule engine, suffix stripping, `SetWordStress`            |
| `translate`    | ✅ Complete   | Full text → IPA pipeline, multi-language, numbers, punctuation         |
| `synthesize`   | ✅ Complete   | Full harmonic synthesis reading espeak-ng binary phoneme data          |

**319 tests passing** — 27/27 IPA oracle comparisons + 10 synthesis integration tests.

**Two synthesis paths are available:**

| Path | How | Quality |
|------|-----|---------|
| `Synthesizer::synthesize(ipa_str)` | Hand-coded IPA → 3-formant cascade | Works anywhere, generic voice |
| `Synthesizer::synthesize_codes(codes, phdata)` | Phoneme bytecodes → real espeak-ng frame data → harmonic synth | Requires espeak-ng data files, authentic espeak-ng character |

---

## Quick start

```bash
# Run all tests (unit + integration + oracle)
cargo test

# Run oracle comparison tests with verbose output
cargo test --test oracle_comparison -- --nocapture

# Run benchmarks (requires espeak-ng binary on PATH for C baseline)
./benches/bench.sh
```

---

## Usage

```rust
// Text → IPA phonemes
let ipa = espeak_ng::text_to_ipa("en", "hello world")?;
assert_eq!(ipa, "həlˈəʊ wˈɜːld");

// More examples
espeak_ng::text_to_ipa("en", "42")?;          // "fˈɔːti tˈuː"
espeak_ng::text_to_ipa("en", "walked")?;      // "wˈɔːkt"
espeak_ng::text_to_ipa("en", "happily")?;     // "hˈapɪli"
espeak_ng::text_to_ipa("de", "schön")?;       // "ʃˈøːn"
espeak_ng::text_to_ipa("fr", "bonjour")?;     // "bɔ̃ʒˈuːɹ"

// Text → raw PCM (22050 Hz, mono, 16-bit) — not yet implemented
let (samples, rate) = espeak_ng::text_to_pcm("en", "hello world")?;
```

---

## What is implemented

### `encoding/`
- Full UTF-8 encode/decode (`utf8_decode_one`, `encode_one`)
- All eSpeak NG codepage tables: ISO-8859-1 through -16, KOI8-R, ISCII
- `Encoding::from_name()` lookup matching C's `encoding.c`

### `phoneme/`
- Binary phoneme table loader (`ph_data` files, `phonindex`)
- Per-language table selection (`select_table_by_name`)
- Phoneme attribute access: type, flags, mnemonic, program address
- IPA string extraction via bytecode scanner (`phoneme_ipa_string`):
  - Handles `i_IPA_NAME` instructions
  - Correctly handles synthesis-only phonemes (first instruction ≥ `i_FMT`)
  - Language-specific scanning depth to avoid bleed-through

### `dictionary/`
- Binary `en_dict`-format reader (`Dictionary::from_bytes`)
- `TransposeAlphabet` decompression for Latin-script entries
- Hash-based word lookup (`hash_word`, `lookup`)
- Full rule engine (`TranslateRules` / `MatchRule`):
  - Pre/post context matching (letter groups, syllable counts, stress, …)
  - `RULE_ENDING` suffix detection with `end_type` and separated `end_phonemes`
  - `RULE_NO_SUFFIX`, `RULE_DOUBLE`, `RULE_LETTERGP`, `RULE_DOLLAR`, …
  - Score-based rule selection, condition bitmask, spell-word flag
- `SetWordStress` — full port of the C function:
  - Vowel stress array construction (`GetVowelStress`)
  - All stress placement strategies (trochaic, iambic, left-to-right, …)
  - `$strend` / `$strend2` end-stress promotion
  - Clause-level final-stress demotion
- Suffix stripping (SUFX_I): re-translates stem with `FLAG_SUFFIX_REMOVED`,
  combining stem phonemes + suffix phonemes correctly
- Word-final devoicing for German / Dutch / Afrikaans / Slovak / Slovenian / Albanian

### `translate/`
- `text_to_ipa(lang, text) → String` public API
- Tokeniser: words, parsed number tokens, punctuation, clause boundaries
- `word_to_phonemes`: dictionary lookup → suffix stripping → translation rules
- Typed number grammar:
  - `NumberGrammar` models ordinal parsing, tens ordering, hundreds, and thousands behavior per language
  - `NumberToken` distinguishes cardinals, decimals, and ordinals before phoneme rendering
  - `Pronunciation` builder handles `END_WORD` (||) separators without manual byte trimming in each branch
- Number-to-phonemes:
  - Integers 0–999 999 999 999 via grouped scale dict entries (`_0`–`_19`, `_NX`, `_0C`, `_0M1`, `_0M2`, `_0M3`)
  - `NUM_1900` year format (1900 → "nineteen hundred")
  - Decimal numbers: integer + point + individual digits
  - Per-language number grammar for conjunctions, units-before-tens ordering, and omitted `one` prefixes
  - Ordinal numbers via `_#<suffix>` dict entries, language ordinal indicators, ordinal-dot languages, and scale-aware ordinal composition
- IPA rendering (`phonemes_to_ipa_full`):
  - Primary (ˈ) / secondary (ˌ) stress marks before vowels
  - `END_WORD` → word-boundary space
  - Language-specific overrides (English schwa, French 'r' → ʁ, …)
  - Context-sensitive phonemes: `d#` → 't'/'d', `z#` → 's'/'z' based on voicing
  - French liaison phoneme suppression at word-final position
  - German word-final devoicing (Auslautverhärtung)
- Multi-clause stress promotion (mirrors `phonemelist.c`)
- Language routing: en, fr, de, es, and many more via data files

---

## Oracle test coverage

| Test                             | Text                         | Expected IPA                          |
|----------------------------------|------------------------------|---------------------------------------|
| `en_hello`                       | "hello"                      | hɛlˈəʊ                                |
| `en_hello_world`                 | "hello world"                | hɛlˈəʊ wˈɜːld                        |
| `en_silent_e`                    | "cake"                       | kˈeɪk                                 |
| `en_gh_digraph`                  | "night"                      | nˈaɪt                                 |
| `en_silent_consonants`           | "pneumonia"                  | njuːmˈəʊniə                           |
| `en_suffixes`                    | "walked", "happily", …       | wˈɔːkt, hˈapɪli, …                    |
| `en_numbers_cardinal`            | "0" … "1000000"              | zˈiəɹəʊ … wˈɒn mˈɪliən               |
| `en_numbers_with_decimal`        | "3.14", "0.5"                | θɹˈiː pɔɪnt wˈɒn fˈɔː, …             |
| `en_sentence_period`             | "Hello. Goodbye."            | hɛlˈəʊ ɡʊdbˈaɪ                       |
| `en_comma`                       | "yes, no, maybe"             | jˈɛs nˈəʊ mˈeɪbi                     |
| `de_guten_tag`                   | "guten Tag"                  | ɡˈuːtən tˈaːk                        |
| `de_umlauts`                     | "über", "schön", "müde"      | ˈyːbɜ, ʃˈøːn, mˈyːdə                 |
| `de_ch_digraph`                  | "Bach", "ich"                | bˈax, ˈɪç                             |
| `es_hola`                        | "hola"                       | ˈola                                  |
| `es_ll_digraph`                  | "llamar"                     | ʎamˈaɾ                                |
| `fr_bonjour`                     | "bonjour"                    | bɔ̃ʒˈuːɹ                              |
| `fr_nasal_vowels`                | "bon"                        | bˈɔ̃                                   |
| `fr_liaison`                     | "les amis"                   | le-z amˈi                             |

---

## Testing approach

Tests are written before the implementation (TDD).

```
tests/
  encoding_integration.rs   22 golden-value tests for all encodings
  oracle_comparison.rs      27 tests comparing Rust ↔ C oracle output
  common/mod.rs             shared helpers (espeak_available, try_espeak_ipa, …)
```

Oracle tests use an `assert_matches_oracle!` macro with three outcomes:

| Condition                       | Result                                   |
|---------------------------------|------------------------------------------|
| `espeak-ng` not on PATH         | Skip with `[SKIP]` notice                |
| Rust returns `NotImplemented`   | Print C oracle value as a target, pass   |
| Rust returns a real string      | Must exactly match C oracle output       |

This means all comparison tests can be written now, run in any environment,
and automatically start enforcing correctness as each module is implemented.

---

## Data directory

The crate reads compiled eSpeak NG data files at runtime.  The data resolution order is:

1. `ESPEAK_DATA_PATH` environment variable
2. `espeak-ng-data/` next to the running executable
3. `espeak-ng-data/` in the current working directory
4. `/usr/share/espeak-ng-data` (system installation)

A complete copy of the compiled data directory (from eSpeak NG 1.52.0 + additional
language files from 1.52.0.1) is bundled at `espeak-ng-data/` in this repository.
This makes the crate fully self-contained without requiring a system eSpeak NG
installation.

The bundle contains:
- **114 compiled dictionaries** (`*_dict` files) for 114 languages
- **145 language definition files** (`lang/`) — includes ps, rup, crh, mn not in 1.52.0
- **200 voice definition files** (`voices/`) — includes asia/ps, ps voices
- **Binary phoneme data** (`phondata`, `phonindex`, `phontab`, `intonations`)

For selective embedding, the repository also contains per-language dictionary
crates under `data-crates/espeak-ng-data-dict-<lang>` in addition to the
aggregate `espeak-ng-data-dicts` crate.

```bash
# Use bundled data explicitly
ESPEAK_DATA_PATH=/path/to/espeak-ng-rs/espeak-ng-data cargo test
```

---

## Features

| Feature           | What it does                                                                 |
|-------------------|------------------------------------------------------------------------------|
| `c-oracle`        | Links `libespeak-ng` via FFI; enables the `oracle` module for comparison tests and benchmarks. Requires `libespeak-ng` to be installed (`pkg-config: espeak-ng`). |
| `bundled-data`    | Embeds the full eSpeak NG dataset via the aggregate data crates and enables `install_bundled_data()`. |
| `bundled-data-<lang>` | Embeds phoneme data plus a single language dictionary crate and enables selective installers such as `install_bundled_language()`. |
| `bundled-espeak`  | Downloads eSpeak NG 1.52.0 from GitHub, builds it with CMake, and bakes the binary/data paths into the benchmarks. Requires `cmake`, a C compiler, `curl`/`wget`, `tar`. |

```bash
# FFI oracle
cargo test --features c-oracle

# Full embedded data
cargo test --features bundled-data

# Selective embedded data
cargo test --features bundled-data-en,bundled-data-uk

# Selective bundled-data demo
cargo run --example bundled_data_selective_demo --features bundled-data-en,bundled-data-uk

# Bundled build (no system install needed)
cargo bench --features bundled-espeak
cargo bench --features bundled-espeak,c-oracle   # both
```

Selective bundled-data helpers exposed by the main crate:
- `espeak_ng::bundled_languages()`
- `espeak_ng::has_bundled_language("uk")`
- `espeak_ng::install_bundled_language(&data_dir, "uk")`
- `espeak_ng::install_bundled_languages(&data_dir, &["en", "uk"])`

---

## Publishing checklist

Before anything is published, make sure tests are valid and passing.

```bash
# 1) Baseline test suite
cargo test

# 2) Oracle + bundled-espeak path
cargo test --features "c-oracle,bundled-espeak"

# 3) Optional selective bundled-data checks
cargo test --test bundled_data_selective --features bundled-data-en,bundled-data-de

# 4) Preview publish order/commands
python3 scripts/publish_all_crates.py

# 5) Dry-run publish checks (local changes allowed)
python3 scripts/publish_all_crates.py --execute --dry-run --allow-dirty

# 6) Actual publish (when ready)
python3 scripts/publish_all_crates.py --execute
```

`scripts/publish_all_crates.py --execute` enforces these preflight checks
before any crate is published and aborts on first failure.

The same gates are enforced in CI on push/PR by
`.github/workflows/ci.yml`.

Use [PUBLISHING.md](PUBLISHING.md) for full publication details.

---

## Benchmarks

![Benchmark chart](benches/results/benchmark_chart.svg)

| Metric | Rust | C subprocess | Speedup |
|--------|------|-------------|---------|
| First-phoneme latency | **~606 ns** | ~5.5 ms | **~9 000×** |
| Synthesizer throughput | **380× real-time** | — | — |
| Resonator DSP (per sample) | **3.2 ns** | — | — |
| Encoding name lookup | **3.0 ns** | — | — |

The Rust speedup over C subprocess comes entirely from eliminating process-spawn and
shared-library initialisation overhead — the in-process dictionary lookup + rule engine
returns the first phoneme in under a microsecond.

See [BENCHMARK.md](BENCHMARK.md) for the full Criterion HTML report.

```bash
./benches/bench.sh               # run + snapshot + generate BENCHMARK.md
./benches/bench.sh --no-run      # regenerate BENCHMARK.md from last run
./benches/bench.sh --filter resonator   # one group only
```

Benchmark groups:

| Group                     | What is measured                                           |
|---------------------------|------------------------------------------------------------|
| `encoding/utf8_decode`    | UTF-8 decode throughput across scripts and input sizes     |
| `encoding/name_lookup`    | `Encoding::from_name()` lookup latency                     |
| `synthesize/resonator`    | Single resonator DSP filter tick (`Resonator::tick()`)     |
| `text_to_ipa/rust`        | Full Rust pipeline: text → IPA                             |
| `text_to_ipa/c_cli`       | C subprocess baseline (process spawn included)             |
| `latency/first_phoneme`   | First-phoneme latency: Rust vs C subprocess                |
| `text_to_ipa/ffi_vs_rust` | Rust vs C FFI baseline (`--features c-oracle`)             |

---

## Project layout

```
espeak-ng-rs/
├── src/
│   ├── lib.rs              public API + module declarations
│   ├── error.rs            EspeakError enum, Result alias
│   ├── encoding/
│   │   ├── mod.rs          Encoding enum, TextDecoder, utf8_decode_one/encode_one
│   │   └── codepages.rs    ISO-8859-*, KOI8-R, ISCII lookup tables
│   ├── phoneme/
│   │   ├── mod.rs          PhonemeType, PhonemeFlags, PhonemeTable
│   │   ├── load.rs         Binary phoneme table loader
│   │   ├── table.rs        Table selection, mnemonic access
│   │   └── feature.rs      Phoneme feature extraction
│   ├── dictionary/
│   │   ├── mod.rs          Constants, flag definitions
│   │   ├── file.rs         Dictionary binary parser, group index
│   │   ├── lookup.rs       Hash-based word lookup
│   │   ├── rules.rs        MatchRule + TranslateRules engine
│   │   ├── stress.rs       SetWordStress, GetVowelStress
│   │   ├── phonemes.rs     Phoneme encoding helpers
│   │   └── transpose.rs    TransposeAlphabet decompression
│   ├── translate/
│   │   ├── mod.rs          Translator, text_to_ipa, word_to_phonemes,
│   │   │                   tokeniser, number-to-phonemes, IPA renderer
│   │   └── ipa_table.rs    Kirschenbaum → IPA lookup, mnemonic overrides
│   ├── synthesize/
│   │   ├── mod.rs          Synthesizer API, synthesize_codes() high-quality path
│   │   ├── engine.rs       IPA → cascade formant synthesizer (generic path)
│   │   ├── targets.rs      IPA → FormantTarget table (60 phonemes)
│   │   ├── phondata.rs     Binary SPECT_SEQ / frame_t parser from phondata
│   │   ├── bytecode.rs     Phoneme bytecode scanner (finds i_FMT address)
│   │   ├── wavegen.rs      Harmonic synthesizer (PeaksToHarmspect + wavegen loop)
│   │   └── sintab_data.rs  2048-entry sine lookup table (from sintab.h)
│   └── oracle/mod.rs       FFI to libespeak-ng  (feature = c-oracle)
├── tests/
│   ├── common/mod.rs
│   ├── encoding_integration.rs
│   ├── dictionary_integration.rs
│   └── oracle_comparison.rs
├── benches/
│   ├── vs_c.rs             Criterion benchmark suite
│   ├── bench.sh            Run benchmarks + generate BENCHMARK.md
│   └── results/            Criterion JSON + SVG snapshots (committed)
├── build.rs                pkg-config link (c-oracle) + CMake build (bundled-espeak)
├── Cargo.toml
├── BENCHMARK.md
└── README.md
```

---

## Known limitations

- **Number translation** uses typed per-language grammar plus grouped scale composition through billions, but it still does not cover every `numbers.c` feature and format.
- **Ordinal numbers** are supported via `_#<suffix>` dict entries (English "1st", Spanish "1º", …), language ordinal indicators (Dutch "1e"), and ordinal-dot languages (German "3.").
- **Prefix stripping** not yet implemented (very rare in English).
- **`phonSWITCH`** (mid-word language switching) not yet handled.

---

## Licence

GPL-3.0-or-later — same as eSpeak NG.

---

## Authors

- [Eugene Hauptmann](https://github.com/eugenehp)

---

## Copyright

Copyright © 2026 Eugene Hauptmann

This project is a from-scratch Pure Rust reimplementation and does not copy
C source from eSpeak NG, but it is licensed under the same terms:
[GPL-3.0-or-later](https://www.gnu.org/licenses/gpl-3.0.html).

Source: <https://github.com/eugenehp/espeak-ng-rs>