neutts-rs

Rust port of NeuTTS — on-device voice-cloning TTS built on a GGUF LLM backbone and the NeuCodec neural audio codec.

Pure Rust — no ONNX Runtime, no native ML dependencies.
The codec runs as a self-contained CPU/GPU inference engine (safetensors + ndarray + rustfft, with optional burn/wgpu GPU path).

Pure-Rust phonemisation — the espeak feature uses the espeak-ng crate with bundled data for all 114 languages. No system library installation required.

Quick start

1. Convert codec weights (one-time, ~2 min)

pip install torch huggingface_hub safetensors
python scripts/convert_weights.py

Downloads neuphonic/neucodec, extracts the decoder weights, and saves them as models/neucodec_decoder.safetensors.

2. Build

cargo build --features espeak

No system dependencies needed — espeak-ng is pure Rust with bundled data.

3. Clone a voice and synthesise

The simplest path — point at any WAV file and say what you want:

cargo run --example speak --features espeak -- \
  --wav      my_voice.wav \
  --ref-text "Exactly what I said in the recording." \
  --text     "Hello, this is my cloned voice."

On the first run the reference WAV is encoded via the Python neucodec package and cached as my_voice.npy beside the WAV. Every subsequent run loads the cache and skips encoding entirely.

One-time Python install for encoding:

pip install neucodec huggingface_hub torchaudio

Examples

Example	What it does
`speak`	Recommended. WAV in → WAV out. Encodes on first run, caches `.npy`. Supports `--list-models`, `--list-files`, `--gguf-file`.
`stream_pcm`	Streaming. Preloads models once, streams raw PCM to stdout as audio is synthesised in chunks.
`basic`	Synthesise from a pre-encoded `.npy` reference
`clone_voice`	Full voice cloning — `.npy` or raw WAV + SHA-256 cache
`encode_reference`	Stub — returns a helpful error; use Python for now
`download_models`	Download / stage weights
`test_pipeline`	Smoke-test every component without model files

speak

# Minimal — encodes reference on first run, cached after that
cargo run --example speak --features espeak -- \
  --wav      my_voice.wav \
  --ref-text "What I said in the recording." \
  --text     "Hello, this is my cloned voice."

# Use a bundled sample voice (pre-encoded, no Python needed)
cargo run --example speak --features espeak -- \
  --wav      samples/jo.wav \
  --ref-text samples/jo.txt \
  --text     "Hello from Jo."

# Skip directly to a pre-encoded .npy
cargo run --example speak --features espeak -- \
  --codes    samples/dave.npy \
  --ref-text samples/dave.txt \
  --text     "Hello from Dave."

# List all known backbone models
cargo run --example speak --features espeak -- --list-models

# List GGUF files available in a specific repo
cargo run --example speak --features espeak -- \
  --backbone neuphonic/neutts-nano-q4-gguf --list-files

# Pick a specific GGUF quantisation
cargo run --example speak --features espeak -- \
  --wav       my_voice.wav \
  --ref-text  "What I said." \
  --text      "Hello." \
  --backbone  neuphonic/neutts-nano-q4-gguf \
  --gguf-file neutts-nano-Q4_K_M.gguf

# Different language backbone
cargo run --example speak --features espeak -- \
  --backbone neuphonic/neutts-nano-german-q4-gguf \
  --wav      samples/greta.wav \
  --ref-text samples/greta.txt \
  --text     "Hallo aus Rust."

# CPU-only (no wgpu)
cargo run --example speak --no-default-features --features espeak -- \
  --wav my_voice.wav --ref-text "..." --text "Hello."

speak flags:

Flag	Short	Purpose
`--wav PATH`	`-w`	WAV file of the voice to clone
`--codes PATH`	`-c`	Pre-encoded `.npy` (skips encoding)
`--ref-text TEXT\|PATH`	`-r`	Transcript of the reference WAV (file or literal string). Auto-detected from `<stem>.txt` if omitted.
`--text TEXT`	`-t`	Text to synthesise
`--out PATH`	`-o`	Output WAV (default: `output.wav`)
`--backbone REPO`	`-b`	HuggingFace backbone repo (see `--list-models`)
`--gguf-file FILE`	`-g`	Specific `.gguf` filename within the repo
`--list-files`		Print all `.gguf` files in `--backbone` and exit
`--list-models`		Print table of all known backbone repos and exit

stream_pcm

Preloads the backbone and codec once at startup, then drives the backbone in streaming mode: speech tokens are forwarded to the codec in chunks as they are generated, and raw signed 16-bit little-endian PCM is written to stdout as each chunk is ready. Timing diagnostics (time-to-first-audio, RTF) go to stderr so they do not corrupt the byte stream.

# Linux — aplay
cargo run --example stream_pcm --features espeak -- \
  --codes samples/dave.npy --ref-text samples/dave.txt \
  --text "Hello, streaming audio." | \
  aplay -f S16_LE -r 24000 -c 1

# macOS — sox
cargo run --example stream_pcm --features espeak -- \
  --codes samples/dave.npy --ref-text samples/dave.txt \
  --text "Hello, streaming audio." | \
  sox -t raw -r 24000 -e signed -b 16 -c 1 - -d

# Cross-platform — ffplay
cargo run --example stream_pcm --features espeak -- \
  --codes samples/dave.npy --ref-text samples/dave.txt \
  --text "Hello, streaming audio." | \
  ffplay -f s16le -ar 24000 -ac 1 -nodisp -

# Save to file, then convert
cargo run --example stream_pcm --features espeak -- \
  --codes samples/dave.npy --ref-text samples/dave.txt \
  --text "Hello." > output.pcm
sox -t raw -r 24000 -e signed -b 16 -c 1 output.pcm output.wav

stream_pcm flags:

Flag	Short	Default	Purpose
`--codes PATH`	`-c`	(required)	Pre-encoded `.npy` reference codes
`--text TEXT`	`-t`	(required)	Text to synthesise
`--ref-text TEXT\|PATH`	`-r`	auto	Transcript of the reference recording
`--backbone REPO`	`-b`	nano-q4	HuggingFace backbone repo
`--gguf-file FILE`	`-g`	auto	Specific `.gguf` filename
`--chunk N`	`-k`	`25`	Tokens per decode chunk (25 ≈ 500 ms)

Chunk size trade-off:

`--chunk`	Audio buffered	Latency (TTFA)	Quality at boundaries
10	~200 ms	lowest	mild artefacts possible
25	~500 ms	balanced	good (default)
50	~1 s	higher	best

Each chunk is decoded independently by the NeuCodec transformer, so very small chunks lose cross-chunk attention context. Values ≥ 25 are recommended.

Available models

Run --list-models to see the full table at any time:

cargo run --example speak -- --list-models

Repo	Name	Language	Params	GGUF
`neuphonic/neutts-nano-q4-gguf`	NeuTTS Nano Q4	en-us	0.2B	✅
`neuphonic/neutts-nano-q8-gguf`	NeuTTS Nano Q8	en-us	0.2B	✅
`neuphonic/neutts-nano`	NeuTTS Nano (full)	en-us	0.2B
`neuphonic/neutts-air-q4-gguf`	NeuTTS Air Q4	en-us	0.7B	✅
`neuphonic/neutts-air-q8-gguf`	NeuTTS Air Q8	en-us	0.7B	✅
`neuphonic/neutts-air`	NeuTTS Air (full)	en-us	0.7B
`neuphonic/neutts-nano-german-q4-gguf`	NeuTTS Nano German Q4	de	0.2B	✅
`neuphonic/neutts-nano-german-q8-gguf`	NeuTTS Nano German Q8	de	0.2B	✅
`neuphonic/neutts-nano-german`	NeuTTS Nano German (full)	de	0.2B
`neuphonic/neutts-nano-french-q4-gguf`	NeuTTS Nano French Q4	fr-fr	0.2B	✅
`neuphonic/neutts-nano-french-q8-gguf`	NeuTTS Nano French Q8	fr-fr	0.2B	✅
`neuphonic/neutts-nano-french`	NeuTTS Nano French (full)	fr-fr	0.2B
`neuphonic/neutts-nano-spanish-q4-gguf`	NeuTTS Nano Spanish Q4	es	0.2B	✅
`neuphonic/neutts-nano-spanish-q8-gguf`	NeuTTS Nano Spanish Q8	es	0.2B	✅
`neuphonic/neutts-nano-spanish`	NeuTTS Nano Spanish (full)	es	0.2B

To discover which specific GGUF quantisation variants are in a repo:

cargo run --example speak -- \
  --backbone neuphonic/neutts-nano-q4-gguf --list-files

Then pick one with --gguf-file:

cargo run --example speak --features espeak -- \
  --backbone  neuphonic/neutts-nano-q4-gguf \
  --gguf-file neutts-nano-Q4_K_M.gguf \
  --wav my_voice.wav --ref-text "..." --text "Hello."

Bundled Languages (114)

The espeak feature bundles phoneme data for all 114 espeak-ng languages:

af, am, an, ar, as, az, ba, be, bg, bn, bpy, bs, ca, chr, cmn, cs, cv, cy, da, de, el, en, eo, es, et, eu, fa, fi, fr, ga, gd, gn, grc, gu, hak, haw, he, hi, hr, ht, hu, hy, ia, id, io, is, it, ja, jbo, ka, kk, kl, kn, ko, kok, ku, ky, la, lb, lfn, lt, lv, mi, mk, ml, mr, ms, mt, mto, my, nci, ne, nl, no, nog, om, or, pa, pap, piqd, pl, pt, py, qdb, qu, quc, qya, ro, ru, sd, shn, si, sjn, sk, sl, smj, sq, sr, sv, sw, ta, te, th, ti, tk, tn, tr, tt, ug, uk, ur, uz, vi, yue

Note: 4 languages (bs, io, lfn, pap) have missing phoneme tables in espeak-ng 0.1.0. 17 languages with non-Latin scripts may return empty IPA for some inputs (upstream limitation).

Architecture

text ──► espeak-ng (pure Rust) ──► IPA ──┐
                                          ├──► prompt builder ──► GGUF backbone ──► speech tokens
ref_codes (.npy) ─────────────────────────┘                          (llama-cpp-4)        │
                                                                                           ▼
                                                                               NeuCodec decoder
                                                                          (Burn wgpu GPU  ──or──
                                                                           Burn NdArray CPU ──or──
                                                                           raw ndarray CPU)
                                                                                           │
                                                                                           ▼
                                                                               audio (Vec<f32>, 24 kHz)

GGUF backbone

Small causal LM in GGUF format, run via llama-cpp-4. Takes a phonemized text prompt and pre-encoded reference speaker codes, generates <|speech_N|> tokens one at a time. The generate_streaming API forwards each token to a callback immediately, enabling low-latency audio delivery.

NeuCodec decoder (pure Rust)

XCodec2-based architecture loaded at runtime from models/neucodec_decoder.safetensors. With the wgpu feature the full forward pass runs on the GPU (Metal / Vulkan / DX12); the final ISTFT always runs on CPU.

codes [T]
   └─► FSQ decode  (integer → 8 scaled digits → project_out Linear 8→2048)
         │
    fc_post_a  (Linear 2048→1024)
         │
   VocosBackbone
    ├─ Conv1d(k=7)
    ├─ 2 × ResnetBlock  (GroupNorm → SiLU → Conv1d)
    ├─ 12 × TransformerBlock  (RMSNorm → MHA + RoPE → SiLU MLP)
    │        └─ RoPE tables pre-computed at load time (see `fast`/`precise` features)
    └─ 2 × ResnetBlock + LayerNorm
         │
   ISTFTHead
    ├─ Linear(1024 → n_fft+2)
    └─ ISTFT (same padding, Hann window, always CPU)
         │
   audio [T × hop_length]  (24 kHz)

Property	Value
Output sample rate	24 000 Hz
Tokens / second	50
Samples / token	480 (hop_length)
FSQ codebook size	4⁸ = 65 536 codes
Encoder input	16 000 Hz mono WAV

Bundled reference voices

Five pre-encoded voices are included and work without any Python encoding step:

Files	Voice	Language
`samples/jo.*`	Jo	English
`samples/dave.*`	Dave	English
`samples/juliette.*`	Juliette	French
`samples/greta.*`	Greta	German
`samples/mateo.*`	Mateo	Spanish

Each has a .wav (original audio), .npy (pre-encoded tokens), and .txt (transcript).

Feature flags

Feature	Default	Description
`backbone`	✓	GGUF backbone via `llama-cpp-4` (requires cmake + C++)
`espeak`		Raw-text input via pure-Rust `espeak-ng` (114 bundled languages, no system deps)
`wgpu`		GPU-accelerated codec via Burn wgpu (Metal/Vulkan/DX12); auto-falls back to Burn NdArray CPU, then raw ndarray
`metal`		macOS Metal GPU for the backbone
`cuda`		NVIDIA CUDA for the backbone
`vulkan`		Vulkan GPU for the backbone (Linux/Windows, requires `libvulkan1`)
`fast`	✓	RoPE: degree-7/6 Horner polynomial (~1 × 10⁻⁴ error, no transcendental calls)
`precise`		RoPE: stdlib `f32::sin_cos()`, correctly rounded; mutually exclusive with `fast`

fast vs precise: Both affect how sin/cos values are computed when building the Rotary Positional Embedding tables in the NeuCodec transformer. The polynomial path (fast) avoids transcendental function calls — 6 FMAs per value — and is measurably faster at load time on platforms where hardware sin/cos is slow. The accuracy difference is imperceptible in speech synthesis. Pass --features precise to opt into full IEEE 754 accuracy:

cargo run --example speak --features "espeak,precise" -- ...

Setting both fast and precise simultaneously is a compile-time error.

Without backbone — codec-only mode; use NeuCodecDecoder::decode() directly.

Without espeak — pass pre-phonemized IPA via tts.infer_from_ipa().

Performance

Measured on a MacBook Pro M2 with the wgpu feature (Metal GPU) and neutts-nano-Q4_0.gguf (0.2B parameters, 372 reference tokens, ~125 output tokens):

Version	Synth time	RTF	Notes
0.0.1	4.45 s	1.79×	GPU init (1.72 s) counted against synthesis
0.0.2	~2.7 s	~1.1×	GPU init moved to load time; RoPE uploads eliminated

Build requirements

Platform	Backbone	Codec	Phonemizer (`espeak` feature)
All	cmake + C++ (auto)	pure Rust	None — pure Rust with bundled data

The espeak feature requires zero system dependencies. The backbone feature requires cmake + a C++ compiler for llama.cpp.

Cross-platform builds

Since phonemisation is now pure Rust, cross-compilation is straightforward:

# iOS
rustup target add aarch64-apple-ios
cargo build --target aarch64-apple-ios --features espeak

# Android
rustup target add aarch64-linux-android
cargo build --target aarch64-linux-android --features espeak

# Linux aarch64
rustup target add aarch64-unknown-linux-gnu
cargo build --target aarch64-unknown-linux-gnu --features espeak

# Windows (from any host)
rustup target add x86_64-pc-windows-gnu
cargo build --target x86_64-pc-windows-gnu --features espeak

No ESPEAK_LIB_DIR, no sysroot, no cross-compiled C library needed.

Using the library

Full pipeline

use neutts::{NeuTTS, download};
use std::path::Path;

// Download backbone from HuggingFace (cached after first run).
// Pass None to auto-select the first GGUF in the repo,
// or Some("filename.gguf") to pick a specific quantisation.
let tts = download::load_from_hub_cb(
    "neuphonic/neutts-nano-q4-gguf",
    None,       // or Some("neutts-nano-Q4_K_M.gguf")
    |_| {},
).unwrap();

// Load pre-encoded reference codes
let ref_codes = tts.load_ref_codes(Path::new("samples/jo.npy")).unwrap();

// Synthesise — returns Vec<f32> at 24 kHz mono
let audio = tts.infer(
    "Hello from Rust!",
    &ref_codes,
    "Transcript of the reference recording.",
).unwrap();

// Save to WAV
tts.write_wav(&audio, Path::new("output.wav")).unwrap();

Streaming synthesis

The backbone exposes a token-by-token streaming API that lets you start decoding audio before the model has finished generating:

use neutts::{tokens, download};
use std::io::Write as _;

let tts = download::load_from_hub_cb("neuphonic/neutts-nano-q4-gguf", None, |_| {}).unwrap();
let ref_codes = tts.load_ref_codes("samples/dave.npy".as_ref()).unwrap();
let ref_ipa   = neutts::phonemize::phonemize("Reference transcript.", "en-us").unwrap();
let input_ipa = neutts::phonemize::phonemize("Hello, streaming!", "en-us").unwrap();
let prompt    = tokens::build_prompt(&ref_ipa, &input_ipa, &ref_codes);

let mut pending   = Vec::<i32>::new();
let     codec     = &tts.codec;
let     stdout    = std::io::stdout();
let mut out       = std::io::BufWriter::new(stdout.lock());

tts.backbone.generate_streaming(&prompt, 2048, |piece| {
    pending.extend(tokens::extract_ids(piece));

    if pending.len() >= 25 {          // 25 tokens ≈ 500 ms of audio
        let audio = codec.decode(&pending)?;
        for s in &audio {
            let s16 = (s.clamp(-1.0, 1.0) * i16::MAX as f32) as i16;
            out.write_all(&s16.to_le_bytes())?;
        }
        out.flush()?;
        pending.clear();
    }
    Ok(())
})?;

See examples/stream_pcm.rs for the full self-contained example with timing diagnostics and player instructions.

Discover models programmatically

use neutts::download::{BACKBONE_MODELS, list_gguf_files, find_model};

// Iterate the registry
for m in BACKBONE_MODELS {
    println!("{} ({}) — GGUF: {}", m.repo, m.language, m.is_gguf);
}

// Find a specific repo
if let Some(info) = find_model("neuphonic/neutts-nano-q4-gguf") {
    println!("language: {}", info.language); // "en-us"
}

// List GGUF files available in a repo (network call)
let files = list_gguf_files("neuphonic/neutts-nano-q4-gguf").unwrap();
for f in &files { println!("{f}"); }

IPA passthrough (without espeak)

let audio = tts.infer_from_ipa(
    "hɛloʊ fɹʌm ɹʌst",    // input IPA
    &ref_codes,
    "wɪ ɑːɹ tɛstɪŋ ðɪs",  // reference IPA
).unwrap();

Decoder only

use neutts::NeuCodecDecoder;

let dec = NeuCodecDecoder::new().unwrap();
println!("backend: {}", dec.backend_name()); // e.g. "burn/wgpu (GPU)"
println!("{} samples/token", dec.hop_length());

let codes: Vec<i32> = vec![/* speech token IDs */];
let audio: Vec<f32> = dec.decode(&codes).unwrap();

Reference-code cache

use neutts::RefCodeCache;
use std::path::Path;

let cache = RefCodeCache::new()?;
if let Some((codes, outcome)) = cache.try_load(Path::new("reference.wav"))? {
    println!("{outcome}"); // "Cache hit (SHA-256: …)"
}

Mobile / C FFI

A practical mobile architecture runs the backbone server-side and only the NeuCodec decoder on-device:

NeuTtsHandle *codec = neutts_model_load("/path/to/neucodec_decoder.safetensors");

float *audio = neutts_decode_tokens(codec, codes, num_codes, &n_samples);
neutts_write_wav(audio, n_samples, "/path/to/output.wav");
neutts_free_audio(audio, n_samples);
neutts_model_free(codec);

See include/neutts.h for the full C header.

Pipeline stages

Text preprocessing — numbers, currencies, abbreviations → spoken words
Phonemisation — pure-Rust espeak-ng converts text to IPA phonemes
Prompt construction — reference codes + IPA → GGUF prompt
Backbone inference — GGUF LLM generates <|speech_N|> tokens
Token extraction — regex extracts integer IDs from generated text
Codec decode — NeuCodec decoder converts IDs to 24 kHz audio

Migration from C `libespeak-ng`

This crate previously used C FFI bindings to libespeak-ng with a 749-line build.rs for native library detection and cross-compilation. It now uses the pure-Rust espeak-ng crate instead:

No system library required — brew install espeak-ng / apt install libespeak-ng-dev no longer needed
No C compiler needed for phonemisation — no cmake, no gcc, no build scripts
No unsafe code in phonemisation — the entire FFI layer was removed
build.rs reduced from 749 lines to 48 — only RoPE checks and weight detection remain
Cross-compilation just works — no ESPEAK_LIB_DIR, no ESPEAK_SYSROOT, no NDK toolchain setup
114 languages bundled — all espeak-ng languages ship as embedded data

Status

Component	Status
GGUF backbone inference	✅
NeuCodec decoder (pure Rust, safetensors)	✅
NeuCodec encoder (pure Rust)	⏳ `speak` example falls back to Python `neucodec`
GPU-accelerated codec (`wgpu` feature)	✅ Metal / Vulkan / DX12 via Burn
Streaming backbone API (`generate_streaming`)	✅
Streaming PCM output (`stream_pcm` example)	✅
Pure-Rust phonemisation (114 languages)	✅
English backbones (Nano / Air, Q4 / Q8)	✅
German / French / Spanish backbones	✅
Cross-platform builds	✅ No native espeak deps
Test suite (unit + integration + e2e)	✅ 114 tests
iOS / Android build	✅ codec + phonemiser are pure Rust; backbone needs cross-compile

Citation

If you use this software in your research or project, please cite it as:

@software{hauptmann2026neuttsrs,
  author       = {Hauptmann, Eugene},
  title        = {{neutts}: Rust port of {NeuTTS} — on-device voice-cloning {TTS}
                  with {GGUF} backbone and {NeuCodec} decoder},
  year         = {2026},
  version      = {0.1.0},
  license      = {MIT},
  url          = {https://github.com/eugenehp/neutts-rs}
}

If you also use the underlying NeuTTS model or NeuCodec, please cite those works directly via their respective HuggingFace repositories at huggingface.co/neuphonic.

License

MIT

neutts 0.1.1