neutts 0.0.2

Rust port of NeuTTS — on-device voice-cloning TTS with GGUF backbone and NeuCodec decoder
Documentation

neutts-rs

Rust port of NeuTTS — on-device voice-cloning TTS built on a GGUF LLM backbone and the NeuCodec neural audio codec.

Pure Rust — no ONNX Runtime, no native ML dependencies.
The codec runs as a self-contained CPU/GPU inference engine (safetensors + ndarray + rustfft, with optional burn/wgpu GPU path).


Quick start

1. Install system dependencies

# macOS
brew install espeak-ng

# Ubuntu / Debian
apt install espeak-ng

# Alpine
apk add espeak-ng

2. Convert codec weights (one-time, ~2 min)

pip install torch huggingface_hub safetensors
python scripts/convert_weights.py

Downloads neuphonic/neucodec, extracts the decoder weights, and saves them as models/neucodec_decoder.safetensors.

3. Build

cargo build --features espeak

4. Clone a voice and synthesise

The simplest path — point at any WAV file and say what you want:

cargo run --example speak --features espeak -- \
  --wav      my_voice.wav \
  --ref-text "Exactly what I said in the recording." \
  --text     "Hello, this is my cloned voice."

On the first run the reference WAV is encoded via the Python neucodec package and cached as my_voice.npy beside the WAV. Every subsequent run loads the cache and skips encoding entirely.

One-time Python install for encoding:

pip install neucodec huggingface_hub torchaudio

Examples

Example What it does
speak Recommended. WAV in → WAV out. Encodes on first run, caches .npy. Supports --list-models, --list-files, --gguf-file.
stream_pcm Streaming. Preloads models once, streams raw PCM to stdout as audio is synthesised in chunks.
basic Synthesise from a pre-encoded .npy reference
clone_voice Full voice cloning — .npy or raw WAV + SHA-256 cache
encode_reference Stub — returns a helpful error; use Python for now
download_models Download / stage weights
test_pipeline Smoke-test every component without model files

speak

# Minimal — encodes reference on first run, cached after that
cargo run --example speak --features espeak -- \
  --wav      my_voice.wav \
  --ref-text "What I said in the recording." \
  --text     "Hello, this is my cloned voice."

# Use a bundled sample voice (pre-encoded, no Python needed)
cargo run --example speak --features espeak -- \
  --wav      samples/jo.wav \
  --ref-text samples/jo.txt \
  --text     "Hello from Jo."

# Skip directly to a pre-encoded .npy
cargo run --example speak --features espeak -- \
  --codes    samples/dave.npy \
  --ref-text samples/dave.txt \
  --text     "Hello from Dave."

# List all known backbone models
cargo run --example speak --features espeak -- --list-models

# List GGUF files available in a specific repo
cargo run --example speak --features espeak -- \
  --backbone neuphonic/neutts-nano-q4-gguf --list-files

# Pick a specific GGUF quantisation
cargo run --example speak --features espeak -- \
  --wav       my_voice.wav \
  --ref-text  "What I said." \
  --text      "Hello." \
  --backbone  neuphonic/neutts-nano-q4-gguf \
  --gguf-file neutts-nano-Q4_K_M.gguf

# Different language backbone
cargo run --example speak --features espeak -- \
  --backbone neuphonic/neutts-nano-german-q4-gguf \
  --wav      samples/greta.wav \
  --ref-text samples/greta.txt \
  --text     "Hallo aus Rust."

# CPU-only (no wgpu)
cargo run --example speak --no-default-features --features espeak -- \
  --wav my_voice.wav --ref-text "..." --text "Hello."

speak flags:

Flag Short Purpose
--wav PATH -w WAV file of the voice to clone
--codes PATH -c Pre-encoded .npy (skips encoding)
--ref-text TEXT|PATH -r Transcript of the reference WAV (file or literal string). Auto-detected from <stem>.txt if omitted.
--text TEXT -t Text to synthesise
--out PATH -o Output WAV (default: output.wav)
--backbone REPO -b HuggingFace backbone repo (see --list-models)
--gguf-file FILE -g Specific .gguf filename within the repo
--list-files Print all .gguf files in --backbone and exit
--list-models Print table of all known backbone repos and exit

stream_pcm

Preloads the backbone and codec once at startup, then drives the backbone in streaming mode: speech tokens are forwarded to the codec in chunks as they are generated, and raw signed 16-bit little-endian PCM is written to stdout as each chunk is ready. Timing diagnostics (time-to-first-audio, RTF) go to stderr so they do not corrupt the byte stream.

# Linux — aplay
cargo run --example stream_pcm --features espeak -- \
  --codes samples/dave.npy --ref-text samples/dave.txt \
  --text "Hello, streaming audio." | \
  aplay -f S16_LE -r 24000 -c 1

# macOS — sox
cargo run --example stream_pcm --features espeak -- \
  --codes samples/dave.npy --ref-text samples/dave.txt \
  --text "Hello, streaming audio." | \
  sox -t raw -r 24000 -e signed -b 16 -c 1 - -d

# Cross-platform — ffplay
cargo run --example stream_pcm --features espeak -- \
  --codes samples/dave.npy --ref-text samples/dave.txt \
  --text "Hello, streaming audio." | \
  ffplay -f s16le -ar 24000 -ac 1 -nodisp -

# Save to file, then convert
cargo run --example stream_pcm --features espeak -- \
  --codes samples/dave.npy --ref-text samples/dave.txt \
  --text "Hello." > output.pcm
sox -t raw -r 24000 -e signed -b 16 -c 1 output.pcm output.wav

stream_pcm flags:

Flag Short Default Purpose
--codes PATH -c (required) Pre-encoded .npy reference codes
--text TEXT -t (required) Text to synthesise
--ref-text TEXT|PATH -r auto Transcript of the reference recording
--backbone REPO -b nano-q4 HuggingFace backbone repo
--gguf-file FILE -g auto Specific .gguf filename
--chunk N -k 25 Tokens per decode chunk (25 ≈ 500 ms)

Chunk size trade-off:

--chunk Audio buffered Latency (TTFA) Quality at boundaries
10 ~200 ms lowest mild artefacts possible
25 ~500 ms balanced good (default)
50 ~1 s higher best

Each chunk is decoded independently by the NeuCodec transformer, so very small chunks lose cross-chunk attention context. Values ≥ 25 are recommended.


Available models

Run --list-models to see the full table at any time:

cargo run --example speak -- --list-models
Repo Name Language Params GGUF
neuphonic/neutts-nano-q4-gguf NeuTTS Nano Q4 en-us 0.2B
neuphonic/neutts-nano-q8-gguf NeuTTS Nano Q8 en-us 0.2B
neuphonic/neutts-nano NeuTTS Nano (full) en-us 0.2B
neuphonic/neutts-air-q4-gguf NeuTTS Air Q4 en-us 0.7B
neuphonic/neutts-air-q8-gguf NeuTTS Air Q8 en-us 0.7B
neuphonic/neutts-air NeuTTS Air (full) en-us 0.7B
neuphonic/neutts-nano-german-q4-gguf NeuTTS Nano German Q4 de 0.2B
neuphonic/neutts-nano-german-q8-gguf NeuTTS Nano German Q8 de 0.2B
neuphonic/neutts-nano-german NeuTTS Nano German (full) de 0.2B
neuphonic/neutts-nano-french-q4-gguf NeuTTS Nano French Q4 fr-fr 0.2B
neuphonic/neutts-nano-french-q8-gguf NeuTTS Nano French Q8 fr-fr 0.2B
neuphonic/neutts-nano-french NeuTTS Nano French (full) fr-fr 0.2B
neuphonic/neutts-nano-spanish-q4-gguf NeuTTS Nano Spanish Q4 es 0.2B
neuphonic/neutts-nano-spanish-q8-gguf NeuTTS Nano Spanish Q8 es 0.2B
neuphonic/neutts-nano-spanish NeuTTS Nano Spanish (full) es 0.2B

To discover which specific GGUF quantisation variants are in a repo:

cargo run --example speak -- \
  --backbone neuphonic/neutts-nano-q4-gguf --list-files

Then pick one with --gguf-file:

cargo run --example speak --features espeak -- \
  --backbone  neuphonic/neutts-nano-q4-gguf \
  --gguf-file neutts-nano-Q4_K_M.gguf \
  --wav my_voice.wav --ref-text "..." --text "Hello."

Architecture

text ──► espeak-ng ──► IPA ──┐
                              ├──► prompt builder ──► GGUF backbone ──► speech tokens
ref_codes (.npy) ─────────────┘                          (llama-cpp-2)        │
                                                                               ▼
                                                                   NeuCodec decoder
                                                              (Burn wgpu GPU  ──or──
                                                               Burn NdArray CPU ──or──
                                                               raw ndarray CPU)
                                                                               │
                                                                               ▼
                                                                   audio (Vec<f32>, 24 kHz)

GGUF backbone

Small causal LM in GGUF format, run via llama-cpp-2. Takes a phonemized text prompt and pre-encoded reference speaker codes, generates <|speech_N|> tokens one at a time. The generate_streaming API forwards each token to a callback immediately, enabling low-latency audio delivery.

NeuCodec decoder (pure Rust)

XCodec2-based architecture loaded at runtime from models/neucodec_decoder.safetensors. With the wgpu feature the full forward pass runs on the GPU (Metal / Vulkan / DX12); the final ISTFT always runs on CPU.

codes [T]
   └─► FSQ decode  (integer → 8 scaled digits → project_out Linear 8→2048)
         │
    fc_post_a  (Linear 2048→1024)
         │
   VocosBackbone
    ├─ Conv1d(k=7)
    ├─ 2 × ResnetBlock  (GroupNorm → SiLU → Conv1d)
    ├─ 12 × TransformerBlock  (RMSNorm → MHA + RoPE → SiLU MLP)
    │        └─ RoPE tables pre-computed at load time (see `fast`/`precise` features)
    └─ 2 × ResnetBlock + LayerNorm
         │
   ISTFTHead
    ├─ Linear(1024 → n_fft+2)
    └─ ISTFT (same padding, Hann window, always CPU)
         │
   audio [T × hop_length]  (24 kHz)
Property Value
Output sample rate 24 000 Hz
Tokens / second 50
Samples / token 480 (hop_length)
FSQ codebook size 4⁸ = 65 536 codes
Encoder input 16 000 Hz mono WAV

Bundled reference voices

Five pre-encoded voices are included and work without any Python encoding step:

Files Voice Language
samples/jo.* Jo English
samples/dave.* Dave English
samples/juliette.* Juliette French
samples/greta.* Greta German
samples/mateo.* Mateo Spanish

Each has a .wav (original audio), .npy (pre-encoded tokens), and .txt (transcript).


Feature flags

Feature Default Description
backbone GGUF backbone via llama-cpp-2 (requires cmake + C++)
espeak Raw-text input via libespeak-ng
wgpu GPU-accelerated codec via Burn wgpu (Metal/Vulkan/DX12); auto-falls back to Burn NdArray CPU, then raw ndarray
metal macOS Metal GPU for the backbone
cuda NVIDIA CUDA for the backbone
fast RoPE: degree-7/6 Horner polynomial (~1 × 10⁻⁴ error, no transcendental calls)
precise RoPE: stdlib f32::sin_cos(), correctly rounded; mutually exclusive with fast

fast vs precise: Both affect how sin/cos values are computed when building the Rotary Positional Embedding tables in the NeuCodec transformer. The polynomial path (fast) avoids transcendental function calls — 6 FMAs per value — and is measurably faster at load time on platforms where hardware sin/cos is slow. The accuracy difference is imperceptible in speech synthesis. Pass --features precise to opt into full IEEE 754 accuracy:

cargo run --example speak --features "espeak,precise" -- ...

Setting both fast and precise simultaneously is a compile-time error.

Without backbone — codec-only mode; use NeuCodecDecoder::decode() directly.

Without espeak — pass pre-phonemized IPA via tts.infer_from_ipa().


Performance

Measured on a MacBook Pro M2 with the wgpu feature (Metal GPU) and neutts-nano-Q4_0.gguf (0.2B parameters, 372 reference tokens, ~125 output tokens):

Version Synth time RTF Notes
0.0.1 4.45 s 1.79× GPU init (1.72 s) counted against synthesis
0.0.2 ~2.7 s ~1.1× GPU init moved to load time; RoPE uploads eliminated

What changed in 0.0.2:

  • Eager GPU initNeuCodecDecoder::from_file() now initialises the Burn wgpu backend immediately, moving the ~1.7 s GPU upload from synthesis latency into model loading time (reported in "loaded in X s").
  • Pre-computed RoPE tables — cos/sin tables for up to 2048 positions are computed once at weight-load time and stored as device tensors. This eliminates the 24 CPU→GPU uploads that previously occurred on every decode call (12 transformer blocks × Q and K projections each).

Build requirements

Platform Backbone Codec Phonemizer
Linux / macOS cmake + C++ (auto) pure Rust libespeak-ng-dev / brew install espeak-ng
iOS / Android cross-compile llama.cpp pure Rust cross-compile espeak-ng; set ESPEAK_LIB_DIR

Using the library

Full pipeline

use neutts::{NeuTTS, download};
use std::path::Path;

// Download backbone from HuggingFace (cached after first run).
// Pass None to auto-select the first GGUF in the repo,
// or Some("filename.gguf") to pick a specific quantisation.
let tts = download::load_from_hub_cb(
    "neuphonic/neutts-nano-q4-gguf",
    None,       // or Some("neutts-nano-Q4_K_M.gguf")
    |_| {},
).unwrap();

// Load pre-encoded reference codes
let ref_codes = tts.load_ref_codes(Path::new("samples/jo.npy")).unwrap();

// Synthesise — returns Vec<f32> at 24 kHz mono
let audio = tts.infer(
    "Hello from Rust!",
    &ref_codes,
    "Transcript of the reference recording.",
).unwrap();

// Save to WAV
tts.write_wav(&audio, Path::new("output.wav")).unwrap();

Streaming synthesis

The backbone exposes a token-by-token streaming API that lets you start decoding audio before the model has finished generating:

use neutts::{tokens, download};
use std::io::Write as _;

let tts = download::load_from_hub_cb("neuphonic/neutts-nano-q4-gguf", None, |_| {}).unwrap();
let ref_codes = tts.load_ref_codes("samples/dave.npy".as_ref()).unwrap();
let ref_ipa   = neutts::phonemize::phonemize("Reference transcript.", "en-us").unwrap();
let input_ipa = neutts::phonemize::phonemize("Hello, streaming!", "en-us").unwrap();
let prompt    = tokens::build_prompt(&ref_ipa, &input_ipa, &ref_codes);

let mut pending   = Vec::<i32>::new();
let     codec     = &tts.codec;
let     stdout    = std::io::stdout();
let mut out       = std::io::BufWriter::new(stdout.lock());

tts.backbone.generate_streaming(&prompt, 2048, |piece| {
    pending.extend(tokens::extract_ids(piece));

    if pending.len() >= 25 {          // 25 tokens ≈ 500 ms of audio
        let audio = codec.decode(&pending)?;
        for s in &audio {
            let s16 = (s.clamp(-1.0, 1.0) * i16::MAX as f32) as i16;
            out.write_all(&s16.to_le_bytes())?;
        }
        out.flush()?;
        pending.clear();
    }
    Ok(())
})?;

See examples/stream_pcm.rs for the full self-contained example with timing diagnostics and player instructions.

Discover models programmatically

use neutts::download::{BACKBONE_MODELS, list_gguf_files, find_model};

// Iterate the registry
for m in BACKBONE_MODELS {
    println!("{} ({}) — GGUF: {}", m.repo, m.language, m.is_gguf);
}

// Find a specific repo
if let Some(info) = find_model("neuphonic/neutts-nano-q4-gguf") {
    println!("language: {}", info.language); // "en-us"
}

// List GGUF files available in a repo (network call)
let files = list_gguf_files("neuphonic/neutts-nano-q4-gguf").unwrap();
for f in &files { println!("{f}"); }

IPA passthrough (without espeak)

let audio = tts.infer_from_ipa(
    "hɛloʊ fɹʌm ɹʌst",    // input IPA
    &ref_codes,
    "wɪ ɑːɹ tɛstɪŋ ðɪs",  // reference IPA
).unwrap();

Decoder only

use neutts::NeuCodecDecoder;

let dec = NeuCodecDecoder::new().unwrap();
println!("backend: {}", dec.backend_name()); // e.g. "burn/wgpu (GPU)"
println!("{} samples/token", dec.hop_length());

let codes: Vec<i32> = vec![/* speech token IDs */];
let audio: Vec<f32> = dec.decode(&codes).unwrap();

Reference-code cache

use neutts::RefCodeCache;
use std::path::Path;

let cache = RefCodeCache::new()?;
if let Some((codes, outcome)) = cache.try_load(Path::new("reference.wav"))? {
    println!("{outcome}"); // "Cache hit (SHA-256: …)"
}

Mobile / C FFI

A practical mobile architecture runs the backbone server-side and only the NeuCodec decoder on-device:

NeuTtsHandle *codec = neutts_model_load("/path/to/neucodec_decoder.safetensors");

float *audio = neutts_decode_tokens(codec, codes, num_codes, &n_samples);
neutts_write_wav(audio, n_samples, "/path/to/output.wav");
neutts_free_audio(audio, n_samples);
neutts_model_free(codec);

See include/neutts.h for the full C header.


Pipeline stages

  1. Text preprocessing — numbers, currencies, abbreviations → spoken words
  2. Phonemisation — espeak-ng converts text to IPA phonemes
  3. Prompt construction — reference codes + IPA → GGUF prompt
  4. Backbone inference — GGUF LLM generates <|speech_N|> tokens
  5. Token extraction — regex extracts integer IDs from generated text
  6. Codec decode — NeuCodec decoder converts IDs to 24 kHz audio

Status

Component Status
GGUF backbone inference
NeuCodec decoder (pure Rust, safetensors)
NeuCodec encoder (pure Rust) speak example falls back to Python neucodec
GPU-accelerated codec (wgpu feature) ✅ Metal / Vulkan / DX12 via Burn
Streaming backbone API (generate_streaming)
Streaming PCM output (stream_pcm example)
English backbones (Nano / Air, Q4 / Q8)
German / French / Spanish backbones
iOS / Android build ✅ codec is pure Rust; backbone needs cross-compile

Citation

If you use this software in your research or project, please cite it as:

@software{hauptmann2026neuttsrs,
  author       = {Hauptmann, Eugene},
  title        = {{neutts}: Rust port of {NeuTTS} — on-device voice-cloning {TTS}
                  with {GGUF} backbone and {NeuCodec} decoder},
  year         = {2026},
  version      = {0.0.2},
  license      = {MIT},
  url          = {https://github.com/eugenehp/neutts-rs}
}

If you also use the underlying NeuTTS model or NeuCodec, please cite those works directly via their respective HuggingFace repositories at huggingface.co/neuphonic.


License

MIT