neutts-rs
Rust port of NeuTTS — on-device voice-cloning TTS built on a GGUF LLM backbone and the NeuCodec neural audio codec.
Pure Rust — no ONNX Runtime, no native ML dependencies.
The codec runs as a self-contained CPU/GPU inference engine
(safetensors + ndarray + rustfft, with optional burn/wgpu GPU path).
Quick start
1. Install system dependencies
# macOS
# Ubuntu / Debian
# Alpine
2. Convert codec weights (one-time, ~2 min)
Downloads neuphonic/neucodec, extracts the decoder weights, and saves them as
models/neucodec_decoder.safetensors.
3. Build
4. Clone a voice and synthesise
The simplest path — point at any WAV file and say what you want:
On the first run the reference WAV is encoded via the Python neucodec package
and cached as my_voice.npy beside the WAV. Every subsequent run loads the
cache and skips encoding entirely.
One-time Python install for encoding:
Examples
| Example | What it does |
|---|---|
speak |
Recommended. WAV in → WAV out. Encodes on first run, caches .npy. Supports --list-models, --list-files, --gguf-file. |
stream_pcm |
Streaming. Preloads models once, streams raw PCM to stdout as audio is synthesised in chunks. |
basic |
Synthesise from a pre-encoded .npy reference |
clone_voice |
Full voice cloning — .npy or raw WAV + SHA-256 cache |
encode_reference |
Stub — returns a helpful error; use Python for now |
download_models |
Download / stage weights |
test_pipeline |
Smoke-test every component without model files |
speak
# Minimal — encodes reference on first run, cached after that
# Use a bundled sample voice (pre-encoded, no Python needed)
# Skip directly to a pre-encoded .npy
# List all known backbone models
# List GGUF files available in a specific repo
# Pick a specific GGUF quantisation
# Different language backbone
# CPU-only (no wgpu)
speak flags:
| Flag | Short | Purpose |
|---|---|---|
--wav PATH |
-w |
WAV file of the voice to clone |
--codes PATH |
-c |
Pre-encoded .npy (skips encoding) |
--ref-text TEXT|PATH |
-r |
Transcript of the reference WAV (file or literal string). Auto-detected from <stem>.txt if omitted. |
--text TEXT |
-t |
Text to synthesise |
--out PATH |
-o |
Output WAV (default: output.wav) |
--backbone REPO |
-b |
HuggingFace backbone repo (see --list-models) |
--gguf-file FILE |
-g |
Specific .gguf filename within the repo |
--list-files |
Print all .gguf files in --backbone and exit |
|
--list-models |
Print table of all known backbone repos and exit |
stream_pcm
Preloads the backbone and codec once at startup, then drives the backbone in streaming mode: speech tokens are forwarded to the codec in chunks as they are generated, and raw signed 16-bit little-endian PCM is written to stdout as each chunk is ready. Timing diagnostics (time-to-first-audio, RTF) go to stderr so they do not corrupt the byte stream.
# Linux — aplay
| \
# macOS — sox
| \
# Cross-platform — ffplay
| \
# Save to file, then convert
stream_pcm flags:
| Flag | Short | Default | Purpose |
|---|---|---|---|
--codes PATH |
-c |
(required) | Pre-encoded .npy reference codes |
--text TEXT |
-t |
(required) | Text to synthesise |
--ref-text TEXT|PATH |
-r |
auto | Transcript of the reference recording |
--backbone REPO |
-b |
nano-q4 | HuggingFace backbone repo |
--gguf-file FILE |
-g |
auto | Specific .gguf filename |
--chunk N |
-k |
25 |
Tokens per decode chunk (25 ≈ 500 ms) |
Chunk size trade-off:
--chunk |
Audio buffered | Latency (TTFA) | Quality at boundaries |
|---|---|---|---|
| 10 | ~200 ms | lowest | mild artefacts possible |
| 25 | ~500 ms | balanced | good (default) |
| 50 | ~1 s | higher | best |
Each chunk is decoded independently by the NeuCodec transformer, so very small chunks lose cross-chunk attention context. Values ≥ 25 are recommended.
Available models
Run --list-models to see the full table at any time:
| Repo | Name | Language | Params | GGUF |
|---|---|---|---|---|
neuphonic/neutts-nano-q4-gguf |
NeuTTS Nano Q4 | en-us | 0.2B | ✅ |
neuphonic/neutts-nano-q8-gguf |
NeuTTS Nano Q8 | en-us | 0.2B | ✅ |
neuphonic/neutts-nano |
NeuTTS Nano (full) | en-us | 0.2B | |
neuphonic/neutts-air-q4-gguf |
NeuTTS Air Q4 | en-us | 0.7B | ✅ |
neuphonic/neutts-air-q8-gguf |
NeuTTS Air Q8 | en-us | 0.7B | ✅ |
neuphonic/neutts-air |
NeuTTS Air (full) | en-us | 0.7B | |
neuphonic/neutts-nano-german-q4-gguf |
NeuTTS Nano German Q4 | de | 0.2B | ✅ |
neuphonic/neutts-nano-german-q8-gguf |
NeuTTS Nano German Q8 | de | 0.2B | ✅ |
neuphonic/neutts-nano-german |
NeuTTS Nano German (full) | de | 0.2B | |
neuphonic/neutts-nano-french-q4-gguf |
NeuTTS Nano French Q4 | fr-fr | 0.2B | ✅ |
neuphonic/neutts-nano-french-q8-gguf |
NeuTTS Nano French Q8 | fr-fr | 0.2B | ✅ |
neuphonic/neutts-nano-french |
NeuTTS Nano French (full) | fr-fr | 0.2B | |
neuphonic/neutts-nano-spanish-q4-gguf |
NeuTTS Nano Spanish Q4 | es | 0.2B | ✅ |
neuphonic/neutts-nano-spanish-q8-gguf |
NeuTTS Nano Spanish Q8 | es | 0.2B | ✅ |
neuphonic/neutts-nano-spanish |
NeuTTS Nano Spanish (full) | es | 0.2B |
To discover which specific GGUF quantisation variants are in a repo:
Then pick one with --gguf-file:
Architecture
text ──► espeak-ng ──► IPA ──┐
├──► prompt builder ──► GGUF backbone ──► speech tokens
ref_codes (.npy) ─────────────┘ (llama-cpp-4) │
▼
NeuCodec decoder
(Burn wgpu GPU ──or──
Burn NdArray CPU ──or──
raw ndarray CPU)
│
▼
audio (Vec<f32>, 24 kHz)
GGUF backbone
Small causal LM in GGUF format, run via llama-cpp-4. Takes a phonemized text
prompt and pre-encoded reference speaker codes, generates <|speech_N|> tokens
one at a time. The generate_streaming API forwards each token to a callback
immediately, enabling low-latency audio delivery.
NeuCodec decoder (pure Rust)
XCodec2-based architecture loaded at runtime from models/neucodec_decoder.safetensors.
With the wgpu feature the full forward pass runs on the GPU (Metal / Vulkan / DX12);
the final ISTFT always runs on CPU.
codes [T]
└─► FSQ decode (integer → 8 scaled digits → project_out Linear 8→2048)
│
fc_post_a (Linear 2048→1024)
│
VocosBackbone
├─ Conv1d(k=7)
├─ 2 × ResnetBlock (GroupNorm → SiLU → Conv1d)
├─ 12 × TransformerBlock (RMSNorm → MHA + RoPE → SiLU MLP)
│ └─ RoPE tables pre-computed at load time (see `fast`/`precise` features)
└─ 2 × ResnetBlock + LayerNorm
│
ISTFTHead
├─ Linear(1024 → n_fft+2)
└─ ISTFT (same padding, Hann window, always CPU)
│
audio [T × hop_length] (24 kHz)
| Property | Value |
|---|---|
| Output sample rate | 24 000 Hz |
| Tokens / second | 50 |
| Samples / token | 480 (hop_length) |
| FSQ codebook size | 4⁸ = 65 536 codes |
| Encoder input | 16 000 Hz mono WAV |
Bundled reference voices
Five pre-encoded voices are included and work without any Python encoding step:
| Files | Voice | Language |
|---|---|---|
samples/jo.* |
Jo | English |
samples/dave.* |
Dave | English |
samples/juliette.* |
Juliette | French |
samples/greta.* |
Greta | German |
samples/mateo.* |
Mateo | Spanish |
Each has a .wav (original audio), .npy (pre-encoded tokens), and .txt (transcript).
Feature flags
| Feature | Default | Description |
|---|---|---|
backbone |
✓ | GGUF backbone via llama-cpp-4 (requires cmake + C++) |
espeak |
Raw-text input via libespeak-ng |
|
wgpu |
GPU-accelerated codec via Burn wgpu (Metal/Vulkan/DX12); auto-falls back to Burn NdArray CPU, then raw ndarray | |
metal |
macOS Metal GPU for the backbone | |
cuda |
NVIDIA CUDA for the backbone | |
vulkan |
Vulkan GPU for the backbone (Linux/Windows, requires libvulkan1) |
|
fast |
✓ | RoPE: degree-7/6 Horner polynomial (~1 × 10⁻⁴ error, no transcendental calls) |
precise |
RoPE: stdlib f32::sin_cos(), correctly rounded; mutually exclusive with fast |
fast vs precise: Both affect how sin/cos values are computed when
building the Rotary Positional Embedding tables in the NeuCodec transformer.
The polynomial path (fast) avoids transcendental function calls — 6 FMAs per
value — and is measurably faster at load time on platforms where hardware sin/cos
is slow. The accuracy difference is imperceptible in speech synthesis. Pass
--features precise to opt into full IEEE 754 accuracy:
Setting both fast and precise simultaneously is a compile-time error.
Without backbone — codec-only mode; use NeuCodecDecoder::decode() directly.
Without espeak — pass pre-phonemized IPA via tts.infer_from_ipa().
Performance
Measured on a MacBook Pro M2 with the wgpu feature (Metal GPU) and
neutts-nano-Q4_0.gguf (0.2B parameters, 372 reference tokens, ~125 output tokens):
| Version | Synth time | RTF | Notes |
|---|---|---|---|
| 0.0.1 | 4.45 s | 1.79× | GPU init (1.72 s) counted against synthesis |
| 0.0.2 | ~2.7 s | ~1.1× | GPU init moved to load time; RoPE uploads eliminated |
What changed in 0.0.2:
- Eager GPU init —
NeuCodecDecoder::from_file()now initialises the Burn wgpu backend immediately, moving the ~1.7 s GPU upload from synthesis latency into model loading time (reported in "loaded in X s"). - Pre-computed RoPE tables — cos/sin tables for up to 2048 positions are computed once at weight-load time and stored as device tensors. This eliminates the 24 CPU→GPU uploads that previously occurred on every decode call (12 transformer blocks × Q and K projections each).
Build requirements
| Platform | Backbone | Codec | Phonemizer (espeak feature) |
|---|---|---|---|
| macOS | cmake + C++ (auto) | pure Rust | brew install espeak-ng — or bash scripts/build-espeak.sh |
| Linux | cmake + C++ (auto) | pure Rust | apt install libespeak-ng-dev — or bash scripts/build-espeak.sh |
| Windows MSVC | cmake + MSVC (auto) | pure Rust | .\scripts\build-espeak-windows.ps1 (see below) |
| Windows GNU | cmake + MinGW (auto) | pure Rust | bash scripts/build-espeak.sh (WSL or Git Bash) |
| iOS / Android | cross-compile llama.cpp | pure Rust | Cross-compile espeak-ng; set ESPEAK_LIB_DIR |
Windows (MSVC) — quick start
# Prerequisites (one-time):
winget install Kitware.CMake Git.Git LLVM.LLVM Ninja-build.Ninja
# From a "Developer PowerShell for VS 2022":
.\scripts\build-espeak-windows.ps1
# Build:
$env:ESPEAK_LIB_DIR = "espeak-static\lib"
cargo build --features espeak
The PowerShell script builds espeak-ng from source into espeak-static\lib\espeak-ng-merged.lib
and copies the language data to espeak-static\share\espeak-ng-data\.
Path-length note: if your project tree is deeply nested, pass -BuildRoot C:\es to use a
short build directory and avoid Windows MAX_PATH issues:
.\scripts\build-espeak-windows.ps1 -BuildRoot C:\es
Alternatively, set ESPEAK_BUILD_DIR=C:\es before cargo build and the build script will use
that directory automatically.
Cross-compiling Linux/macOS → Windows x86_64-pc-windows-gnu
Option A — MinGW-w64 (Ubuntu / Debian / macOS):
# Install MinGW-w64 cross-compiler:
# Ubuntu: sudo apt install gcc-mingw-w64-x86-64 g++-mingw-w64-x86-64
# macOS: brew install mingw-w64
# Build espeak-ng for Windows-GNU target:
CROSS_TARGET=x86_64-w64-mingw32
# Add the Rust target and build:
ESPEAK_LIB_DIR=espeak-static/lib \
CARGO_TARGET_X86_64_PC_WINDOWS_GNU_LINKER=x86_64-w64-mingw32-gcc \
Option B — Zig as cross-compiler (Alpine Linux / any host with Zig):
When MinGW-w64 is unavailable (e.g. Alpine Linux), Zig ships a built-in
x86_64-windows-gnu cross-compiler that can substitute:
# Install Zig (0.12+)
# Alpine: apk add zig
# Others: https://ziglang.org/download/
# Create thin wrapper scripts so cc-rs can find the MinGW triple:
# Codec-only build (no espeak / cmake required):
# → target/x86_64-pc-windows-gnu/release/libneutts.a (PE/COFF x86-64)
See .cargo/config.toml for the pre-wired linker/ar configuration.
Using the library
Full pipeline
use ;
use Path;
// Download backbone from HuggingFace (cached after first run).
// Pass None to auto-select the first GGUF in the repo,
// or Some("filename.gguf") to pick a specific quantisation.
let tts = load_from_hub_cb.unwrap;
// Load pre-encoded reference codes
let ref_codes = tts.load_ref_codes.unwrap;
// Synthesise — returns Vec<f32> at 24 kHz mono
let audio = tts.infer.unwrap;
// Save to WAV
tts.write_wav.unwrap;
Streaming synthesis
The backbone exposes a token-by-token streaming API that lets you start decoding audio before the model has finished generating:
use ;
use Write as _;
let tts = load_from_hub_cb.unwrap;
let ref_codes = tts.load_ref_codes.unwrap;
let ref_ipa = phonemize.unwrap;
let input_ipa = phonemize.unwrap;
let prompt = build_prompt;
let mut pending = Vec::new;
let codec = &tts.codec;
let stdout = stdout;
let mut out = new;
tts.backbone.generate_streaming?;
See examples/stream_pcm.rs for the full
self-contained example with timing diagnostics and player instructions.
Discover models programmatically
use ;
// Iterate the registry
for m in BACKBONE_MODELS
// Find a specific repo
if let Some = find_model
// List GGUF files available in a repo (network call)
let files = list_gguf_files.unwrap;
for f in &files
IPA passthrough (without espeak)
let audio = tts.infer_from_ipa.unwrap;
Decoder only
use NeuCodecDecoder;
let dec = new.unwrap;
println!; // e.g. "burn/wgpu (GPU)"
println!;
let codes: = vec!;
let audio: = dec.decode.unwrap;
Reference-code cache
use RefCodeCache;
use Path;
let cache = new?;
if let Some = cache.try_load?
Mobile / C FFI
A practical mobile architecture runs the backbone server-side and only the NeuCodec decoder on-device:
NeuTtsHandle *codec = ;
float *audio = ;
;
;
;
See include/neutts.h for the full C header.
Pipeline stages
- Text preprocessing — numbers, currencies, abbreviations → spoken words
- Phonemisation — espeak-ng converts text to IPA phonemes
- Prompt construction — reference codes + IPA → GGUF prompt
- Backbone inference — GGUF LLM generates
<|speech_N|>tokens - Token extraction — regex extracts integer IDs from generated text
- Codec decode — NeuCodec decoder converts IDs to 24 kHz audio
Status
| Component | Status |
|---|---|
| GGUF backbone inference | ✅ |
| NeuCodec decoder (pure Rust, safetensors) | ✅ |
| NeuCodec encoder (pure Rust) | ⏳ speak example falls back to Python neucodec |
GPU-accelerated codec (wgpu feature) |
✅ Metal / Vulkan / DX12 via Burn |
Streaming backbone API (generate_streaming) |
✅ |
Streaming PCM output (stream_pcm example) |
✅ |
| English backbones (Nano / Air, Q4 / Q8) | ✅ |
| German / French / Spanish backbones | ✅ |
| Windows cross-compilation (Zig or MinGW-w64) | ✅ |
| Test suite (unit + integration + e2e) | ✅ 106 tests, no model files required |
| iOS / Android build | ✅ codec is pure Rust; backbone needs cross-compile |
Citation
If you use this software in your research or project, please cite it as:
If you also use the underlying NeuTTS model or NeuCodec, please cite those works directly via their respective HuggingFace repositories at huggingface.co/neuphonic.
License
MIT