voxcpm-rs
Pure-Rust inference for VoxCPM2 — a zero-shot text-to-speech model with voice cloning — built on top of the Burn ML framework.
Runs locally on your machine via Vulkan (AMD, NVIDIA, Intel) or a pure-CPU fallback. No Python, no CUDA, no ONNX runtime — just a cargo dependency.
let model: = from_local?;
let wav = model.generate?;
write_wav?;
Contents
- Why
- Quick start
- Backends & features
- API tour
- Architecture
- Examples
- Contributing
- Related projects
- License
Why
The upstream VoxCPM2 reference is Python + PyTorch + CUDA. That is a heavy dependency tree to ship inside a desktop app, a game, a CLI tool, or any other Rust project where you want offline, on-device TTS.
voxcpm-rs is a single cargo add away and runs on:
- Any Vulkan-capable GPU (AMD, NVIDIA, Intel, Apple via MoltenVK).
- Pure CPU with SIMD elementwise ops, optionally with vendored OpenBLAS for multi-core matmul — no system libraries required.
It aims to stay faithful to the official implementation (see vendor/VoxCPM) while
exposing a small, idiomatic Rust API.
Quick start
-
Grab a checkpoint. Download the VoxCPM2 weights from Hugging Face:
You should end up with a directory containing
config.json,tokenizer.json,model.safetensors, andaudiovae.pth. The crate consumes this layout as-shipped — no manual weight conversion step is required. See Model files below for the full accepted layout. -
Add the crate:
# Cargo.toml [] = { = "0.1", = false, = ["wgpu"] } -
Synthesize something:
use ; type B = Wgpu;VoxCPM::generatetakes&self, so one loaded model can serve any number of sequential synthesis calls without reloading. Note however thatVoxCPM<B>is notSync— burn'sParam<Tensor<...>>wraps astd::cell::OnceCellfor lazy device materialization, which transitively makes the whole model!Sync. To share a single loaded model across threads or async tasks, wrap it inArc<Mutex<VoxCPM<B>>>(orArc<parking_lot::Mutex<...>>) and serializegeneratecalls; for true parallel inference, load oneVoxCPM<B>per worker. -
Or just run the bundled example:
Model files
VoxCPM::from_local expects a directory with:
| File | Purpose | Format accepted |
|---|---|---|
config.json |
Model architecture config | JSON |
tokenizer.json |
HuggingFace tokenizer | JSON |
model.safetensors / model.pth |
LM + DiT backbone weights | SafeTensors preferred, .pth/.pt fallback |
audiovae.safetensors / audiovae.pth |
AudioVAE decoder weights | SafeTensors preferred, .pth fallback |
The upstream HF repo currently ships model.safetensors + audiovae.pth; both
work directly with no conversion. PyTorch state_dict./model./module.
top-level container prefixes are stripped automatically.
Weight loading takes ~20–25 s on first call (a 4.3 GB BF16 backbone is upcast
to F32 for the wgpu backend — WGSL has no BF16 type). The cost is paid
once per from_local; subsequent generate() calls are free of any I/O.
Load-phase progress is reported via the log
crate, so wiring up env_logger / tracing-log surfaces it.
Backends & features
Pick exactly one backend:
| Feature | Backend | Notes |
|---|---|---|
cpu (default) |
burn-ndarray + SIMD |
Works everywhere. Matmul is single-threaded. |
cpu-blas |
cpu + vendored OpenBLAS |
Multi-core matmul. Builds OpenBLAS from source (no system deps). |
wgpu |
Vulkan / Metal / DX12 | Recommended for GPUs. Fast cold start. |
wgpu-fast |
wgpu + fusion + autotune |
~5–7% faster steady-state; pays a one-time autotune cost (cached). |
# CPU + BLAS
# Vulkan, tuned
Tip: with
wgpu-fast, setCUBECL_AUTOTUNE_LEVEL=minimalto shrink the first-run autotune cost. Results are cached intarget/autotune/.
API tour
Zero-shot synthesis
let wav = model.generate?;
Voice cloning
Provide a short reference clip (ideally a few seconds of clean speech):
use Prompt;
let opts = builder
.prompt
.build;
let wav = model.generate?;
Or continue from an existing utterance (the model picks up after audio):
let opts = builder
.prompt
.build;
Audio from memory
Prompt audio doesn't have to live on disk. PromptAudio
accepts three sources — a path, already-encoded bytes, or raw PCM samples — so
you can plug the model into an in-memory pipeline (microphone capture, HTTP
upload, another TTS stage, …):
use ;
// 1. From a file path (the default — `Into<PromptAudio>` is implemented for
// `&str`, `&Path` and `PathBuf`):
let a = Reference ;
// 2. From encoded bytes in memory (any format Symphonia supports):
let bytes: = read?;
let b = Reference ;
// 3. From raw mono f32 PCM you already have:
let c = Reference ;
Symmetrically, audio::load_audio_bytes /
audio::load_audio_bytes_as let you decode encoded audio
buffers without touching the filesystem.
Streaming
For real-time playback, network streaming, or just to start hearing audio
before the whole utterance is ready, use
VoxCPM::generate_stream. It returns an iterator
of Result<Vec<f32>> chunks at model.sample_rate():
let opts = builder
.chunk_patches // ~400 ms / chunk at the default model config
.build;
for chunk in model.generate_stream?
Concatenating every chunk yields exactly the same waveform generate()
would have returned — chunk boundaries are seamless because the AudioVAE
decoder is causal. chunk_patches trades latency for throughput: smaller
→ lower per-chunk latency, larger → fewer chunks. The default 5 is a
sensible balance for live playback.
See examples/tts_stream.rs for an end-to-end
run with per-chunk timing.
Implementation note. The autoregressive loop (LM + DiT) runs incrementally with KV-cache, so streaming adds no AR overhead compared to
generate(). The AudioVAE decoder, however, is currently stateless across chunks — each chunk re-decodes the cumulative latent and emits only the new tail samples, making total VAE workO(N²/chunk_patches)over an utterance instead ofO(N). AR cost dominates in practice, so the difference is rarely visible.
Tuning knobs
All options flow through the fluent builder:
let opts = builder
.cfg // classifier-free guidance; 1.5–3.0 is typical
.timesteps // diffusion Euler steps; fewer = faster, <6 degrades
.min_len
.max_len // hard cap on generated latent patches (~80 ms each)
.chunk_patches // patches per chunk in `generate_stream`
.build;
Cancellation
Long generations can be cancelled cooperatively from another thread via
CancelToken. The autoregressive loop polls the token between every
diffusion step, so cancel latency is bounded by one step
(~200 ms on wgpu at default timesteps=10).
use ;
use ;
let cancel = new;
let opts = builder.cancel.build;
match model.generate
CancelToken is Clone + Send + Sync (an Arc<AtomicBool> underneath),
so you can hand copies to as many watchers as you like.
Architecture
VoxCPM2 is a cascade of four components — each lives in its own module:
text ──► tokenizer ──► minicpm4 (LM backbone) ──► locenc ──► locdit (diffusion) ──► audiovae ──► wav
| Module | Role |
|---|---|
tokenizer |
HF tokenizers wrapper for the LlamaTokenizerFast vocab. |
minicpm4 |
Decoder-only LM backbone (rotary attention + KV cache). |
locenc |
Local encoder — conditions the diffusion head on LM hidden states. |
locdit |
Local DiT + conditional flow-matching sampler. |
audiovae |
VAE decoder that turns FSQ patches into 16 kHz audio. |
voxcpm2 |
Glue + convenient VoxCPM façade. |
Weights are loaded directly from .safetensors or .pth via
burn-store with the PyTorchToBurnAdapter,
so HuggingFace checkpoints drop in with no manual conversion step.
Examples
Browse examples/ for standalone binaries:
tts.rs— end-to-end synthesis.tts_stream.rs— chunked streaming synthesis with per-chunk latency logging.clone.rs— voice cloning from a reference wav.lm_check.rs,vae_check.rs,feat_check.rs— per-component parity checks against the reference implementation.bench_rmsnorm.rs— microbench for hot kernels.
Contributing
Contributions are very welcome — especially:
- Bug reports with a minimal repro and the backend/feature flags you used.
- Performance PRs (kernels, memory layout, KV cache, sampler).
- New backends supported by Burn (CUDA, Metal direct, etc.).
Before opening a PR:
cargo fmt --allandcargo clippy --all-targets.cargo test --no-default-features --features cpu.- If you touched a numeric path, run the matching
*_checkexample against a real checkpoint and include the RTF / parity numbers in the PR description.
Keep PRs focused — one feature or fix per PR makes review much easier.
Related projects
- VoxCPM (official, Python) — the
reference implementation this crate tracks. A copy lives under
vendor/VoxCPMfor parity testing. - Burn — the ML framework powering all the tensor math here.
- cubecl — the GPU kernel compiler
behind Burn's
wgpubackend.
License
Licensed under the Apache License, Version 2.0. The vendored reference
implementation under vendor/VoxCPM/ (kept in the repository for parity testing,
not shipped on crates.io) retains its own license — see the
upstream LICENSE.