Expand description
Voxtral-4B-TTS on RLX — Ministral LM + acoustic flow matching + codec decode.
Native Rust port of vLLM-Omni VoxtralTTSAudioGeneration (no Python at inference).
Re-exports§
pub use backbone::CompiledMinistralLm;pub use backbone::MinistralLm;pub use backbone::NativeTtsEngine;pub use bench::VoxtralTtsBenchReport;pub use codec::CodecDecoder;pub use config::HF_MODEL_ID;pub use config::VoxtralTtsConfig;pub use generation::GenerationConfig;pub use load::VoxtralTtsWeightStore;pub use lora::load_lora_bank;pub use options::VoxtralTtsOptions;pub use options::VoxtralTtsRunnerBuilder;pub use prompt_tokens::load_prompt_tokens;pub use runner::VoxtralTtsRunner;pub use runner::parse_codes_file;pub use runner::write_wav_mono;pub use tokens::PRESET_VOICES;pub use voice::VoiceEmbedding;pub use voice_clone::VoiceCloneSupport;pub use voice_clone::clone_from_reference_audio;pub use voice_clone::encode_reference_wav;pub use voice_clone::encode_reference_wav_to_file;pub use voice_clone::max_reference_seconds;pub use voice_clone::voice_clone_support;
Modules§
- acoustic
- Flow-matching acoustic transformer (vLLM-Omni
FlowMatchingAudioTransformer). - acoustic_
compiled - Compiled acoustic velocity stack (3-token FM sequence, bidirectional attention).
- acoustic_
engine - Acoustic head backend — eager CPU reference or RLX-compiled stack.
- acoustic_
flow - Compiled acoustic velocity stack (3-token bidirectional transformer, no attention RoPE).
- backbone
- bench
- Stage timing for native TTS (LM prefill / decode, acoustic, codec).
- cli
- codec
- config
- Voxtral-4B-TTS config (
params.json/ HF layout). - decode_
shard_ layer - Decode layer for wgpu LM shards — global checkpoint keys, local
past_k_*inputs. - generation
- Generation options (parity with vLLM-Omni sampling).
- lm_flow
- Compiled Ministral graphs (
inputs_embedsprefill/decode, no LM head). - load
- Mmap-backed weight access for
consolidated.safetensors. - lora
- LoRA adapters on Ministral attention + FFN projections (inference merge + eager apply).
- math
- Small ndarray helpers for eager CPU inference.
- options
- Runner options — device and eager fallbacks.
- prompt_
tokens - Load prompt token ids exported by the Docker tools image.
- rng
- Reproducible Gaussian noise for flow-matching (seeded PCG).
- runner
- End-to-end TTS runner — native Rust only.
- speech_
tokenizer - Native Tekken speech prompt tokenization (replaces Docker
mistral_common). - tokens
- Audio + text special tokens (vLLM-Omni
AudioSpecialTokens). - voice
- Preset voice embeddings (
voice_embedding/*.ptor converted.f32). - voice_
clone - Reference-audio voice cloning via trained codec encoder weights.
- voice_
pt - Convert HuggingFace
voice_embedding/*.pt(bf16 zip) to native.f32. - weights
- Map Voxtral-4B-TTS checkpoint keys → Llama builder keys.