Crate rlx_voxtral_tts

Source

Expand description

Voxtral-4B-TTS on RLX — Ministral LM + acoustic flow matching + codec decode.

Native Rust port of vLLM-Omni VoxtralTTSAudioGeneration (no Python at inference).

Re-exports§

pub use backbone::CompiledMinistralLm;
pub use backbone::MinistralLm;
pub use backbone::NativeTtsEngine;
pub use bench::VoxtralTtsBenchReport;
pub use codec::CodecDecoder;
pub use config::HF_MODEL_ID;
pub use config::VoxtralTtsConfig;
pub use generation::GenerationConfig;
pub use load::VoxtralTtsWeightStore;
pub use lora::load_lora_bank;
pub use options::VoxtralTtsOptions;
pub use options::VoxtralTtsRunnerBuilder;
pub use prompt_tokens::load_prompt_tokens;
pub use runner::VoxtralTtsRunner;
pub use runner::parse_codes_file;
pub use runner::write_wav_mono;
pub use tokens::PRESET_VOICES;
pub use voice::VoiceEmbedding;
pub use voice_clone::VoiceCloneSupport;
pub use voice_clone::clone_from_reference_audio;
pub use voice_clone::encode_reference_wav;
pub use voice_clone::encode_reference_wav_to_file;
pub use voice_clone::max_reference_seconds;
pub use voice_clone::voice_clone_support;

Modules§

acoustic: Flow-matching acoustic transformer (vLLM-Omni FlowMatchingAudioTransformer).
acoustic_compiled: Compiled acoustic velocity stack (3-token FM sequence, bidirectional attention).
acoustic_engine: Acoustic head backend — eager CPU reference or RLX-compiled stack.
acoustic_flow: Compiled acoustic velocity stack (3-token bidirectional transformer, no attention RoPE).
backbone
bench: Stage timing for native TTS (LM prefill / decode, acoustic, codec).
cli
codec
config: Voxtral-4B-TTS config (params.json / HF layout).
decode_shard_layer: Decode layer for wgpu LM shards — global checkpoint keys, local past_k_* inputs.
generation: Generation options (parity with vLLM-Omni sampling).
lm_flow: Compiled Ministral graphs (inputs_embeds prefill/decode, no LM head).
load: Mmap-backed weight access for consolidated.safetensors.
lora: LoRA adapters on Ministral attention + FFN projections (inference merge + eager apply).
math: Small ndarray helpers for eager CPU inference.
options: Runner options — device and eager fallbacks.
prompt_tokens: Load prompt token ids exported by the Docker tools image.
rng: Reproducible Gaussian noise for flow-matching (seeded PCG).
runner: End-to-end TTS runner — native Rust only.
speech_tokenizer: Native Tekken speech prompt tokenization (replaces Docker mistral_common).
tokens: Audio + text special tokens (vLLM-Omni AudioSpecialTokens).
voice: Preset voice embeddings (voice_embedding/*.pt or converted .f32).
voice_clone: Reference-audio voice cloning via trained codec encoder weights.
voice_pt: Convert HuggingFace voice_embedding/*.pt (bf16 zip) to native .f32.
weights: Map Voxtral-4B-TTS checkpoint keys → Llama builder keys.

Crate rlx_voxtral_tts

Crate rlx_voxtral_tts Copy item path

Re-exports§

Modules§

Crate rlx_voxtral_tts