Mistral Voxtral — Whisper-style audio encoder + 4× projector + Llama text decoder.
Weights: HuggingFace safetensors (mistralai/Voxtral-Mini-3B-2507) with
audio_tower.*, multi_modal_projector.*, and language_model.* tensors.
Audio and text embeddings are fused additively at [audio_token_id] placeholders
before the Llama trunk runs (see [embed::fuse_inputs_embeds]).