Expand description
Shared base types for vision-language and omni runners (PLAN.md M7).
rlx-qwen3-vl, rlx-lfm-vl, and rlx-nemotron-omni all need the
same shape of plumbing: a per-image preprocessor (resize +
patchify), a vision-tower trait, an MLP projector trait, and a
multimodal turn interleaver that mixes image / text / (audio)
into a single LM token stream. This crate hosts those traits so
the family crates stay thin.
Status: TYPE SKELETON. The traits and supporting structs are in place; implementations land alongside the per-family crates as M7 progresses.
Structs§
- Image
Patches - One image as the preprocessor sees it after resize + patchify.
patches.len() == grid_h * grid_w * channels * patch_h * patch_w— the exact layout depends on the family. - Multimodal
Prompt - Multimodal prompt — turn-ordered list of
(modality, payload)chunks. The runner consumes this and assembles the LM token stream by interleaving text token ids with image/audio embeddings after passing each non-text chunk through the relevant encoder + projector.
Enums§
- Modality
- Modality tag for one chunk of a multimodal prompt. Lives next to the LM token stream so the runner knows when to invoke the vision tower / audio encoder instead of consuming raw token ids.
- Prompt
Chunk
Traits§
- Audio
Encoder - Audio encoder for omni models. Mel features → hidden embeddings.
Reuse
rlx-whisper’s mel encoder where possible — this trait is the contract a family crate adapts to. - Image
Preprocessor - Image preprocessor. Implementations resize/letterbox/normalise per the family’s training pipeline (Qwen3-VL uses SigLIP norms, LFM2.5-VL uses its own, etc.).
- Projector
- Projector — maps vision-tower embeddings into the LM’s embedding space (so they slot in next to text token embeddings). Typically a 2-layer MLP with GeLU.
- Vision
Tower - Vision tower — embeds patches into the model’s hidden dim.
Output shape is
[num_patches, hidden].