Crate rlx_vlm_base

Expand description

Shared base types for vision-language and omni runners (PLAN.md M7).

rlx-qwen3-vl, rlx-lfm-vl, and rlx-nemotron-omni all need the same shape of plumbing: a per-image preprocessor (resize + patchify), a vision-tower trait, an MLP projector trait, and a multimodal turn interleaver that mixes image / text / (audio) into a single LM token stream. This crate hosts those traits so the family crates stay thin.

Status: TYPE SKELETON. The traits and supporting structs are in place; implementations land alongside the per-family crates as M7 progresses.

Structs§

ImagePatches: One image as the preprocessor sees it after resize + patchify. patches.len() == grid_h * grid_w * channels * patch_h * patch_w — the exact layout depends on the family.
MultimodalPrompt: Multimodal prompt — turn-ordered list of (modality, payload) chunks. The runner consumes this and assembles the LM token stream by interleaving text token ids with image/audio embeddings after passing each non-text chunk through the relevant encoder + projector.

Enums§

Modality: Modality tag for one chunk of a multimodal prompt. Lives next to the LM token stream so the runner knows when to invoke the vision tower / audio encoder instead of consuming raw token ids.
PromptChunk

Traits§

AudioEncoder: Audio encoder for omni models. Mel features → hidden embeddings. Reuse rlx-whisper’s mel encoder where possible — this trait is the contract a family crate adapts to.
ImagePreprocessor: Image preprocessor. Implementations resize/letterbox/normalise per the family’s training pipeline (Qwen3-VL uses SigLIP norms, LFM2.5-VL uses its own, etc.).
Projector: Projector — maps vision-tower embeddings into the LM’s embedding space (so they slot in next to text token embeddings). Typically a 2-layer MLP with GeLU.
VisionTower: Vision tower — embeds patches into the model’s hidden dim. Output shape is [num_patches, hidden].