Expand description
Multimodal turn assembly (PLAN.md M7).
Replaces llama-cpp-4’s MtmdContext end-to-end. The runner
receives a list of MtmdTurns — text + images + audio — and
produces an AssembledTurn the per-family VL/Omni runner
consumes via the [rlx_vlm_base] traits.
Status: TYPE SKELETON. The shape is in place so skill can
write code against MtmdContext::build_turn(..) today; the
actual image-loading / audio-resampling implementations land
alongside the per-family runners in M7.
Structs§
- Assembled
Turn - Result of assembling a turn list into something the per-family
runner can feed into prefill.
text_tokensis the chat-template output run through the tokenizer;image_refs/audio_refsretain order so the runner knows where to insert the embeddings. - Mtmd
Context - Context for assembling multimodal turns. Holds the chat template
and (eventually) the tokenizer; per-family runners hand the
resulting
AssembledTurninto their prefill path. - Mtmd
Turn - One turn in a multimodal conversation.
textis rendered through the sameChatTemplateas the text-only path;images/audioare interleaved into the LM stream by the per-family runner.
Enums§
- Media
Source - Where one image / audio chunk lives.