Skip to main content

Module mtmd

Module mtmd 

Source
Expand description

Multimodal turn assembly (PLAN.md M7).

Replaces llama-cpp-4’s MtmdContext end-to-end. The runner receives a list of MtmdTurns — text + images + audio — and produces an AssembledTurn the per-family VL/Omni runner consumes via the [rlx_vlm_base] traits.

Status: TYPE SKELETON. The shape is in place so skill can write code against MtmdContext::build_turn(..) today; the actual image-loading / audio-resampling implementations land alongside the per-family runners in M7.

Structs§

AssembledTurn
Result of assembling a turn list into something the per-family runner can feed into prefill. text_tokens is the chat-template output run through the tokenizer; image_refs / audio_refs retain order so the runner knows where to insert the embeddings.
MtmdContext
Context for assembling multimodal turns. Holds the chat template and (eventually) the tokenizer; per-family runners hand the resulting AssembledTurn into their prefill path.
MtmdTurn
One turn in a multimodal conversation. text is rendered through the same ChatTemplate as the text-only path; images / audio are interleaved into the LM stream by the per-family runner.

Enums§

MediaSource
Where one image / audio chunk lives.