1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
//! Model-family traits beyond `DecoderOnlyLLM`.
//!
//! Extension points only at this stage — no model in the current tree
//! implements them yet. Landing order in Phase D:
//!
//! `MultimodalLLM` Qwen-VL / LLaVA (ViT backbone + LLM decoder)
//! `EncoderDecoderLM` Whisper (encoder hidden + decoder loop)
//! `EmbeddingModel` Bert / E5 / multilingual-e5 (single forward → hidden)
//! `Transcriber` Whisper CLI-facing API
//! `TtsModel` Qwen3-TTS (talker + vocoder pipeline)
//!
//! Each trait is written so it composes with the existing
//! `DecoderOnlyLLM` where appropriate (Multimodal reuses decoder loop,
//! Transcriber wraps an EncoderDecoderLM + mel frontend, etc.).
use crateDecoderOnlyLLM;
/// Opaque block of visual tokens produced by a vision encoder.
/// Exact shape depends on the model; commonly `[num_patches, hidden]`.
pub type VisualTokens = ;
/// Opaque block of audio tokens (mel spectrogram features or encoder output).
pub type AudioTokens = ;
/// Opaque sample buffer in bytes (image pixel data).
pub type ImageBuffer = ;
/// PCM audio buffer — f32 mono samples.
pub type PcmSamples = ;
/// One output segment from a transcriber (start/end seconds + text).
/// One synthesized audio chunk (stereo not supported yet).
/// Optional reference for voice-cloning-style TTS.
// ── Multimodal LLM ──────────────────────────────────────────────────────
//
// A multimodal LLM is a decoder-only LLM that additionally accepts visual
// and/or audio inputs (Qwen-VL, LLaVA, etc.). The image/audio encoders
// typically share the Backend trait but have dedicated model code
// (separate file per model family).
// ── Encoder + Decoder ────────────────────────────────────────────────────
//
// Encoder-decoder models (Whisper, T5, BART) keep encoder hidden state
// around for the duration of decode. The encoder state is opaque to the
// engine — each model defines its own.
/// Encoder-side state handed back from `encode()` and passed into
/// `decode_step()`. Opaque to the engine.
// ── Embedding Model ──────────────────────────────────────────────────────
// ── Transcriber ──────────────────────────────────────────────────────────
//
// Higher-level audio-to-text API. Wraps an internal encoder-decoder model
// plus mel-spectrogram frontend + sampler; CLI only sees this trait.
// ── TTS Model ────────────────────────────────────────────────────────────