1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
//! `DecoderOnlyLLM` trait — the "model family" interface that every
//! decoder-only language model (Qwen3 / Llama / Mistral / DeepSeek / ...)
//! implements, independent of backend and weight format.
//!
//! `LlmExecutor` (living in `ferrum-engine`) holds a `Box<dyn DecoderOnlyLLM>`
//! and adapts it to the `ModelExecutor` trait that the scheduler calls.
/// Runtime configuration every decoder-only LLM must expose.
///
/// This is the *execution-facing* config — the bare minimum the surrounding
/// engine needs (KV cache sizing, sampler vocab bounds, scheduler quotas).
/// It deliberately does not include architecture details like `num_heads`
/// or `intermediate_size`; those stay private to the model implementation.
/// A decoder-only language model.
///
/// Contract:
/// - `prefill` processes a batch of prompt tokens and returns logits for the
/// *last* token, along with initializing whatever KV cache the model
/// maintains internally (keyed by `cache_id`).
/// - `decode` processes a single generated token at position `pos` and
/// returns logits for the next step.
/// - `release` frees the KV cache for a completed sequence.
///
/// Today the model owns its KV cache. Integration with `ferrum-kv`'s paged
/// KV manager is a Phase D concern; the trait is kept minimal so it can
/// evolve then without a full refactor.