pub trait DecoderOnlyLLM: Send + Sync {
// Required methods
fn config(&self) -> &LlmRuntimeConfig;
fn prefill(&mut self, cache_id: &str, tokens: &[u32]) -> Vec<f32>;
fn decode(&mut self, cache_id: &str, token: u32, pos: u32) -> Vec<f32>;
fn release(&mut self, cache_id: &str);
// Provided methods
fn decode_batch(&mut self, batch: &[(String, u32, u32)]) -> Vec<Vec<f32>> { ... }
fn reset(&mut self) { ... }
}Expand description
A decoder-only language model.
Contract:
prefillprocesses a batch of prompt tokens and returns logits for the last token, along with initializing whatever KV cache the model maintains internally (keyed bycache_id).decodeprocesses a single generated token at positionposand returns logits for the next step.releasefrees the KV cache for a completed sequence.
Today the model owns its KV cache. Integration with ferrum-kv’s paged
KV manager is a Phase D concern; the trait is kept minimal so it can
evolve then without a full refactor.
Required Methods§
Sourcefn config(&self) -> &LlmRuntimeConfig
fn config(&self) -> &LlmRuntimeConfig
Runtime-facing configuration.
Sourcefn prefill(&mut self, cache_id: &str, tokens: &[u32]) -> Vec<f32>
fn prefill(&mut self, cache_id: &str, tokens: &[u32]) -> Vec<f32>
Prefill the model with a prompt. Returns [vocab_size] logits for
the last prompt token.
Provided Methods§
Sourcefn decode_batch(&mut self, batch: &[(String, u32, u32)]) -> Vec<Vec<f32>>
fn decode_batch(&mut self, batch: &[(String, u32, u32)]) -> Vec<Vec<f32>>
Decode multiple concurrent requests in a single forward pass.
Each entry is (cache_id, token, pos) — per-request state. Returns
one [vocab_size] logits vec per request in the SAME order.
Default implementation loops decode sequentially. Backends that
implement true batched decode (one GEMM with m=batch, per-item
attention loop) override for concurrency speedup.