Skip to main content

DecoderOnlyLLM

Trait DecoderOnlyLLM 

Source
pub trait DecoderOnlyLLM: Send + Sync {
    // Required methods
    fn config(&self) -> &LlmRuntimeConfig;
    fn prefill(&mut self, cache_id: &str, tokens: &[u32]) -> Vec<f32>;
    fn decode(&mut self, cache_id: &str, token: u32, pos: u32) -> Vec<f32>;
    fn release(&mut self, cache_id: &str);

    // Provided methods
    fn decode_batch(&mut self, batch: &[(String, u32, u32)]) -> Vec<Vec<f32>> { ... }
    fn reset(&mut self) { ... }
}
Expand description

A decoder-only language model.

Contract:

  • prefill processes a batch of prompt tokens and returns logits for the last token, along with initializing whatever KV cache the model maintains internally (keyed by cache_id).
  • decode processes a single generated token at position pos and returns logits for the next step.
  • release frees the KV cache for a completed sequence.

Today the model owns its KV cache. Integration with ferrum-kv’s paged KV manager is a Phase D concern; the trait is kept minimal so it can evolve then without a full refactor.

Required Methods§

Source

fn config(&self) -> &LlmRuntimeConfig

Runtime-facing configuration.

Source

fn prefill(&mut self, cache_id: &str, tokens: &[u32]) -> Vec<f32>

Prefill the model with a prompt. Returns [vocab_size] logits for the last prompt token.

Source

fn decode(&mut self, cache_id: &str, token: u32, pos: u32) -> Vec<f32>

Advance the model by one generated token. pos is the position of token in the sequence (number of tokens already consumed so far). Returns [vocab_size] logits for the next step.

Source

fn release(&mut self, cache_id: &str)

Release the KV cache for a completed sequence.

Provided Methods§

Source

fn decode_batch(&mut self, batch: &[(String, u32, u32)]) -> Vec<Vec<f32>>

Decode multiple concurrent requests in a single forward pass.

Each entry is (cache_id, token, pos) — per-request state. Returns one [vocab_size] logits vec per request in the SAME order.

Default implementation loops decode sequentially. Backends that implement true batched decode (one GEMM with m=batch, per-item attention loop) override for concurrency speedup.

Source

fn reset(&mut self)

Drop all cached state (useful for tests and hot-reload).

Implementors§