pub struct PrefixCachedEngine<'a> {
pub inner: InferenceEngine<'a>,
pub prefix_cache: PrefixAwarePrefill,
}Expand description
An InferenceEngine augmented with prefix KV-cache reuse.
On each generate call the engine:
- Resets the model’s KV cache (single-engine, sequential request model).
- Looks up the longest cached prefix in the trie.
- Injects the matched KV blocks back into the model’s CPU cache.
- Runs prefill only on the uncached suffix at the correct
pos_start. - Extracts any newly produced full blocks of KV state and stores them in the trie for subsequent requests (skipped on GPU tiers where the CPU cache stays empty).
- Sample-decodes new tokens up to
params.max_tokensor EOS. - Releases the session (decrements ref counts) when done.
Fields§
§inner: InferenceEngine<'a>The underlying inference engine.
prefix_cache: PrefixAwarePrefillPrefix-cache-aware prefill helper with the block trie.
Implementations§
Source§impl<'a> PrefixCachedEngine<'a>
impl<'a> PrefixCachedEngine<'a>
Sourcepub fn new(engine: InferenceEngine<'a>, max_cache_blocks: usize) -> Self
pub fn new(engine: InferenceEngine<'a>, max_cache_blocks: usize) -> Self
Wrap an existing InferenceEngine with a prefix cache.
Derives num_layers, num_kv_heads, and head_dim directly from
the engine’s model configuration, so no manual wiring is required.
§Parameters
engine— the inference engine to wrap.max_cache_blocks— maximum number of simultaneously live cache blocks. Each block holdsBLOCK_SIZE(16) tokens of KV data for every layer; memory per block is approximately2 × num_layers × num_kv_heads × head_dim × 16 × 4bytes.
Sourcepub fn generate(
&mut self,
prompt_tokens: &[u32],
params: &SamplingParams,
) -> Vec<u32>
pub fn generate( &mut self, prompt_tokens: &[u32], params: &SamplingParams, ) -> Vec<u32>
Generate tokens from prompt_tokens, reusing any cached prefix.
Returns the generated token IDs (not including the prompt). On any
internal error the method logs via tracing::warn and returns an
empty vector — generate itself is infallible from the caller’s
perspective so it can be dropped into batch pipelines.
Sourcepub fn cache_stats(&self) -> PrefixCacheStats
pub fn cache_stats(&self) -> PrefixCacheStats
Return a snapshot of the current prefix-cache statistics.
Sourcepub fn clear_cache(&mut self)
pub fn clear_cache(&mut self)
Clear all entries from the prefix cache.
Does not reset the inner engine’s KV cache.
Auto Trait Implementations§
impl<'a> Freeze for PrefixCachedEngine<'a>
impl<'a> RefUnwindSafe for PrefixCachedEngine<'a>
impl<'a> Send for PrefixCachedEngine<'a>
impl<'a> Sync for PrefixCachedEngine<'a>
impl<'a> Unpin for PrefixCachedEngine<'a>
impl<'a> UnsafeUnpin for PrefixCachedEngine<'a>
impl<'a> UnwindSafe for PrefixCachedEngine<'a>
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more