Expand description
Prefix-cache-aware inference engine wrapper.
PrefixCachedEngine wraps an InferenceEngine and transparently
intercepts the prefill phase: identical prompt prefixes (e.g. a shared
system prompt) are served from the KV-cache trie rather than being
re-processed by the model, cutting prefill cost to near-zero for cached
prefixes.
§Usage
use oxibonsai_core::config::Qwen3Config;
use oxibonsai_runtime::engine::InferenceEngine;
use oxibonsai_runtime::sampling::SamplingParams;
use oxibonsai_runtime::prefix_cache_engine::PrefixCachedEngine;
let config = Qwen3Config::tiny_test();
let engine = InferenceEngine::new(config, SamplingParams::default(), 42);
let mut cached = PrefixCachedEngine::new(engine, 64);
let tokens = cached.generate(&[1, 2, 3, 4], &SamplingParams::default());
let stats = cached.cache_stats();
println!("hit rate: {:.1}%", stats.hit_rate * 100.0);§Limitations
Real prefix-cache reuse is only effective when the engine’s forward
path populates the CPU oxibonsai_model::KvCache. On Metal/CUDA tiers
the GPU keeps its own KV state separate from the CPU cache; in that
case the post-prefill extraction would yield all-zero tensors. This
engine detects that case (the real_cpu_kv check below) and falls back
to plain prefill without poisoning the trie. The session bookkeeping
(hit-rate stats) still runs.
Structs§
- Prefix
Cached Engine - An
InferenceEngineaugmented with prefix KV-cache reuse.