Module prefix_cache_engine

Expand description

Prefix-cache-aware inference engine wrapper.

PrefixCachedEngine wraps an InferenceEngine and transparently intercepts the prefill phase: identical prompt prefixes (e.g. a shared system prompt) are served from the KV-cache trie rather than being re-processed by the model, cutting prefill cost to near-zero for cached prefixes.

§Usage

use oxibonsai_core::config::Qwen3Config;
use oxibonsai_runtime::engine::InferenceEngine;
use oxibonsai_runtime::sampling::SamplingParams;
use oxibonsai_runtime::prefix_cache_engine::PrefixCachedEngine;

let config = Qwen3Config::tiny_test();
let engine = InferenceEngine::new(config, SamplingParams::default(), 42);
let mut cached = PrefixCachedEngine::new(engine, 64);

let tokens = cached.generate(&[1, 2, 3, 4], &SamplingParams::default());
let stats = cached.cache_stats();
println!("hit rate: {:.1}%", stats.hit_rate * 100.0);

§Limitations

Real prefix-cache reuse is only effective when the engine’s forward path populates the CPU oxibonsai_model::KvCache. On Metal/CUDA tiers the GPU keeps its own KV state separate from the CPU cache; in that case the post-prefill extraction would yield all-zero tensors. This engine detects that case (the real_cpu_kv check below) and falls back to plain prefill without poisoning the trie. The session bookkeeping (hit-rate stats) still runs.

Structs§

PrefixCachedEngine: An InferenceEngine augmented with prefix KV-cache reuse.