Expand description
Semantic caching layer for LLM inference.
Returns cached responses for semantically similar queries (above a cosine similarity threshold), avoiding redundant model inference. The cache uses TF-IDF embeddings and cosine similarity for semantic matching, with LRU-style eviction and TTL-based expiry.
§Example
use oxibonsai_runtime::semantic_cache::{CachedInference, SemanticCacheConfig};
let config = SemanticCacheConfig::default();
let ci = CachedInference::new(config);
let (response, was_hit) = ci.run_or_cache(
"What is Rust programming language?",
|| "Rust is a systems programming language focused on safety.".to_string(),
);
assert!(!was_hit);
let (response2, was_hit2) = ci.run_or_cache(
"Tell me about the Rust language",
|| "Rust is a memory-safe systems language.".to_string(),
);
// May or may not be a hit depending on similarity
let _ = (response2, was_hit2);Structs§
- Cached
Inference - Middleware wrapper that checks the semantic cache before running inference.
- Cached
Response - A cached LLM response returned on a semantic cache hit.
- Semantic
Cache - Semantic cache using TF-IDF embeddings and cosine similarity.
- Semantic
Cache Config - Configuration for semantic caching.
- Semantic
Cache Stats - Statistics about the cache, suitable for monitoring and dashboards.