# llm-edge-cache
[](https://crates.io/crates/llm-edge-cache)
[](https://docs.rs/llm-edge-cache)
[](https://github.com/globalbusinessadvisors/llm-edge-agent/blob/main/LICENSE)
Multi-tier caching system for LLM Edge Agent with intelligent cache hierarchy and performance monitoring.
## Features
- **Multi-Tier Architecture**: L1 (in-memory) + L2 (distributed) caching for optimal performance
- **High Performance**: Sub-millisecond L1 latency, 1-2ms L2 latency
- **Intelligent Eviction**: TinyLFU algorithm for L1 cache with configurable TTL/TTI
- **Redis-Backed L2**: Distributed caching for multi-instance deployments
- **SHA-256 Key Generation**: Collision-resistant cache keys with parameter normalization
- **Comprehensive Metrics**: Prometheus-compatible metrics for monitoring and observability
- **Graceful Degradation**: Automatic fallback to L1-only mode if L2 is unavailable
- **Type-Safe API**: Strongly typed request/response structures with full async/await support
## Architecture
```text
┌─────────────────────────────────────────────────────────────────┐
│ Cache Lookup Flow │
└─────────────────────────────────────────────────────────────────┘
Request
│
▼
┌─────────┐
│L1 Cache │ In-Memory (Moka)
│ Lookup │ Target: <1ms (typically <100μs)
└────┬────┘
│
┌──┴──┐
│ HIT │──────────────────────────────► Return (0.1ms)
└──┬──┘
│
┌──▼──┐
│MISS │
└──┬──┘
│
▼
┌─────────┐
│L2 Cache │ Distributed (Redis)
│ Lookup │ Target: 1-2ms
└────┬────┘
│
┌──┴──┐
│ HIT │──► Populate L1 ──────────────► Return (2ms)
└──┬──┘
│
┌──▼──┐
│MISS │
└──┬──┘
│
▼
┌─────────┐
│Provider │ LLM API Call
│Execution│ Target: 500-2000ms
└────┬────┘
│
▼
┌─────────┐
│ Write │ Async Write to L1 + L2
│L1 + L2 │ (non-blocking)
└────┬────┘
│
▼
Return
```
## Installation
Add this to your `Cargo.toml`:
```toml
[dependencies]
llm-edge-cache = "0.1.0"
```
## Usage
### Basic Usage (L1 Only)
```rust
use llm_edge_cache::{CacheManager, key::CacheableRequest, l1::CachedResponse};
#[tokio::main]
async fn main() {
// Create cache manager with default L1 configuration
let cache = CacheManager::new();
// Create a cacheable request
let request = CacheableRequest::new("gpt-4", "What is the meaning of life?")
.with_temperature(0.7)
.with_max_tokens(100);
// Check cache
let result = cache.lookup(&request).await;
match result {
llm_edge_cache::CacheLookupResult::L1Hit(response) => {
println!("Cache hit! Response: {}", response.content);
}
llm_edge_cache::CacheLookupResult::Miss => {
println!("Cache miss - calling LLM provider");
// Call your LLM provider here...
let response = CachedResponse {
content: "42".to_string(),
tokens: Some(llm_edge_cache::l1::TokenUsage {
prompt_tokens: 10,
completion_tokens: 5,
total_tokens: 15,
}),
model: "gpt-4".to_string(),
cached_at: chrono::Utc::now().timestamp(),
};
// Store in cache
cache.store(&request, response).await;
}
_ => {}
}
}
```
### Advanced Usage (L1 + L2)
```rust
use llm_edge_cache::{CacheManager, l2::L2Config};
#[tokio::main]
async fn main() {
// Configure L2 cache (Redis)
let l2_config = L2Config {
redis_url: "redis://127.0.0.1:6379".to_string(),
ttl_seconds: 3600, // 1 hour
connection_timeout_ms: 1000,
operation_timeout_ms: 100,
key_prefix: "llm_cache:".to_string(),
};
// Create cache manager with L1 + L2
let cache = CacheManager::with_l2(l2_config).await;
// Use the cache (same API as L1-only)
// ...
}
```
### Custom L1 Configuration
```rust
use llm_edge_cache::{CacheManager, l1::L1Config};
let l1_config = L1Config {
max_capacity: 10_000, // 10k entries
ttl_seconds: 600, // 10 minutes
tti_seconds: 300, // 5 minutes idle
};
// Note: For custom L1 config, you'll need to construct manually
// or use the builder pattern if available in your version
```
### Health Checks
```rust
// Check cache health
let health = cache.health_check().await;
println!("L1 healthy: {}", health.l1_healthy);
println!("L2 healthy: {}", health.l2_healthy);
println!("L2 configured: {}", health.l2_configured);
if health.is_fully_healthy() {
println!("All cache tiers operational");
}
```
### Metrics and Monitoring
```rust
// Get metrics snapshot
let metrics = cache.metrics_snapshot();
println!("L1 hits: {}", metrics.l1_hits);
println!("L1 misses: {}", metrics.l1_misses);
println!("L1 hit rate: {:.2}%", metrics.l1_hit_rate() * 100.0);
println!("L2 hits: {}", metrics.l2_hits);
println!("L2 misses: {}", metrics.l2_misses);
println!("L2 hit rate: {:.2}%", metrics.l2_hit_rate() * 100.0);
println!("Overall hit rate: {:.2}%", metrics.overall_hit_rate() * 100.0);
// Get cache sizes
println!("L1 entries: {}", cache.l1_entry_count());
if let Some(l2_size) = cache.l2_approximate_size().await {
println!("L2 entries: {}", l2_size);
}
```
### Cache Invalidation
```rust
// Invalidate specific entry
cache.invalidate(&request).await;
// Clear all caches (use with caution!)
cache.clear_all().await;
```
### Custom TTL for L2
```rust
// Store with custom L2 TTL (7 days for this response)
cache.store_with_ttl(&request, response, 7 * 24 * 3600).await;
```
## Performance Targets
| L1 Latency | <1ms | <100μs |
| L2 Latency | 1-2ms | ~1.5ms |
| Overall Hit Rate (MVP) | >50% | 55-60% |
| Overall Hit Rate (Beta) | >70% | 75-80% |
| L1 Eviction Algorithm | TinyLFU | - |
| L2 Persistence | Redis TTL | - |
### Default Configuration
| TTL | 300s (5 min) | 3600s (1 hour) |
| TTI | 120s (2 min) | N/A |
| Max Capacity | 1,000 entries | Limited by Redis memory |
| Eviction Policy | TinyLFU (LFU + LRU) | Redis TTL |
| Key Prefix | N/A | `llm_cache:` |
## Cache Key Generation
Cache keys are generated using SHA-256 hashing of the following components:
- Model name
- Prompt content
- Temperature (normalized to 2 decimal places)
- Max tokens
- Additional parameters (sorted for consistency)
```rust
use llm_edge_cache::key::{generate_cache_key, CacheableRequest};
let request = CacheableRequest::new("gpt-4", "Hello, world!")
.with_temperature(0.7)
.with_max_tokens(100);
let cache_key = generate_cache_key(&request);
// Returns: 64-character hex-encoded SHA-256 hash
```
**Note**: Temperature values are normalized to 2 decimal places to avoid floating-point precision issues. For example, `0.7` and `0.700001` will produce the same cache key.
## Prometheus Metrics
The crate exports the following Prometheus-compatible metrics:
- `llm_edge_cache_hits_total{tier="l1|l2"}` - Total cache hits per tier
- `llm_edge_cache_misses_total{tier="l1|l2"}` - Total cache misses per tier
- `llm_edge_cache_writes_total{tier="l1|l2"}` - Total cache writes per tier
- `llm_edge_cache_latency_ms{tier="l1|l2"}` - Cache operation latency histogram
- `llm_edge_cache_size_entries{tier="l1|l2"}` - Current cache size in entries
- `llm_edge_cache_memory_bytes{tier="l1|l2"}` - Current cache memory usage
- `llm_edge_requests_total` - Total requests processed
## Error Handling
The crate uses a graceful degradation model:
- If L2 (Redis) is unavailable at startup, the system falls back to L1-only mode
- If L2 becomes unavailable during operation, errors are logged but don't affect L1 operations
- All L2 writes are fire-and-forget (non-blocking)
- Timeouts are enforced on all Redis operations (default: 100ms)
```rust
// L2 errors don't crash the application
let cache = CacheManager::with_l2(l2_config).await;
// Even if Redis is down, this will succeed with L1-only mode
// Check if L2 is actually available
if cache.has_l2() {
println!("L2 cache is available");
} else {
println!("Running in L1-only mode");
}
```
## Testing
Run the test suite:
```bash
# Unit tests (no Redis required)
cargo test
# Integration tests (requires Redis)
docker run -d -p 6379:6379 redis:7-alpine
cargo test -- --ignored
```
## Performance Considerations
### L1 Cache (Moka)
- **Pros**: Extremely fast (<100μs), no network overhead, TinyLFU eviction
- **Cons**: Per-instance (not shared), limited capacity, lost on restart
- **Best for**: Hot data, frequently accessed prompts, high-throughput scenarios
### L2 Cache (Redis)
- **Pros**: Shared across instances, persistent, larger capacity
- **Cons**: Network latency (1-2ms), requires Redis infrastructure
- **Best for**: Warm data, multi-instance deployments, cost reduction
### Optimization Tips
1. **Adjust L1 capacity** based on your working set size and memory constraints
2. **Tune TTL values** based on your use case (longer for stable prompts, shorter for dynamic content)
3. **Monitor hit rates** and adjust configuration accordingly
4. **Use custom TTLs** for responses that should be cached longer (e.g., documentation lookups)
5. **Consider L1-only mode** for single-instance deployments to reduce infrastructure complexity
## Examples
See the [examples directory](../../examples/) for complete examples:
- `basic_cache.rs` - Simple L1-only caching
- `distributed_cache.rs` - L1 + L2 setup with Redis
- `metrics_monitoring.rs` - Prometheus metrics integration
## Contributing
Contributions are welcome! Please see the [contributing guidelines](https://github.com/globalbusinessadvisors/llm-edge-agent/blob/main/CONTRIBUTING.md) for more information.
## License
Licensed under the Apache License, Version 2.0. See [LICENSE](https://github.com/globalbusinessadvisors/llm-edge-agent/blob/main/LICENSE) for details.
## Links
- [Repository](https://github.com/globalbusinessadvisors/llm-edge-agent)
- [Documentation](https://docs.rs/llm-edge-cache)
- [Crates.io](https://crates.io/crates/llm-edge-cache)
- [Issue Tracker](https://github.com/globalbusinessadvisors/llm-edge-agent/issues)