llm-edge-cache
Multi-tier caching system for LLM Edge Agent with intelligent cache hierarchy and performance monitoring.
Features
- Multi-Tier Architecture: L1 (in-memory) + L2 (distributed) caching for optimal performance
- High Performance: Sub-millisecond L1 latency, 1-2ms L2 latency
- Intelligent Eviction: TinyLFU algorithm for L1 cache with configurable TTL/TTI
- Redis-Backed L2: Distributed caching for multi-instance deployments
- SHA-256 Key Generation: Collision-resistant cache keys with parameter normalization
- Comprehensive Metrics: Prometheus-compatible metrics for monitoring and observability
- Graceful Degradation: Automatic fallback to L1-only mode if L2 is unavailable
- Type-Safe API: Strongly typed request/response structures with full async/await support
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Cache Lookup Flow │
└─────────────────────────────────────────────────────────────────┘
Request
│
▼
┌─────────┐
│L1 Cache │ In-Memory (Moka)
│ Lookup │ Target: <1ms (typically <100μs)
└────┬────┘
│
┌──┴──┐
│ HIT │──────────────────────────────► Return (0.1ms)
└──┬──┘
│
┌──▼──┐
│MISS │
└──┬──┘
│
▼
┌─────────┐
│L2 Cache │ Distributed (Redis)
│ Lookup │ Target: 1-2ms
└────┬────┘
│
┌──┴──┐
│ HIT │──► Populate L1 ──────────────► Return (2ms)
└──┬──┘
│
┌──▼──┐
│MISS │
└──┬──┘
│
▼
┌─────────┐
│Provider │ LLM API Call
│Execution│ Target: 500-2000ms
└────┬────┘
│
▼
┌─────────┐
│ Write │ Async Write to L1 + L2
│L1 + L2 │ (non-blocking)
└────┬────┘
│
▼
Return
Installation
Add this to your Cargo.toml:
[]
= "0.1.0"
Usage
Basic Usage (L1 Only)
use ;
async
Advanced Usage (L1 + L2)
use ;
async
Custom L1 Configuration
use ;
let l1_config = L1Config ;
// Note: For custom L1 config, you'll need to construct manually
// or use the builder pattern if available in your version
Health Checks
// Check cache health
let health = cache.health_check.await;
println!;
println!;
println!;
if health.is_fully_healthy
Metrics and Monitoring
// Get metrics snapshot
let metrics = cache.metrics_snapshot;
println!;
println!;
println!;
println!;
println!;
println!;
println!;
// Get cache sizes
println!;
if let Some = cache.l2_approximate_size.await
Cache Invalidation
// Invalidate specific entry
cache.invalidate.await;
// Clear all caches (use with caution!)
cache.clear_all.await;
Custom TTL for L2
// Store with custom L2 TTL (7 days for this response)
cache.store_with_ttl.await;
Performance Targets
| Metric | Target | Typical |
|---|---|---|
| L1 Latency | <1ms | <100μs |
| L2 Latency | 1-2ms | ~1.5ms |
| Overall Hit Rate (MVP) | >50% | 55-60% |
| Overall Hit Rate (Beta) | >70% | 75-80% |
| L1 Eviction Algorithm | TinyLFU | - |
| L2 Persistence | Redis TTL | - |
Default Configuration
| Parameter | L1 Default | L2 Default |
|---|---|---|
| TTL | 300s (5 min) | 3600s (1 hour) |
| TTI | 120s (2 min) | N/A |
| Max Capacity | 1,000 entries | Limited by Redis memory |
| Eviction Policy | TinyLFU (LFU + LRU) | Redis TTL |
| Key Prefix | N/A | llm_cache: |
Cache Key Generation
Cache keys are generated using SHA-256 hashing of the following components:
- Model name
- Prompt content
- Temperature (normalized to 2 decimal places)
- Max tokens
- Additional parameters (sorted for consistency)
use ;
let request = new
.with_temperature
.with_max_tokens;
let cache_key = generate_cache_key;
// Returns: 64-character hex-encoded SHA-256 hash
Note: Temperature values are normalized to 2 decimal places to avoid floating-point precision issues. For example, 0.7 and 0.700001 will produce the same cache key.
Prometheus Metrics
The crate exports the following Prometheus-compatible metrics:
llm_edge_cache_hits_total{tier="l1|l2"}- Total cache hits per tierllm_edge_cache_misses_total{tier="l1|l2"}- Total cache misses per tierllm_edge_cache_writes_total{tier="l1|l2"}- Total cache writes per tierllm_edge_cache_latency_ms{tier="l1|l2"}- Cache operation latency histogramllm_edge_cache_size_entries{tier="l1|l2"}- Current cache size in entriesllm_edge_cache_memory_bytes{tier="l1|l2"}- Current cache memory usagellm_edge_requests_total- Total requests processed
Error Handling
The crate uses a graceful degradation model:
- If L2 (Redis) is unavailable at startup, the system falls back to L1-only mode
- If L2 becomes unavailable during operation, errors are logged but don't affect L1 operations
- All L2 writes are fire-and-forget (non-blocking)
- Timeouts are enforced on all Redis operations (default: 100ms)
// L2 errors don't crash the application
let cache = with_l2.await;
// Even if Redis is down, this will succeed with L1-only mode
// Check if L2 is actually available
if cache.has_l2 else
Testing
Run the test suite:
# Unit tests (no Redis required)
# Integration tests (requires Redis)
Performance Considerations
L1 Cache (Moka)
- Pros: Extremely fast (<100μs), no network overhead, TinyLFU eviction
- Cons: Per-instance (not shared), limited capacity, lost on restart
- Best for: Hot data, frequently accessed prompts, high-throughput scenarios
L2 Cache (Redis)
- Pros: Shared across instances, persistent, larger capacity
- Cons: Network latency (1-2ms), requires Redis infrastructure
- Best for: Warm data, multi-instance deployments, cost reduction
Optimization Tips
- Adjust L1 capacity based on your working set size and memory constraints
- Tune TTL values based on your use case (longer for stable prompts, shorter for dynamic content)
- Monitor hit rates and adjust configuration accordingly
- Use custom TTLs for responses that should be cached longer (e.g., documentation lookups)
- Consider L1-only mode for single-instance deployments to reduce infrastructure complexity
Examples
See the examples directory for complete examples:
basic_cache.rs- Simple L1-only cachingdistributed_cache.rs- L1 + L2 setup with Redismetrics_monitoring.rs- Prometheus metrics integration
Contributing
Contributions are welcome! Please see the contributing guidelines for more information.
License
Licensed under the Apache License, Version 2.0. See LICENSE for details.