llm-edge-cache

Multi-tier caching system for LLM Edge Agent with intelligent cache hierarchy and performance monitoring.

Features

Multi-Tier Architecture: L1 (in-memory) + L2 (distributed) caching for optimal performance
High Performance: Sub-millisecond L1 latency, 1-2ms L2 latency
Intelligent Eviction: TinyLFU algorithm for L1 cache with configurable TTL/TTI
Redis-Backed L2: Distributed caching for multi-instance deployments
SHA-256 Key Generation: Collision-resistant cache keys with parameter normalization
Comprehensive Metrics: Prometheus-compatible metrics for monitoring and observability
Graceful Degradation: Automatic fallback to L1-only mode if L2 is unavailable
Type-Safe API: Strongly typed request/response structures with full async/await support

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Cache Lookup Flow                        │
└─────────────────────────────────────────────────────────────────┘

    Request
       │
       ▼
  ┌─────────┐
  │L1 Cache │  In-Memory (Moka)
  │ Lookup  │  Target: <1ms (typically <100μs)
  └────┬────┘
       │
    ┌──┴──┐
    │ HIT │──────────────────────────────► Return (0.1ms)
    └──┬──┘
       │
    ┌──▼──┐
    │MISS │
    └──┬──┘
       │
       ▼
  ┌─────────┐
  │L2 Cache │  Distributed (Redis)
  │ Lookup  │  Target: 1-2ms
  └────┬────┘
       │
    ┌──┴──┐
    │ HIT │──► Populate L1 ──────────────► Return (2ms)
    └──┬──┘
       │
    ┌──▼──┐
    │MISS │
    └──┬──┘
       │
       ▼
  ┌─────────┐
  │Provider │  LLM API Call
  │Execution│  Target: 500-2000ms
  └────┬────┘
       │
       ▼
  ┌─────────┐
  │  Write  │  Async Write to L1 + L2
  │L1 + L2  │  (non-blocking)
  └────┬────┘
       │
       ▼
    Return

Installation

Add this to your Cargo.toml:

[dependencies]
llm-edge-cache = "0.1.0"

Usage

Basic Usage (L1 Only)

use llm_edge_cache::{CacheManager, key::CacheableRequest, l1::CachedResponse};

#[tokio::main]
async fn main() {
    // Create cache manager with default L1 configuration
    let cache = CacheManager::new();

    // Create a cacheable request
    let request = CacheableRequest::new("gpt-4", "What is the meaning of life?")
        .with_temperature(0.7)
        .with_max_tokens(100);

    // Check cache
    let result = cache.lookup(&request).await;

    match result {
        llm_edge_cache::CacheLookupResult::L1Hit(response) => {
            println!("Cache hit! Response: {}", response.content);
        }
        llm_edge_cache::CacheLookupResult::Miss => {
            println!("Cache miss - calling LLM provider");

            // Call your LLM provider here...
            let response = CachedResponse {
                content: "42".to_string(),
                tokens: Some(llm_edge_cache::l1::TokenUsage {
                    prompt_tokens: 10,
                    completion_tokens: 5,
                    total_tokens: 15,
                }),
                model: "gpt-4".to_string(),
                cached_at: chrono::Utc::now().timestamp(),
            };

            // Store in cache
            cache.store(&request, response).await;
        }
        _ => {}
    }
}

Advanced Usage (L1 + L2)

use llm_edge_cache::{CacheManager, l2::L2Config};

#[tokio::main]
async fn main() {
    // Configure L2 cache (Redis)
    let l2_config = L2Config {
        redis_url: "redis://127.0.0.1:6379".to_string(),
        ttl_seconds: 3600,  // 1 hour
        connection_timeout_ms: 1000,
        operation_timeout_ms: 100,
        key_prefix: "llm_cache:".to_string(),
    };

    // Create cache manager with L1 + L2
    let cache = CacheManager::with_l2(l2_config).await;

    // Use the cache (same API as L1-only)
    // ...
}

Custom L1 Configuration

use llm_edge_cache::{CacheManager, l1::L1Config};

let l1_config = L1Config {
    max_capacity: 10_000,    // 10k entries
    ttl_seconds: 600,        // 10 minutes
    tti_seconds: 300,        // 5 minutes idle
};

// Note: For custom L1 config, you'll need to construct manually
// or use the builder pattern if available in your version

Health Checks

// Check cache health
let health = cache.health_check().await;
println!("L1 healthy: {}", health.l1_healthy);
println!("L2 healthy: {}", health.l2_healthy);
println!("L2 configured: {}", health.l2_configured);

if health.is_fully_healthy() {
    println!("All cache tiers operational");
}

Metrics and Monitoring

// Get metrics snapshot
let metrics = cache.metrics_snapshot();

println!("L1 hits: {}", metrics.l1_hits);
println!("L1 misses: {}", metrics.l1_misses);
println!("L1 hit rate: {:.2}%", metrics.l1_hit_rate() * 100.0);

println!("L2 hits: {}", metrics.l2_hits);
println!("L2 misses: {}", metrics.l2_misses);
println!("L2 hit rate: {:.2}%", metrics.l2_hit_rate() * 100.0);

println!("Overall hit rate: {:.2}%", metrics.overall_hit_rate() * 100.0);

// Get cache sizes
println!("L1 entries: {}", cache.l1_entry_count());
if let Some(l2_size) = cache.l2_approximate_size().await {
    println!("L2 entries: {}", l2_size);
}

Cache Invalidation

// Invalidate specific entry
cache.invalidate(&request).await;

// Clear all caches (use with caution!)
cache.clear_all().await;

Custom TTL for L2

// Store with custom L2 TTL (7 days for this response)
cache.store_with_ttl(&request, response, 7 * 24 * 3600).await;

Performance Targets

Metric	Target	Typical
L1 Latency	<1ms	<100μs
L2 Latency	1-2ms	~1.5ms
Overall Hit Rate (MVP)	>50%	55-60%
Overall Hit Rate (Beta)	>70%	75-80%
L1 Eviction Algorithm	TinyLFU	-
L2 Persistence	Redis TTL	-

Default Configuration

Parameter	L1 Default	L2 Default
TTL	300s (5 min)	3600s (1 hour)
TTI	120s (2 min)	N/A
Max Capacity	1,000 entries	Limited by Redis memory
Eviction Policy	TinyLFU (LFU + LRU)	Redis TTL
Key Prefix	N/A	`llm_cache:`

Cache Key Generation

Cache keys are generated using SHA-256 hashing of the following components:

Model name
Prompt content
Temperature (normalized to 2 decimal places)
Max tokens
Additional parameters (sorted for consistency)

use llm_edge_cache::key::{generate_cache_key, CacheableRequest};

let request = CacheableRequest::new("gpt-4", "Hello, world!")
    .with_temperature(0.7)
    .with_max_tokens(100);

let cache_key = generate_cache_key(&request);
// Returns: 64-character hex-encoded SHA-256 hash

Note: Temperature values are normalized to 2 decimal places to avoid floating-point precision issues. For example, 0.7 and 0.700001 will produce the same cache key.

Prometheus Metrics

The crate exports the following Prometheus-compatible metrics:

llm_edge_cache_hits_total{tier="l1|l2"} - Total cache hits per tier
llm_edge_cache_misses_total{tier="l1|l2"} - Total cache misses per tier
llm_edge_cache_writes_total{tier="l1|l2"} - Total cache writes per tier
llm_edge_cache_latency_ms{tier="l1|l2"} - Cache operation latency histogram
llm_edge_cache_size_entries{tier="l1|l2"} - Current cache size in entries
llm_edge_cache_memory_bytes{tier="l1|l2"} - Current cache memory usage
llm_edge_requests_total - Total requests processed

Error Handling

The crate uses a graceful degradation model:

If L2 (Redis) is unavailable at startup, the system falls back to L1-only mode
If L2 becomes unavailable during operation, errors are logged but don't affect L1 operations
All L2 writes are fire-and-forget (non-blocking)
Timeouts are enforced on all Redis operations (default: 100ms)

// L2 errors don't crash the application
let cache = CacheManager::with_l2(l2_config).await;
// Even if Redis is down, this will succeed with L1-only mode

// Check if L2 is actually available
if cache.has_l2() {
    println!("L2 cache is available");
} else {
    println!("Running in L1-only mode");
}

Testing

Run the test suite:

# Unit tests (no Redis required)
cargo test

# Integration tests (requires Redis)
docker run -d -p 6379:6379 redis:7-alpine
cargo test -- --ignored

Performance Considerations

L1 Cache (Moka)

Pros: Extremely fast (<100μs), no network overhead, TinyLFU eviction
Cons: Per-instance (not shared), limited capacity, lost on restart
Best for: Hot data, frequently accessed prompts, high-throughput scenarios

L2 Cache (Redis)

Pros: Shared across instances, persistent, larger capacity
Cons: Network latency (1-2ms), requires Redis infrastructure
Best for: Warm data, multi-instance deployments, cost reduction

Optimization Tips

Adjust L1 capacity based on your working set size and memory constraints
Tune TTL values based on your use case (longer for stable prompts, shorter for dynamic content)
Monitor hit rates and adjust configuration accordingly
Use custom TTLs for responses that should be cached longer (e.g., documentation lookups)
Consider L1-only mode for single-instance deployments to reduce infrastructure complexity

Examples

See the examples directory for complete examples:

basic_cache.rs - Simple L1-only caching
distributed_cache.rs - L1 + L2 setup with Redis
metrics_monitoring.rs - Prometheus metrics integration

Contributing

Contributions are welcome! Please see the contributing guidelines for more information.

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

llm-edge-cache 0.1.0