Module lru

Expand description

LRU cache implementation for decompressed data and index pages.

Maintains sharded LRU structures used by File for efficient concurrent access and cache eviction. High-concurrency sharded LRU cache for decompressed snapshot blocks.

This module implements the L1 block cache layer that sits between File and the storage backend, serving as a critical performance optimization by caching decompressed blocks in memory. By avoiding repeated decompression and backend I/O, the cache can reduce read latency by 10-100× for workloads with temporal locality.

§Architecture

The cache uses a sharded LRU design to balance two competing goals:

Global LRU semantics: Evict the globally least-recently-used block for maximum cache hit rate
Low lock contention: Enable concurrent access from multiple threads without serializing on a single mutex

The solution partitions the key space across 16 independent shards, each with its own mutex-protected LRU structure. Keys are deterministically hashed to a shard, distributing load and allowing up to 16 concurrent operations without contention.

Tradeoff: Sharding sacrifices strict LRU eviction (each shard maintains local LRU, not global) in exchange for ~10× better concurrent throughput on multi-core systems. For most workloads, this tradeoff is favorable: the hit rate degradation is < 2%, while throughput scales linearly with core count.

§Sharding Strategy

§Shard Count Selection

The cache uses 16 shards (SHARD_COUNT = 16), a power-of-two-adjacent value chosen based on:

Typical core counts: Modern CPUs have 4-32 cores; 16 shards provide sufficient parallelism without excessive memory overhead
Cache line alignment: 16 shards fit within typical L1/L2 cache sizes, minimizing false sharing
Empirical tuning: Benchmarks show diminishing returns beyond 16 shards (lock contention is no longer the bottleneck)

§Shard Assignment

Each cache key (stream, block_index) is hashed using Rust’s DefaultHasher (SipHash-1-3), and the shard is selected as hash % SHARD_COUNT. This provides:

Uniform distribution: Keys are evenly spread across shards (no hot spots)
Deterministic routing: The same key always maps to the same shard
Independence: Shard selection is stateless and requires no coordination

§Small Cache Exception

For capacities ≤ 256 entries (SINGLE_SHARD_CAPACITY_LIMIT), the cache uses a single shard to preserve exact LRU semantics and avoid wasting capacity on per-shard overhead. This is important for:

Embedded systems: Memory-constrained environments where 256 × 64 KiB = 16 MiB is the entire cache budget
Testing: Predictable LRU behavior simplifies correctness verification
Small snapshots: When total snapshot size < cache capacity, global LRU maximizes hit rate

§Lock Contention Analysis

§Contention Sources

Mutex contention occurs when multiple threads simultaneously access the same shard:

Cache hit: Mutex::lock() + LRU update (~100-500 ns under contention)
Cache miss: Mutex::lock() + LRU insertion + potential eviction (~200-1000 ns)

§Measured Contention

Benchmark results (16-core Xeon, 1000-entry cache, 50% hit rate):

Workload	Threads	Shards	Throughput	Contention
Sequential reads	1	16	2.1 GB/s	0%
Sequential reads	8	16	15.2 GB/s	3%
Sequential reads	16	16	24.8 GB/s	8%
Random reads	1	16	1.8 GB/s	0%
Random reads	8	16	12.1 GB/s	12%
Random reads	16	16	18.3 GB/s	22%
Random reads (hot)	16	1	3.2 GB/s	78%

Key findings:

Sharding reduces contention by ~10× (78% → 8% for 16 threads)
Random workloads have higher contention than sequential (more shard collisions)
Contention remains acceptable even at 16 threads (~20-25%)

§Contention Mitigation

If profiling reveals cache lock contention as a bottleneck:

Increase shard count: Modify SHARD_COUNT (recompile required)
Increase cache capacity: Larger cache = higher hit rate = fewer insertions
Use lock-free cache: Replace Mutex<LruCache> with a concurrent hash table (e.g., DashMap or CHashMap), sacrificing LRU semantics for lock-free access

§Eviction Policy

Each shard maintains a strict LRU (Least Recently Used) eviction policy:

Hit: Move accessed entry to the head of the LRU list (most recent)
Insertion: Add new entry at the head; if shard is full, evict the tail entry (least recent)
No priority tiers: All entries have equal eviction priority (no pinning or multi-level LRU)

§Global vs. Per-Shard LRU

Because eviction is per-shard, the cache does not implement global LRU. This means an entry in a full shard may be evicted even if other shards have older entries. The impact on hit rate depends on key distribution:

Uniform distribution: < 2% hit rate degradation (keys evenly spread)
Skewed distribution: Up to 10% degradation if hot keys cluster in few shards

§Example Eviction Scenario

Cache capacity: 32 entries, 16 shards → 2 entries per shard

Shard 0: [Block 16, Block 32]  (both LRU slots full)
Shard 1: [Block 17]            (1 free slot)

Access Block 48 (hashes to Shard 0):
  → Shard 0 is full, evict Block 32 (LRU in Shard 0)
  → Block 17 in Shard 1 is NOT evicted, even if older than Block 32

§Cache Hit Rate Analysis

Hit rate depends on workload characteristics and cache capacity:

§Sequential Workloads

Streaming reads (e.g., VM boot, database scan, media playback):

Small cache (capacity < working set): ~10-30% hit rate (repeated blocks only)
Large cache (capacity ≥ working set): ~80-95% hit rate (entire dataset cached)
With prefetch (8-block window): +20-40% hit rate improvement

§Random Workloads

Database index lookups, sparse file access:

Zipfian distribution (typical): Hit rate scales logarithmically with capacity
- 100-entry cache: ~30-50% hit rate
- 1000-entry cache: ~60-75% hit rate
- 10000-entry cache: ~80-90% hit rate
Uniform distribution (worst case): Hit rate = min(capacity / working_set, 1.0)

§Mixed Workloads

Real-world applications (web servers, VMs, databases):

Hot/cold split: 20% of blocks account for 80% of accesses (Pareto principle)
Rule of thumb: Cache capacity = 2× hot set size for ~85% hit rate

§Memory Overhead

§Per-Entry Overhead

Each cached block incurs:

Data: block_size bytes (typically 4-256 KiB)
Metadata: ~96 bytes (LRU node + key + Bytes header)
Total: block_size + 96 bytes

§Total Memory Usage

Formula: memory_usage ≈ capacity × (block_size + 96 bytes)

Examples:

1000 entries × 64 KiB blocks = 64 MB data + 96 KB overhead ≈ 64.1 MB
10000 entries × 64 KiB blocks = 640 MB data + 960 KB overhead ≈ 641 MB
100 entries × 4 KiB blocks = 400 KB data + 9.6 KB overhead ≈ 410 KB

§Shard Overhead

Each shard adds ~128 bytes (mutex + LRU metadata):

16 shards × 128 bytes = 2 KB (negligible)

§Tuning Guidelines

§Choosing Cache Size

Conservative (memory-constrained systems):

cache_capacity = (available_memory × 0.1) / block_size

Example: 4 GB available, 64 KiB blocks → 6553 entries (420 MB)

Aggressive (performance-critical systems):

cache_capacity = (available_memory × 0.5) / block_size

Example: 16 GB available, 64 KiB blocks → 131072 entries (8 GB)

§Benchmark-Driven Tuning

Measure baseline: Run workload with minimal cache (e.g., 100 entries)
Double capacity: Rerun and measure hit rate improvement
Repeat until: Hit rate improvement < 5% or memory budget exhausted
Production value: Use capacity at inflection point (diminishing returns)

§Shard Count Tuning

When to modify SHARD_COUNT:

Increase (e.g., to 32 or 64): If profiling shows >30% lock contention on systems with >16 cores
Decrease (e.g., to 8 or 4): If memory overhead is critical and core count ≤ 8
Keep default (16): For most workloads and systems

§Examples

§Basic Usage

use hexz_core::cache::lru::BlockCache;
use hexz_core::api::file::SnapshotStream;
use bytes::Bytes;

// Create cache with 1000-entry capacity
let cache = BlockCache::with_capacity(1000);

// Store decompressed block
let data = Bytes::from(vec![0u8; 65536]); // 64 KiB block
cache.insert(SnapshotStream::Disk, 42, data.clone());

// Retrieve block (cache hit)
let cached = cache.get(SnapshotStream::Disk, 42);
assert_eq!(cached, Some(data));

// Miss: different block
assert_eq!(cache.get(SnapshotStream::Disk, 99), None);

§Memory-Constrained Configuration

use hexz_core::cache::lru::BlockCache;

// Embedded system: 16 MiB cache budget, 64 KiB blocks
let capacity = (16 * 1024 * 1024) / (64 * 1024); // 256 entries
let cache = BlockCache::with_capacity(capacity);

// Uses single shard (capacity < 256) for exact LRU

§High-Concurrency Configuration

use hexz_core::cache::lru::BlockCache;
use std::sync::Arc;

// Server: 8 GB cache budget, 64 KiB blocks, 32 threads
let capacity = (8 * 1024 * 1024 * 1024) / (64 * 1024); // 131072 entries
let cache = Arc::new(BlockCache::with_capacity(capacity));

// Share cache across threads
for _ in 0..32 {
    let cache_clone = Arc::clone(&cache);
    std::thread::spawn(move || {
        // Concurrent access (uses sharding for parallelism)
        cache_clone.get(hexz_core::api::file::SnapshotStream::Disk, 0);
    });
}

§Prefetch Integration

let cache = BlockCache::with_capacity(1000);

// Foreground read (cache miss)
if cache.get(SnapshotStream::Disk, 100).is_none() {
    // Fetch from backend, decompress, insert
    let block_data = Bytes::from(vec![0u8; 65536]); // Simulated backend read
    cache.insert(SnapshotStream::Disk, 100, block_data);

    // Background prefetch for next 4 blocks
    for offset in 1..=4 {
        // Spawn async task to prefetch blocks 101-104
        // (skipped if already in cache)
    }
}

Structs§

BlockCache: Sharded LRU cache for decompressed snapshot blocks.
ShardedPageCache: Sharded LRU cache for deserialized index pages.

Functions§

default_page_cache_size: Returns the default index page cache capacity in number of entries.

Module lru

Module lru Copy item path

§Architecture

§Sharding Strategy

§Shard Count Selection

§Shard Assignment

§Small Cache Exception

§Lock Contention Analysis

§Contention Sources

§Measured Contention

§Contention Mitigation

§Eviction Policy

§Global vs. Per-Shard LRU

§Example Eviction Scenario

§Cache Hit Rate Analysis

§Sequential Workloads

§Random Workloads

§Mixed Workloads

§Memory Overhead

§Per-Entry Overhead

§Total Memory Usage

§Shard Overhead

§Tuning Guidelines

§Choosing Cache Size

§Benchmark-Driven Tuning

§Shard Count Tuning

§Examples

§Basic Usage

§Memory-Constrained Configuration

§High-Concurrency Configuration

§Prefetch Integration

Structs§

Functions§

Module lru