dualcache-ff 0.2.3

A wait-free, high-performance concurrent cache optimized for extreme read-to-write ratios.
Documentation

DualCache-FF (Fast and Furious) 0.2.3

A highly opinionated, absolutely wait-free concurrent cache in Rust, optimized for extreme read-to-write ratios and scan-resistance. Now with full no_std support (including allocation-free static caches and zero-overhead stubs), progressive CPU spin-then-yield optimizations, graceful lifecycle shutdown, dynamic thread ID recycling, and zero external dependencies.

DualcacheFF is not a general-purpose cache. It is a specialized, high-density concurrent primitive built on CQRS (Command Query Responsibility Segregation), QSBR (Quiet State Based Reclamation), and a novel Avg-based Clock Eviction Algorithm.

By deliberately abandoning heavy API contracts (like strict linearizability and global LFU history) in favor of CPU spatial locality and wait-free semantics, DualcacheFF achieves up to 80x higher throughput than standard W-TinyLFU implementations (like Moka) under hostile workloads.

📊 Performance Benchmark Summary (0.2.3)

DualCacheFF v0.2.3 integrates Dynamic Thread ID Recycling, Cold-Start L1 Filter Bypass, and strict capacity budget alignments, along with robust static allocations for memory-constrained targets. It maintains a stable, verified cache hit rate under heavy concurrent read/write workloads at 85.65% (outperforming Moka and TinyUFO), while sustaining an extreme wait-free throughput of 55.3M - 86.8M ops/s (up to 45x faster than Moka and 18x faster than TinyUFO). For the full detailed analysis, see PERF.md.

Throughput & Hit Rate (50% Read / 50% Write Zipf Workload)

Cache Throughput (ops/s) Hit Rate
DualCacheFF (v0.2.3) 78,964,875 85.66%
TinyUFO 10,514,318 81.58%
Moka 2,980,172 82.16%

Latency Distribution (Zipf, 2M ops)

Metric DualCacheFF Moka TinyUFO
P50 42 ns 292 ns 84 ns
P99 458 ns 9,083 ns 2,125 ns
P99.9 1,083 ns 97,500 ns 6,041 ns

Memory Efficiency (per item, 1M items)

Cache Bytes per Item
DualCacheFF 50.32 B
TinyUFO 201.22 B
Moka 232.00 B

Why the leap in 0.2.0? By replacing standard channels with a State-Gated LossyQueue, we eliminated the Parker/Unparker wake-up overhead and false sharing in TLS buffers. The read path remains 100% wait-free.

🔌 Embedded no_std Options (v0.2.3)

In addition to the high-performance concurrent dynamic DualCacheFF, version 0.2.3 introduces two specialized interfaces under the static_cache module for memory-constrained and bare-metal environments:

  1. StaticDualCache<K, V, const N: usize> (Zero-Allocation Static Cache):
    • 100% alloc-free: Avoids dynamic heap allocations and fragmentation.
    • Thread-Safe: Uses slot-level isolation via atomic spinlocks.
    • API Parity: Fully supports get, insert, remove, clear, and sync.
  2. DualCacheStub<K, V> (Zero-Overhead Facade):
    • Direct No-Op: All methods are marked #[inline(always)] and evaluate to no-ops.
    • Zero Physical Footprint: Optimizes to exactly 0 bytes and 0 CPU cycles in Release mode, allowing caching to be compiled out cleanly.

⚖️ The 0.2.0 Evolution

  1. Zero External Dependencies: Removed crossbeam-channel and crossbeam-utils. All concurrency primitives (LossyQueue, CachePadded, OneshotAck) are now custom-built for zero overhead.
  2. no_std & Embedded Support: The core engine is now no_std compatible (requires alloc). Optimized for RTOS environments where the Daemon can be scheduled as a manual task.
  3. State-Gated Lossy Pipeline: Uses a 3-state turnstile (EMPTY/WRITING/READY) to ensure the Daemon never reads partially written data while maintaining a completely non-blocking writer path.
  4. Time-Based TLS Flush: Introduced daemon_tick to force-flush TLS buffers even under low write pressure, eliminating the "split-brain" visibility lag in high-concurrency scenarios.

🧠 Internal Architecture

1. Three-Tier Promotion System

  • T1 (Hot Cache): A high-speed AtomicPtr slot array mapping to Cache indices. Holds the most frequently accessed nodes for instant lookup.
  • T2 (Secondary Filter): A larger slot array for capturing secondary heat patterns.
  • Cache (Main Storage): The source of truth. Uses an open-addressed index (Linear Probing) and a flat AtomicPtr<Node> array for zero-allocation wait-free reads.

2. The Circular Clock with Revolution Shield

  • Revolution Shield: Hit items are refilled to MAX_RANK (3), granting them multiple sweeps of guaranteed survival.
  • Avg-based Decay: Elements with rank below the global average are evicted; others are decayed, ensuring instant adaptation to workload shifts.

3. High-Performance Admission

  • Ghost Set Resurrection: Evicted items leave a 16-bit fingerprint. Re-inserting a "ghost" item bypasses all filters for immediate promotion.
  • TLS Local Probation: A zero-contention 4KB Bloom-like filter in every thread filters out "one-hit wonders" before they reach the Daemon.
  • Batched Telemetry: Hits are buffered in TLS (64 items) and flushed as a single batch using our custom LossyQueue.

📜 License

This project is licensed under the MIT License.


Project co-developed and optimized with Antigravity.