DualCache-FF (Fast and Furious) 0.2.3
A highly opinionated, absolutely wait-free concurrent cache in Rust, optimized for extreme read-to-write ratios and scan-resistance. Now with full
no_stdsupport (including allocation-free static caches and zero-overhead stubs), progressive CPU spin-then-yield optimizations, graceful lifecycle shutdown, dynamic thread ID recycling, and zero external dependencies.
DualcacheFF is not a general-purpose cache. It is a specialized, high-density concurrent primitive built on CQRS (Command Query Responsibility Segregation), QSBR (Quiet State Based Reclamation), and a novel Avg-based Clock Eviction Algorithm.
By deliberately abandoning heavy API contracts (like strict linearizability and global LFU history) in favor of CPU spatial locality and wait-free semantics, DualcacheFF achieves up to 80x higher throughput than standard W-TinyLFU implementations (like Moka) under hostile workloads.
📊 Performance Benchmark Summary (0.2.3)
DualCacheFF v0.2.3 integrates Dynamic Thread ID Recycling, Cold-Start L1 Filter Bypass, and strict capacity budget alignments, along with robust static allocations for memory-constrained targets. It maintains a stable, verified cache hit rate under heavy concurrent read/write workloads at 85.65% (outperforming Moka and TinyUFO), while sustaining an extreme wait-free throughput of 55.3M - 86.8M ops/s (up to 45x faster than Moka and 18x faster than TinyUFO). For the full detailed analysis, see PERF.md.
Throughput & Hit Rate (50% Read / 50% Write Zipf Workload)
| Cache | Throughput (ops/s) | Hit Rate |
|---|---|---|
| DualCacheFF (v0.2.3) | 78,964,875 | 85.66% |
| TinyUFO | 10,514,318 | 81.58% |
| Moka | 2,980,172 | 82.16% |
Latency Distribution (Zipf, 2M ops)
| Metric | DualCacheFF | Moka | TinyUFO |
|---|---|---|---|
| P50 | 42 ns | 292 ns | 84 ns |
| P99 | 458 ns | 9,083 ns | 2,125 ns |
| P99.9 | 1,083 ns | 97,500 ns | 6,041 ns |
Memory Efficiency (per item, 1M items)
| Cache | Bytes per Item |
|---|---|
| DualCacheFF | 50.32 B |
| TinyUFO | 201.22 B |
| Moka | 232.00 B |
Why the leap in 0.2.0? By replacing standard channels with a State-Gated LossyQueue, we eliminated the Parker/Unparker wake-up overhead and false sharing in TLS buffers. The read path remains 100% wait-free.
🔌 Embedded no_std Options (v0.2.3)
In addition to the high-performance concurrent dynamic DualCacheFF, version 0.2.3 introduces two specialized interfaces under the static_cache module for memory-constrained and bare-metal environments:
StaticDualCache<K, V, const N: usize>(Zero-Allocation Static Cache):- 100%
alloc-free: Avoids dynamic heap allocations and fragmentation. - Thread-Safe: Uses slot-level isolation via atomic spinlocks.
- API Parity: Fully supports
get,insert,remove,clear, andsync.
- 100%
DualCacheStub<K, V>(Zero-Overhead Facade):- Direct No-Op: All methods are marked
#[inline(always)]and evaluate to no-ops. - Zero Physical Footprint: Optimizes to exactly 0 bytes and 0 CPU cycles in Release mode, allowing caching to be compiled out cleanly.
- Direct No-Op: All methods are marked
⚖️ The 0.2.0 Evolution
- Zero External Dependencies: Removed
crossbeam-channelandcrossbeam-utils. All concurrency primitives (LossyQueue, CachePadded, OneshotAck) are now custom-built for zero overhead. no_std& Embedded Support: The core engine is nowno_stdcompatible (requiresalloc). Optimized for RTOS environments where the Daemon can be scheduled as a manual task.- State-Gated Lossy Pipeline: Uses a 3-state turnstile (EMPTY/WRITING/READY) to ensure the Daemon never reads partially written data while maintaining a completely non-blocking writer path.
- Time-Based TLS Flush: Introduced
daemon_tickto force-flush TLS buffers even under low write pressure, eliminating the "split-brain" visibility lag in high-concurrency scenarios.
🧠 Internal Architecture
1. Three-Tier Promotion System
- T1 (Hot Cache): A high-speed
AtomicPtrslot array mapping to Cache indices. Holds the most frequently accessed nodes for instant lookup. - T2 (Secondary Filter): A larger slot array for capturing secondary heat patterns.
- Cache (Main Storage): The source of truth. Uses an open-addressed index (Linear Probing) and a flat
AtomicPtr<Node>array for zero-allocation wait-free reads.
2. The Circular Clock with Revolution Shield
- Revolution Shield: Hit items are refilled to
MAX_RANK(3), granting them multiple sweeps of guaranteed survival. - Avg-based Decay: Elements with rank below the global average are evicted; others are decayed, ensuring instant adaptation to workload shifts.
3. High-Performance Admission
- Ghost Set Resurrection: Evicted items leave a 16-bit fingerprint. Re-inserting a "ghost" item bypasses all filters for immediate promotion.
- TLS Local Probation: A zero-contention 4KB Bloom-like filter in every thread filters out "one-hit wonders" before they reach the Daemon.
- Batched Telemetry: Hits are buffered in TLS (64 items) and flushed as a single batch using our custom
LossyQueue.
📜 License
This project is licensed under the MIT License.
Project co-developed and optimized with Antigravity.