dualcache-ff 0.2.2

A wait-free, high-performance concurrent cache optimized for extreme read-to-write ratios.
Documentation
# DualCache-FF Specification (v0.2.2)

---

### 1. Core Facade and Global Configuration (`lib.rs`)

```rust
pub struct Config {
    pub capacity: usize,
    pub t1_slots: usize,
    pub t2_slots: usize,
    pub duration: u32,
    pub threads: usize,
    pub poll_us: u64,             // Daemon polling interval (µs)
    pub flush_tick_threshold: u64, // TLS forced flush threshold (ticks)
}

pub struct DualCacheFF<K, V, S = RandomState> {
    pub hasher: S,
    pub t1: Arc<T1<K, V>>,
    pub t2: Arc<T2<K, V>>,
    pub cache: Arc<Cache<K, V>>,
    pub cmd_tx: Arc<LossyQueue<Command<K, V>>>, // Custom Wait-Free queue
    pub hit_tx: Arc<LossyQueue<[usize; 64]>>,
    pub epoch: Arc<AtomicU32>,
    pub worker_states: Arc<[WorkerState]>,
    pub miss_buffers: Arc<[WorkerSlot<K, V>]>,
    pub daemon_tick: Arc<AtomicU64>,             // Background counter
}
```

### 2. Memory Access Path (`storage.rs`, `arena.rs`, `cache_padded.rs`)

**Physical Constraints:**
- **Alignment**: Uses `cache_padded::CachePadded` to prevent False Sharing on `WorkerState`. ARM/Apple Silicon uses 128-byte alignment.
- **Wait-Free Read**: The read path is 100% lock-free, based on `AtomicPtr` and **QSBR (Quiet State Based Reclamation)**.
- **no_std**: The core engine code is fully compatible with `no_std` + `alloc`.

```rust
// ─── Node: Physical Data Node ───
pub struct Node<K, V> {
    pub key: K,
    pub value: V,
    pub expire_at: u32,
    pub g_idx: u32,
}

// ─── LossyQueue: State Turnstile Queue ───
// Status Codes: EMPTY(0) -> WRITING(1) -> READY(2)
pub struct LossyQueue<T> {
    tail: AtomicUsize, // Producer FAA
    head: AtomicUsize, // Consumer
    buffer: Box<[Slot<T>]>,
}
```

### 3. Lifecycle Management (`daemon.rs`)

```rust
pub struct Daemon<K, V, S> {
    pub arena: Arena,
    pub cmd_rx: Arc<LossyQueue<Command<K, V>>>,
    pub hit_rx: Arc<LossyQueue<[usize; 64]>>,
    pub daemon_tick: Arc<AtomicU64>,
    // ...
}
```

---

# Execution Flow

## Physical Constraints and Concurrency Model
*   **Worker (Frontend)**: 100% Wait-Free. Employs `L1_FILTER` for local admission control.
*   **Time-based Flush**: The worker compares its local `LAST_FLUSH_TICK` with the global `daemon_tick`. If the interval exceeds the threshold, it forces a flush of the TLS buffer to resolve visibility delays under low-frequency write scenarios.
*   **Daemon (Background)**: The single writer. Executes maintenance tasks at intervals defined by `poll_us`. In `no_std` mode, this can be driven externally by an RTOS scheduler.

---

## Phase 1: Frontend Read Pipeline (Worker Read Path)
1.  **[QSBR Check-in]**: Marks the `local_epoch` with the current global epoch.
2.  **[Hierarchical Lookup]**: Searches T1 (L1) -> T2 (L2) -> Cache (L3).
3.  **[QSBR Check-out]**: Clears the `local_epoch` marker (resets to 0).
4.  **[Hit Telemetry]**: Buffers the hit index in TLS. Flushes to `hit_tx` when the buffer reaches 64 elements or the flush threshold expires.

---

## Phase 2: Background Maintenance Pipeline (Daemon Maintenance)
1.  **[Hit Resolution]**: Consumes hit indices in batches from `hit_rx`, updates the `Arena` ranks, and populates T1/T2.
2.  **[QSBR Reclamation]**: Checks all registered worker epochs and safely deallocates expired nodes in the `garbage_queue`.
3.  **[Tick Advancement]**: Increments `daemon_tick` to trigger frontend TLS forced flushes.

---

## Phase 3: Background Eviction Pipeline (Daemon Eviction)
1.  **[Averaged Scan]**: Scans the `Arena` using a circular Clock algorithm.
2.  **[Revolution Shield]**: High-frequency items gain rank protection. Nodes with ranks below the global average are evicted into the `garbage_queue`.

---

## Phase 4: Background Insertion Pipeline (Daemon Insertion)
1.  **[Admission Control]**: Checks the `Ghost Set` (Resurrection filter).
2.  **[Lossy Enqueue]**: Pushes insertion requests through the MPSC `LossyQueue`. If `compare_exchange` fails due to queue saturation, the request is immediately dropped to protect frontend latency.

---

## Phase 5: Lifecycle Destruction & CPU Yielding
1. **[Lock-Free Safe Destruction]**: The `Drop` implementation of `DualCacheFF` monitors `Arc` strong counts. When the last controller instance is dropped, it automatically dispatches `Command::Shutdown` to `cmd_tx` to safely terminate the background Daemon's polling loop.
2. **[Zero-Leak Memory Safety]**:
   - The `Drop` implementation of `Cache` iterates through all internal pointers and safely deallocates (`Box::from_raw`) any remaining physical `Node` elements.
   - The `Drop` implementation of `Daemon` safely deallocates all pending `Node` memory held in the `garbage_queue` awaiting QSBR reclamation.
3. **[Progressive CPU Yielding (Spin-then-Yield)]**: In `std` mode, `OneshotAck::wait()` and `LossyQueue::send_blocking()` utilize progressive spinning. The first 100 iterations use `core::hint::spin_loop()`, after which they transition to yielding via `std::thread::yield_now()` to prevent high CPU utilization (busy waiting) under extreme contention. In `no_std` mode, they maintain pure wait-free spin behavior.

---

## Phase 6: Concurrency Safety, Loom Checking & Timeout Watchdogs

1. **[Loom Coroutine Stack Overflow Prevention]**:
   - In `loom` model checking, the constrained virtual coroutine stack (2 KB - 4 KB) is highly susceptible to stack overflows when copying large structures like `BatchBuf` (`[MaybeUninit; 32]`, ~768+ bytes) in `new_headless`.
   - The framework has been updated under the `loom` feature flag to heap-allocate `BatchBuf` inside `WorkerSlot` via `UnsafeCell<Box<BatchBuf<K, V>>>`. By accessing the buffer via `**ptr` under Loom, the hot read path's API remains perfectly compatible while completely resolving virtual stack overflows.
2. **[Active Timeout Watchdogs]**:
   - **Integration Tests**: All 6 integration test suites (`concurrent`, `pressure`, `robust`, `stability`, `test_hash`, `unsafe_spec`) are wrapped in a custom watchdog executor (`run_with_timeout`) with limits ranging from 5 to 30 seconds. Exceeding these limits triggers an immediate panic to prevent indefinite hangs in case of deadlocks or livelocks.
   - **Benchmarks**: All 5 benchmark suites (`capex`, `latency`, `memory`, `throughput`, `rw_ratio`) spawn a background asynchronous watchdog (`start_timeout_watchdog`) upon startup (120s limit for `rw_ratio`, 60s for others). If a benchmark hangs, the watchdog terminates the process via `std::process::exit(101)` to prevent zombie processes.

---

## Phase 7: Dynamic Thread ID Recycling, Cold Start & Hit Rate Restoration

1. **[Dynamic Thread ID Recycling]**:
   - **Context**: Previously, thread-local `WORKER_ID` allocations relied on a simple monotonic atomic increment (`fetch_add(1)`). In test and benchmark runs where threads are repeatedly spawned and destroyed, IDs exceeded `config.threads`, causing those threads to degrade to "overflow threads" that bypass L1 admission control and hit rate tracking.
   - **Optimization**: We introduced a global `IdAllocator` and a thread-local `ThreadIdGuard` implementing `Drop`. When a worker thread terminates, its ID is automatically pushed back to a global thread-safe `free_list` for reuse. This keeps active concurrent IDs strictly bounded by the physical concurrent thread peak, eliminating overflow degradation.
2. **[Cold Start & Update Lookup Bypass]**:
   - Under heavy write ratios, the L1 probation filter (designed to drop one-hit wonders) would incorrectly drop high-frequency update elements before they reached the Daemon, hurting hit rates.
   - We introduced a global `is_cold_start` flag. When the cache is in a cold-start phase, or if the key already exists in `t1`, `t2`, or `cache` (L3), the frontend `insert` bypasses the L1 filter entirely, permitting direct insertion/updates. This restored hit rates to their peak of **84.5% - 84.7%** under all workloads (Zipf and varied read/write ratios).
3. **[Graceful Overflow Safety]**:
   - Lookups and insertions on T1, T2, and Cache are strictly guarded under `if id_opt.is_some() { ... }`.
   - If an unexpected thread burst exceeds the registered `config.threads` capacity, the thread gracefully degrades to a safe cache miss with lossy background updates, fully preserving memory safety and eliminating Use-After-Free (UAF) risks.