# DualCache-FF Specification (v0.2.1)
---
### 1. Core Facade and Global Configuration (`lib.rs`)
```rust
pub struct Config {
pub capacity: usize,
pub t1_slots: usize,
pub t2_slots: usize,
pub duration: u32,
pub threads: usize,
pub poll_us: u64, // Daemon polling interval (µs)
pub flush_tick_threshold: u64, // TLS forced flush threshold (ticks)
}
pub struct DualCacheFF<K, V, S = RandomState> {
pub hasher: S,
pub t1: Arc<T1<K, V>>,
pub t2: Arc<T2<K, V>>,
pub cache: Arc<Cache<K, V>>,
pub cmd_tx: Arc<LossyQueue<Command<K, V>>>, // Custom Wait-Free queue
pub hit_tx: Arc<LossyQueue<[usize; 64]>>,
pub epoch: Arc<AtomicU32>,
pub worker_states: Arc<[WorkerState]>,
pub miss_buffers: Arc<[WorkerSlot<K, V>]>,
pub daemon_tick: Arc<AtomicU64>, // Background counter
}
```
### 2. Memory Access Path (`storage.rs`, `arena.rs`, `cache_padded.rs`)
**Physical Constraints:**
- **Alignment**: Uses `cache_padded::CachePadded` to prevent False Sharing on `WorkerState`. ARM/Apple Silicon uses 128-byte alignment.
- **Wait-Free Read**: The read path is 100% lock-free, based on `AtomicPtr` and **QSBR (Quiet State Based Reclamation)**.
- **no_std**: The core engine code is fully compatible with `no_std` + `alloc`.
```rust
// ─── Node: Physical Data Node ───
pub struct Node<K, V> {
pub key: K,
pub value: V,
pub expire_at: u32,
pub g_idx: u32,
}
// ─── LossyQueue: State Turnstile Queue ───
// Status Codes: EMPTY(0) -> WRITING(1) -> READY(2)
pub struct LossyQueue<T> {
tail: AtomicUsize, // Producer FAA
head: AtomicUsize, // Consumer
buffer: Box<[Slot<T>]>,
}
```
### 3. Lifecycle Management (`daemon.rs`)
```rust
pub struct Daemon<K, V, S> {
pub arena: Arena,
pub cmd_rx: Arc<LossyQueue<Command<K, V>>>,
pub hit_rx: Arc<LossyQueue<[usize; 64]>>,
pub daemon_tick: Arc<AtomicU64>,
// ...
}
```
---
# Execution Flow
## Physical Constraints and Concurrency Model
* **Worker (Frontend)**: 100% Wait-Free. Employs `L1_FILTER` for local admission control.
* **Time-based Flush**: The worker compares its local `LAST_FLUSH_TICK` with the global `daemon_tick`. If the interval exceeds the threshold, it forces a flush of the TLS buffer to resolve visibility delays under low-frequency write scenarios.
* **Daemon (Background)**: The single writer. Executes maintenance tasks at intervals defined by `poll_us`. In `no_std` mode, this can be driven externally by an RTOS scheduler.
---
## Phase 1: Frontend Read Pipeline (Worker Read Path)
1. **[QSBR Check-in]**: Marks the `local_epoch` with the current global epoch.
2. **[Hierarchical Lookup]**: Searches T1 (L1) -> T2 (L2) -> Cache (L3).
3. **[QSBR Check-out]**: Clears the `local_epoch` marker (resets to 0).
4. **[Hit Telemetry]**: Buffers the hit index in TLS. Flushes to `hit_tx` when the buffer reaches 64 elements or the flush threshold expires.
---
## Phase 2: Background Maintenance Pipeline (Daemon Maintenance)
1. **[Hit Resolution]**: Consumes hit indices in batches from `hit_rx`, updates the `Arena` ranks, and populates T1/T2.
2. **[QSBR Reclamation]**: Checks all registered worker epochs and safely deallocates expired nodes in the `garbage_queue`.
3. **[Tick Advancement]**: Increments `daemon_tick` to trigger frontend TLS forced flushes.
---
## Phase 3: Background Eviction Pipeline (Daemon Eviction)
1. **[Averaged Scan]**: Scans the `Arena` using a circular Clock algorithm.
2. **[Revolution Shield]**: High-frequency items gain rank protection. Nodes with ranks below the global average are evicted into the `garbage_queue`.
---
## Phase 4: Background Insertion Pipeline (Daemon Insertion)
1. **[Admission Control]**: Checks the `Ghost Set` (Resurrection filter).
2. **[Lossy Enqueue]**: Pushes insertion requests through the MPSC `LossyQueue`. If `compare_exchange` fails due to queue saturation, the request is immediately dropped to protect frontend latency.
---
## Phase 5: Lifecycle Destruction & CPU Yielding
1. **[Lock-Free Safe Destruction]**: The `Drop` implementation of `DualCacheFF` monitors `Arc` strong counts. When the last controller instance is dropped, it automatically dispatches `Command::Shutdown` to `cmd_tx` to safely terminate the background Daemon's polling loop.
2. **[Zero-Leak Memory Safety]**:
- The `Drop` implementation of `Cache` iterates through all internal pointers and safely deallocates (`Box::from_raw`) any remaining physical `Node` elements.
- The `Drop` implementation of `Daemon` safely deallocates all pending `Node` memory held in the `garbage_queue` awaiting QSBR reclamation.
3. **[Progressive CPU Yielding (Spin-then-Yield)]**: In `std` mode, `OneshotAck::wait()` and `LossyQueue::send_blocking()` utilize progressive spinning. The first 100 iterations use `core::hint::spin_loop()`, after which they transition to yielding via `std::thread::yield_now()` to prevent high CPU utilization (busy waiting) under extreme contention. In `no_std` mode, they maintain pure wait-free spin behavior.
---
## Phase 6: Concurrency Safety, Loom Checking & Timeout Watchdogs
1. **[Loom Coroutine Stack Overflow Prevention]**:
- In `loom` model checking, the constrained virtual coroutine stack (2 KB - 4 KB) is highly susceptible to stack overflows when copying large structures like `BatchBuf` (`[MaybeUninit; 32]`, ~768+ bytes) in `new_headless`.
- The framework has been updated under the `loom` feature flag to heap-allocate `BatchBuf` inside `WorkerSlot` via `UnsafeCell<Box<BatchBuf<K, V>>>`. By accessing the buffer via `**ptr` under Loom, the hot read path's API remains perfectly compatible while completely resolving virtual stack overflows.
2. **[Active Timeout Watchdogs]**:
- **Integration Tests**: All 6 integration test suites (`concurrent`, `pressure`, `robust`, `stability`, `test_hash`, `unsafe_spec`) are wrapped in a custom watchdog executor (`run_with_timeout`) with limits ranging from 5 to 30 seconds. Exceeding these limits triggers an immediate panic to prevent indefinite hangs in case of deadlocks or livelocks.
- **Benchmarks**: All 5 benchmark suites (`capex`, `latency`, `memory`, `throughput`, `rw_ratio`) spawn a background asynchronous watchdog (`start_timeout_watchdog`) upon startup (120s limit for `rw_ratio`, 60s for others). If a benchmark hangs, the watchdog terminates the process via `std::process::exit(101)` to prevent zombie processes.
---
## Phase 7: Dynamic Thread ID Recycling, Cold Start & Hit Rate Restoration
1. **[Dynamic Thread ID Recycling]**:
- **Context**: Previously, thread-local `WORKER_ID` allocations relied on a simple monotonic atomic increment (`fetch_add(1)`). In test and benchmark runs where threads are repeatedly spawned and destroyed, IDs exceeded `config.threads`, causing those threads to degrade to "overflow threads" that bypass L1 admission control and hit rate tracking.
- **Optimization**: We introduced a global `IdAllocator` and a thread-local `ThreadIdGuard` implementing `Drop`. When a worker thread terminates, its ID is automatically pushed back to a global thread-safe `free_list` for reuse. This keeps active concurrent IDs strictly bounded by the physical concurrent thread peak, eliminating overflow degradation.
2. **[Cold Start & Update Lookup Bypass]**:
- Under heavy write ratios, the L1 probation filter (designed to drop one-hit wonders) would incorrectly drop high-frequency update elements before they reached the Daemon, hurting hit rates.
- We introduced a global `is_cold_start` flag. When the cache is in a cold-start phase, or if the key already exists in `t1`, `t2`, or `cache` (L3), the frontend `insert` bypasses the L1 filter entirely, permitting direct insertion/updates. This restored hit rates to their peak of **84.5% - 84.7%** under all workloads (Zipf and varied read/write ratios).
3. **[Graceful Overflow Safety]**:
- Lookups and insertions on T1, T2, and Cache are strictly guarded under `if id_opt.is_some() { ... }`.
- If an unexpected thread burst exceeds the registered `config.threads` capacity, the thread gracefully degrades to a safe cache miss with lossy background updates, fully preserving memory safety and eliminating Use-After-Free (UAF) risks.