Crate nexus_queue

Expand description

This implementation uses per-slot lap counters instead:

Per-slot sequencing (nexus):
┌──────────────────────────────────────────────────────────┐
│ buffer[0]: { lap: AtomicUsize, data: T }                 │
│ buffer[1]: { lap: AtomicUsize, data: T }                 │
│ ...                                                      │
└──────────────────────────────────────────────────────────┘
   Lap counter + data on SAME cache line

§Why Per-Slot Sequencing Wins

§1. Cache Locality

When checking if a slot is ready, we load slot.lap. The subsequent read of slot.data is on the same cache line - already in L1. Traditional designs require fetching from 3 separate cache lines (head, tail, data).

§2. No Stale Cache Problem

Traditional designs cache the remote index to avoid atomic loads:

// Traditional push (rtrb-style)
if head - cached_tail >= capacity {
    cached_tail = tail.load();  // "Slow path" refresh
    if head - cached_tail >= capacity {
        return Full;
    }
}

In ping-pong benchmarks (1 message in flight), the cache is always stale, so every operation hits the “slow path”. Per-slot sequencing has no cache to refresh - just check the slot directly.

§3. Simpler Control Flow

Fewer branches = better branch prediction. Our branch miss rate is ~1.6% vs similar rates for rtrb, but with fewer total branches.

§Optimization Journey

We started by cloning rtrb’s design, then systematically tested changes. Using rtrb as baseline (p50 ≈ 200 cycles):

Change	Result	Cycles	Notes
Clone rtrb exactly	Baseline	~200	Starting point
Per-slot lap counters	+25%	~150	Biggest win
Division → bit shift	+15%	~150→~130	`tail/cap` → `tail>>shift`
`repr(C)` field ordering	+5%	-	Hot fields first
Manual fencing	~0%	-	No change on x86, helps ARM
Const generics	-20%	-	Surprisingly worse!
CachePadded slots	~0%	-	Not needed for our layout

§What Worked

Per-slot sequencing: The single biggest win. Eliminated the stale cache problem entirely and improved cache locality.

Bit shift for lap calculation: Division is 20-90 cycles on x86. Since capacity is always a power of two, tail / capacity becomes tail >> shift. Saved ~20-50 cycles per operation.

Field ordering with repr(C): Placing hot-path fields (local_tail, buffer, mask) first improves prefetching. ~5% throughput improvement.

Instruction ordering: Updating local_tail after the atomic store gives cache coherency a head start, reducing latency variance.

§What Didn’t Work

Const generics for capacity: We expected const CAP: usize to help by making the mask a compile-time constant. Instead, throughput regressed 20%! Likely causes: monomorphization code bloat, different inlining decisions.

CachePadded lap counters: With small T, multiple slots share a cache line. We tested padding each slot to 64 bytes. No improvement - the true sharing on the active slot dominates, not false sharing between slots.

Cached indices: The traditional “optimization” that rtrb uses. In our testing, it’s actually slower for latency-sensitive workloads because the cache is always stale in ping-pong scenarios.

§Example

use nexus_queue;

let (mut producer, mut consumer) = nexus_queue::ring_buffer::<u64>(1024);

// Bounded push - returns error if full
producer.push(1).unwrap();
producer.push(2).unwrap();

assert_eq!(consumer.pop(), Some(1));
assert_eq!(consumer.pop(), Some(2));

§Benchmarking Methodology

Latency benchmarks use ping-pong between two threads pinned to separate physical cores (avoiding hyperthreading). We measure round-trip time with rdtscp and divide by 2 for one-way latency. 10,000 warmup iterations followed by 100,000 measured samples.

Throughput benchmarks measure time to transfer 10 million messages through a 1024-slot buffer with producer and consumer on separate cores.

All benchmarks run with turbo boost disabled for consistent results:

echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
sudo taskset -c 0,2 ./target/release/deps/perf_spsc_bounded_latency-*

§Memory Ordering

The implementation uses manual fencing for clarity and portability:

Producer: fence(Release) before storing lap counter
Consumer: fence(Acquire) after loading lap counter

On x86, these compile to no instructions (strong memory model), but they’re required for correctness on ARM and other weakly-ordered architectures.

§When to Use This vs Other Queues

Use nexus::spsc when:

You have exactly one producer and one consumer
You need the lowest possible latency
You want both bounded and overwriting modes in one queue

Consider alternatives when:

Multiple producers → use MPSC
Multiple consumers → use SPMC or MPMC
You need async/await integration → use tokio::sync::mpsc

Structs§

Consumer: The consumer half of an SPSC ring buffer.
Full: Error returned when the ring buffer is full.
Producer: The producer half of an SPSC ring buffer.

Functions§

ring_buffer: Creates a new SPSC ring buffer with the given capacity.