Crate nexus_queue

Crate nexus_queue 

Source
Expand description

This implementation uses per-slot lap counters instead:

Per-slot sequencing (nexus):
┌──────────────────────────────────────────────────────────┐
│ buffer[0]: { lap: AtomicUsize, data: T }                 │
│ buffer[1]: { lap: AtomicUsize, data: T }                 │
│ ...                                                      │
└──────────────────────────────────────────────────────────┘
   Lap counter + data on SAME cache line

§Why Per-Slot Sequencing Wins

§1. Cache Locality

When checking if a slot is ready, we load slot.lap. The subsequent read of slot.data is on the same cache line - already in L1. Traditional designs require fetching from 3 separate cache lines (head, tail, data).

§2. No Stale Cache Problem

Traditional designs cache the remote index to avoid atomic loads:

// Traditional push (rtrb-style)
if head - cached_tail >= capacity {
    cached_tail = tail.load();  // "Slow path" refresh
    if head - cached_tail >= capacity {
        return Full;
    }
}

In ping-pong benchmarks (1 message in flight), the cache is always stale, so every operation hits the “slow path”. Per-slot sequencing has no cache to refresh - just check the slot directly.

§3. Simpler Control Flow

Fewer branches = better branch prediction. Our branch miss rate is ~1.6% vs similar rates for rtrb, but with fewer total branches.

§Optimization Journey

We started by cloning rtrb’s design, then systematically tested changes. Using rtrb as baseline (p50 ≈ 200 cycles):

ChangeResultCyclesNotes
Clone rtrb exactlyBaseline~200Starting point
Per-slot lap counters+25%~150Biggest win
Division → bit shift+15%~150→~130tail/captail>>shift
repr(C) field ordering+5%-Hot fields first
Manual fencing~0%-No change on x86, helps ARM
Const generics-20%-Surprisingly worse!
CachePadded slots~0%-Not needed for our layout

§What Worked

Per-slot sequencing: The single biggest win. Eliminated the stale cache problem entirely and improved cache locality.

Bit shift for lap calculation: Division is 20-90 cycles on x86. Since capacity is always a power of two, tail / capacity becomes tail >> shift. Saved ~20-50 cycles per operation.

Field ordering with repr(C): Placing hot-path fields (local_tail, buffer, mask) first improves prefetching. ~5% throughput improvement.

Instruction ordering: Updating local_tail after the atomic store gives cache coherency a head start, reducing latency variance.

§What Didn’t Work

Const generics for capacity: We expected const CAP: usize to help by making the mask a compile-time constant. Instead, throughput regressed 20%! Likely causes: monomorphization code bloat, different inlining decisions.

CachePadded lap counters: With small T, multiple slots share a cache line. We tested padding each slot to 64 bytes. No improvement - the true sharing on the active slot dominates, not false sharing between slots.

Cached indices: The traditional “optimization” that rtrb uses. In our testing, it’s actually slower for latency-sensitive workloads because the cache is always stale in ping-pong scenarios.

§Example

use nexus_queue;

let (mut producer, mut consumer) = nexus_queue::ring_buffer::<u64>(1024);

// Bounded push - returns error if full
producer.push(1).unwrap();
producer.push(2).unwrap();

assert_eq!(consumer.pop(), Some(1));
assert_eq!(consumer.pop(), Some(2));

§Benchmarking Methodology

Latency benchmarks use ping-pong between two threads pinned to separate physical cores (avoiding hyperthreading). We measure round-trip time with rdtscp and divide by 2 for one-way latency. 10,000 warmup iterations followed by 100,000 measured samples.

Throughput benchmarks measure time to transfer 10 million messages through a 1024-slot buffer with producer and consumer on separate cores.

All benchmarks run with turbo boost disabled for consistent results:

echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
sudo taskset -c 0,2 ./target/release/deps/perf_spsc_bounded_latency-*

§Memory Ordering

The implementation uses manual fencing for clarity and portability:

  • Producer: fence(Release) before storing lap counter
  • Consumer: fence(Acquire) after loading lap counter

On x86, these compile to no instructions (strong memory model), but they’re required for correctness on ARM and other weakly-ordered architectures.

§When to Use This vs Other Queues

Use nexus::spsc when:

  • You have exactly one producer and one consumer
  • You need the lowest possible latency
  • You want both bounded and overwriting modes in one queue

Consider alternatives when:

  • Multiple producers → use MPSC
  • Multiple consumers → use SPMC or MPMC
  • You need async/await integration → use tokio::sync::mpsc

Structs§

Consumer
The consumer half of an SPSC ring buffer.
Full
Error returned when the ring buffer is full.
Producer
The producer half of an SPSC ring buffer.

Functions§

ring_buffer
Creates a new SPSC ring buffer with the given capacity.