Single-Producer Single-Consumer (SPSC) ring buffer with per-slot sequencing.

This module provides a high-performance SPSC queue that outperforms traditional implementations like rtrb on both latency and throughput.

Performance Summary

Benchmarked on Intel Core Ultra 7 155H @ 4.8GHz, pinned to cores 0,2:

Metric	nexus spsc	rtrb	Improvement
Latency p50	~152 cycles	~202 cycles	25% faster
Latency p99	~255 cycles	~300 cycles	15% faster
Throughput	449ms/10M ops	587ms/10M ops	31% faster
IPC	0.19	0.14	36% better

At 4.8GHz, 152 cycles ≈ 32 nanoseconds one-way latency.

Design: Per-Slot Sequencing

Traditional SPSC queues (like rtrb) use separate atomic head/tail indices with cached copies to reduce atomic operations:

Traditional (rtrb-style):
┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│ head (atomic)   │  │ tail (atomic)   │  │ buffer[N]       │
│ cached_tail     │  │ cached_head     │  │ (just data)     │
└─────────────────┘  └─────────────────┘  └─────────────────┘
   Cache line 1         Cache line 2         Cache line 3+

This implementation uses per-slot lap counters instead:

Per-slot sequencing (nexus):
┌──────────────────────────────────────────────────────────┐
│ buffer[0]: { lap: AtomicUsize, data: T }                 │
│ buffer[1]: { lap: AtomicUsize, data: T }                 │
│ ...                                                      │
└──────────────────────────────────────────────────────────┘
   Lap counter + data on SAME cache line

Why Per-Slot Sequencing Wins

1. Cache Locality

When checking if a slot is ready, we load slot.lap. The subsequent read of slot.data is on the same cache line - already in L1. Traditional designs require fetching from 3 separate cache lines (head, tail, data).

2. No Stale Cache Problem

Traditional designs cache the remote index to avoid atomic loads:

// Traditional push (rtrb-style)
if head - cached_tail >= capacity {
    cached_tail = tail.load();  // "Slow path" refresh
    if head - cached_tail >= capacity {
        return Full;
    }
}

In ping-pong benchmarks (1 message in flight), the cache is always stale, so every operation hits the "slow path". Per-slot sequencing has no cache to refresh - just check the slot directly.

3. Simpler Control Flow

Fewer branches = better branch prediction. Our branch miss rate is ~1.6% vs similar rates for rtrb, but with fewer total branches.

Optimization Journey

We started by cloning rtrb's design, then systematically tested changes. Using rtrb as baseline (p50 ≈ 200 cycles):

Change	Result	Cycles	Notes
Clone rtrb exactly	Baseline	~200	Starting point
Per-slot lap counters	+25%	~150	Biggest win
Division → bit shift	+15%	~150→~130	`tail/cap` → `tail>>shift`
`repr(C)` field ordering	+5%	-	Hot fields first
Manual fencing	~0%	-	No change on x86, helps ARM
Const generics	-20%	-	Surprisingly worse!
CachePadded slots	~0%	-	Not needed for our layout

What Worked

Per-slot sequencing: The single biggest win. Eliminated the stale cache problem entirely and improved cache locality.

Bit shift for lap calculation: Division is 20-90 cycles on x86. Since capacity is always a power of two, tail / capacity becomes tail >> shift. Saved ~20-50 cycles per operation.

Field ordering with repr(C): Placing hot-path fields (local_tail, buffer, mask) first improves prefetching. ~5% throughput improvement.

Instruction ordering: Updating local_tail after the atomic store gives cache coherency a head start, reducing latency variance.

What Didn't Work

Const generics for capacity: We expected const CAP: usize to help by making the mask a compile-time constant. Instead, throughput regressed 20%! Likely causes: monomorphization code bloat, different inlining decisions.

CachePadded lap counters: With small T, multiple slots share a cache line. We tested padding each slot to 64 bytes. No improvement - the true sharing on the active slot dominates, not false sharing between slots.

Cached indices: The traditional "optimization" that rtrb uses. In our testing, it's actually slower for latency-sensitive workloads because the cache is always stale in ping-pong scenarios.

Example

use nexus_queue;

let (mut producer, mut consumer) = nexus_queue::ring_buffer::<u64>(1024);

// Bounded push - returns error if full
producer.push(1).unwrap();
producer.push(2).unwrap();

assert_eq!(consumer.pop(), Some(1));
assert_eq!(consumer.pop(), Some(2));

Benchmarking Methodology

Latency benchmarks use ping-pong between two threads pinned to separate physical cores (avoiding hyperthreading). We measure round-trip time with rdtscp and divide by 2 for one-way latency. 10,000 warmup iterations followed by 100,000 measured samples.

Throughput benchmarks measure time to transfer 10 million messages through a 1024-slot buffer with producer and consumer on separate cores.

All benchmarks run with turbo boost disabled for consistent results:

echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
sudo taskset -c 0,2 ./target/release/deps/perf_spsc_bounded_latency-*

Memory Ordering

The implementation uses manual fencing for clarity and portability:

Producer: fence(Release) before storing lap counter
Consumer: fence(Acquire) after loading lap counter

On x86, these compile to no instructions (strong memory model), but they're required for correctness on ARM and other weakly-ordered architectures.

When to Use This vs Other Queues

Use nexus::spsc when:

You have exactly one producer and one consumer
You need the lowest possible latency
You want both bounded and overwriting modes in one queue

Consider alternatives when:

Multiple producers → use MPSC
Multiple consumers → use SPMC or MPMC
You need async/await integration → use tokio::sync::mpsc

nexus-queue 0.1.0