nexus-queue 0.1.0

High-performance lock-free ring buffer queues for trading systems
Documentation

Single-Producer Single-Consumer (SPSC) ring buffer with per-slot sequencing.

This module provides a high-performance SPSC queue that outperforms traditional implementations like rtrb on both latency and throughput.

Performance Summary

Benchmarked on Intel Core Ultra 7 155H @ 4.8GHz, pinned to cores 0,2:

Metric nexus spsc rtrb Improvement
Latency p50 ~152 cycles ~202 cycles 25% faster
Latency p99 ~255 cycles ~300 cycles 15% faster
Throughput 449ms/10M ops 587ms/10M ops 31% faster
IPC 0.19 0.14 36% better

At 4.8GHz, 152 cycles ≈ 32 nanoseconds one-way latency.

Design: Per-Slot Sequencing

Traditional SPSC queues (like rtrb) use separate atomic head/tail indices with cached copies to reduce atomic operations:

Traditional (rtrb-style):
┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│ head (atomic)   │  │ tail (atomic)   │  │ buffer[N]       │
│ cached_tail     │  │ cached_head     │  │ (just data)     │
└─────────────────┘  └─────────────────┘  └─────────────────┘
   Cache line 1         Cache line 2         Cache line 3+

This implementation uses per-slot lap counters instead:

Per-slot sequencing (nexus):
┌──────────────────────────────────────────────────────────┐
│ buffer[0]: { lap: AtomicUsize, data: T }                 │
│ buffer[1]: { lap: AtomicUsize, data: T }                 │
│ ...                                                      │
└──────────────────────────────────────────────────────────┘
   Lap counter + data on SAME cache line

Why Per-Slot Sequencing Wins

1. Cache Locality

When checking if a slot is ready, we load slot.lap. The subsequent read of slot.data is on the same cache line - already in L1. Traditional designs require fetching from 3 separate cache lines (head, tail, data).

2. No Stale Cache Problem

Traditional designs cache the remote index to avoid atomic loads:

// Traditional push (rtrb-style)
if head - cached_tail >= capacity {
    cached_tail = tail.load();  // "Slow path" refresh
    if head - cached_tail >= capacity {
        return Full;
    }
}

In ping-pong benchmarks (1 message in flight), the cache is always stale, so every operation hits the "slow path". Per-slot sequencing has no cache to refresh - just check the slot directly.

3. Simpler Control Flow

Fewer branches = better branch prediction. Our branch miss rate is ~1.6% vs similar rates for rtrb, but with fewer total branches.

Optimization Journey

We started by cloning rtrb's design, then systematically tested changes. Using rtrb as baseline (p50 ≈ 200 cycles):

Change Result Cycles Notes
Clone rtrb exactly Baseline ~200 Starting point
Per-slot lap counters +25% ~150 Biggest win
Division → bit shift +15% ~150→~130 tail/captail>>shift
repr(C) field ordering +5% - Hot fields first
Manual fencing ~0% - No change on x86, helps ARM
Const generics -20% - Surprisingly worse!
CachePadded slots ~0% - Not needed for our layout

What Worked

Per-slot sequencing: The single biggest win. Eliminated the stale cache problem entirely and improved cache locality.

Bit shift for lap calculation: Division is 20-90 cycles on x86. Since capacity is always a power of two, tail / capacity becomes tail >> shift. Saved ~20-50 cycles per operation.

Field ordering with repr(C): Placing hot-path fields (local_tail, buffer, mask) first improves prefetching. ~5% throughput improvement.

Instruction ordering: Updating local_tail after the atomic store gives cache coherency a head start, reducing latency variance.

What Didn't Work

Const generics for capacity: We expected const CAP: usize to help by making the mask a compile-time constant. Instead, throughput regressed 20%! Likely causes: monomorphization code bloat, different inlining decisions.

CachePadded lap counters: With small T, multiple slots share a cache line. We tested padding each slot to 64 bytes. No improvement - the true sharing on the active slot dominates, not false sharing between slots.

Cached indices: The traditional "optimization" that rtrb uses. In our testing, it's actually slower for latency-sensitive workloads because the cache is always stale in ping-pong scenarios.

Example

use nexus_queue;

let (mut producer, mut consumer) = nexus_queue::ring_buffer::<u64>(1024);

// Bounded push - returns error if full
producer.push(1).unwrap();
producer.push(2).unwrap();

assert_eq!(consumer.pop(), Some(1));
assert_eq!(consumer.pop(), Some(2));

Benchmarking Methodology

Latency benchmarks use ping-pong between two threads pinned to separate physical cores (avoiding hyperthreading). We measure round-trip time with rdtscp and divide by 2 for one-way latency. 10,000 warmup iterations followed by 100,000 measured samples.

Throughput benchmarks measure time to transfer 10 million messages through a 1024-slot buffer with producer and consumer on separate cores.

All benchmarks run with turbo boost disabled for consistent results:

echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
sudo taskset -c 0,2 ./target/release/deps/perf_spsc_bounded_latency-*

Memory Ordering

The implementation uses manual fencing for clarity and portability:

  • Producer: fence(Release) before storing lap counter
  • Consumer: fence(Acquire) after loading lap counter

On x86, these compile to no instructions (strong memory model), but they're required for correctness on ARM and other weakly-ordered architectures.

When to Use This vs Other Queues

Use nexus::spsc when:

  • You have exactly one producer and one consumer
  • You need the lowest possible latency
  • You want both bounded and overwriting modes in one queue

Consider alternatives when:

  • Multiple producers → use MPSC
  • Multiple consumers → use SPMC or MPMC
  • You need async/await integration → use tokio::sync::mpsc