Single-Producer Single-Consumer (SPSC) ring buffer with per-slot sequencing.
This module provides a high-performance SPSC queue that outperforms traditional
implementations like rtrb on both latency and throughput.
Performance Summary
Benchmarked on Intel Core Ultra 7 155H @ 4.8GHz, pinned to cores 0,2:
| Metric | nexus spsc | rtrb | Improvement |
|---|---|---|---|
| Latency p50 | ~152 cycles | ~202 cycles | 25% faster |
| Latency p99 | ~255 cycles | ~300 cycles | 15% faster |
| Throughput | 449ms/10M ops | 587ms/10M ops | 31% faster |
| IPC | 0.19 | 0.14 | 36% better |
At 4.8GHz, 152 cycles ≈ 32 nanoseconds one-way latency.
Design: Per-Slot Sequencing
Traditional SPSC queues (like rtrb) use separate atomic head/tail indices with cached copies to reduce atomic operations:
Traditional (rtrb-style):
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ head (atomic) │ │ tail (atomic) │ │ buffer[N] │
│ cached_tail │ │ cached_head │ │ (just data) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Cache line 1 Cache line 2 Cache line 3+
This implementation uses per-slot lap counters instead:
Per-slot sequencing (nexus):
┌──────────────────────────────────────────────────────────┐
│ buffer[0]: { lap: AtomicUsize, data: T } │
│ buffer[1]: { lap: AtomicUsize, data: T } │
│ ... │
└──────────────────────────────────────────────────────────┘
Lap counter + data on SAME cache line
Why Per-Slot Sequencing Wins
1. Cache Locality
When checking if a slot is ready, we load slot.lap. The subsequent read of
slot.data is on the same cache line - already in L1. Traditional designs
require fetching from 3 separate cache lines (head, tail, data).
2. No Stale Cache Problem
Traditional designs cache the remote index to avoid atomic loads:
// Traditional push (rtrb-style)
if head - cached_tail >= capacity {
cached_tail = tail.load(); // "Slow path" refresh
if head - cached_tail >= capacity {
return Full;
}
}
In ping-pong benchmarks (1 message in flight), the cache is always stale, so every operation hits the "slow path". Per-slot sequencing has no cache to refresh - just check the slot directly.
3. Simpler Control Flow
Fewer branches = better branch prediction. Our branch miss rate is ~1.6% vs similar rates for rtrb, but with fewer total branches.
Optimization Journey
We started by cloning rtrb's design, then systematically tested changes. Using rtrb as baseline (p50 ≈ 200 cycles):
| Change | Result | Cycles | Notes |
|---|---|---|---|
| Clone rtrb exactly | Baseline | ~200 | Starting point |
| Per-slot lap counters | +25% | ~150 | Biggest win |
| Division → bit shift | +15% | ~150→~130 | tail/cap → tail>>shift |
repr(C) field ordering |
+5% | - | Hot fields first |
| Manual fencing | ~0% | - | No change on x86, helps ARM |
| Const generics | -20% | - | Surprisingly worse! |
| CachePadded slots | ~0% | - | Not needed for our layout |
What Worked
Per-slot sequencing: The single biggest win. Eliminated the stale cache problem entirely and improved cache locality.
Bit shift for lap calculation: Division is 20-90 cycles on x86. Since
capacity is always a power of two, tail / capacity becomes tail >> shift.
Saved ~20-50 cycles per operation.
Field ordering with repr(C): Placing hot-path fields (local_tail, buffer,
mask) first improves prefetching. ~5% throughput improvement.
Instruction ordering: Updating local_tail after the atomic store
gives cache coherency a head start, reducing latency variance.
What Didn't Work
Const generics for capacity: We expected const CAP: usize to help by
making the mask a compile-time constant. Instead, throughput regressed 20%!
Likely causes: monomorphization code bloat, different inlining decisions.
CachePadded lap counters: With small T, multiple slots share a cache
line. We tested padding each slot to 64 bytes. No improvement - the true
sharing on the active slot dominates, not false sharing between slots.
Cached indices: The traditional "optimization" that rtrb uses. In our testing, it's actually slower for latency-sensitive workloads because the cache is always stale in ping-pong scenarios.
Example
use nexus_queue;
let = ;
// Bounded push - returns error if full
producer.push.unwrap;
producer.push.unwrap;
assert_eq!;
assert_eq!;
Benchmarking Methodology
Latency benchmarks use ping-pong between two threads pinned to separate
physical cores (avoiding hyperthreading). We measure round-trip time with
rdtscp and divide by 2 for one-way latency. 10,000 warmup iterations
followed by 100,000 measured samples.
Throughput benchmarks measure time to transfer 10 million messages through a 1024-slot buffer with producer and consumer on separate cores.
All benchmarks run with turbo boost disabled for consistent results:
|
Memory Ordering
The implementation uses manual fencing for clarity and portability:
- Producer:
fence(Release)before storing lap counter - Consumer:
fence(Acquire)after loading lap counter
On x86, these compile to no instructions (strong memory model), but they're required for correctness on ARM and other weakly-ordered architectures.
When to Use This vs Other Queues
Use nexus::spsc when:
- You have exactly one producer and one consumer
- You need the lowest possible latency
- You want both bounded and overwriting modes in one queue
Consider alternatives when:
- Multiple producers → use MPSC
- Multiple consumers → use SPMC or MPMC
- You need async/await integration → use
tokio::sync::mpsc