nexus-queue
A high-performance SPSC (Single-Producer Single-Consumer) ring buffer for Rust, optimized for ultra-low-latency messaging.
Performance
Benchmarked against rtrb and crossbeam on Intel Core Ultra 7 155H, pinned to physical cores 0,2 with turbo boost disabled:
| Metric | nexus-queue | rtrb | crossbeam ArrayQueue |
|---|---|---|---|
| p50 latency | 425 cycles (158 ns) | 560 cycles (208 ns) | 1073 cycles (398 ns) |
| p99 latency | 681 cycles (253 ns) | 894 cycles (332 ns) | 1598 cycles (593 ns) |
| Throughput | 117 M msgs/sec | 50 M msgs/sec | 32 M msgs/sec |
Usage
use nexus_queue;
let = ;
// Producer thread
producer.push.unwrap;
// Consumer thread
assert_eq!;
Handling backpressure
// Spin until space is available
while producer.push.is_err
// Or handle the full case
match producer.push
Disconnection detection
// Check if the other end has been dropped
if consumer.is_disconnected
if producer.is_disconnected
Design
Per-Slot Sequencing
Traditional SPSC queues use separate atomic head/tail indices:
Traditional (rtrb-style):
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ head (atomic) │ │ tail (atomic) │ │ buffer[N] │
│ cached_tail │ │ cached_head │ │ (just data) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Cache line 1 Cache line 2 Cache line 3+
nexus-queue uses per-slot lap counters instead:
Per-slot sequencing (nexus):
┌──────────────────────────────────────────────────────────┐
│ buffer[0]: { lap: AtomicUsize, data: T } │
│ buffer[1]: { lap: AtomicUsize, data: T } │
│ ... │
└──────────────────────────────────────────────────────────┘
Lap counter + data on SAME cache line
Why This Wins
-
Cache Locality: The lap counter and data share a cache line. One fetch gets both the synchronization state and the payload.
-
No Stale Cache Problem: Traditional designs cache the remote index to avoid atomic loads, but in ping-pong scenarios the cache is always stale. Per-slot sequencing checks the slot directly.
-
Simpler Control Flow: Fewer branches means better branch prediction.
Optimization Journey
Starting from an rtrb clone (p50 ≈ 560 cycles):
| Change | Impact | Notes |
|---|---|---|
| Per-slot lap counters | +25% | Biggest win - eliminates stale cache |
| Division → bit shift | +15% | tail/cap → tail>>shift |
repr(C) field ordering |
+5% | Hot fields first for prefetching |
| Manual fencing | ~0% | Required for ARM correctness |
What didn't work:
- Const generics: -20% regression (monomorphization bloat)
- CachePadded slots: No improvement (true sharing dominates)
- Cached indices: Slower in latency-sensitive workloads
Benchmarking
For accurate results, disable turbo boost and pin to physical cores:
# Disable turbo boost (Intel)
|
# Run benchmark pinned to cores 0 and 2
# Re-enable turbo boost
|
Verify your core topology with lscpu -e — you want cores with different CORE numbers to avoid hyperthreading siblings.
Memory Ordering
The implementation uses manual fencing for clarity and portability:
- Producer:
fence(Release)before storing lap counter - Consumer:
fence(Acquire)after loading lap counter,fence(Release)before clearing
On x86 these compile to no instructions (strong memory model), but they're required for correctness on ARM and other weakly-ordered architectures.
When to Use This
Use nexus-queue when:
- You have exactly one producer and one consumer
- You need the lowest possible latency
- You're building trading systems, audio pipelines, or real-time applications
Consider alternatives when:
- Multiple producers → use MPSC queues
- Multiple consumers → use MPMC queues
- You need async/await → use
tokio::sync::mpsc
Acknowledgments
Inspired by the design of crossbeam's ArrayQueue and its use of per-slot sequence counters.
License
MIT OR Apache-2.0