nexus-queue
A high-performance SPSC (Single-Producer Single-Consumer) ring buffer for Rust, optimized for ultra-low-latency messaging.
Performance
Benchmarked against rtrb on dual-socket Intel Xeon 8124M @ 3.00GHz, pinned to physical cores with turbo boost disabled:
Latency (ping-pong, 25 runs)
| Metric | nexus-queue | rtrb | Δ |
|---|---|---|---|
| p50 best | 346 cycles | 375 cycles | -8% |
| p50 median | ~370 cycles | ~430 cycles | -14% |
| p99 typical | ~600 cycles | ~700 cycles | -14% |
25/25 wins on p50 latency.
Throughput
| Metric | nexus-queue | rtrb |
|---|---|---|
| Throughput | 294 M msgs/sec | 127 M msgs/sec |
2.3x throughput advantage.
Usage
use spsc;
let = ;
// Producer thread
tx.push.unwrap;
// Consumer thread
assert_eq!;
Handling backpressure
use Full;
// Spin until space is available
while tx.push.is_err
// Or handle the full case
match tx.push
Disconnection detection
// Check if the other end has been dropped
if rx.is_disconnected
if tx.is_disconnected
Design
Two implementations are available with different cache line ownership patterns:
Index-based (default)
┌─────────────────────────────────────────────────────────────┐
│ Shared: │
│ tail: CachePadded<AtomicUsize> ← Producer writes │
│ head: CachePadded<AtomicUsize> ← Consumer writes │
│ buffer: *mut T │
└─────────────────────────────────────────────────────────────┘
Producer and consumer write to separate cache lines. Each endpoint caches the other's index, only refreshing when the cache indicates full/empty.
Slot-based
┌──────────────────────────────────────────────────────────────┐
│ buffer[0]: { lap: AtomicUsize, data: T } │
│ buffer[1]: { lap: AtomicUsize, data: T } │
│ ... │
└──────────────────────────────────────────────────────────────┘
Producer and consumer write to the same cache line (the slot's lap counter). The synchronization word and data share a cache line for locality.
Trade-offs
| index (default) | slot | |
|---|---|---|
| Cache line writes | Unidirectional | Bidirectional |
| Multi-socket/NUMA | ✓ Better | Worse |
| Shared L3 (single socket) | Good | ✓ Better |
Which performs better depends on your hardware topology. Benchmark both on your target hardware.
# Use slot-based implementation
= { = "0.3", = ["slot-based"] }
Both implementations are always available as submodules for benchmarking:
use ;
let = ;
let = ;
Benchmarking
For accurate results, disable turbo boost and pin to physical cores:
# Disable turbo boost (Intel)
|
# Run benchmark pinned to cores 0 and 2
# Re-enable turbo boost
|
Verify your core topology with lscpu -e — you want cores with different CORE numbers to avoid hyperthreading siblings.
Memory Ordering
Both implementations use manual fencing for clarity and portability:
- Producer:
fence(Release)before publishing - Consumer:
fence(Acquire)after reading,fence(Release)before advancing
On x86 these compile to no instructions (strong memory model), but they're required for correctness on ARM and other weakly-ordered architectures.
When to Use This
Use nexus-queue when:
- You have exactly one producer and one consumer
- You need the lowest possible latency
- You're building trading systems, audio pipelines, or real-time applications
Consider alternatives when:
- Multiple producers → use MPSC queues
- Multiple consumers → use MPMC queues
- You need async/await → use
tokio::sync::mpsc
License
MIT OR Apache-2.0