Photon Ring
Ultra-low-latency SPMC inter-thread messaging using seqlock-stamped ring buffers.
Photon Ring is a single-producer, multi-consumer (SPMC) pub/sub library for Rust.
no_std compatible (requires alloc), zero-allocation hot path, ~98 ns cross-thread
latency, and ~3 ns publish cost.
use ;
// Low-level SPMC channel
let = ;
let mut sub = subscribers.subscribe;
publisher.publish;
assert_eq!;
// Named-topic bus
let bus = new;
let mut pub_ = bus.publisher;
let mut sub = bus.subscribe;
pub_.publish;
assert_eq!;
The Problem
Inter-thread communication is the dominant cost in concurrent systems. Traditional approaches pay for at least one of:
| Approach | Write cost | Read cost | Allocation |
|---|---|---|---|
std::sync::mpsc |
Lock + CAS | Lock + CAS | Per-message |
Mutex<VecDeque> |
Lock acquisition | Lock acquisition | Dynamic ring growth |
| Crossbeam bounded channel | CAS on head | CAS on tail | None (pre-allocated) |
| LMAX Disruptor | Sequence claim + barrier | Sequence barrier spin | None (pre-allocated) |
The Disruptor eliminated allocation overhead and demonstrated that pre-allocated ring buffers with sequence barriers could achieve 8-32 ns latency. But it still relies on sequence barriers (shared atomic cursors) that create cache-line contention between producer and consumers.
The Solution: Seqlock-Stamped Slots
Photon Ring takes a different approach. Instead of sequence barriers, each slot in the ring buffer carries its own seqlock stamp co-located with the payload:
64 bytes (one cache line)
┌─────────────────────────────────────────────────────┐
│ stamp: AtomicU64 │ value: T │
│ (seqlock) │ (Copy, no Drop) │
└─────────────────────────────────────────────────────┘
For T <= 56 bytes, stamp and value share one cache line.
Larger T spills to additional lines (still correct, slightly slower).
Write Protocol (Publisher)
1. stamp = seq * 2 + 1 (odd = write in progress)
2. fence(Release) (stamp visible before data)
3. memcpy(slot.value, data) (direct write, no allocation)
4. stamp = seq * 2 + 2 (even = write complete, Release)
5. cursor = seq (Release — consumers can proceed)
Read Protocol (Subscriber)
1. s1 = stamp.load(Acquire)
2. if odd → spin (writer active)
3. if s1 < expected → Empty (not yet published)
4. if s1 > expected → Lagged (slot reused, consult head cursor)
5. value = memcpy(slot) (direct read, T: Copy)
6. fence(Acquire)
7. s2 = stamp.load()
8. if s1 == s2 → return (consistent read)
9. else → retry (torn read detected)
Why This Is Fast
-
No shared mutable state on the read path. Each subscriber has its own cursor (a local
u64, not an atomic). Subscribers never write to memory that anyone else reads. Zero cache-line bouncing between consumers. -
Stamp-in-slot co-location. For payloads up to 56 bytes, the seqlock stamp and payload share the same cache line. A reader loads the stamp and the data in a single cache-line fetch. The Disruptor pattern requires reading a separate sequence barrier (different cache line) before accessing the slot.
-
No allocation, ever. The ring is pre-allocated at construction. Publish is a
memcpyinto a pre-existing slot. NoArc, noBox, no heap allocation on the hot path. -
T: Copyenables torn-read detection without resource leaks. BecauseThas no destructor, a torn read (partial overwrite during read) never causes double-free or resource leaks. The stamp check detects the inconsistency and the read is retried. See Soundness for the full discussion. -
Single-producer by type system.
Publisher::publishtakes&mut self, enforced by the Rust borrow checker. No CAS, no lock, no sequence claiming on the write side.
Benchmark Results
Benchmark Environment
- CPU: Intel Core i7-10700KF @ 3.80 GHz (8 cores / 16 threads)
- OS: Linux 6.8.0-101-generic (Ubuntu)
- Rust: 1.93.1 (stable),
--releaseprofile (opt-level 3) - Turbo Boost: enabled; SMT: enabled; Core pinning: none
- Framework: Criterion, 100 samples, 3-second warmup
Numbers are medians from the machine above. Your results will vary — run
cargo bench on your own hardware for authoritative numbers.
Cross-Thread Latency (the metric that matters)
Both libraries measured with publisher and consumer on separate OS threads, busy-spin wait strategy, ring size 4096. This is the apples-to-apples comparison.
| Benchmark | Photon Ring | disruptor 4.0 |
|---|---|---|
| Cross-thread roundtrip | 98 ns | 133 ns |
| Publish only (write cost) | 3 ns | 24 ns |
Cross-thread latency is dominated by the CPU's cache coherence protocol (MESI/MOESI). Both libraries are close to the hardware floor. The publish-only difference reflects Photon Ring's simpler write path (one seqlock stamp vs sequence claim + barrier).
Photon Ring Detailed Benchmarks
| Operation | Latency | Notes |
|---|---|---|
publish (write only) |
3 ns | Single slot seqlock write |
publish + try_recv (1 sub, same thread) |
2.5 ns | Stamp-only fast path |
| Fanout: 2 subscribers | 4 ns | ~1.1 ns per additional sub |
| Fanout: 5 subscribers | 8 ns | Linear scaling |
| Fanout: 10 subscribers | 14 ns | Linear scaling |
try_recv (empty channel) |
< 1 ns | Single atomic load |
| Batch publish 64 + drain | 155 ns | 2.4 ns/msg amortized |
| Struct roundtrip (24B payload) | 4.4 ns | Realistic payload size |
| Cross-thread latency | 98 ns | Inter-core cache transfer |
Throughput
The market_data example publishes 500,000 messages per topic across 4 independent
SPMC topics (4 publishers, 4 subscribers):
2,000,000 messages in 12.5 ms = 160M msg/s
Soundness
Test Suite
- 26 correctness tests covering basic pub/sub, multi-subscriber fanout, ring overflow
with lag detection,
latest()under contention, batch publish, cross-thread SPMC, and a 1M-message stress test verified across 5 consecutive runs. - 3 doc-tests verifying all README-facing code examples compile and run.
MIRI Verification
22 single-threaded tests pass under Miri with no undefined behavior detected. Multi-threaded tests are excluded because Miri's thread scheduling is non-deterministic and the tests contain spin loops.
MIRI verifies the single-threaded unsafe operations (pointer reads/writes, MaybeUninit
handling, UnsafeCell access patterns) but does not verify the concurrent seqlock
protocol, which relies on hardware memory ordering guarantees beyond what the abstract
memory model formalizes.
The Seqlock Memory Model Question
Seqlocks involve an optimistic read pattern: the reader copies data that may be concurrently modified by the writer, then verifies consistency via the stamp. Under the C++20/Rust abstract memory model, concurrent non-atomic reads and writes to the same memory location constitute a data race, which is undefined behavior — even if the result is discarded on mismatch.
This is a known open problem in language-level memory models. The pattern is universally used in practice:
- The Linux kernel uses seqlocks pervasively (
seqlock_t) for read-heavy data likejiffies, namespace counters, and filesystem metadata. - Facebook/Meta's Folly library implements
folly::SharedMutexusing the same pattern. - The C++ standards committee (WG21) has acknowledged this gap. Papers like
P1478R7 (
std::byte-based seqlock support) and discussions aroundstd::start_lifetime_asaim to formalize seqlock semantics.
Why T: Copy is necessary but not sufficient:
The T: Copy bound ensures no destructor runs on a torn read, preventing resource leaks
and double-free. However, certain Copy types have validity invariants — for example,
bool (must be 0 or 1), NonZero<u32> (must be non-zero), or reference types. A torn
read of these types could produce a value that violates the type's invariant, which is
undefined behavior regardless of whether the value is later discarded.
Recommended payload types: Use plain numeric types (u8..u128, f32, f64),
fixed-size arrays of numerics, or #[repr(C)] structs composed exclusively of such types.
These have no validity invariants beyond alignment and can safely tolerate torn reads.
In practice, on all mainstream architectures (x86, ARM, RISC-V), torn reads of naturally-aligned types produce a valid-but-meaningless bit pattern that is always detected and discarded by the stamp check. No undefined CPU state, trap, or signal is produced.
API
Low-Level Channel
use ;
let = ; // capacity must be power of 2
// Subscribe (future messages only)
let mut sub = subs.subscribe;
// Or subscribe from oldest available message still in the ring
let mut sub_old = subs.subscribe_from_oldest;
// Publish
pub_.publish;
pub_.publish_batch;
// Receive (non-blocking)
match sub.try_recv
// Blocking receive (busy-spins until data is available)
let value = sub.recv;
// Skip to latest (discard intermediate messages)
if let Some = sub.latest
// Query state
let n = sub.pending; // messages available (capped at capacity)
let n = pub_.published; // total messages published
Important: recv() busy-spins (core::hint::spin_loop) and will consume 100% of one
CPU core while waiting. Use try_recv() with your own backoff strategy if this is
unacceptable.
Named-Topic Bus
use Photon;
let bus = new;
// Each topic is an independent SPMC ring.
// publisher() can only be called once per topic (panics on second call).
let mut prices_pub = bus.publisher;
let mut prices_sub = bus.subscribe;
// Multiple subscribers per topic
let mut logger_sub = bus.subscribe;
prices_pub.publish;
Design Constraints
| Constraint | Rationale |
|---|---|
T: Copy |
Enables torn-read detection without resource leaks; see Soundness |
| Power-of-two capacity | Bitmask modulo (seq & mask) instead of expensive % division |
| Single producer | Seqlock invariant requires exclusive write access; enforced via &mut self |
| Lossy on overflow | When the ring wraps, oldest messages are silently overwritten; consumers detect via Lagged |
Busy-spin recv() |
Lowest latency; use try_recv() with your own backoff if CPU usage matters |
Comparison with Existing Work
| Photon Ring | disruptor-rs (v4) | bus (jonhoo) | crossbeam bounded | |
|---|---|---|---|---|
| Pattern | SPMC seqlock ring | SP/MP sequence barriers | SPMC broadcast | MPMC bounded queue |
| Cross-thread latency | 98 ns | 133 ns | — | — |
| Publish cost | 3 ns | 24 ns | — | — |
| Allocation | None | None | None | None (bounded) |
| Consumer model | Poll (try_recv) |
Callback + Poller API | Poll | Poll |
| Overflow | Lossy (Lagged) | Backpressure (blocks) | Backpressure | Backpressure |
| Multi-producer | No | Yes | No | Yes |
no_std |
Yes | No | No | No |
| Dependencies | 2 (hashbrown, spin) | 4 | 0 | 3 |
Note: Crossbeam bounded channels use backpressure (the sender blocks when the buffer is full), which prevents message loss but adds latency under contention. Photon Ring uses lossy semantics — the producer never blocks, but slow consumers miss messages.
Running Benchmarks
# Full benchmark suite (includes disruptor comparison)
# Market data throughput example
# Run the test suite
# MIRI soundness check (requires nightly)
License
Licensed under either of Apache License, Version 2.0 or MIT License at your option.