Photon Ring
Ultra-low-latency SPMC/MPMC pub/sub using seqlock-stamped ring buffers.
Photon Ring is a zero-allocation pub/sub crate for Rust built around pre-allocated ring buffers, per-slot seqlock stamps, and T: Pod payloads. It targets the part of concurrent systems where queueing overhead dominates: market data, telemetry fanout, staged pipelines, and other hot-path broadcast workloads where every subscriber should see every message.
It is no_std compatible with alloc, supports named-topic buses and typed buses, and includes a pipeline builder for multi-stage thread topologies on supported desktop/server platforms.
[!IMPORTANT] The default
channel()is lossy on overflow: the publisher never blocks, and slow subscribers detect drops viaTryRecvError::Lagged. If lossless delivery matters more than raw latency, usechannel_bounded().
Quick start
use ;
// SPMC channel (single producer, multiple consumers)
let = ;
let mut sub = subs.subscribe;
pub_.publish;
assert_eq!;
// MPMC channel (multiple producers, clone-able)
let = ;
let mp_pub2 = mp_pub.clone;
mp_pub.publish;
mp_pub2.publish;
// Named-topic bus
let bus = new;
let mut p = bus.publisher;
let mut s = bus.subscribe;
p.publish;
assert_eq!;
Installation
[]
= "2.1.0"
Optional features:
derive: enables#[derive(photon_ring::DerivePod)]for user-definedPodtypes.hugepages: enables Linux memory controls such asmlock,prefault, and NUMA helpers.
Rust 1.70+ is supported.
The problem
Inter-thread communication is often the dominant cost in concurrent systems. Traditional messaging designs usually pay for at least one expensive property on the hot path:
| Approach | Write cost | Read cost | Allocation |
|---|---|---|---|
std::sync::mpsc |
Lock + CAS | Lock + CAS | Per-message |
Mutex<VecDeque> |
Lock acquisition | Lock acquisition | Dynamic growth |
crossbeam-channel |
CAS on head | CAS on tail | None |
| LMAX Disruptor pattern | Sequence claim + barrier | Sequence barrier spin | None |
The Disruptor showed that pre-allocated rings can remove allocator overhead and drive very low latency, but its shared sequence barriers still create cache-line contention between producers and consumers.
The solution: seqlock-stamped slots
Photon Ring moves synchronization into each slot. Every slot carries its own seqlock stamp beside the payload, so readers validate the data they just loaded instead of bouncing a shared barrier cache line.
64 bytes (one cache line)
+-----------------------------------------------------+
| stamp: AtomicU64 | value: T |
| (seqlock) | (Pod - all bit patterns valid)|
+-----------------------------------------------------+
For T <= 56 bytes, stamp and value share one cache line.
Larger T spills to additional lines (still correct, slightly slower).
Write protocol
1. stamp = seq * 2 + 1 (odd = write in progress)
2. fence(Release) (stamp visible before data)
3. memcpy(slot.value, data)
4. stamp = seq * 2 + 2 (even = write complete, Release)
5. cursor = seq (Release - consumers can proceed)
Read protocol
1. s1 = stamp.load(Acquire)
2. if odd -> spin
3. if s1 < expected -> Empty
4. if s1 > expected -> Lagged
5. value = memcpy(slot) (direct read, T: Pod)
6. s2 = stamp.load(Acquire)
7. if s1 == s2 -> return
8. else -> retry
Why this is fast
- No shared mutable state on the read path. Each subscriber keeps its own local cursor. Readers do not publish progress into a shared hot cache line unless bounded backpressure tracking is enabled.
- Stamp and payload are co-located. For
T <= 56bytes, the stamp check and payload read hit the same cache line. - No allocation on publish or receive. The ring is fixed at construction time, and hot-path operations are direct slot reads and writes.
T: Podmakes torn reads safe to reject. Every bit pattern is valid, so an optimistic torn read is harmless and discarded by the stamp re-check.- Single-producer SPMC avoids write-side contention.
Publisher::publishtakes&mut self, so the type system enforces one producer without CAS.MpPublisheradds an MPMC path when you need multiple concurrent writers.
Benchmarks
Measured with Criterion on an Intel i7-10700KF (8C/16T, 3.80 GHz, Linux 6.8, Rust 1.93.1) and Apple M1 Pro (8C, macOS 26.3, Rust 1.92.0), --release, 100 samples, no core pinning unless stated.
[!NOTE] These numbers use
Podpayloads and compare concrete implementations, not abstract algorithms. Scheduler noise, pinning, CPU generation, and payload layout all matter, so treat them as reproducible snapshots rather than universal constants.
Against disruptor-rs
- Publish: 2.8 ns (Intel) / 2.4 ns (M1 Pro), versus 30.6 ns / 15.3 ns for
disruptor-rs - Cross-thread roundtrip: 95 ns (Intel) / 130 ns (M1 Pro), versus 138 ns / 186 ns for
disruptor-rs
Core operations
i7-10700KF M1 Pro
────────── ──────
Publish only 2.8 ns 2.4 ns
Roundtrip (1 sub, same thread) 2.7 ns 8.8 ns
Fanout (10 independent subs) 17.0 ns 27.7 ns
SubscriberGroup (any N, O(1)) 2.6 ns 8.8 ns
MPMC (1 pub, 1 sub) 12.1 ns 10.6 ns
Empty poll 0.9 ns 1.1 ns
Batch 64 + drain 158 ns 282 ns
Struct roundtrip (24B Pod) 4.8 ns 9.3 ns
Cross-thread roundtrip 95 ns 130 ns
One-way latency (RDTSC) 48 ns p50 —
Throughput
- Sustained throughput: about 300M msg/s on Intel and 88M msg/s on M1 Pro
- Payload scaling: Photon Ring outperformed
disruptor-rsacross 8-byte to 4 KiBPodpayloads; seedocs/payload-scaling.md
Comparison
| Photon Ring | disruptor-rs (v4) | crossbeam-channel | bus | |
|---|---|---|---|---|
| Delivery | Broadcast | Broadcast | Point-to-point queue | Broadcast |
| Publish cost | 2.8 ns / 12.1 ns (MPMC) | 30.6 ns | - | - |
| Cross-thread | 95 ns | 138 ns | - | - |
| Throughput | ~300M msg/s | - | - | - |
| Topology builder | Yes | Yes | No | No |
| Batch APIs | Yes | Yes | Iterator | No |
| Topic bus | Yes | No | No | No |
| Backpressure | Optional | Default | Default | Default |
no_std |
Yes | No | No | No |
| Multi-producer | Yes | Yes | Yes | No |
Use crossbeam-channel when one receiver should own each message. Use Photon Ring when each subscriber should observe the same stream with minimal coordination overhead.
API overview
Channels are the lowest-level interface. channel::<T>(capacity) creates the fastest single-producer path and returns a Publisher<T> plus a cloneable Subscribable<T>. channel_bounded::<T>(capacity, watermark) adds optional backpressure; Publisher::try_publish returns PublishError::Full(value) instead of overwriting unread slots. channel_mpmc::<T>(capacity) returns MpPublisher<T>, which is Clone + Send + Sync and uses atomic sequence claiming for concurrent producers. On the write side, the important APIs are publish, publish_with for in-place construction, publish_batch on Publisher, and published/capacity for lightweight counters.
Subscribers are independent and contention-free by default. Subscribable::subscribe() starts from future messages only, while subscribe_from_oldest() starts at the oldest message still retained in the ring. Subscriber<T> exposes try_recv, recv, recv_with, latest, pending, recv_batch, and drain, plus observability counters through total_received, total_lagged, and receive_ratio. If multiple logical consumers are polled on the same thread, subscribe_group::<N>() creates a SubscriberGroup<T, N> that performs one ring read for the whole group instead of one per logical subscriber.
For topic routing, Photon<T> provides a string-keyed bus where all topics share the same payload type, and TypedBus allows a different T: Pod per topic. Both lazily create topics and expose publisher, try_publisher, subscribe, and subscribable. publisher() will panic if the publisher for that topic was already taken, and TypedBus also panics on type mismatches for an existing topic.
Pipelines build dedicated-thread processing graphs on supported OS targets. topology::Pipeline::builder().capacity(...).input::<T>() returns an input publisher plus a typed builder; .then(...) chains stages, .fan_out(...) creates a diamond, .then_a(...) and .then_b(...) extend either branch, and .build() returns the final subscriber plus a Pipeline handle. The handle supports shutdown, join, panicked_stages, is_healthy, and stage_count. For manual shutdown outside topology, use Shutdown.
Wait behavior is explicit. recv_with accepts WaitStrategy::BusySpin, YieldSpin, BackoffSpin, Adaptive, MonitorWait, or MonitorWaitFallback depending on whether you want the absolute lowest wakeup latency or better core sharing. MonitorWait uses Intel UMONITOR/UMWAIT (Alder Lake+) for near-zero power wakeup (~30 ns), with automatic fallback to PAUSE on older x86 or WFE on ARM; construct it safely via WaitStrategy::monitor_wait(&stamp). MonitorWaitFallback uses TPAUSE without requiring an address. On supported platforms, the crate also includes affinity helpers for CPU pinning; with the hugepages feature on Linux, you can use Publisher::mlock, Publisher::prefault, and mem::{set_numa_preferred, reset_numa_policy} to reduce page-fault and NUMA noise.
Design constraints
| Constraint | Rationale |
|---|---|
T: Pod |
Every bit pattern must be valid, which makes optimistic torn reads safe to reject. |
| Power-of-two capacity | Indexing uses seq & mask instead of %. |
| Single producer by default | The fastest path relies on &mut self rather than write-side atomics. |
| Lossy overflow by default | The publisher never blocks; subscribers detect drops through Lagged. |
| 64-bit atomics required | The core algorithm depends on AtomicU64. |
Platform support
| Platform | Core ring | Affinity | Topology | Hugepages |
|---|---|---|---|---|
| x86_64 Linux | Yes | Yes | Yes | Yes |
| x86_64 macOS / Windows | Yes | Yes | Yes | No |
| aarch64 Linux | Yes | Yes | Yes | Yes |
| aarch64 macOS (Apple Silicon) | Yes | Yes | Yes | No |
| wasm32 | Yes | No | No | No |
| FreeBSD / NetBSD / Android | Yes | Yes | Yes | No |
| 32-bit ARM (Cortex-M) | No | No | No | No |
Soundness and Pod
[!WARNING] Photon Ring uses a seqlock-style optimistic read: a subscriber may speculatively copy a slot while a writer is updating it, then reject the value if the stamp changed. This pattern is sound for
T: Pod, but not for arbitraryCopytypes. If a torn bit pattern could be invalid forT, the read would be undefined behavior before Photon Ring had a chance to discard it.
The Pod trait means more than Copy: every possible bit pattern of the payload must be valid. Primitive numerics, arrays of Pod, and tuples of Pod are already supported. For your own structs, use #[repr(C)], stick to Pod fields, and implement Pod manually or via the derive feature when appropriate.
| Type | Why it is not Pod |
Use instead |
|---|---|---|
bool |
Only 0 and 1 are valid |
u8 |
char |
Must be a valid Unicode scalar | u32 |
NonZero<u32> |
0 is invalid |
u32 |
Option<T> |
The discriminant has invalid patterns | Sentinel integer |
Rust enum |
Only declared variants are valid | u8 or u32 |
&T, &str |
Pointers must be valid | Value types only |
String, Vec<_> |
Heap-owning, has Drop |
Fixed [u8; N] buffer |
[!TIP] Keep rich domain types at the edges and publish compact
Podmessages in the middle. Convert enums,Option, booleans, and strings into explicit numeric fields or fixed-size buffers before callingpublish.
Examples of safe boundary conversions:
bool->u8(0 = false,1 = true)enum Side { Buy, Sell }->u8(0 = Buy,1 = Sell)Option<u32>->u32(0 = None, nonzero =Some)String/&str->[u8; N]
Running
License
Licensed under Apache-2.0. See LICENSE-APACHE.