photon-ring 2.2.0

Ultra-low-latency SPMC pub/sub using seqlock-stamped ring buffers. no_std compatible.
Documentation

Photon Ring

Crates.io docs.rs License no_std CI

Ultra-low-latency SPMC/MPMC pub/sub using seqlock-stamped ring buffers.

Photon Ring is a zero-allocation pub/sub crate for Rust built around pre-allocated ring buffers, per-slot seqlock stamps, and T: Pod payloads. It targets the part of concurrent systems where queueing overhead dominates: market data, telemetry fanout, staged pipelines, and other hot-path broadcast workloads where every subscriber should see every message.

It is no_std compatible with alloc, supports named-topic buses and typed buses, and includes a pipeline builder for multi-stage thread topologies on supported desktop/server platforms.

[!IMPORTANT] The default channel() is lossy on overflow: the publisher never blocks, and slow subscribers detect drops via TryRecvError::Lagged. If lossless delivery matters more than raw latency, use channel_bounded().

Quick start

use photon_ring::{channel, channel_mpmc, Photon};

// SPMC channel (single producer, multiple consumers)
let (mut pub_, subs) = channel::<u64>(1024);
let mut sub = subs.subscribe();
pub_.publish(42);
assert_eq!(sub.try_recv(), Ok(42));

// MPMC channel (multiple producers, clone-able)
let (mp_pub, subs) = channel_mpmc::<u64>(1024);
let mp_pub2 = mp_pub.clone();
mp_pub.publish(1);
mp_pub2.publish(2);

// Named-topic bus
let bus = Photon::<u64>::new(1024);
let mut p = bus.publisher("prices");
let mut s = bus.subscribe("prices");
p.publish(100);
assert_eq!(s.try_recv(), Ok(100));

Installation

[dependencies]
photon-ring = "2.1.0"

Optional features:

  • derive: enables #[derive(photon_ring::DerivePod)] for user-defined Pod types.
  • hugepages: enables Linux memory controls such as mlock, prefault, and NUMA helpers.

Rust 1.70+ is supported.

The problem

Inter-thread communication is often the dominant cost in concurrent systems. Traditional messaging designs usually pay for at least one expensive property on the hot path:

Approach Write cost Read cost Allocation
std::sync::mpsc Lock + CAS Lock + CAS Per-message
Mutex<VecDeque> Lock acquisition Lock acquisition Dynamic growth
crossbeam-channel CAS on head CAS on tail None
LMAX Disruptor pattern Sequence claim + barrier Sequence barrier spin None

The Disruptor showed that pre-allocated rings can remove allocator overhead and drive very low latency, but its shared sequence barriers still create cache-line contention between producers and consumers.

The solution: seqlock-stamped slots

Photon Ring moves synchronization into each slot. Every slot carries its own seqlock stamp beside the payload, so readers validate the data they just loaded instead of bouncing a shared barrier cache line.

                        64 bytes (one cache line)
    +-----------------------------------------------------+
    |  stamp: AtomicU64  |  value: T                      |
    |  (seqlock)         |  (Pod - all bit patterns valid)|
    +-----------------------------------------------------+
    For T <= 56 bytes, stamp and value share one cache line.
    Larger T spills to additional lines (still correct, slightly slower).

Write protocol

1. stamp = seq * 2 + 1     (odd = write in progress)
2. fence(Release)          (stamp visible before data)
3. memcpy(slot.value, data)
4. stamp = seq * 2 + 2     (even = write complete, Release)
5. cursor = seq            (Release - consumers can proceed)

Read protocol

1. s1 = stamp.load(Acquire)
2. if odd -> spin
3. if s1 < expected -> Empty
4. if s1 > expected -> Lagged
5. value = memcpy(slot)    (direct read, T: Pod)
6. s2 = stamp.load(Acquire)
7. if s1 == s2 -> return
8. else -> retry

Why this is fast

  1. No shared mutable state on the read path. Each subscriber keeps its own local cursor. Readers do not publish progress into a shared hot cache line unless bounded backpressure tracking is enabled.
  2. Stamp and payload are co-located. For T <= 56 bytes, the stamp check and payload read hit the same cache line.
  3. No allocation on publish or receive. The ring is fixed at construction time, and hot-path operations are direct slot reads and writes.
  4. T: Pod makes torn reads safe to reject. Every bit pattern is valid, so an optimistic torn read is harmless and discarded by the stamp re-check.
  5. Single-producer SPMC avoids write-side contention. Publisher::publish takes &mut self, so the type system enforces one producer without CAS. MpPublisher adds an MPMC path when you need multiple concurrent writers.

Benchmarks

Measured with Criterion on an Intel i7-10700KF (8C/16T, 3.80 GHz, Linux 6.8, Rust 1.93.1) and Apple M1 Pro (8C, macOS 26.3, Rust 1.92.0), --release, 100 samples, no core pinning unless stated.

[!NOTE] These numbers use Pod payloads and compare concrete implementations, not abstract algorithms. Scheduler noise, pinning, CPU generation, and payload layout all matter, so treat them as reproducible snapshots rather than universal constants.

Against disruptor-rs

  • Publish: 2.8 ns (Intel) / 2.4 ns (M1 Pro), versus 30.6 ns / 15.3 ns for disruptor-rs
  • Cross-thread roundtrip: 95 ns (Intel) / 130 ns (M1 Pro), versus 138 ns / 186 ns for disruptor-rs

Core operations

                                    i7-10700KF     M1 Pro
                                    ──────────     ──────
  Publish only                        2.8 ns      2.4 ns
  Roundtrip (1 sub, same thread)      2.7 ns      8.8 ns
  Fanout (10 independent subs)       17.0 ns     27.7 ns
  SubscriberGroup (any N, O(1))       2.6 ns      8.8 ns
  MPMC (1 pub, 1 sub)                12.1 ns     10.6 ns
  Empty poll                          0.9 ns      1.1 ns
  Batch 64 + drain                    158 ns      282 ns
  Struct roundtrip (24B Pod)          4.8 ns      9.3 ns
  Cross-thread roundtrip               95 ns      130 ns
  One-way latency (RDTSC)             48 ns p50     —

Throughput

  • Sustained throughput: about 300M msg/s on Intel and 88M msg/s on M1 Pro
  • Payload scaling: Photon Ring outperformed disruptor-rs across 8-byte to 4 KiB Pod payloads; see docs/payload-scaling.md

Comparison

Photon Ring disruptor-rs (v4) crossbeam-channel bus
Delivery Broadcast Broadcast Point-to-point queue Broadcast
Publish cost 2.8 ns / 12.1 ns (MPMC) 30.6 ns - -
Cross-thread 95 ns 138 ns - -
Throughput ~300M msg/s - - -
Topology builder Yes Yes No No
Batch APIs Yes Yes Iterator No
Topic bus Yes No No No
Backpressure Optional Default Default Default
no_std Yes No No No
Multi-producer Yes Yes Yes No

Use crossbeam-channel when one receiver should own each message. Use Photon Ring when each subscriber should observe the same stream with minimal coordination overhead.

API overview

Channels are the lowest-level interface. channel::<T>(capacity) creates the fastest single-producer path and returns a Publisher<T> plus a cloneable Subscribable<T>. channel_bounded::<T>(capacity, watermark) adds optional backpressure; Publisher::try_publish returns PublishError::Full(value) instead of overwriting unread slots. channel_mpmc::<T>(capacity) returns MpPublisher<T>, which is Clone + Send + Sync and uses atomic sequence claiming for concurrent producers. On the write side, the important APIs are publish, publish_with for in-place construction, publish_batch on Publisher, and published/capacity for lightweight counters.

Subscribers are independent and contention-free by default. Subscribable::subscribe() starts from future messages only, while subscribe_from_oldest() starts at the oldest message still retained in the ring. Subscriber<T> exposes try_recv, recv, recv_with, latest, pending, recv_batch, and drain, plus observability counters through total_received, total_lagged, and receive_ratio. If multiple logical consumers are polled on the same thread, subscribe_group::<N>() creates a SubscriberGroup<T, N> that performs one ring read for the whole group instead of one per logical subscriber.

For topic routing, Photon<T> provides a string-keyed bus where all topics share the same payload type, and TypedBus allows a different T: Pod per topic. Both lazily create topics and expose publisher, try_publisher, subscribe, and subscribable. publisher() will panic if the publisher for that topic was already taken, and TypedBus also panics on type mismatches for an existing topic.

Pipelines build dedicated-thread processing graphs on supported OS targets. topology::Pipeline::builder().capacity(...).input::<T>() returns an input publisher plus a typed builder; .then(...) chains stages, .fan_out(...) creates a diamond, .then_a(...) and .then_b(...) extend either branch, and .build() returns the final subscriber plus a Pipeline handle. The handle supports shutdown, join, panicked_stages, is_healthy, and stage_count. For manual shutdown outside topology, use Shutdown.

Wait behavior is explicit. recv_with accepts WaitStrategy::BusySpin, YieldSpin, BackoffSpin, Adaptive, MonitorWait, or MonitorWaitFallback depending on whether you want the absolute lowest wakeup latency or better core sharing. MonitorWait uses Intel UMONITOR/UMWAIT (Alder Lake+) for near-zero power wakeup (~30 ns), with automatic fallback to PAUSE on older x86 or WFE on ARM; construct it safely via WaitStrategy::monitor_wait(&stamp). MonitorWaitFallback uses TPAUSE without requiring an address. On supported platforms, the crate also includes affinity helpers for CPU pinning; with the hugepages feature on Linux, you can use Publisher::mlock, Publisher::prefault, and mem::{set_numa_preferred, reset_numa_policy} to reduce page-fault and NUMA noise.

Design constraints

Constraint Rationale
T: Pod Every bit pattern must be valid, which makes optimistic torn reads safe to reject.
Power-of-two capacity Indexing uses seq & mask instead of %.
Single producer by default The fastest path relies on &mut self rather than write-side atomics.
Lossy overflow by default The publisher never blocks; subscribers detect drops through Lagged.
64-bit atomics required The core algorithm depends on AtomicU64.

Platform support

Platform Core ring Affinity Topology Hugepages
x86_64 Linux Yes Yes Yes Yes
x86_64 macOS / Windows Yes Yes Yes No
aarch64 Linux Yes Yes Yes Yes
aarch64 macOS (Apple Silicon) Yes Yes Yes No
wasm32 Yes No No No
FreeBSD / NetBSD / Android Yes Yes Yes No
32-bit ARM (Cortex-M) No No No No

Soundness and Pod

[!WARNING] Photon Ring uses a seqlock-style optimistic read: a subscriber may speculatively copy a slot while a writer is updating it, then reject the value if the stamp changed. This pattern is sound for T: Pod, but not for arbitrary Copy types. If a torn bit pattern could be invalid for T, the read would be undefined behavior before Photon Ring had a chance to discard it.

The Pod trait means more than Copy: every possible bit pattern of the payload must be valid. Primitive numerics, arrays of Pod, and tuples of Pod are already supported. For your own structs, use #[repr(C)], stick to Pod fields, and implement Pod manually or via the derive feature when appropriate.

Type Why it is not Pod Use instead
bool Only 0 and 1 are valid u8
char Must be a valid Unicode scalar u32
NonZero<u32> 0 is invalid u32
Option<T> The discriminant has invalid patterns Sentinel integer
Rust enum Only declared variants are valid u8 or u32
&T, &str Pointers must be valid Value types only
String, Vec<_> Heap-owning, has Drop Fixed [u8; N] buffer

[!TIP] Keep rich domain types at the edges and publish compact Pod messages in the middle. Convert enums, Option, booleans, and strings into explicit numeric fields or fixed-size buffers before calling publish.

Examples of safe boundary conversions:

  • bool -> u8 (0 = false, 1 = true)
  • enum Side { Buy, Sell } -> u8 (0 = Buy, 1 = Sell)
  • Option<u32> -> u32 (0 = None, nonzero = Some)
  • String / &str -> [u8; N]

Running

cargo test
cargo bench
cargo bench --bench payload_scaling
cargo +nightly miri test --test correctness -- --test-threads=1
cargo run --release --example market_data
cargo run --release --example pipeline
cargo run --release --example backpressure

License

Licensed under Apache-2.0. See LICENSE-APACHE.