gil
Get In Line - A fast single-producer single-consumer (SPSC) queue with sync and async support.
Inspired by Facebook's folly's ProducerConsumerQueue.
⚠️ WIP: things WILL change a lot without warnings even in minor updates until v1, use at your own risk.
Features
- Lock-free: Uses atomic operations for synchronization
- Single-producer, single-consumer: Optimized for this specific use case
- Thread-safe: Producer and consumer can run on different threads
- Blocking and non-blocking operations: Both sync and async APIs
- Batch operations: Send and receive multiple items efficiently
- Zero-copy operations: Direct buffer access for maximum performance
- High performance (probably): Competitive with Facebook's folly implementation
Installation
Add this to your Cargo.toml:
[]
= "0.2"
Usage
The producer and consumer can run on different threads, but there can only be one producer and only one consumer. The producer (or consumer) can be moved between threads, but cannot be shared between threads. The queue has a fixed capacity that must be specified when creating the channel.
Consumer blocks until there is a value on the queue, or use Receiver::try_recv for non-blocking version. Similarly, producer blocks until there is a free slot on the queue, or use Sender::try_send for non-blocking version.
Basic Example (Synchronous)
use thread;
use channel;
Async Example
use channel;
async
Non-blocking Operations
use channel;
let = channel;
// Try to send without blocking
match tx.try_send
// Try to receive without blocking
match rx.try_recv
Batch Operations
Batch operations are more efficient than individual sends/receives because they amortize the cost of atomic operations:
use thread;
use VecDeque;
use channel;
let = channel;
spawn;
let mut buffer = new;
let mut received = 0;
while received < 1000
Zero-Copy Operations
For maximum performance, you can directly access the internal buffer:
use channel;
use ptr;
let = channel;
// Zero-copy write
let data = ;
let slice = tx.get_write_slice;
let count = data.len.min;
unsafe
tx.commit;
// Zero-copy read
let slice = rx.get_read_slice;
for &value in slice
rx.advance;
Performance
The queue achieves high throughput through several optimizations:
- Cache-line alignment: Head and tail pointers are on separate cache lines to prevent false sharing
- Local caching: Each side caches the other side's position to reduce atomic operations
- Batch operations: Amortize atomic operation costs across multiple items
- Zero-copy API: Direct buffer access eliminates memory copies
On Apple M3, the queue can achieve ~50GB/s throughput and with batching and zero-copy operations. Latency is around 80ns, but depends on which cores the producer and consumer are running on.
Type Constraints
The queue works with:
u128on aarch64 (ARM64) architecturesu64on x86_64 architectures
This allows the queue to store values that fit within these sizes directly. For larger types, consider using indices or pointers with an external storage mechanism.
Safety
The code has been verified using:
Future Improvements
- More comprehensive benchmarks
- Support for generic types (not just
u64/u128) using custom arena allocators - Optimize for x86
- Try benching with Intel x86's
cldemoteinstruction - Run and benchmark on NVIDIA Grace (or any NVLink-C2C), just for fun and to see how fast this can really go. In theory NVIDIA Grace should go upto 900GB/s.
Contributing
Contributions are welcome! Please feel free to submit issues or pull requests.
License
MIT License - see LICENSE file for details.