multiqueue2 0.1.3

A fast mpmc broadcast queue
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
# MultiQueue2: Fast MPMC Broadcast Queue

MultiQueue2 is a fast bounded mpmc queue that supports broadcast/broadcast style operations [![Build Status](https://travis-ci.org/schets/multiqueue.svg?branch=master)](https://travis-ci.org/abbychau/multiqueue)

[MultiQueue](https://github.com/schets/multiqueue) is a mpmc library developed by Sam Schetterer, but not updated for some time. I found it very useful as it implements `futures`. However, it is with a few outdated library API and the use of spin locks is taking 100% CPU in many cases. This version tries to fix these.

All warnings are fixed and upgraded to 2018 as well.

TOC: [Overview]#over | [Queue Model]#model | [Examples]#examples | [MPMC Mode]#mpmc | [Futures Mode]#futures | [Benchmarks]#bench | [FAQ]#faq | [Footnotes]#footnotes

## <a name = "over">Overview</a>


Multiqueue is based on the queue design from the LMAX Disruptor, with a few improvements:
  * It can act as a futures stream/sink, so you can set up high-performance computing pipelines with ease
  * It can dynamically add/remove producers, and each [stream]#model can have multiple consumers
  * It has fast fallbacks for whenever there's a single consumer and/or a single producer and can detect switches at runtime
  * It works on 32 bit systems without any performance or capability penalty
  * In most cases, one can view data written directly into the queue without copying it

One can think of MultiQueue as a sort of [souped up channel/sync_channel](#bench),
with the additional ability to have multiple independent consumers each receiving the same [stream](#model) of data.
So why would you choose MultiQueue over the built-in channels?
  * MultiQueue support broadcasting elements to multiple readers with a single push into the queue
  * MultiQueue allows reading elements in-place in the queue in most cases, so you can broadcast elements without lots of copying
  * MultiQueue can act as a futures stream and sink
  * MultiQueue does not allocate on push/pop unlike channel, leading to much more predictable latencies
  * Multiqueue is practically lockless<sup>[1]#ft1</sup> unlike sync_channel, and fares decently under contention

On the other hand, you would want to use a channel/sync_channel if you:
  * Truly want an unbounded queue, although you should probably handle backlog instead
  * Need senders to block when the queue is full and can't use the futures api
  * Don't want the memory usage of a large buffer
  * You need a oneshot queue
  * You very frequently add/remove producers/consumers

Otherwise, in most cases, MultiQueue should be a good replacement for channels.
In general, this will function very well as normal bounded queue with performance
approaching that of hand-written queues for single/multiple consumers/producers
even without taking advantage of the broadcast

## <a name = "model">Queue Model</a>

MultiQueue functions similarly to the LMAX Disruptor from a high level view.
There's an incoming FIFO data stream that is broadcast to a set of subscribers
as if there were multiple streams being written to.
There are two main differences:
  * MultiQueue transparently supports switching between single and multiple producers.
  * Each broadcast stream can be shared among multiple consumers.

The last part makes the model a bit confusing, since there's a difference between a stream of data
and something consuming that stream. To make things worse, each consumer may not actually see each
value on the stream. Instead, multiple consumers may act on a single stream each getting unique access
to certain elements.

A helpful mental model may be to think about this as if each stream was really just an mpmc
queue that was getting pushed to, and the MultiQueue structure just assembled a bunch together behind the scenes.

A diagram that represents a general use case of a broadcast queue where each consumer has unique access to a stream
is below - the # stand in for producers and @ stands in for the consumer of each stream, each with a label.
The lines are meant to show the data flow through the queue.

```
-> #        @-1
    \      /
     -> -> -> @-2
    /      \
-> #        @-3
```
This is a pretty standard broadcast queue setup -
for each element sent in, it is seen on each stream by that's streams consumer.


However, in MultiQueue, each logical consumer might actually be demultiplexed across many actual consumers, like below.
```
-> #        @-1
    \      /
     -> -> -> @-2' (really @+@+@ each compete for a spot)
    /      \
-> #        @-3
```

If this diagram is redrawn with each of the producers sending in a sequenced element (time goes left to right):


```
t=1|t=2|    t=3    | t=4|
1 -> #              @-1 (1, 2)
      \            /
       -> 2 -> 1 -> -> @-2' (really @ (1) + @ (2) + @ (nothing yet))
      /            \
2 -> #              @-3 (1, 2)
```

If one imagines this as a webserver, the streams for @-1 and @-3 might be doing random webservery work like some logging
or metrics gathering and can handle the workload completely on one core, @-2 is doing expensive work handling requests
and is split into multiple workers dealing with the data stream.

Since those drawings probably made no sense, here are some examples

## <a name = "examples">Examples</a>

### Single-producer single-stream

This is about as simple as it gets for a queue. Fast, one writer, one reader, simple to use.
```rust
extern crate multiqueue2 as multiqueue;

use std::thread;

let (send, recv) = multiqueue::mpmc_queue(10);

thread::spawn(move || {
    for val in recv {
        println!("Got {}", val);
    }
});

for i in 0..10 {
    send.try_send(i).unwrap();
}

// Drop the sender to close the queue
drop(send);

// prints
// Got 0
// Got 1
// Got 2
// etc


// some join mechanics here
```

### Single-producer double stream.

Let's send the values to two different streams
```rust
extern crate multiqueue2 as multiqueue;

use std::thread;

let (send, recv) = multiqueue::broadcast_queue(4);

for i in 0..2 { // or n
    let cur_recv = recv.add_stream();
    thread::spawn(move || {
        for val in cur_recv {
            println!("Stream {} got {}", i, val);
        }
    });
}

// Take notice that I drop the reader - this removes it from
// the queue, meaning that the readers in the new threads
// won't get starved by the lack of progress from recv
recv.unsubscribe();

for i in 0..10 {
    // Don't do this busy loop in real stuff unless you're really sure
    loop {
        if send.try_send(i).is_ok() {
            break;
        }
    }
}

// Drop the sender to close the queue
drop(send);

// prints along the lines of
// Stream 0 got 0
// Stream 0 got 1
// Stream 1 got 0
// Stream 0 got 2
// Stream 1 got 1
// etc

// some join mechanics here
```

### Single-producer broadcast, 2 consumers per stream
Let's take the above and make each stream consumed by two consumers
```rust
extern crate multiqueue2 as multiqueue;

use std::thread;

let (send, recv) = multiqueue::broadcast_queue(4);

for i in 0..2 { // or n
    let cur_recv = recv.add_stream();
    for j in 0..2 {
        let stream_consumer = cur_recv.clone();
        thread::spawn(move || {
            for val in stream_consumer {
                println!("Stream {} consumer {} got {}", i, j, val);
            }
        });
    }
    // cur_recv is dropped here
}

// Take notice that I drop the reader - this removes it from
// the queue, meaning that the readers in the new threads
// won't get starved by the lack of progress from recv
recv.unsubscribe();

for i in 0..10 {
    // Don't do this busy loop in real stuff unless you're really sure
    loop {
        if send.try_send(i).is_ok() {
            break;
        }
    }
}
drop(send);

// prints along the lines of
// Stream 0 consumer 1 got 2
// Stream 0 consumer 0 got 0
// Stream 1 consumer 0 got 0
// Stream 0 consumer 1 got 1
// Stream 1 consumer 1 got 1
// Stream 1 consumer 0 got 2
// etc

// some join mechanics here
```

### Something wacky
Has anyone really been far even as decided to use even go want to do look more like?

```rust
extern crate multiqueue2 as multiqueue;

use std::thread;

let (send, recv) = multiqueue::broadcast_queue(4);

// start like before
for i in 0..2 { // or n
    let cur_recv = recv.add_stream();
    for j in 0..2 {
        let stream_consumer = cur_recv.clone();
        thread::spawn(move || {
            for val in stream_consumer {
                println!("Stream {} consumer {} got {}", i, j, val);
            }
        });
    }
    // cur_recv is dropped here
}

// On this stream, since there's only one consumer,
// the receiver can be made into a SingleReceiver
// which can view items inline in the queue
let single_recv = recv.add_stream().into_single().unwrap();

thread::spawn(move || {
    for val in single_recv.iter_with(|item_ref| 10 * *item_ref) {
        println!("{}", val);
    }
});

// Same as above, except this time we just want to iterate until the receiver is empty
let single_recv_2 = recv.add_stream().into_single().unwrap();

thread::spawn(move || {
    for val in single_recv_2.try_iter_with(|item_ref| 10 * *item_ref) {
        println!("{}", val);
    }
});

// Take notice that I drop the reader - this removes it from
// the queue, meaning that the readers in the new threads
// won't get starved by the lack of progress from recv
recv.unsubscribe();

// Many senders to give all the receivers something
for _ in 0..3 {
    let cur_send = send.clone();
    for i in 0..10 {
        thread::spawn(loop {
            if cur_send.try_send(i).is_ok() {
                break;
            }
        });
    }
}
drop(send);
```

## <a name = "mpmc">MPMC Mode</a>
One might notice that the broadcast queue modes requires that a type be Clone,
and the single-reader inplace variants require that a type be Sync as well.
This is only required for broadcast queues and not normal mpmc queues,
so there's an mpmc api as well. It doesn't require that a type be Clone or Sync
for any api, and also moves items directly out of the queue instead of cloning them.
There's basically no api difference aside from that, so I'm not going to have a huge
section on them.

## <a name = "futures">Futures Mode</a>
For both mpmc and broadcast, a futures mode is supported. The datastructures are quite
similar to the normal ones, except they implement the Futures Sink/Stream traits for
senders and receivers. This comes at a bit of a performance cost, which is why the
futures types are separate


## <a name = "bench">Benchmarks</a>

### Throughput

More are coming, but here are some basic throughput becnhmarks done on a Intel(R) Xeon(R) CPU E3-1240 v5.
These were done using the busywait method to block, but using a blocking wait on receivers in practice will
be more than fast enough for most use cases.


Single Producer Single Consumer: 50-70 million ops per second. In this case, channels do ~8-11 million ops per second
```
# -> -> @
```
____
Single Producer Single Consumer, broadcasted to two different streams: 28 million ops per second<sup>[2](#ft2)</sup>
```
         @
        /
# -> ->
        \
         @
```
____
Single Producer Single Consumer, broadcasted to three different streams: 25 million ops per second<sup>[2](#ft2)</sup>
```
         @
        /
# -> -> -> @
        \
         @
```
____
Multi Producer Single Consumer: 9 million ops per second. In this case, channels do ~8-9 million ops per second.
```
#
 \
  -> -> @
 /
#
```
____
Multi Producer Single Consumer, broadcast to two different streams: 8 million ops per second
```
#        @
 \      /
  -> ->
 /      \
#        @
```
### Latency

I need to rewrite the latency benchmark tool, but latencies will be approximately the
inter core communication delay, about 40-70 ns on a single socket machine.
These will be somewhat higher with multiple producers
and multiple consumers since each one must perform an RMW before finishing a write or read.

## <a name = "faq">FAQ</a>

#### My type isn't Clone, can I use the queue?
You can use the MPMC portions of the queue, but you can't broadcast anything

#### Why can't senders block even though readers can?
It's sensible for a reader to block if there is truly nothing for it to do, while the equivalent
isn't true for senders. If a sender blocks, that means that the system is backlogged and
something should be done about it (like starting another worker that consumes from the backed up stream)!
Furthermore, it puts more of a performance penalty on the queue than I really wanted and the latency
hit for notifying senders comes before the queue action is finished, while notifying readers happens
after the value has sent.

#### Why can the futures sender park even though senders can't block?
It's required for futures api to work sensibly, since when futures can't send into the queue
it expects that the task will be parked and awoken by some other process (if this is wrong, please let me know!).
That makes sense as well since other events will be handled during that time instead of plain blocking.
I'm probably going to add a futures api that just spins on the queue for people who want the niceness of
the futures api but don't want the performance hit.

#### I want to know which stream is the farthest behind when there's backlog, can I do that?
As of now, that's not possible to do. In general, that sort of question is difficult to concretely
answer because any attempt to answer it will be racing against writer updates, and there's also no way
to transform the idea of 'which stream is behind' into something actionable by a program.

#### Is it possible to select from a set of MultiQueues?
No, it is not currently. Since performance is a key feature of the queue, I wouldn't want
to make amends to allow select if they would negatively affect performance.

#### What happens if consumers of one stream fall behind?
The queue won't overwrite a datapoint until all streams have advanced past it,
so writes to the queue would fail. Depending on your goals, this is either a good or a bad thing.
On one hand, nobody likes getting blocked/starved of updates because of some dumb slow thread.
On the other hand, this basically enforces a sort of system-wide backlog control. If you want
an example why that's needed, NYSE occasionally does not keep the consolidated feed
up to date with the individual feeds and markets fall into disarray.

## <a name = "footnotes">Footnotes</a>

<a name = "ft1">1</a>. The queue is technically not lockless - a writer which has claimed a write spot
but then gotten stuck will block readers from progressing. There is a lockless MPMC bounded queue,
but it requires a statically known max senders and I don't think can be extended to broadcast.
In practice, it will rarely ever matter.

<a name = "ft2">2</a> These benchmarks had extremely varying benchmarks so I took the upper bound. On some other machines
they showed only minor performance differences compared to the spsc case so
I think in practice the effective throughput will be much higher