ring-buffer-macro 0.2.0

A procedural macro for creating ring buffer (circular buffer) data structures at compile time
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
# ring-buffer-macro Internals

A deep-dive into how this crate works, why it works, and what all the proc macro
nonsense is actually doing under the hood.

---

## Table of Contents

1. [What Even Is a Proc Macro?]#what-even-is-a-proc-macro
2. [The Big Picture]#the-big-picture
3. [Project Structure]#project-structure
4. [The Entry Point: `src/lib.rs`]#the-entry-point-srclibrs
5. [Parsing: `src/parser.rs`]#parsing-srcparserrs
6. [Error Handling: `src/error.rs`]#error-handling-srcerrorrs
7. [Standard Mode: `src/generator.rs`]#standard-mode-srcgeneratorrs
8. [SPSC Mode: `src/spsc.rs`]#spsc-mode-srcspscrs
9. [MPSC Mode: `src/mpsc.rs`]#mpsc-mode-srcmpscrs
10. [The Macro Expansion Lifecycle]#the-macro-expansion-lifecycle
11. [Memory Ordering (the hard part)]#memory-ordering-the-hard-part
12. [Key Design Decisions]#key-design-decisions

---

## What Even Is a Proc Macro?

Before diving into the code, you need to understand what a procedural macro (proc
macro) actually is, because it's genuinely one of the weirder things in Rust.

### Normal Code vs. Macro Code

When you write normal Rust code, the compiler reads your `.rs` files, parses them
into an AST (Abstract Syntax Tree), type-checks everything, and emits machine code.
Straightforward.

A proc macro is a **Rust program that runs at compile time**. It takes Rust source
code as input (as tokens), does whatever it wants with that input, and produces new
Rust source code as output (also as tokens). The compiler then takes that output and
continues compiling it as if you'd written it by hand.

So when you write:

```rust
#[ring_buffer(5)]
struct MyBuffer(i32);
```

The compiler sees the `#[ring_buffer(5)]` attribute, goes "oh, that's a proc macro",
and hands two things to our macro:

1. **The arguments**: `5` (what's inside the parentheses)
2. **The input**: `struct MyBuffer(i32);` (the item the attribute is attached to)

Our macro then runs, processes these, and returns a completely new chunk of Rust code
that replaces the original. That new code includes the struct (now with different
fields) and a full `impl` block with all the ring buffer methods.

### The Three Crates

Proc macros live in a special kind of crate (`proc-macro = true` in `Cargo.toml`).
This crate depends on three core libraries:

- **`proc_macro`** (std library): Provides the raw `TokenStream` type that the
  compiler hands to your macro. This is the "real" token stream that the compiler
  understands.

- **`syn`**: A parsing library. It takes raw tokens and parses them into a strongly
  typed AST. Instead of dealing with raw tokens like `Ident("struct")`, `Ident("MyBuffer")`,
  `Punct('(')`, etc., `syn` gives you nice Rust types like `DeriveInput` with fields
  like `.ident`, `.data`, `.generics`. It's basically a Rust parser in library form.

- **`quote`**: The inverse of `syn`. Where `syn` goes from tokens to structured data,
  `quote` goes from structured data back to tokens. The `quote!` macro lets you write
  Rust-looking code with interpolation holes (using `#variable_name`) and it produces
  a `TokenStream`.

The flow is: **raw tokens -> syn (parse) -> your logic -> quote (generate) -> raw tokens**.

### Attribute Macro vs. Derive Macro

There are actually multiple kinds of proc macros. The two most common are:

- **Derive macros**: `#[derive(Debug, Clone)]` -- these _add_ code alongside your
  struct. Your struct stays exactly as-is, and the macro adds an `impl Debug for ...`
  block.

- **Attribute macros**: `#[ring_buffer(5)]` -- these _replace_ the item they're
  attached to. The macro gets the entire struct, can modify it however it wants, and
  returns the replacement.

This crate uses an **attribute macro** because we need to _transform_ the struct.
We take a tuple struct `struct MyBuffer(i32)` and replace it with a completely
different named struct with fields like `data`, `head`, `tail`, etc. A derive macro
can't do that -- it can only add code, not modify existing code.

---

## The Big Picture

Here's what happens when someone writes:

```rust
#[ring_buffer(5)]
struct MyBuffer(i32);
```

**Step 1: Parse the arguments**
The `5` gets parsed into a `RingBufferArgs` struct: `{ capacity: 5, mode: Standard, power_of_two: false, cache_padded: false, blocking: false }`.

**Step 2: Parse the input**
The `struct MyBuffer(i32)` gets parsed into `syn::DeriveInput`. This is a structured
representation of the struct -- its name (`MyBuffer`), its visibility (`pub` or not),
its generics (none here), and its fields (one unnamed field of type `i32`).

**Step 3: Extract the element type**
We dig into the parsed struct and pull out `i32` as the element type. This is what
the buffer will store.

**Step 4: Transform the struct**
We replace the tuple struct's single unnamed field with named fields:

```rust
struct MyBuffer {
    data: Vec<i32>,
    capacity: usize,
    head: usize,
    tail: usize,
    size: usize,
}
```

This happens by mutating the `DeriveInput` in place. We literally rip out the old
`Fields::Unnamed` and replace it with `Fields::Named`.

**Step 5: Generate the implementation**
We use `quote!` to generate a full `impl MyBuffer { ... }` block with `new()`,
`enqueue()`, `dequeue()`, and all the other methods. The `#element_type` placeholder
gets replaced with `i32`, `#capacity` with `5`, etc.

**Step 6: Return everything**
We combine the modified struct definition and the generated impl block into one
`TokenStream` and hand it back to the compiler.

The compiler then compiles the output as if you had hand-written a 200+ line struct
with all those methods. The user just sees a 2-line annotation.

---

## Project Structure

```
src/
  lib.rs        -- Entry point. The #[proc_macro_attribute] function lives here.
  parser.rs     -- Parses macro arguments and extracts the element type from the struct.
  error.rs      -- Custom error types that produce nice compile-time error messages.
  generator.rs  -- Generates code for standard (single-threaded) mode.
  spsc.rs       -- Generates code for SPSC (single-producer, single-consumer) mode.
  mpsc.rs       -- Generates code for MPSC (multi-producer, single-consumer) mode.
```

Each mode gets its own file because the generated code is significantly different
between modes. Standard mode uses `Vec<T>` and simple integer indices. SPSC uses
`UnsafeCell<Vec<MaybeUninit<T>>>` and atomic indices. MPSC adds CAS loops and
per-slot `AtomicBool` flags. They share the same parser and error handling, but
the generation logic is mode-specific.

---

## The Entry Point: `src/lib.rs`

```rust
#[proc_macro_attribute]
pub fn ring_buffer(args: TokenStream, input: TokenStream) -> TokenStream {
    let args = parse_macro_input!(args as RingBufferArgs);
    let mut input = parse_macro_input!(input as DeriveInput);

    match expand_ring_buffer(args, &mut input) {
        Ok(tokens) => tokens,
        Err(e) => e.to_compile_error().into(),
    }
}
```

This is the actual macro function. The `#[proc_macro_attribute]` annotation tells the
Rust compiler "this function is an attribute macro". It takes two `TokenStream`
arguments:

- `args`: Everything inside the parentheses, e.g. `5` or `capacity = 1024, mode = "spsc"`.
- `input`: The struct definition the attribute is attached to.

`parse_macro_input!` is a `syn` macro that parses raw tokens into a typed struct. If
parsing fails, it automatically returns a compile error. `RingBufferArgs` is our
custom type (defined in `parser.rs`), and `DeriveInput` is syn's representation of
any item that could have a derive attribute (structs, enums, unions).

The `expand_ring_buffer` function does the real work. It's separated out so we can
use `?` for error handling (the main macro function can't use `?` because its return
type is `TokenStream`, not `Result`).

Inside `expand_ring_buffer`:

```rust
fn expand_ring_buffer(args: RingBufferArgs, input: &mut DeriveInput) -> Result<TokenStream> {
    let element_type = find_element_type(input)?;

    let expanded = match args.mode {
        BufferMode::Standard => { /* ... */ }
        BufferMode::Spsc => { /* ... */ }
        BufferMode::Mpsc => { /* ... */ }
    };

    Ok(expanded.into())
}
```

Each mode follows the same two-step pattern:

1. **`add_*_fields()`** -- Mutates the struct to replace tuple fields with named fields
2. **`generate_*_impl()`** -- Generates the `impl` block with all methods

The `quote! { #input #implementation }` at the end combines both pieces: the modified
struct definition and the generated implementation.

---

## Parsing: `src/parser.rs`

This file handles two things: parsing the macro arguments and extracting the element
type from the struct.

### `RingBufferArgs`

```rust
pub struct RingBufferArgs {
    pub capacity: usize,
    pub mode: BufferMode,
    pub power_of_two: bool,
    pub cache_padded: bool,
    pub blocking: bool,
}
```

This struct holds everything the user configured. The `Parse` trait implementation
handles two syntaxes:

**Simple syntax**: `#[ring_buffer(5)]`

The parser peeks at the first token. If it's a `LitInt` (integer literal), it takes
the simple path: parse the number, set defaults for everything else.

**Named parameter syntax**: `#[ring_buffer(capacity = 1024, mode = "spsc")]`

If the first token is an `Ident` (identifier), it enters a loop that parses
`key = value` pairs separated by commas. Each key is matched against known parameter
names.

The `peek` / `lookahead` pattern is how `syn` does speculative parsing. Instead of
trying to parse and handling errors, you look ahead at what the next token _is_ and
branch on that. This avoids consuming tokens you can't put back.

### `find_element_type`

```rust
pub fn find_element_type(input: &DeriveInput) -> Result<Type> {
    match &input.data {
        Data::Struct(data_struct) => match &data_struct.fields {
            Fields::Unnamed(fields) if fields.unnamed.len() == 1 => {
                Ok(fields.unnamed.first().unwrap().ty.clone())
            }
            // ... error cases
        },
        _ => Err(Error::not_a_struct(input.ident.span())),
    }
}
```

This digs through syn's AST to find the type inside the tuple struct. The path is:

`DeriveInput` -> `Data::Struct` -> `DataStruct.fields` -> `Fields::Unnamed` -> first
field -> `.ty` (the `Type`)

For `struct MyBuffer(i32)`, this returns the `Type` representing `i32`. For
`struct GenericBuffer<T: Clone>(Vec<T>)`, this returns the `Type` representing `Vec<T>`.

The function validates that:
- The item is a struct (not an enum or union)
- The struct uses tuple syntax (not named fields or unit struct)
- There's exactly one field (not zero, not two)

If any of these fail, it returns a descriptive error that shows up as a compile-time
error pointing at the right span (location in the source code).

---

## Error Handling: `src/error.rs`

```rust
pub enum Error {
    NotAStruct(Span),
    NotTupleStruct(Span),
    InvalidTupleStruct(Span),
    Syn(SynError),
}
```

Each variant carries a `Span`, which is a location in the user's source code. When
we convert an error to a compile error via `to_compile_error()`, the error message
points at the right line and column in the user's code, not somewhere in our macro
internals.

For example, if someone writes:

```rust
#[ring_buffer(5)]
enum NotAStruct { A, B }
```

They'll see: `error: ring_buffer can only be applied to structs` pointing at the
`enum` keyword. The span makes this possible.

The `From<SynError>` impl lets us use `?` with syn's parsing functions, which return
`syn::Error`. Those automatically get wrapped in our `Error::Syn` variant.

---

## Standard Mode: `src/generator.rs`

This generates the simplest version of the ring buffer -- single-threaded, no atomics,
no unsafe.

### `add_fields`

This function transforms:

```rust
struct MyBuffer(i32);
```

into:

```rust
struct MyBuffer {
    data: Vec<i32>,
    capacity: usize,
    head: usize,
    tail: usize,
    size: usize,
}
```

It does this by:

1. Creating new `syn::Field` values using `syn::parse_quote!`. This macro lets you
   write Rust syntax that gets parsed into syn types. So `syn::parse_quote! { data: Vec<#element_type> }`
   creates a field named `data` of type `Vec<i32>` (with `#element_type` interpolated).

2. Building a `FieldsNamed` struct and pushing the fields into it.

3. Replacing `data_struct.fields` (which was `Fields::Unnamed`) with `Fields::Named`.

If `cache_padded = true`, the `head` and `tail` fields use a generated cache-padded
wrapper type instead of plain `usize`. This type is `#[repr(C, align(64))]` which
forces 64-byte alignment (a typical cache line size). This prevents false sharing
when head and tail are accessed from different threads (more relevant in the
concurrent modes, but available here too for consistency).

### `generate_impl`

This is where the bulk of the generated code lives. The function builds an entire
`impl` block using `quote!`.

Key things happening:

**Index calculation**: The `next_head` and `next_tail` expressions compute the next
index with wraparound. Normally this is `(index + 1) % capacity`. If `power_of_two`
is enabled, it's `(index + 1) & mask` where `mask = capacity - 1`. The bitwise AND
is faster than modulo on most CPUs because modulo requires a division instruction.

**`enqueue`**: Checks if full, writes to the tail slot, advances tail with wraparound,
increments size. If the backing `Vec` hasn't been filled to this index yet, it uses
`push()`; otherwise it overwrites with direct indexing. This is because the `Vec`
starts empty and grows lazily.

**`dequeue`**: Checks if empty, clones the item at head (this is why standard mode
requires `Clone`), advances head with wraparound, decrements size. The `where T: Clone`
bound is placed on the method, not the struct, so you can create a buffer of
non-Clone types -- you just can't dequeue from it (which is enforced at compile time).

**`drain`**: This creates a separate struct (`MyBufferDrain`) that acts as an iterator.
It holds a raw pointer to the buffer and calls `dequeue()` on each `next()`. The
`Drop` impl ensures remaining items are drained even if the iterator is dropped early.
The raw pointer is necessary because you can't hold a `&mut` reference to the buffer
while also having the drain struct (which borrows the buffer) -- Rust's borrow checker
doesn't allow it. This is a common pattern in std (see `Vec::drain()`).

**Visibility propagation**: The `#vis` interpolation ensures generated methods and
types inherit the same visibility as the original struct. If you write
`pub struct MyBuffer(i32)`, the generated `new()`, `enqueue()`, etc. are all `pub` too.

**Generics propagation**: `#impl_generics`, `#ty_generics`, and `#where_clause` are
produced by `generics.split_for_impl()`. This is syn's standard way of handling
generics in generated code. For a struct like `GenericBuffer<T: Clone>`, these expand
to:

- `impl_generics`: `<T: Clone>` (goes after `impl`)
- `ty_generics`: `<T>` (goes after the struct name)
- `where_clause`: empty here (but would hold additional `where` bounds)

---

## SPSC Mode: `src/spsc.rs`

SPSC (Single-Producer, Single-Consumer) mode generates a lock-free ring buffer that
can be safely used from two threads: one writing, one reading.

### How It's Different From Standard Mode

The fundamental difference is **no locks, no mutexes**. Standard mode uses `&mut self`
for enqueue/dequeue, which means only one thread can touch the buffer at a time. SPSC
mode uses shared references (`&self`) and atomic operations to allow concurrent access.

The generated struct looks like:

```rust
struct SpscBuffer {
    data: UnsafeCell<Vec<MaybeUninit<i32>>>,
    head: AtomicUsize,        // only the consumer modifies this
    tail: AtomicUsize,        // only the producer modifies this
    _marker: PhantomData<i32>,
}
```

Key differences from standard mode:

- **`UnsafeCell`**: This is Rust's escape hatch for interior mutability. Normally,
  you can't mutate data through a `&` reference. `UnsafeCell` says "I know what I'm
  doing, let me mutate through a shared reference." This is safe here because the
  producer only writes to slots the consumer isn't reading, and vice versa.

- **`MaybeUninit<T>`**: Instead of cloning items out like standard mode, SPSC mode
  _moves_ items. A slot contains `MaybeUninit<T>` -- it might have a valid value or
  it might be garbage. When the producer writes, it puts a real value in with
  `MaybeUninit::new()`. When the consumer reads, it takes the value out with
  `assume_init_read()`. No clone needed, which means `T` doesn't need `Clone` --
  it just needs `Send` (can be transferred between threads).

- **`AtomicUsize`**: The head and tail indices are atomic, meaning they can be read
  and written from different threads without data races. The specific memory orderings
  used (Acquire/Release) ensure that when the consumer sees a new tail value, all the
  data the producer wrote to that slot is actually visible.

- **`PhantomData<T>`**: The struct doesn't directly own a `T` (it owns
  `UnsafeCell<Vec<MaybeUninit<T>>>`), so the compiler doesn't know it logically owns
  `T` values. `PhantomData` tells the compiler "pretend this struct owns `T`" so that
  the drop checker and variance rules work correctly.

### The Split Pattern

Instead of directly calling `enqueue`/`dequeue` on the buffer, SPSC mode uses a
split pattern:

```rust
let buf = SpscBuffer::new();
let (producer, consumer) = buf.split();
```

`split()` returns two lightweight handle structs: `SpscBufferProducer` and
`SpscBufferConsumer`. Each holds a `&SpscBuffer` reference. The producer can only
enqueue; the consumer can only dequeue. This separation is enforced at the type level.

Why not just have `enqueue` and `dequeue` on the buffer directly? Because then
there's nothing stopping you from calling both from the same thread, or having two
threads both enqueue. The split pattern makes it structurally impossible to misuse --
you physically can't call `try_dequeue` on a producer.

### The `unsafe impl Send/Sync` Block

```rust
unsafe impl Send for SpscBuffer where i32: Send {}
unsafe impl Sync for SpscBuffer where i32: Send {}
unsafe impl<'a> Send for SpscBufferProducer<'a> where i32: Send {}
unsafe impl<'a> Send for SpscBufferConsumer<'a> where i32: Send {}
```

The compiler can't automatically verify that our `UnsafeCell`-based concurrent access
is safe, so we manually promise it with `unsafe impl`. The `where T: Send` bound
ensures we only do this for types that are safe to transfer between threads (which
excludes things like `Rc<T>`).

`Sync` on the buffer means it can be shared between threads via `&` references (which
is necessary for `Arc<SpscBuffer>` to work). `Send` on the handles means they can be
moved to other threads.

### Blocking Mode

When `blocking = true`, three extra fields are added:

```rust
mutex: Mutex<()>,
not_empty: Condvar,
not_full: Condvar,
```

The `Mutex` doesn't actually protect any data (it guards `()`, nothing). It exists
purely because `Condvar::wait()` requires a `MutexGuard`. The condition variables
provide efficient waiting:

- `enqueue_blocking()`: If the buffer is full, the producer locks the mutex, calls
  `not_full.wait()`, and goes to sleep. The OS wakes it up when a consumer calls
  `not_full.notify_one()` after dequeueing an item.

- `dequeue_blocking()`: Same idea in reverse. If the buffer is empty, the consumer
  waits on `not_empty` until a producer notifies it.

This is much better than busy-spinning (which wastes CPU) but has higher latency
than pure lock-free (because waking a thread involves a syscall).

---

## MPSC Mode: `src/mpsc.rs`

MPSC (Multi-Producer, Single-Consumer) adds support for multiple threads enqueuing
concurrently. This is significantly more complex than SPSC.

### The Problem

In SPSC, only one thread writes the tail index, so a simple atomic store is enough.
In MPSC, multiple threads are competing to write to the next slot. If Thread A and
Thread B both read `tail = 5` and both try to write to slot 5, you get a data race.

### The Solution: CAS (Compare-And-Swap)

The core of MPSC is a CAS loop in `try_enqueue`:

```rust
loop {
    let tail = self.buffer.tail.load(Ordering::Relaxed);
    let next_tail = (tail + 1) % capacity;
    let head = self.buffer.head.load(Ordering::Acquire);

    if next_tail == head { return Err(item); }  // Full

    match self.buffer.tail.compare_exchange_weak(tail, next_tail, ...) {
        Ok(_) => {
            // We won! Write our data to slot `tail`
            data[tail] = MaybeUninit::new(item);
            written[tail].store(true, Ordering::Release);
            return Ok(());
        }
        Err(_) => continue,  // Someone else got there first, retry
    }
}
```

`compare_exchange_weak` is the atomic CAS operation. It says: "If `tail` is still
the value I read earlier, atomically change it to `next_tail`. If someone else already
changed it, fail and tell me the new value."

So if Thread A and Thread B both read `tail = 5`:
- Thread A does CAS(5 -> 6) -- succeeds, writes to slot 5
- Thread B does CAS(5 -> 6) -- fails (tail is now 6), loops back
- Thread B reads `tail = 6`, does CAS(6 -> 7) -- succeeds, writes to slot 6

This guarantees that every producer gets a unique slot to write to.

### The Written Flags

There's a subtle problem: after a producer wins the CAS, there's a gap between
"I claimed slot 5" and "I finished writing data to slot 5". If the consumer sees
`tail = 6` and tries to read slot 5, the producer might not have finished writing yet.

The `written` array (`Box<[AtomicBool]>`) solves this. Each slot has a boolean flag:

1. Producer claims slot via CAS on tail
2. Producer writes data to the slot
3. Producer sets `written[slot] = true` (with Release ordering)
4. Consumer checks `written[slot]` before reading (with Acquire ordering)
5. Consumer reads data
6. Consumer sets `written[slot] = false`

This ensures the consumer never reads partially-written data.

### Producer vs Consumer Handles

Unlike SPSC's `split()` which returns both handles at once, MPSC has separate methods:

```rust
let producer = buffer.producer();  // Can call this multiple times / clone
let consumer = buffer.consumer();  // Should only have one
```

The producer handle is `#[derive(Clone)]` because you need multiple producers. The
consumer handle is not clonable because the protocol only supports one consumer (the
consumer doesn't use CAS for the head pointer, so two consumers would race).

Note: there's no runtime enforcement of "only one consumer". It's documented but
not prevented at the type level. If you call `consumer()` twice and use both from
different threads, you'll get data races. This is a known limitation.

---

## The Macro Expansion Lifecycle

To make this concrete, here's exactly what `#[ring_buffer(5)] struct Buf(i32);`
expands to in standard mode (simplified):

```rust
// The struct (transformed from tuple to named):
struct Buf {
    data: Vec<i32>,
    capacity: usize,
    head: usize,
    tail: usize,
    size: usize,
}

// The generated impl:
impl Buf {
    fn new() -> Self {
        Self {
            data: Vec::with_capacity(5),
            capacity: 5,
            head: 0,
            tail: 0,
            size: 0,
        }
    }

    fn enqueue(&mut self, item: i32) -> Result<(), i32> {
        if self.is_full() { return Err(item); }
        let tail = self.tail;
        if self.data.len() <= tail {
            self.data.push(item);
        } else {
            self.data[tail] = item;
        }
        self.tail = (self.tail + 1) % self.capacity;
        self.size += 1;
        Ok(())
    }

    fn dequeue(&mut self) -> Option<i32> where i32: Clone {
        if self.is_empty() { return None; }
        let head = self.head;
        let item = self.data[head].clone();
        self.head = (self.head + 1) % self.capacity;
        self.size -= 1;
        Some(item)
    }

    fn peek(&self) -> Option<&i32> { /* ... */ }
    fn peek_mut(&mut self) -> Option<&mut i32> { /* ... */ }
    fn peek_back(&self) -> Option<&i32> { /* ... */ }
    fn is_full(&self) -> bool { self.size == self.capacity }
    fn is_empty(&self) -> bool { self.size == 0 }
    fn len(&self) -> usize { self.size }
    fn capacity(&self) -> usize { self.capacity }
    fn clear(&mut self) { /* ... */ }
    fn iter(&self) -> impl Iterator<Item = &i32> { /* ... */ }
    fn drain(&mut self) -> BufDrain { /* ... */ }
}

struct BufDrain { buffer: *mut Buf }
impl Iterator for BufDrain { /* ... */ }
impl Drop for BufDrain { /* ... */ }
```

You write 2 lines. The macro generates ~100 lines. That's the point.

---

## Memory Ordering (the hard part)

If you're not familiar with atomics, this section will hurt. But it's the most
important part of understanding the concurrent modes.

### Why Memory Ordering Matters

CPUs don't execute instructions in order. They reorder reads and writes for
performance. On a single thread, this is invisible -- the CPU guarantees the
_illusion_ of sequential execution. But across threads, reordering is visible
and causes bugs.

Example without ordering guarantees:

```
Thread A (producer):              Thread B (consumer):
  data[5] = 42;                     if tail == 6 {
  tail = 6;                           x = data[5];  // might see garbage!
                                    }
```

The CPU might reorder Thread A's operations, writing `tail = 6` _before_ `data[5] = 42`.
Thread B sees the new tail, reads slot 5, and gets garbage.

### Acquire and Release

This crate primarily uses **Acquire** and **Release** ordering:

- **Release** (on stores): "All writes I did before this store are visible to anyone
  who does an Acquire load of this value."

- **Acquire** (on loads): "I can see all writes that happened before the Release store
  that wrote the value I just read."

Together, they form a _happens-before_ relationship. In the SPSC code:

```rust
// Producer:
data[tail] = MaybeUninit::new(item);           // Write data
self.buffer.tail.store(next_tail, Release);     // Release: data is visible

// Consumer:
let tail = self.buffer.tail.load(Acquire);      // Acquire: sees the data
let item = data[head].assume_init_read();        // Read data (safe now)
```

The Release store on tail _synchronizes with_ the Acquire load on tail. Everything
the producer wrote before the Release (including the data) is guaranteed to be visible
to the consumer after the Acquire.

### Relaxed

**Relaxed** ordering means "just do the atomic operation, no ordering guarantees."
It's the cheapest ordering. We use it when:

- A thread is reading its own index (the producer reads `tail` with Relaxed because
  it's the only writer of `tail`)
- Ordering doesn't matter for correctness (e.g., approximate `len()` calculations)

### AcqRel (in MPSC)

`compare_exchange_weak` in the MPSC producer uses `AcqRel` (Acquire + Release
combined). This is because the CAS both reads the old tail value (needs Acquire to see
the previous producer's data) and writes the new tail value (needs Release so the
consumer can see our data).

---

## Key Design Decisions

### Why a tuple struct as input?

The macro requires `struct Buffer(i32)`, not `struct Buffer { element_type: i32 }`.
This is purely ergonomic. The tuple struct syntax `(T)` is a concise way to specify
"this buffer holds T". The single field is immediately extracted and thrown away -- the
actual struct gets completely different fields. The tuple struct is just a vehicle for
specifying the element type and the struct name.

### Why `Vec<T>` in standard mode instead of an array?

Arrays in Rust need a const generic size: `[T; N]`. While this would work fine with
our known-at-compile-time capacity, `Vec` is simpler to generate and avoids
complications with uninitialized memory. The `Vec` is pre-allocated to full capacity
on creation (`Vec::with_capacity(N)`) so there's no reallocation. The trade-off is
one heap allocation instead of inline storage, but for a runtime-created buffer
this is fine.

### Why `MaybeUninit<T>` in concurrent modes?

Standard mode uses `Vec<T>` and clones items on dequeue. This requires `T: Clone`.
Concurrent modes use `MaybeUninit<T>` and _move_ items (write with
`MaybeUninit::new()`, read with `assume_init_read()`). This drops the `Clone`
requirement to just `Send`, which is important for performance-critical concurrent
code where cloning in a hot loop is unacceptable.

### Why `UnsafeCell` instead of `Mutex`?

The whole point of lock-free data structures is to avoid mutex overhead. A `Mutex`
would make the implementation trivially safe but would serialize all access through
a single lock. `UnsafeCell` lets us do concurrent access without locks, at the cost
of us being responsible for correctness (via atomic orderings and careful protocol
design).

### Why no runtime enforcement of "one consumer" in MPSC?

The split pattern in SPSC structurally prevents misuse (you get one producer, one
consumer, that's it). MPSC doesn't do this for the consumer -- you can call
`consumer()` twice. Adding runtime enforcement (e.g., an `AtomicBool` flag) would
add overhead to a path that's supposed to be zero-cost. The choice was to document
the constraint rather than enforce it.

### Why not use `crossbeam` or `std::sync::mpsc`?

Because the point of this crate is compile-time code generation with zero dependencies
(beyond the proc-macro tooling). The generated code is self-contained -- it doesn't
pull in any runtime library. This makes it suitable for environments where you want
minimal dependency trees or need to understand exactly what code is running.

---

That's the full picture. The crate is fundamentally a code generator: it takes a
two-line struct annotation and emits a complete, specialized ring buffer implementation.
The proc macro machinery (`syn`/`quote`) is the plumbing; the actual ring buffer
algorithms (standard FIFO, lock-free SPSC, CAS-based MPSC) are the substance.